1. A Real Example Of Nagios Monitoring

A Real Example Of Nagios Monitoring

It’s now time to setup proper monitoring to avoid unpleasant surprises in future.

There are two major problems the monitoring solves: alerting and trending. Alerting is to notify the person in charge about a major event like service failing to work. Trending is to track the change of something over time – disk or memory usage, replication lag etc.

This post will be about alerting with Nagios.

The major problem with most of Nagios setups I’ve seen is excessive amount of false positives. This kills the whole idea of monitoring. The matter is, when an admin gets a false alert they tend to mute it, explicitly or implicitly. They either filter out alerts or don’t treat them seriously. In general, an alert must be able to wake up the admin in the middle of the night. If the alert isn’t worth as much, the real problem will be ignored sooner or later.

For proper monitoring, Nagios checks must be carefully chosen and tuned. Which means, only alert about important things and not about non-important. How do you know which ones are important? Your application provides some service. If some event makes it impossible to provide the service at a pre-defined SLA, then the event is important:

  • Host doesn’t respond
  • Service doesn’t work
  • One is running out of critical resource: memory, disk space
  • SLA is broken

Do not alert about minor stuff like some threshold is exceeded. This may sound wrong, but open your mind and get some perspective. The application provides some service at some SLA. There may be no formal signed and stamped SLA, but you always set some reasonable SLA at least for yourself. A good alert tells you: if you don’t take action, your application will fail to provide the service or the SLA will be broken. High Load average is good example of a bad alert. If Load average is high, but SLA isn’t broken – who cares about high Load average? So, monitor SLA not LA.

For monitoring, I use a dedicated server that does nothing but monitor. To run checks on remote servers, I use NRPE because it needs less resources than SSH. Obviously,  TCP port 5666 on the monitored servers must be open for the monitoring box only.

Let me show how and what is monitored on each server on a real example.

Storage Server

IP availability:

-w and -c are mandatory, so let’s leave them. Besides, Non-zero packet loss indicates a problem in the network and we don’t want to store corrupted backups.

Disk space:

Disk space in root partition is critical for the OS. However, disk space on the partitions with backups is critical too – no space, no service.

SSH admin access:

For obvious reasons, an admin must be able to log in to the server at any time.

SSH access for TwinDB agents:

TwinDB agents must be able to access the storage server too.

For the TwinDB sshd process I want one more check – /proc/<PID>/oom_adj must be -17, so oom-killer doesn’t kill it.
There is no check for that in nagios_plugins package, so let’s put one in /usr/lib64/nagios/plugins:

I added the command to /etc/nagios/nrpe.cfg to the storage server:

And respective check in /etc/nagios/conf.d/storage-01.cfg to the monitoring host:

Application Server app-01

This server hosts Apache for this website, TwinDB dispatcherTwinDB web console and MySQL server for all of them.

Here, I check IP connectivity, free disk space in root partition, and ssh availability as it’s done for the storage server.

I don’t monitor Load average, total processes, swap usage as default Nagios config suggests.

Instead, Nagios checks the state of httpd, as it’s very important:

When a user accesses https://twindb.com i.e. via HTTP, not HTTPS, Apache redirects them to https://twindb.com. Apache must respond within one second, this is my SLA. If it takes longer, Nagios will send an alert.

Next check is the HTTPS service:

The service must respond with OK, –ssl –sni options tell nagios to check HTTPS, not HTTP.

-w 1 -c 5 option define my SLA – the index page must return in one second or less.

And most importantly, Apache must respond with reasonable content. The string “About MySQL backups with TwinDB” is something from the MySQL database, so I can be sure the index page is more or less valid.

TwinDB dispatcher responds on api.twindb.com.

As before, Apache must respond within one second. Option “-u /get_config.php” tells Nagios to check URL http://api.twindb.com/get_config.php

Next, MySQL checks follow. I use checks from Percona Monitoring Plugins for Nagios:

The test checks if some important MySQL files are deleted. For example, ibdata1. The matter is, InnoDB at the start opens ibdata1 and never closes until MySQL exits. If somebody deletes the ibdata1 file, the running MySQL won’t notice it. The file is still open and normally accessible for the mysqld process. But when MySQL stops, the operating system removes the file completely and at the next start InnoDB will create a new ibdata1… and it will be empty. In TwinDB, we do data recovery issues and we had cases like that.

Check MySQL file privileges to make sure MySQL can write to them, but the permissions are not too open.

Check if InnoDB has any threads that have been blocked for a long time. This check is also attributed to SLA compliance, if MySQL query is blocked for more than 60 seconds, TwinDB doesn’t provide good service.

Check the MySQL pid file, it must exist and belong to the running MySQL instance. If two instances access InnoDB files, they may corrupt it. Again, that was proven in fields.

And the last check for MySQL is the query response time. We need to check execution time for a query. To choose the query, I collected the slow log file with long_query_time=0. To generate high traffic, I ran a stress test with jmeter. The goal is to collect all queries MySQL executes.

Then, with pt-query-digest I found the query where MySQL spends most of the time. As a threshold the max execution time was taken.

So, the Nagios is

And the script that checks MySQL query execution time is

The command is defined in /etc/nagios/nrpe.cfg:

Application Server app-02

This server hosts TwinDB packages repository, TwinDB bug tracker and SMTP/POP3 mail server.

Besides common checks described above, Nagios has other rules:

Comments, suggestions are welcomed. Hope this helps you in your environment.

Have a question? Ask the experts!

Previous Post Next Post