1. About Nagios monitoring in real example

About Nagios monitoring in real example

Now it’s time to setup proper monitoring to avoid unpleasant surprises in future.

There are two major problems the monitoring solves: alerting and trending. Alerting is to notify a responsible person about some major event like service stopped working. Trending is to track the change of something over time – disk or memory usage over time, replication lag etc.

This post will be about alerting with Nagios.

The major problem with most of Nagios setups I’ve seen is excessive amount of false positives. This kills whole idea of monitoring. The matter is when an admin gets a false alert they tend to mute it, explicitly or implicitly. They either filter alerts out or don’t treat them seriously. In general case the alert must be worth to wake up the admin in the middle of the night. If the alert isn’t worth as much the real problem will be ignored sooner or later.

For proper monitoring Nagios checks must be carefully chosen and tuned. So, alert about important things and don’t – about non-important. How to know which are important? Your application provides some service. If some event makes impossible to provide the service at pre-defined SLA then the event is important:

  • Host doesn’t respond
  • Service doesn’t work
  • One is running out of critical resource: memory, disk space
  • SLA is broken

Do not alert about minor stuff like some threshold is exceeded. This may sound wrong, but open your mind and get a bird’s eye view of the problem. The application provides some service at some SLA. There may be no formal signed and stamped SLA, but you always set some reasonable SLA at least for yourself. A good alert tells you if you don’t take action your application will fail to provide the service or the SLA will be broken. High Load average is good example of the bad alert. If Load average is high, but SLA isn’t broken – who cares about high Load average? So, monitor SLA not LA.

For the monitoring I use a dedicated server that does nothing but monitoring. To run checks on remote servers I use NRPE because it needs less resources than SSH. Obviously TCP port 5666 on the monitored servers must be open for the monitoring box only.

Let me show on the real example how and what is monitored on each server.

Storage server.

IP availability:

-w and -c are mandatory, so let’s leave them. Besides, Non-zero packet loss indicates the problem in the network and we don’t want to store corrupted backups.

Disk space:

Disk space in root partition is critical for OS. Disk space on the partitions with backups is critical too – no space, no service.

SSH admin access:

For obvious reason an admin must be able to log in to the server any time.

SSH access for TwinDB agents:

TwinDB agents must be able to access the storage server too.

For TwinDB sshd process I want one more check – /proc/<PID>/oom_adj must be -17, so oom-killer doesn’t kill it.
There is no a check for that in nagios_plugins package, so let’s put one in /usr/lib64/nagios/plugins:

I added the command to /etc/nagios/nrpe.cfg on the storage server:

And respective check in /etc/nagios/conf.d/storage-01.cfg on the monitoring host:

Application server app-01

This server hosts Apache for this website, TwinDB dispatcherTwinDB web console and MySQL server for all of them.

Here I check IP connectivity, free disk space in root partition and ssh availability as it’s done for the storage server.

I don’t monitor Load average, total processes, swap usage as default Nagios config suggests.

Instead, Nagios checks very important state of httpd:

When a user accesses https://twindb.com i.e. via HTTP, not HTTPS, Apache redirects them to https://twindb.com. Apache must respond within one second, this is my SLA. If it takes longer Nagios will send an alert.

Next check is HTTPS service:

The service must respond with OK, –ssl –sni options tell nagios to check HTTPS, not HTTP.

-w 1 -c 5 option define my SLA – the index page must return in one second or less.

And the most important Apache must respond with reasonable content. String “About MySQL backups with TwinDB” is something from MySQL database, so I can be sure the index page is more or less valid.

TwinDB dispatcher responds on api.twindb.com.

As before Apache must respond within one second. Option “-u /get_config.php” tells Nagios to check URL http://api.twindb.com/get_config.php

Then MySQL checks follow. I use checks from Percona Monitoring Plugins for Nagios:

The test checks if some important MySQL files are deleted. For example, ibdata1. The matter is InnoDB at the start opens ibdata1 and never closes until MySQL exits. If somebody deletes file ibdata1 the running MySQL won’t notice it. The file is still open and normally accessible for mysqld process. But when MySQL stops the operating system removes the file completely and at the next start InnoDB will create new ibdata1… and empty. In TwinDB we do data recovery issues and we had cases like that.

Check MySQL file privileges to make sure MySQL can write to them, but the permissions are not too open.

Check if InnoDB has any blocked for long time threads. This check also attributed to SLA compliance, if MySQL query is blocked for more than 60 seconds TwinDB doesn’t provide good service.

Check MySQL pid file, it must exist and belong to the running MySQL instance. If two instances access InnoDB files they may corrupt it. Again, that was proved in fields.

And the last check for MySQL is query response time. We need to check execution time for some query. To choose the query I collected the slow log file with long_query_time=0. To generate high traffic I ran a stress test with jmeter. The goal is to collect all queries MySQL executes.

Then with pt-query-digest I found the query where MySQL spends most of the time. As a threshold the max execution time was taken.

So, the Nagios is

And the script that checks MySQL query execution time is

The command is defined in /etc/nagios/nrpe.cfg:

Application server app-02

This server hosts TwinDB packages repository, TwinDB bug tracker and SMTP/POP3 mail server.

Beside common checks described above Nagios has other rules:

Comments, suggestions are welcomed. Hope this helps you in your environment.

Have a question? Ask the experts!

Previous Post Next Post