1. Monitoring MySQL Backups With Datadog and TwinDB Backup Tool

Monitoring MySQL Backups With Datadog and TwinDB Backup Tool

TwinDB + Datadog

Monitoring MySQL backups is a vital part of any reliable backup solution. By monitoring the most common disaster recovery metrics, the Recovery Time Objective and the Recovery Point Objective, you can find out if a backup job was successful and produced a usable backup copy. The TwinDB Backup Tool along with Datadog allows monitoring both of them more efficiently.

Recovery Point Objective

Basically, the Recovery Point Objective (RPO) means how much data you can lose if a disaster happens. If you take backups every hour, you can lose up to an hour of data. If you take backups every day, you can lose a day. Sometimes, people ask us to recover their data. They have a backup copy from yesterday, but they can’t tolerate a day of data loss. Unfortunately, they realized it when it was too late.

Recovery Time Objective

The Recovery Time Objective (RTO) is the time needed to fully restore the database. It’s important to measure it because that way you can check if your backups are usable at all. Just take a look at some of our data recovery customer cases. Many of our clients thought they had backups. But when they needed to restore the database, it turned out the backup job didn’t run; or it produced corrupt backups; or full copies were OK, but incremental ones weren’t, and so on. After a decade in data recovery business, I’ve seen thousands of cases when backups were supposed to be available, but they weren’t. Hence, don’t forget to verify your backups.

Needless to say, downtime hurts business. If you know your RTO, you can make certain preparations and get, for example, an insurance that would cover losses in case of disaster.

Like with any other Service Level Agreement metric, recording it is not enough. The database administrator must be notified about the SLA breach. Thus, if the last backup was made too long ago, and the RPO or the RTO exceeds threshold value, the DBA must be notified and take appropriate action to remediate the problem.

How we measure disaster recovery metrics

Technically, Recovery Point Objective is not measured, it’s rather pre-configured with desired threshold metrics and alerted if the threshold is exceeded. Consequently, if RPO is an hour, we take backups every hour and send an alert if the most recent copy is older than that.

To measure Recovery Time Objective, we restore the database from the latest copy and record the time it took to do that.

When the TwinDB Backup Tool takes or restores a backup copy, it sends respective metrics to Datadog. In Datadog, we put the metrics into a chart to see changes and configure monitors to signal if our SLA is broken.

How to configure monitoring MySQL backups

In TwinDB Backup you would need to export metrics, and in Datadog you accept the metrics and configure monitors for alerting.

TwinDB Backup Tool

First of all, TwinDB Backup installs a cron configuration where it runs backup every hour by default:

If you need to backup more often, change the cron config accordingly. Don’t forget to check how often the tool will take full copies: if the database is too big, it may take more time than the backup interval.

In the example above, full copies will be taken every day and incremental copies will be taken every hour.

Now, you need to configure the export of metrics from TwinDB Backup to Datadog. Every time TwinDB Backup takes or restores a backup, it will report respective metrics to Datadog.

Where app_key and api_key are the credentials of your Datadog account.

Datadog

On the Datadog side, you need to enable Python integration, create keys, create graphs and monitors. Let’s go over the whole process step by step.

1. Enable Python integration on https://app.datadoghq.com/account/settings.

Enable Python integration

Enable Python integration

Enable Python integration - code example
Code usage example.

2. Generate API and APP keys.
The generated keys should be used in the twindb-backup config as shown above.

Generate API and APP keys
Note: Step 1 and 2 are prerequisites for the export feature in the TwinDB Backup Tool.

3. Create your dashboard with new graphs or add new graphs to the existing dashboard.

Disaster Recovery metrics will be recorded in twindb.mysql.backup_time and  twindb.mysql.restore_time.

MySQL disaster recovery metrics
MySQL disaster recovery metrics

TwinDB Backup Tool reports backup and restore time for file backups, too.

Files disaster recovery metrics
Files disaster recovery metrics

4. Datadog monitors will alert the DBA if the RPO or the RTO SLA is no longer function properly.

We will create two monitors: “Backup time is too high” and “Restore time is too high”. Each of the monitors will have two functions. One of them sends out an alert if a backup/restore threshold is exceeded. The other one sends out an alert if TwinDB Backup hasn’t reported the backup/restore time metric for a long time.

Backup time exceeds the threshold.

 

Restore time is higher than the threshold.

 

Previous Post Next Post