Monitoring MySQL Backups With Datadog and TwinDB Backup Tool
Monitoring MySQL backups is a vital part of any reliable backup solution. By monitoring the most common disaster recovery metrics, the Recovery Time Objective and the Recovery Point Objective, you can find out if a backup job was successful and produced a usable backup copy. The TwinDB Backup Tool along with Datadog allows monitoring both of them more efficiently.
Recovery Point Objective
Basically, the Recovery Point Objective (RPO) means how much data you can lose if a disaster happens. If you take backups every hour, you can lose up to an hour of data. If you take backups every day, you can lose a day. Sometimes, people ask us to recover their data. They have a backup copy from yesterday, but they can’t tolerate a day of data loss. Unfortunately, they realized it when it was too late.
Recovery Time Objective
The Recovery Time Objective (RTO) is the time needed to fully restore the database. It’s important to measure it because that way you can check if your backups are usable at all. Just take a look at some of our data recovery customer cases. Many of our clients thought they had backups. But when they needed to restore the database, it turned out the backup job didn’t run; or it produced corrupt backups; or full copies were OK, but incremental ones weren’t, and so on. After a decade in data recovery business, I’ve seen thousands of cases when backups were supposed to be available, but they weren’t. Hence, don’t forget to verify your backups.
Needless to say, downtime hurts business. If you know your RTO, you can make certain preparations and get, for example, an insurance that would cover losses in case of disaster.
Like with any other Service Level Agreement metric, recording it is not enough. The database administrator must be notified about the SLA breach. Thus, if the last backup was made too long ago, and the RPO or the RTO exceeds threshold value, the DBA must be notified and take appropriate action to remediate the problem.
How we measure disaster recovery metrics
Technically, Recovery Point Objective is not measured, it’s rather pre-configured with desired threshold metrics and alerted if the threshold is exceeded. Consequently, if RPO is an hour, we take backups every hour and send an alert if the most recent copy is older than that.
To measure Recovery Time Objective, we restore the database from the latest copy and record the time it took to do that.
When the TwinDB Backup Tool takes or restores a backup copy, it sends respective metrics to Datadog. In Datadog, we put the metrics into a chart to see changes and configure monitors to signal if our SLA is broken.
How to configure monitoring MySQL backups
In TwinDB Backup you would need to export metrics, and in Datadog you accept the metrics and configure monitors for alerting.
TwinDB Backup Tool
First of all, TwinDB Backup installs a cron configuration where it runs backup every hour by default:
# cat /etc/cron.d/twindb-backup @hourly root twindb-backup backup hourly @daily root twindb-backup backup daily @weekly root twindb-backup backup weekly @monthly root twindb-backup backup monthly @yearly root twindb-backup backup yearly
If you need to backup more often, change the cron config accordingly. Don’t forget to check how often the tool will take full copies: if the database is too big, it may take more time than the backup interval.
# cat /etc/twindb/twindb-backup.cfg ... [mysql] full_backup=daily ...
In the example above, full copies will be taken every day and incremental copies will be taken every hour.
Now, you need to configure the export of metrics from TwinDB Backup to Datadog. Every time TwinDB Backup takes or restores a backup, it will report respective metrics to Datadog.
# cat twindb-backup.cfg ... [export] transport=datadog app_key=*** api_key=*** ...
api_key are the credentials of your Datadog account.
On the Datadog side, you need to enable Python integration, create keys, create graphs and monitors. Let’s go over the whole process step by step.
1. Enable Python integration on https://app.datadoghq.com/account/settings.
Code usage example.
2. Generate API and APP keys.
The generated keys should be used in the twindb-backup config as shown above.
Note: Step 1 and 2 are prerequisites for the export feature in the TwinDB Backup Tool.
3. Create your dashboard with new graphs or add new graphs to the existing dashboard.
Disaster Recovery metrics will be recorded in
TwinDB Backup Tool reports backup and restore time for file backups, too.