Recover Table Structure From InnoDB Dictionary

sys_parser

When a table gets dropped, MySQL removes the respective .frm file. This post explains how to recover the table structure if the table was dropped.

You need the table structure to recover a dropped table from the InnoDB tablespace. The B+tree structure of the InnoDB index doesn’t contain any information about field types. MySQL needs to know that in order to access records of the InnoDB table. Normally, MySQL gets the table structure from the .frm file. But when MySQL drops a table the respective frm file removed too.

Fortunately, there’s one more place where MySQL keeps the table structure. It’s the InnoDB dictionary.

The InnoDB dictionary is a set of tables where InnoDB keeps information about the tables. I reviewed them in detail in a separate InnoDB Dictionary post earlier. After the DROP, InnoDB deletes records related to the dropped table from the dictionary. So, we need to recover deleted records from the dictionary and then get the table structure.

Compiling Data Recovery Tool

First, we need to get the source code. The code is hosted on GitHub.

git clone https://github.com/twindb/undrop-for-innodb.git

To compile it, we need gcc, bison and flex. Install these packages with a package manager (yum/apt/etc). Then, time to compile.

# make
cc -g -O3 -I./include -c stream_parser.c
cc -g -O3 -I./include  -pthread -lm stream_parser.o -o stream_parser
flex  sql_parser.l
bison  -o sql_parser.c sql_parser.y
sql_parser.y: conflicts: 6 shift/reduce
cc -g -O3 -I./include -c sql_parser.c
cc -g -O3 -I./include -c c_parser.c
cc -g -O3 -I./include -c tables_dict.c
cc -g -O3 -I./include -c print_data.c
cc -g -O3 -I./include -c check_data.c
cc -g -O3 -I./include  sql_parser.o c_parser.o tables_dict.o print_data.o check_data.o -o c_parser -pthread -lm
cc -g -O3 -I./include -o innochecksum_changer innochecksum.c

Recover The InnoDB Dictionary

Now, let’s create dictionary tables in the sakila_recovered database. The data recovery tool comes with the structure of the dictionary tables.

# cat dictionary/SYS_* | mysql sakila_recovered

The dictionary is stored in the ibdata1 file. So, let’s parse it.

./stream_parser -f /var/lib/mysql/ibdata1
 
...
 
Size to process:                  79691776 (76.000 MiB)
Worker(0): 84.13% done. 2014-09-03 16:31:20 ETA(in 00:00:00). Processing speed: 7.984 MiB/sec
Worker(2): 84.21% done. 2014-09-03 16:31:20 ETA(in 00:00:00). Processing speed: 8.000 MiB/sec
Worker(1): 84.21% done. 2014-09-03 16:31:21 ETA(in 00:00:00). Processing speed: 4.000 MiB/sec
All workers finished in 2 sec

Now, we need to extract the dictionary records from the InnoDB pages. Let’s create a directory for the table dumps.

# mkdir -p dumps/default

And now we can generate table dumps and LOAD INFILE commands to load the dumps. We also need to specify the -D option to the c_parser because the records we need were deleted from the dictionary when the table was dropped.

SYS_TABLES

# ./c_parser -4Df pages-ibdata1/FIL_PAGE_INDEX/0000000000000001.page \
    -t dictionary/SYS_TABLES.sql \
    > dumps/default/SYS_TABLES \
    2> dumps/default/SYS_TABLES.sql

SYS_INDEXES

# ./c_parser -4Df pages-ibdata1/FIL_PAGE_INDEX/0000000000000003.page \
    -t dictionary/SYS_INDEXES.sql \
    > dumps/default/SYS_INDEXES \
    2> dumps/default/SYS_INDEXES.sql

SYS_COLUMNS

# ./c_parser -4Df pages-ibdata1/FIL_PAGE_INDEX/0000000000000002.page \
    -t dictionary/SYS_COLUMNS.sql \
    > dumps/default/SYS_COLUMNS \
    2> dumps/default/SYS_COLUMNS.sql

SYS_FIELDS

# ./c_parser -4Df pages-ibdata1/FIL_PAGE_INDEX/0000000000000004.page \
    -t dictionary/SYS_FIELDS.sql \
    > dumps/default/SYS_FIELDS \
    2> dumps/default/SYS_FIELDS.sql

With the generated LOAD INFILE commands it’s easy to load the dumps.

# cat dumps/default/*.sql | mysql sakila_recovered

Now we have the InnoDB dictionary loaded into normal InnoDB tables.

Compiling sys_parser

ys_parser is a tool that reads the dictionary from tables stored in MySQL and generates the CREATE TABLE structure for a table.

To compile it we need MySQL libraries and development files. Depending on distribution, they may be in -devel or -dev package. On RedHat based systems, you can check it with the command yum provides “*/mysql_config” . On my server it was the mysql-community-devel package.

If all necessary packages are installed, the compilation boils down to a simple command:

# make sys_parser
/usr/bin/mysql_config
cc `mysql_config --cflags` `mysql_config --libs` -o sys_parser sys_parser.c

Recover Table Structure

Now sys_parser can do its magic. Just run it to get the CREATE statement in the standard output.

# ./sys_parser
sys_parser [-h <host>] [-u <user>] [-p <passowrd>] [-d <db>] databases/table

It will use “root” as the username to connect to MySQL, “querty” – as the password. The dictionary is stored in SYS_* tables in the sakila_recovered database. What we want to recover is sakila.actor. InnoDB uses a slash “/” as a separator between database name and table name, so does sys_parser.

# ./sys_parser -u root -p qwerty  -d sakila_recovered sakila/actor
CREATE TABLE `actor`(
`actor_id` SMALLINT UNSIGNED NOT NULL,
`first_name` VARCHAR(45) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
`last_name` VARCHAR(45) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
`last_update` TIMESTAMP NOT NULL,
PRIMARY KEY (`actor_id`)
) ENGINE=InnoDB;
# ./sys_parser -u root -p qwerty  -d sakila_recovered sakila/customer
CREATE TABLE `customer`(
`customer_id` SMALLINT UNSIGNED NOT NULL,
`store_id` TINYINT UNSIGNED NOT NULL,
`first_name` VARCHAR(45) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
`last_name` VARCHAR(45) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
`email` VARCHAR(50) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci',
`address_id` SMALLINT UNSIGNED NOT NULL,
`active` TINYINT NOT NULL,
`create_date` DATETIME NOT NULL,
`last_update` TIMESTAMP NOT NULL,
PRIMARY KEY (`customer_id`)
) ENGINE=InnoDB;

There are few caveats though.

  1. InnoDB doesn’t store all information you can find in the .frm file. For example, if a field is AUTO_INCREMENT, the InnoDB dictionary knows nothing about it. Therefore, sys_parser won’t recover that property. If there were any field or table level comments, they’ll be lost.
  2. sys_parser generates the table structure eligible for further data recovery. It could but it doesn’t recover secondary indexes, or foreign keys.
  3. InnoDB stores the DECIMAL type as a binary string. It doesn’t store the precision of a DECIMAL field. So, that information will be lost.

For example, table payment uses DECIMAL to store money.

# ./sys_parser -u root -p qwerty  -d sakila_recovered sakila/payment
CREATE TABLE `payment`(
        `payment_id` SMALLINT UNSIGNED NOT NULL,
        `customer_id` SMALLINT UNSIGNED NOT NULL,
        `staff_id` TINYINT UNSIGNED NOT NULL,
        `rental_id` INT,
        `amount` DECIMAL(6,0) NOT NULL,
        `payment_date` DATETIME NOT NULL,
        `last_update` TIMESTAMP NOT NULL,
        PRIMARY KEY (`payment_id`)
) ENGINE=InnoDB;

Fortunately, Oracle is planning to extend the InnoDB dictionary and finally get rid of .frm files. I salute that decision, having the structure in two places leads to inconsistencies.

Image credit: m.khajoo

Simple shell script is bad idea

Torre Grossa

Internet is full of simple shell backup scripts. Every DBA and system administrators (including myself) has written one. Bash is a bad tool for this job and in this post I will explain why.

Let’s find out why the need to write anything arises at first place. There is a lot of great backup tools for databases as well as for files. Following Unix paradigm they do one thing and do it well. With tar I can copy files, with mysqldump I can copy MySQL database. But there are other problems to solve. How to do scheduling (although it’s the easiest one), encryption (both transfers and at rest),  compression, backups retention, work with storage. Surely, I’ve missed many more smaller but important details. That’s exactly a reason why people write wrappers on top of their favorite backup tools.

Don’t you think it’s too much for a shell script?

At the beginning I also needed something quick and dirty to backup our databases and files. In one evening I wrote a simple script (!) for MySQL and for files. It was a bash script, its config was also a bash piece. I put it in cron and was content for the moment.

Then I tried to reuse the script for our customers. The script wasn’t good for them in that state. The databases were bigger, they needed S3 storage, incremental backups and other features. That was a moment when the simplicity fired back.

First of all, the shell script has zero test coverage. There were neither unit not integration tests. If you add a smallest change there is no way to guarantee that something else won’t break. Same with bug fixes. If you fix a bug once there is no way to make sure the bug won’t appear again. I’m not aware of any unit test frameworks for Bash, never seen anybody using them. That’s why Bash is a good tool for stuff like “do step 1; do step 2; do step 3; exit”. Anything more complex than that is a reckless driving without fastened seatbelt – if you skilled enough you’ll get to point B, but if anything goes wrong, consequences are catastrophic.

Another problem with Bash is lack of means to work with data. Everything that can be elegantly organized in Python classes becomes spaghetti code in Bash. For any complex algorithm it’s vitally important.

Error handling in Bash is limited and tricky. The best what it has is set -o errexit . And I hope you learned about  set -o pipefail not the hard way like I did :).

Instead of conclusion I want to tell that vast majority of our data recovery cases has similar story. There was a home baked script that wasn’t running/silently failing/buggy. When the backup copy became needed the script failed to deliver.

That happens because backup is not a simple problem and it cannot be solved by a simple script.

P.S.

I’m writing this post looking at a tower (Torre Grossa) built  in 14th century. It wouldn’t survive time if ancient masters didn’t follow best practices.

Recovering A Corrupt MySQL Database

Author Andriy Lysyuk.

The unDROP for InnoDB tool can be used to recover corrupt MySQL databases. In this post, we will show how to repair a MySQL database if its files got corrupted and even innodb_force_recovery=6 doesn’t help.

The corruption of InnoDB tablespace may be caused by many reasons. A dying hard drive can write garbage, thus, page checksum will be wrong. InnoDB then reports to the error log:

InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4.

MySQL is well known for its poor start-up script. A simple upgrade procedure may end up with two mysqld processes writing to the same tablespace. That leads to corruption too. Sometimes, power reset corrupts not only the InnoDB files, but the whole file system becomes unusable for the operating system.

InnoDB is very strict when it works with pages. If checksum doesn’t match or some field in the header carries and unexpected value, InnoDB wisely prefers to crash to avoid further corruption.

The manual suggests to start MySQL with innodb_force_recovery option. The purpose of this option is to let user dump their data. There are no means to repair the tablespace. Users must drop the tablespace, create a new one and load the data back.

innodb_force_recovery accepts values from one to six. The higher value, the more tests InnoDB disables.

In this post, we will assume that MySQL cannot start even with innodb_force_recovery=6.

The recovery toolkit works directly with the InnoDB files, it can read records from the InnoDB page. If some part of the page is damaged, it will just skip that piece and continue to read the records further in the page.

So, let’s corrupt some InnoDB file and recover the table.

InnoDB Corruption

For the sake of simplicity we will overwrite part of a .ibd file in the area with user data.
In real life, the corruption may be at any place of index PRIMARY.
In the middle of the PRIMARY index of the sakila.actor table, we will rewrite the data with 128 “A” characters:

0000C058  00 00 00 02 00 32 01 00 02 00 1C 69 6E 66 69 6D 75 6D 00 05 .....2.....infimum..
0000C06C  00 0B 00 00 73 75 70 72 65 6D 75 6D 07 08 00 00 10 00 29 00 ....supremum......).
0000C080  01 00 00 00 00 05 1E 9F 00 00 01 4D 01 10 50 45 4E 45 4C 4F ...........M..PENELO
0000C094  50 45 47 55 49 4E 45 53 53 43 F2 F5 A9 08 04 00 00 18 00 26 PEGUINESSC.........&amp;
0000C0A8  00 02 00 00 00 00 05 1E 9F 00 00 01 4D 01 1A 4E 49 43 4B 57 ............M..NICKW
0000C0BC  41 48 4C 42 45 52 47 43 F2 F5 A9 05 02 00 00 20 00 21 00 03 AHLBERGC....... .!..
0000C0D0  00 00 00 00 05 1E 9F 00 00 01 4D 01 24 45 44 43 48 41 53 45 ..........M.$EDCHASE
0000C0E4  43 F2 F5 A9 05 08 04 00 28 00 27 00 04 00 00 00 00 05 1E 9F C.......(.'.........
0000C0F8  00 00 01 4D 01 2E 4A 45 4E 4E 49 46 45 52 44 41 56 49 53 43 ...M..JENNIFERDAVISC
0000C10C  F2 F5 A9 0C 06 00 00 30 00 2C 00 05 00 00 00 00 05 1E 9F 00 .......0.,..........
0000C120  00 01 4D 01 38 4A 4F 48 4E 4E 59 4C 4F 4C 4C 4F 42 52 49 47 ..M.8JOHNNYLOLLOBRIG
0000C134  49 44 41 43 F2 F5 A9 09 05 00 00 38 00 28 00 06 00 00 00 00 IDAC.......8.(......
0000C148  05 1E 9F 00 00 01 41 41 41 41 41 41 41 41 41 41 41 41 41 41 ......AAAAAAAAAAAAAA
0000C15C  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAAAAAA
0000C170  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAAAAAA
0000C184  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAAAAAA
0000C198  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAAAAAA
0000C1AC  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAAAAAA
0000C1C0  41 41 41 41 41 41 41 41 41 41 41 41 41 41 4E 4B 43 F2 F5 A9 AAAAAAAAAAAAAANKC...
0000C1D4  05 09 00 00 58 00 28 00 0A 00 00 00 00 05 1E 9F 00 00 01 4D ....X.(............M
0000C1E8  01 6A 43 48 52 49 53 54 49 41 4E 47 41 42 4C 45 43 F2 F5 A9 .jCHRISTIANGABLEC...
0000C1FC  04 04 00 00 60 00 22 00 0B 00 00 00 00 05 1E 9F 00 00 01 4D ....`."............M

Corrupted InnoDB Table Crashes MySQL

When MySQL reads a page with the user data, checksum is wrong and the server crashes.

mysql> SELECT COUNT(*) FROM sakila.actor
+----------+
| COUNT(*) |
+----------+
|      200 |
+----------+
 
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query

Before the crash, MySQL writes to the error log what exactly went wrong and dumps the faulty page:

Version: '5.6.19-67.0'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 67.0, Revision 618
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4.
InnoDB: You may have to recover from a backup.
2014-07-14 20:18:44 7f060bfff700 InnoDB: Page dump in ascii and hex (16384 bytes):
len 16384; hex 1bce9a5000000004ffffffffffffffff0000000026c3095945bf00000000000000000

Recovering InnoDB Table Corruption

When you see corruption in the InnoDB tablespace. the first thing to try is to start MySQL with the innodb_force_recovery option. It makes sense to try all values starting from one to six.

We assume that MySQL doesn’t start even with innodb_force_recovery=6 or it does start, but any SELECT crashes it.

The recovery plan is the following:

  • Split corrupted InnoDB tablespace into pages; sort the pages by type and index_id.
  • Fetch records from the PRIMARY index of the table.
  • DROP the corrupted table and create a new one.
  • Load records back into MySQL.

We would need to parse two tablespaces: ibdata1 and actor.ibd (since option innodb_file_per_table=ON). The InnoDB dictionary is stored in ibdata1, we need it to know the index_id of the PRIMARY index of the sakila.actor table.

Split Corrupted InnoDB Tablespace

# ./stream_parser -f /var/lib/mysql/ibdata1
Opening file: /var/lib/mysql/ibdata1
File information:
 
ID of device containing file:        64768
inode number:                         8028
protection:                         100660 (regular file)
number of hard links:                    1
user ID of owner:                      106
group ID of owner:                     114
device ID (if special file):             0
blocksize for filesystem I/O:         4096
number of blocks allocated:          36864
time of last access:            1406832698 Thu Jul 31 14:51:38 2014
time of last modification:      1406833058 Thu Jul 31 14:57:38 2014
time of last status change:     1406833058 Thu Jul 31 14:57:38 2014
total size, in bytes:             18874368 (18.000 MiB)
 
Size to process:                  18874368 (18.000 MiB)
All workers finished in 0 sec

Now actor.ibd‘s turn.

# ./stream_parser -f /var/lib/mysql/sakila/actor.ibd
Opening file: /var/lib/mysql/sakila/actor.ibd
File information:
 
ID of device containing file:        64768
inode number:                         8037
protection:                         100660 (regular file)
number of hard links:                    1
user ID of owner:                      106
group ID of owner:                     114
device ID (if special file):             0
blocksize for filesystem I/O:         4096
number of blocks allocated:            224
time of last access:            1406832349 Thu Jul 31 14:45:49 2014
time of last modification:      1406832300 Thu Jul 31 14:45:00 2014
time of last status change:     1406832300 Thu Jul 31 14:45:00 2014
total size, in bytes:               114688 (112.000 kiB)
 
Size to process:                    114688 (112.000 kiB)
All workers finished in 0 sec

Recover InnoDB Dictionary

We need to know the index_id of the PRIMARY index of the sakila.actor table. Learn more about InnoDB dictionary. Now, we’ll just get index_id of sakila.actor:

# ./c_parser -4f pages-ibdata1/FIL_PAGE_INDEX/0000000000000001.page -t dictionary/SYS_TABLES.sql |grep actor
000000000504    85000001320110  SYS_TABLES      "sakila/actor"  13      4       1       0       0       ""      1
00000000050D    8E0000013B0110  SYS_TABLES      "sakila/film\_actor"    20      3       1       0       0       ""      8
...
 
# ./c_parser -4f pages-ibdata1/FIL_PAGE_INDEX/0000000000000003.page -t dictionary/SYS_INDEXES.sql |grep 13
000000000300    810000012D01D3  SYS_INDEXES     11      13      "REF\_IND"      1       0       0       304
000000000504    85000001320178  SYS_INDEXES     13      15      "PRIMARY"       1       3       1       3
000000000504    850000013201A6  SYS_INDEXES     13      16      "idx\_actor\_last\_name"        1       0       1       4
000000000505    860000013301CE  SYS_INDEXES     14      17      "PRIMARY"       1       3       2       3
...

So, the index_id of the PRIMARY index of the sakila.actor table is 15, the fifth column in the dump.

Recover Records From The PRIMARY Index Of The Table

c_parser reads InnoDB pages, matches them with a given table structure, and dumps records in a tab-separated values format.

Opposite to InnoDB, when c_parser hits corrupted area it skips it and continues reading the page.

Let’s read the records from index_id 15, which is the PRIMARY index according to the dictionary.

# ./c_parser -6f pages-actor.ibd/FIL_PAGE_INDEX/0000000000000015.page \
    -t sakila/actor.sql \
    > dumps/default/actor
    2> dumps/default/actor_load.sql
 
# cat dumps/default/actor
-- Page id: 3, Format: COMPACT, Records list: Invalid, Expected records: (0 200)
72656D756D07    08000010002900  actor   30064   "\0\0\0\0"      ""      "1972-09-20 23:07:44"
1050454E454C    4F50454755494E  actor   19713   "ESSC▒" ""      "2100-08-09 07:52:36"
00000000051E    9F0000014D011A  actor   2       "NICK"  "WAHLBERG"      "2006-02-15 04:34:33"
00000000051E    9F0000014D0124  actor   3       "ED"    "CHASE" "2006-02-15 04:34:33"
00000000051E    9F0000014D012E  actor   4       "JENNIFER"      "DAVIS" "2006-02-15 04:34:33"
00000000051E    9F0000014D0138  actor   5       "JOHNNY"        "LOLLOBRIGIDA"  "2006-02-15 04:34:33"
00000000051E    9F000001414141  actor   6       "AAAAA" "AAAAAAAAA"     "2004-09-10 01:53:05"
 
00000000051E    9F0000014D016A  actor   10      "CHRISTIAN"     "GABLE" "2006-02-15 04:34:33"
...

We have identified some valid records, but certainly there’s also some “garbage”. Pay attention to the recovered records before “Nick Wahlberg”. Definitely, there should be records of “Penelope Guiness”, since we haven’t overwritten that data in the actor.ibd file.

Filters

We can improve recovery quality by applying filters on possible values of certain fields. There are 200 records on the original table, but the first two “garbage” records have some weird identifiers (30064 and 19713). We know that the actor identifier should be in the range of [1..300]. Therefore, we tell the parser to match that condition. For this purpose, we add a hint in the comments of the actor.sql file that defines actor table. This comment should be in a special format for parser to recognize it. Listing of the part of actor.sql file (note a comma after the comment!):

CREATE TABLE `actor` (
  `actor_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT
    /*!FILTER
     int_min_val: 1
     int_max_val: 300 */,
  `first_name` varchar(45) NOT NULL,
  `last_name` varchar(45) NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`actor_id`),
  KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=utf8;

After applying the filter, recognized records look much better:

# ./c_parser -6f pages-actor.ibd/FIL_PAGE_INDEX/0000000000000015.page \
    -t sakila/actor.sql \
    &gt; dumps/default/actor \
    2&gt; dumps/default/actor_load.sql
 
# head -10 dumps/default/actor
-- Page id: 3, Format: COMPACT, Records list: Invalid, Expected records: (0 200)
00000000051E    9F0000014D0110  actor   1       "PENELOPE"      "GUINESS"       "2006-02-15 04:34:33"
00000000051E    9F0000014D011A  actor   2       "NICK"  "WAHLBERG"      "2006-02-15 04:34:33"
00000000051E    9F0000014D0124  actor   3       "ED"    "CHASE" "2006-02-15 04:34:33"
00000000051E    9F0000014D012E  actor   4       "JENNIFER"      "DAVIS" "2006-02-15 04:34:33"
00000000051E    9F0000014D0138  actor   5       "JOHNNY"        "LOLLOBRIGIDA"  "2006-02-15 04:34:33"
00000000051E    9F000001414141  actor   6       "AAAAA" "AAAAAAAAA"     "2004-09-10 01:53:05"
00000000051E    9F0000014D016A  actor   10      "CHRISTIAN"     "GABLE" "2006-02-15 04:34:33"
00000000051E    9F0000014D0174  actor   11      "ZERO"  "CAGE"  "2006-02-15 04:34:33"
00000000051E    9F0000014D017E  actor   12      "KARL"  "BERRY" "2006-02-15 04:34:33"

You see, the record for “Penelope Guiness” is already here. The only issue remaining is the invalid record 6-“AAAAA”-“AAAAAAAAA”. This happens because the record appeared to have actor_id of 6, which is what we expected. Ideally, the dump shouldn’t have any junk records, so you may try to add more filters on other fields.
Or, we can delete these records in the database manually later.

DROP Corrupted Tables And Create New One

As soon as we have dumps of all tables we need to create a new instance of MySQL.

If it’s a single table corruption it makes sense to try innodb_force_recovery=6 to DROP the table.

If MySQL can’t even start, try to move the corrupt actor.ibd elsewhere. In the recovery mode, after DROP TABLE actor MySQL removes the record from the dictionary. Remove actor.frm if it still remains.

The point is to get a clean up & running MySQL, ready to import the table dump.

Once MySQL is ready create an empty table actor:

mysql> CREATE TABLE `actor` (
    ->     `actor_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
    ->     `first_name` varchar(45) NOT NULL,
    ->     `last_name` varchar(45) NOT NULL,
    ->     `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    ->      PRIMARY KEY (`actor_id`),
    ->      KEY `idx_actor_last_name` (`last_name`)
    ->      ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
Query OK, 0 rows affected (0.01 sec)

Load Records Back Into MySQL

Then, we load information from the recovered dump:

# mysql --local-infile -uroot -p$mypass
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> use sakila;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
 
Database changed
mysql> source dumps/default/actor_load.sql
Query OK, 0 rows affected (0.00 sec)
 
Query OK, 199 rows affected, 1 warning (0.00 sec)
Records: 198  Deleted: 1  Skipped: 0  Warnings: 1

The final step is to know how much data we’ve lost due to corruption.
The c_parser provides count of expected and actually found records.
In the beginning of each page it gives number of expected records:

-- Page id: 3, Format: COMPACT, Records list: Invalid, Expected records: (0 200)

Which means 200 records are expected, but the list of records is broken (thus, Records list: Invalid).

In the end of each page it gives a summary of what was actually found.

-- Page id: 3, Found records: 197, Lost records: YES, Leaf page: YES

Show Differences Found by pt-table-checksum

First of all, pt-table-checksum is perfect for checking if the master and its slaves are in-sync. It answers the question “Are the slaves consistent with the master?”. However, if they’re not, pt-table-checksum doesn’t actually tell you what exactly is wrong.

master# pt-table-checksum -q
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
09-03T22:21:10 0 1 2 1 0 0.013 mysql.proxies_priv
09-03T22:21:10 0 1 9 1 0 0.010 mysql.user

From pt-table-checksum to pt-table-sync

Nevertheless, pt-table-sync may give you a partial answer. It can print SQL statements to sync the replication cluster. Thus, reading the SQL code, you may guess what records were missing/extra or differ.

master# pt-table-sync --print --replicate percona.checksums localhost
DELETE FROM `mysql`.`proxies_priv` WHERE `host`='slave.box' AND `user`='root' AND `proxied_host`='' AND `proxied_user`='' LIMIT 1 /*percona-toolkit src_db:mysql src_tbl:proxies_priv src_dsn:h=localhost dst_db:mysql dst_tbl:proxies_priv dst_dsn:h=ip-192-168-35-251.ec2.internal lock:1 transaction:0 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:21261 user:root host:master.box*/;
REPLACE INTO `mysql`.`proxies_priv`(`host`, `user`, `proxied_host`, `proxied_user`, `with_grant`, `grantor`, `timestamp`) VALUES ('localhost', 'root', '', '', '1', '', '2016-09-03 20:02:28') /*percona-toolkit src_db:mysql src_tbl:proxies_priv src_dsn:h=localhost dst_db:mysql dst_tbl:proxies_priv dst_dsn:h=ip-192-168-35-251.ec2.internal lock:1 transaction:0 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:21261 user:root host:master.box*/;
REPLACE INTO `mysql`.`proxies_priv`(`host`, `user`, `proxied_host`, `proxied_user`, `with_grant`, `grantor`, `timestamp`) VALUES ('master.box', 'root', '', '', '1', '', '2016-09-03 20:02:28') /*percona-toolkit src_db:mysql src_tbl:proxies_priv src_dsn:h=localhost dst_db:mysql dst_tbl:proxies_priv dst_dsn:h=ip-192-168-35-251.ec2.internal lock:1 transaction:0 changing_src:percona.checksums replicate:percona.checksums bidirectional:0 pid:21261 user:root host:master.box*/;
...

This is not very convenient. I’d note that pt-table-sync serves its main purpose – syncing two or more servers. It was never a goal to show you inconsistencies between two servers. Which is something you’d like to know.

Using TwinDB Table Compare

TwinDB Table Compare is a tool that shows the exact difference between a slave and its master. It’s published on PyPi, therefore, you can install it right away:

slave# pip install twindb-table-compare

Old versions of twindb-table-compare used MySQLDb, which depended on libmysql. Not anymore. Now it uses pure Python implementation pymysql. A little convenience for users.

TwinDB Table Compare works side-by-side with pt-table-checksum. You know, pt-table-checksum runs against the master and populates table percona.checksums.

twindb-table-compare reads mismatched chunks from percona.checksums, then runs a SELECT against the master and the slave. Then compares the result and shows it to a user as a diff.

Let’s see twindb-table-compare in action. To illustrate how it works, I intentionally deleted a record directly from the slave.

slave> delete from t1 where id = 393188;

Then, I ran pt-table-checksum on the master, that found an inconsistency. To check if there are any mismatching chunks you can run a query against percona.checksums.

slave# mysql -e "SELECT * FROM percona.checksums WHERE this_cnt<>master_cnt OR this_crc<>master_crc AND db='test'\G"
*************************** 1. row ***************************
            db: test
           tbl: t1
         chunk: 3
    chunk_time: 0.202156
   chunk_index: PRIMARY
lower_boundary: 319858
upper_boundary: 393197
      this_crc: 376517ca
      this_cnt: 73339
    master_crc: 0
    master_cnt: 73340
            ts: 2020-09-13 22:31:28

Now, that we know there is an inconsistency, we need to know what exactly it is.

slave# twindb-table-compare --user dba --password qwerty
...
2020-09-13 22:33:47,642: INFO: compare.get_inconsistencies():543: Executing: SELECT chunk FROM percona.checksums WHERE (this_crc<>master_crc OR this_cnt<>master_cnt) AND db='test' AND tbl='t1'
2020-09-13 22:33:47,643: INFO: compare.get_inconsistencies():554: Found 1 inconsistent chunk
2020-09-13 22:33:47,643: INFO: compare.build_chunk_query():360: # test.t1, chunk 3
2020-09-13 22:33:47,643: INFO: compare.get_chunk_index():63: Executing SELECT chunk_index FROM percona.checksums WHERE db='test' AND tbl='t1' AND chunk = 3
2020-09-13 22:33:47,700: INFO: compare.build_chunk_query():362: # chunk index: PRIMARY
2020-09-13 22:33:47,700: INFO: compare.get_index_fields():86: Executing SELECT COLUMN_NAME FROM information_schema.STATISTICS WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1' AND INDEX_NAME='PRIMARY' ORDER BY SEQ_IN_INDEX
2020-09-13 22:33:47,702: INFO: compare.get_boundary():129: Executing SELECT lower_boundary, upper_boundary FROM percona.checksums WHERE db='test' AND tbl='t1' AND chunk = 3
2020-09-13 22:33:47,705: INFO: compare.print_horizontal():402: Executing: SELECT id, name FROM test.t1 USE INDEX (PRIMARY) WHERE ( 0 OR ( 1 AND id >= '319858' ) ) AND ( 0 OR ( 1 AND id <= '393197' ) ) 2020-09-13 22:33:52,175: INFO: compare.print_horizontal():424: Executing: SELECT id, name FROM test.t1 USE INDEX (PRIMARY) WHERE ( 0 OR ( 1 AND id >= '319858' ) ) AND ( 0 OR ( 1 AND id <= '393197' ) )
2020-09-13 22:33:56,944: INFO: compare.get_inconsistencies():566: Differences between slave localhost and its master:
@@ -73328,7 +73328,6 @@
393185 40.79178826957176
393186 40.79178826957176
393187 40.79178826957176
-393188 40.79178826957176
393189 40.79178826957176
393190 40.79178826957176
393191 40.79178826957176

From the output we see that the slave is missing a record with id 393188 - the one I deleted.

Efficient Use Of Indexes In MySQL

These are the slides of the “Efficient Use Of Indexes In MySQL” talk we delivered on the SFMySQL Meetup.

This is an introductory talk for developers on MySQL indexes. In my opinion, it’s quite important to understand how InnoDB organizes data. If you know how MySQL accesses data, it’s easier to write optimal queries.

When working with queries, I imagine secondary indexes as a table with records sorted by secondary key fields. This is a powerful concept that helps to understand the MySQL logic. It’s also easy to understand complex optimizations like loose index scan.

For example, for index (last_name, rank) the secondary index table looks like:

Enjoy the slides!

The talk has been recorded. Unfortunately, sound is very quiet. Probably a webinar on the topic will make it up. Let me know if you’d be interested in the comments below!

InnoDB Dictionary

Why Does InnoDB Need A Dictionary

An InnoDB dictionary is a set of internal tables InnoDB uses to maintain various information about user tables. It serves as API between a human and the database. While the humans refer to tables by their names, InnoDB works with integer identifiers. The dictionary stores correspondence between table name and index_id.

The dictionary tables are normal InnoDB tables, but they’re not visible to a user. However, some versions of MySQL provide read-only access to the dictionary in information_schema database.

The dictionary is stored in ibdata1. Root page of SYS_TABLES, for example, has id 8, so it’s the eighth page from the beginning of ibdata1.

The dictionary pages are in REDUNDANT format even if you use MySQL 5.6. I hope to write more about record formats in future posts. For now, it’s enough to mention that REDUNDANT is the oldest record format. It was available since 4.0 and was the only format by that time.

SYS_TABLES

CREATE TABLE `SYS_TABLES` (
    `NAME` varchar(255) NOT NULL DEFAULT '',
    `ID` bigint(20) unsigned NOT NULL DEFAULT '0',
    `N_COLS` int(10) DEFAULT NULL,
    `TYPE` int(10) unsigned DEFAULT NULL,
    `MIX_ID` bigint(20) unsigned DEFAULT NULL,
    `MIX_LEN` int(10) unsigned DEFAULT NULL,
    `CLUSTER_NAME` varchar(255) DEFAULT NULL,
    `SPACE` int(10) unsigned DEFAULT NULL,
    PRIMARY KEY (`NAME`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1; 

All dictionary tables have predefined hard-coded index_ids. No surprise, as long as InnoDB doesn’t have a dictionary for the dictionary :). SYS_TABLES has index_id 0-1. Root page of SYS_TABLES is page id 8.

NAME is the human readable table name. Well, the database name and the table name. For example, for table sakila.actor NAME will be “sakila/actor”.

ID is the table identifier. Just an integer number. The table is referenced to by this table_id in other dictionary tables.

N_COLS is the number of columns in the table.

I will pretend TYPE, MIX_ID, MIX_LEN and CLUSTER_NAME are not important. Actually, I have no idea what these fields are. Anyway, they were never used for data recovery, so let’s leave them for future investigations.

SPACE is an identifier of a tablespace. Thus ibdata1 is SPACE 0, ibdata2 is SPACE 1, Every ibd file has its own SPACE identifier. SPACE tells in what tablespace the table’s indexes are stored.

Sample record for table sakila.actor:

# ./c_parser \
        -4f pages-1402815586/FIL_PAGE_INDEX/0-1 \
        | grep sakila/actor
...
"sakila/actor" 61 4 1 0 80 "" 47

“sakila/actor” is the table name obviously. 61 is the table id. 4 is the number of fields in the table:

CREATE TABLE `actor` (
  `actor_id` smallint(5) unsigned NOT NULL 
             AUTO_INCREMENT,
  `first_name` varchar(45) NOT NULL,
  `last_name` varchar(45) NOT NULL,
  `last_update` timestamp NOT NULL 
                DEFAULT CURRENT_TIMESTAMP
                ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`actor_id`),
  KEY `idx_actor_last_name` (`last_name`)
) ENGINE=InnoDB 
  AUTO_INCREMENT=201 
  DEFAULT CHARSET=utf8 

47 is the SPACE identifier. It would be 0 if innodb_file_per_table was disabled. It’s not in my environment:

 mysql> show variables like 'innodb_file_per_table';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+
1 row in set (0.00 sec) 

SYS_INDEXES

Another important dictionary table is SYS_INDEXES. It stores information about InnoDB indexes.

Its index_id is 0-3 and the root page is stored in page 11.

CREATE TABLE `SYS_INDEXES` (
  `TABLE_ID` bigint(20) unsigned NOT NULL 
             DEFAULT '0',
  `ID` bigint(20) unsigned NOT NULL DEFAULT '0',
  `NAME` varchar(120) DEFAULT NULL,
  `N_FIELDS` int(10) unsigned DEFAULT NULL,
  `TYPE` int(10) unsigned DEFAULT NULL,
  `SPACE` int(10) unsigned DEFAULT NULL,
  `PAGE_NO` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`TABLE_ID`,`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Fields of the table:

TABLE_ID is the table identifier. It’s one from SYS_TABLES.

ID is the index identifier. This is a very important field. The matter is, every InnoDB page contains the index_id of the index it belongs to in the header. So, if we want to recover a table, we need to find all pages that belong to the PRIMARY index of the table, which means we need all pages with a specific index_id.

NAME is the name of an index. Table sakila.actor has two indexes: PRIMARY and idx_actor_last_name. So, these are names in SYS_INDEXES.

N_FIELDS is the number of fields in an index. idx_actor_last_name has one field, so is it in SYS_INDEXES.

TYPE is not an important field 🙂

SPACE is an InnoDB tablespace identifier showing where the index is stored. Probably, InnoDB developers foresaw the possibility to store indexes in different files, but for current MySQL versions it’s the same as in SYS_TABLES.

PAGE_NO is an identifier of the root page of the index. As you know, InnoDB stores indexes in B+tree data structure. Thus, root node of the B+tree is the root page identified by this field.

SYS_COLUMNS

SYS_COLUMNS describes individual fields in a table. It is stored as index_id 0-2.

 CREATE TABLE `SYS_COLUMNS` (
  `TABLE_ID` bigint(20) unsigned NOT NULL,
  `POS` int(10) unsigned NOT NULL,
  `NAME` varchar(255) DEFAULT NULL,
  `MTYPE` int(10) unsigned DEFAULT NULL,
  `PRTYPE` int(10) unsigned DEFAULT NULL,
  `LEN` int(10) unsigned DEFAULT NULL,
  `PREC` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`TABLE_ID`,`POS`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1; 

TABLE_ID is the table identifier, it comes from SYS_TABLES

POS is a relative position of the field in the table.

NAME is the name of the field.

MTYPE and PRTYPE encode the MySQL type of the field. InnoDB started its history as a separate product, not related to MySQL. These two fields are legacy from that time. There’s no strict correspondence between MySQL and the InnoDB type. InnoDB uses its own type that suits a particular MySQL type the best. Encoding and collation are stored here as well.

LEN is the length of the field.

PREC is precision for some MySQL types.

Isn’t it interesting to see an example of the table for sakila.actor? Here it goes. Remember, table_id is 61? We know that from SYS_TABLES.

61      0       "actor\_id"     6       1794    2       0
61      1       "first\_name"   12      2162959 135     0
61      2       "last\_name"    12      2162959 135     0
61      3       "last\_update"  3       525575  4       0 

As we see, smallint is two bytes, timestamp is four bytes. And varchar(45) is 135 bytes. Apparently, this is the maximum size, not the actual size of a field which is different for every record. But why 135, not 45? Because the field is utf8, where each character may use up to three bytes. 3 x 45 adds up to 135.

SYS_FIELDS

The last table for today (but not least) is SYS_FIELDS. It’s stored in index_id 0-4.
It lists individual fields on all indexes.

 CREATE TABLE `SYS_FIELDS` (
  `INDEX_ID` bigint(20) unsigned NOT NULL,
  `POS` int(10) unsigned NOT NULL,
  `COL_NAME` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`INDEX_ID`,`POS`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1; 

The table is quite simple.

INDEX_ID is the index identifier that we can find in SYS_INDEXES.

POS is the position of the field in the index.

And COL_NAME is the field name.

Why Do We Care?

Why do we need an InnoDB dictionary anyway?

There are two reasons. First, if we need to recover table A, we need to know index_id of its PRIMARY index. The easiest way to learn it is the InnoDB dictionary.

Second, as you see, the InnoDB dictionary stores some information about field names and types. Which means, in case we don’t know the table structure we can recover it from the dictionary.