Galera set up troubleshooting

In a previous post on galera, I mentioned it’s important to look at both the joiner and donor logs to get a full picture of what problems you may be running into. Sometimes even that is not enough and you’ll need to spend time narrowing down the issue. It’s important to keep in mind that Galera is relatively young and so the documentation, error messages and online help is not comparable to mysql.

We had a situation where we were building a staging environment, duplicating our production run of Galera, yet we were running into problems not previously encountered. Two nodes, (one an arbitrator) were online and we were attempting to join a third node and have xtrabackup transfer over the data. The node would initiate xtrabackup, appear to be in operation as both nodes participated in the operation, but then quit after a few moments. Typically you might run into permissions issues, file locations, etc.. with either xtrabackup and the logs, either /varlog/mysqld.log on the donor, or the accompanying /var/lib/mysql/innobackup.backup.log will let you know what the problem is. In this case however, we were getting no obvious error,

the joiner log simply quit with,

SST failed: 32 (Broken pipe)

and the donor had a little more information,

WSREP_SST: [ERROR] innobackupex finished with error: 2.  Check /var/lib/mysql//innobackup.backup.log (20130422 11:23:08.361)

[ERROR] WSREP: Failed to read from: Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address '' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'

Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address '' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 22 (Invalid argument)

which in retrospect did subtly point out the issue, but the pointer to innobackup.backup.log distracted from the real issue. Looking at that log, again a broken pipe error 32 was noted,
which doesn’t really tell you anything.

And so commencing with the testing of port/db/file permissions, switching which was a donor vs. joiner to see if the issue was two ways, comparisons to our production environment, etc.. mysql seemed to be fine, xtrabackup seemed to work correctly, etc..

The test that made is obvious was switching the SST type, to mysqldump, which then proceeded to behave the same way, but with some much more obvious error messages,

ERROR 2003 (HY000): Can't connect to MySQL server on '' (4) 130422 11:24:11 [ERROR] WSREP: Process completed with error: wsrep_sst_mysqldump --user 'root' --password 'xxxxxxxx' --host '' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 1 (Operation not permitted) 130422 11:24:11 [ERROR] WSREP: Try 1/3: 'wsrep_sst_mysqldump --user 'root' --password 'xxxxxxx' --host '' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'' failed: 1 (Operation not permitted) hostname: Unknown host hostname: Unknown host

the ‘invalid argument’ in the first set of errors was referencing an unrecognized host. The staging servers were using internal ips, which work fine as far as Galera was concerned, except SST (xtrabackup/mysqldump) does not use the Galera node ips specified in ‘wsrep_cluster_address’  directly, a lookup is done and SST subsquently used the public ips, which were not open.

the solution is simple, you can use the variable,

in your my.cnf to explicitly specify what SST should use as the ip.