Boo! a frightful hour in production, ghost of BIGINT

Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.

processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..

The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.

Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.

What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.

Galera gotchas

We’ve recently implemented Galera clustering and have been pleased with the relatively easy install and implementation. A quick description of galera is the joining of individual mysql dbs as nodes to create a cluster that features multi-threaded synchronous replication, which allows for true high availability while still using your original db and engine (innodb). Likewise, the ability to quickly break down the cluster to individual servers if need be.  Our specific set up includes two dbs and an arbitrator to avoid ‘split-brain’ in case of a node failure. The process of adding a node simply involves connecting to the group, and the data transfer and sync is automatically done via SST (state snapshot transfer) with the typical overhead associated with the familiar backup methods, mysqldump, rsync and more recently xtrabackup. With xtrabackup you truly have a non-blocking method to add and re-add nodes. Recent improvements also include the addition of IST (incremental state transfer) which allows you to disconnect a node, do work, and reconnect and quickly catch up on the missing transactions on that node.

As mentioned, the install and implementation has been quite smooth, however here are a few things to keep in mind,

1. On installation, when running into errors, it’s important you analyze both the joiner AND donor error logs, as they of course will have differing messages. We ran into what ended up being an xtrabackup issue, which was misleading from the joiner logs, but clear as day in the donor logs.

2. On initial installation, you’ll be told to set wsrep_cluster_address=gcomm:// for your first node, as there is nothing to join to. However, DON’T keep this in your my.cnf, as on restart, you’ll end up creating a new cluster (again), not join the one you’ve made. Change it to specify the ip of one of the other nodes.

3. Similar to replication, Galera will auto-increment offset by the number of nodes, this is automatic, however, keep this in mind regarding large tables and datatype limits.

4. You may be surprised to learn that some fundamental operations will lock the entire cluster for the duration of the operation without some care. Here are two examples and by no means the only statements that can be long running and cause grief,

  • An alter table locks the entire cluster, even on a table that is not in use
  • A load data infile also locks the entire cluster even on a table that is not in use

The first is partially due to how galera handles DDL statements, total order isolation (TOI) is the default, more info can be found here,

http://www.codership.com/wiki/doku.php?id=rolling_schema_upgrade

the fact that is affects all tables and dbs is a bug / feature, more details here,

https://bugs.launchpad.net/galera/+bug/928919

I assume the load data infile lock is due to the synchronous requirements of the cluster waiting to receive confirmation of the commit on the second node.

you have a couple of options to avoid a cluster wide lock,

a) for the alter table scenario, as detailed at the link above, you can use a DDL handling method known as rolling schema upgrade (RSU) which automatically detaches the node for the duration of the operation, then resynchronizes when finished.

b) for both the load data infile and alter table you can do a more manual version of this by simply disconnecting each node, performing the operation, and reconnecting

c) A third version is to issue a command to only apply the command locally,

SET wsrep_on=0;
do your command (alter, load data, etc...)
SET wsrep_on=1;

All methods would have to be performed on all nodes and particularly with ALTER, you’ll need to consider whether inconsistent schemas on the nodes will cause replication problems.

5. As noted in the limitations section,

http://www.codership.com/wiki/doku.php?id=limitations

only innodb is supported (myisam is starting to become experimental). when dealing with the mysql permissions db (which are myisam), use ‘grant’ and ‘create’ instead of ‘insert’ if you want the commands to replicate.

6. A caveat and further explanation to the last point – to tables other than innodb, DML statements are not replicated, but DDL statements are. This is because DDL statements seem to be passed on using a different method than DML statements. This difference also has implications that can cause some confusion; on our two nodes, we were mystified as to why the dbs would replicate DDL statements both directions node1 <–> node2 , yet would not replicate DML statements, only node1 –> node2. Figuring that the DDL was replicating both ways, we ruled out a configuration restriction issue, which was wrong, as the eventual cause was replicate-wild-do-table= specification in my.cnf for a particular db, while we were using  the ‘test’ db for the test. The setting would not allow DML replication, yet Galera allowed the replication of the DDL (create and alter) of the table in the ‘test’ db.

7. It may take some time for the node to become fully operational, for instance you might not be able to login at first, or you’ve logged in and issue a command, ‘command not found’ is returned. Just give it a little time and/or check the error logs to confirm it’s on track.

mitigating the pain of mysql restarts

Every now and then you’ll have no choice but to restart the mysql server, whether it be for editing server variables, upgrading, etc.. When the server is restarted, among the operations performed, it commits all active transactions, it flushes all dirty pages (data in memory) to disk and on restart, the buffers are wiped and re-caches data as needed. On a busy server, this set of events is not trivial, as you’ll find it can take quite a bit of time for the pages to flush, and an increased load on the server to re-populate the cache with it’s working data set. There are several things you can do to help with this process.

Be sure there are no active transactions

the server can’t shutdown if there is a running transaction. Do a ‘show processlist’ to be sure.

Flush pages before server shutdown

Have you ever shutdown the server, watching the dots add up on the command line, wondering what it’s doing and when it’s going to actually shutdown? Most of the time is taken by the flushing of dirty pages. Innodb caches a certain amount of dirty pages, loosely specified by,

innodb_max_dirty_pages_pct

the default as of 5.5.24 is 75 (%). The reason it keeps a certain percentage of dirty pages is to group together data and rearrange it in sequential order so it can be written in one in go on disk, resulting in much less I/O. You can change this setting dynamically. To cause innodb to flush as frequently as possible, you can issue the following on the command line,

set option global innodb_max_dirty_pages_pct=0

keep in mind this will cause additional load on the server, as now you’ve directed innodb to essentially randomly write the data as it’s requested in real time. You can monitor the current amount of dirty pages with the following,

mysqladmin extended -r -i 1 | grep Innodb_buffer_pool_pages_dirty

you’ll see the amount of dirty pages drop until it levels out (you’ll never hit 0 on a busy server). You are now at the minimum amount of dirty pages and you can issue your restart command.

repopulating the buffer pool

If you are worried about the extra load on the server when restarting due to all pages needing to be read from disk instead of memory, you can quickly repopulate the cache if you have percona’s flavour of mysql. No reason to repeat information here, go to,

http://www.percona.com/doc/percona-server/5.5/management/innodb_lru_dump_restore.html?id=percona-server:features:innodb_lru_dump_restore

warehousing and data analytics in a mysql world

Mysql in it’s early days had the reputation of being simple and fast but lacking critical features for many use cases. Over the years, it’s become more stable and mature with a ton of advancements in the innodb storage engine, features such as stored code, replication improvements, query planning improvements, etc.. One area mysql has not tackled is BI and analytic OLAP. I can only assume it’s not been a priority with the user base to date. Oracle and SQL server cover a different market and have these options, however, in this new era of big data, a wider array of businesses want to collect and analyze data such as performance statistics to improve their business offering. Fortunately, there are a few 3rd party vendors who have taken on this area with the spirit of open source and/or tiered pay structure in mind, as you’ll find it can be very costly otherwise.

In our search to analyze the ever increasing volume of statistical data we were collecting, we considered the following products,

  • infobright
  • infinidb
  • luciddb
  • monetdb
  • fastbit
  • metakit
  • greenplum
  • hive

most all of these solutions implement a columnar architecture to the data in one way or another, essentially rows become columns, columns become rows. This is done to take advantage of the layout of the data and allow sequential reading (vs. random reads) of each long row (which was a column) on disk, which will happen much of the time doing OLAP type queries, aggregating on a dataset much larger than memory.

Back to the products, a number we rejected for one reason or another, generally because they were either developed more for the scientific community (metakit, fastbit), geared towards another open source db such as postgresql (greenplum), or focused on distributed systems (hive) which is a bit of overkill for us at this point.

so we were left with,

  • infobright
  • infinidb
  • luciddb
  • monetdb

infobright and infinidb have a free and paid version (the free versions both have limitations), luciddb and monetdb are open source.

our priorities in order were as follows,

– high compression
– ease of use (ideally a mysql client and connection)
– full DDL, DML statement use

there is a very good benchmark comparison done by percona on these 4 dbs, http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/

keep in mind it’s a couple of years old, on a specific set of data, so definitely worth testing yourself for your particular needs.

so why do we care about compression? The pros are two-fold, first, getting significant compression obviously saves space, one of the premiums of an OLAP system, the longer you can put off sharding (a distributed system). Typical compression on full products range from a very impressive 10:1 – 40:1, so essentially, you can have 10 TBs of data on a typical 1TB disk. Second, the compression means the disk arm for a sequential read only travels a fraction of the distance, making the operation and read faster.

Right off the bat, infinidb offers only a very modest 22% compression reduction and is their major limitation on the free version (paid you get 40:1 compression), and monetdb it’s not clear there is any meaningful compression at all, the tests matching their non-commital statement about the products compression

Infobright and LucidDb both offer in the neighborhood of 10:1 compression, very nice for a free product.

note: Now, if we were in the market for a paid solution, we take a hard look at InfiniDB along with the 40:1 compression it is a mysql build with their proprietary storage engine incorporated making it an easy transition from stock mysql.

So we are down to,

  • Infobright
  • LucidDB

The two have very different remaining pros and cons, Infobright is also a mysql build with a proprietary storage engine incorporated, however, the drawback is the limitations they’ve implemented on the free version, namely most of the DDL and DML statement functionality has been stripped away. That means you can not insert, update, delete, truncate, or alter a table. At first glance that may sound insane, but keep in mind an analytics environment in many cases needs only primarily two things, to be able to load data, and to be able to select data, and do it quickly, which it can do very well, mysql style. However if you wish to change or alter the table or data, you’ll need to fully reload the table, and these restriction may be tough for an environment needing to serve customers, however our requirements are only a daily load, appending the existing data, internal analysis, and we rarely change table structure on these particular tables, and so we are willing to deal with a dump and reload of a table occasionally if need be.

LucidDB is a completely different animal, requiring Java to run, and uses it’s own SQL variant and data loading methods (for instance, there is no datetime datatype, you’re only choice is varchar to keep the data integrity). It does have full DDL, DML statement functionality and no other obvious restrictions. Certainly with a learning curve, we could use this product, but when considering the people that would use this database, such as BAs and developers, it made sense to choose a product that is as consistent as possible with what we are currently using.

So infobright was the choice, with some work arounds to make it usable. Truncate and rename table can be emulated, and other issues had to be dealt with, for instance, you can not simply copy over an innodb schema to a brighthouse schema in many cases, it will complain about keys, foreign keys, auto increment, unsigned datatypes, etc.. We wrote a script to parse this stuff out. Not even close to being usable generally, just our specific tables, otherwise I’d include it here. Another example the utf8 charset is handled badly, you’ll need to increase your datatype definition size 4x e.g. varchar(32) -> varchar(128). There are number of odd things such as this, I’m not so sure I’d be as eager to use for outside customers, but for our internal purposes, it will fit just fine.

pt-online-schema-change

This is one of many perl script tools from Percona that makes life easier administrating, monitoring and testing mysql, this one solves the need to alter a table (adding a key, column, partitioning, etc…) on a production system that can’t be taken down. Many of the current Percona tools came from maatkit, developed by Baron Schwartz, this particular one from my understanding was initiated by the Facebook development team for the myriad of db servers they administrate, however, there seems to be two distinct OSC products, the one we are going to talk about is pt-online-schema-change.

It’s very impressive, from a conceptual and safe use point of view. The official description and download page is at http://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-change.html

how it works under the hood

The description above will tell you the basics of how it works, essentially an empty altered copy of the table is created, then filled by selecting data from from the active table in manageable chunks as a precaution to overloading a busy server. Triggers on the active table are temporarily created to keep the altered copy up to date with real time changes. You may question how the trigger can update a row on the new table that may not be there yet. This is done by using ‘replace’ (an insert or delete and insert) for all insert, update and delete statements in the triggers. Inserts are of course inserted, updates are now inserts on the new table (or delete and inserts if the row is already there), and deletes do the delete portion of the replace and have nothing to insert after the delete. Now to avoid conflict with the rows that are being selected and inserted from the active table, an ‘ignore’ flag in used on those bulk inserts to leave already inserted rows from the trigger as is. once the copy is completed, an atomic rename of the two tables is done.

safety first

The are many safeguards to ensure your data doesn’t get screwed up or server overloaded, among them options are the ability to control the chunk size of the selects, setting a max amount of threads connected and running due to triggered queries and general load, a dry run vs. execute option which you have to explicitly specify, a rollback feature if the process is terminated at any point along the line, no dropping the old table, no swapping the table and so on. The link above has specific implementation details.

things to keep in mind (as of pt 2.1.2)

a couple of things that I’ve run across are,

1) when dropping foreign keys, you will need to specify the altered table foreign key name, which will not be the same as the active copy foreign key name. The system uses ‘_’ in front on the altered copy, so you’ll need to do the same when specifying those foreign keys.

2) As per mysql, partitioning does not support foreign keys, and you will not be able to both partition and drop foreign keys in the same operation. You’ll most likely need to run it twice, separately.

3) If you are running bin-logs with statement based replication and without tx_isolation=READ-COMMITTED or innodb_locks_unsafe_for_binlog, the ‘insert into select * from…’ for each chunk in the copying process will need a read lock on the table, and depending on what else it going on, such as backups, potentially can cause a lock wait timeout on the process and exiting without completing.

4) there can be a momentary stall while applying the switch of tables. see below for details.

production results

we successfully used pt-online-schema-change to indirectly alter a 1.7 billion row table handling 50,000 queries per second with no downtime, taking about 15 hours to complete. There was a momentary stall (a fraction of a second) resulting in a load spike on switching out the tables, although max_used_connections does not seem to corroborate query build up as the reason. We’ll take a closer look next time.

Where Partitioning hurts

As a mobile gaming platform provider, we collect a tremendous amount of data, our biggest tables collect upwards of 50 million rows per day, and so partitioning is a must for both our statistical collection and especially analytics, where we piece together the entire dataset over time. Partitioning allows the tables to essentially be broken up into smaller tables, arranged by a partitioning key, in this case, ‘dayofweek’ or ‘dayofyear’ column. This allows a query to be able to restrict the query to the particular partitions in question relevant to the query such as a count of the records for that day. This is a big advantage for both reads and writes that would need to scan the entire index of a table that would not fit into memory to execute the query.

For the most part, for the sorts of queries we do, it’s almost automatic for us to partition big tables, we started with a couple of the biggest, and recently partitioned everything for both stat collection and analytics in preparation for ever increasing volume. All of the queries are based on a particular day, week, month, etc.. and so I was surprised to hear of a report that a query being run seemed to be about 2X as slow once we partitioned it.

However, looking at the query, it made sense. The query involved two joined tables, both partitioned. One table was our largest (analytics events), restricted to a particular day of the week, and already partitioned prior, so no gain there. The other table was our sessions table, newly partitioned, and the query DID NOT restrict to a particular time frame (such as the same day as the analytics events). Now the query was correct, because some sessions may have been started weeks ago, (an app can sit in the tray unused and so the session is still technically valid and is the correct identifier related to that day’s analytics event).

The point being, if you’ve partitioned your tables, and you are doing queries that do not take advantage of them, the parser has to do a lot more work looking at all the partitions and will be slower than a non-partitioned table. What may not be so obvious is the same holds true if you are joining tables and even one of them is partitioned. Now we also do queries on the sessions table which do restrict to a particular time frame which indeed improved, but many of the crucial queries join to that table over it’s entire time range. For us, the solution was to remove the partitioning on the sessions table. You’ll have to of course make your best judgement as to what the best trade-off is depending on the range of the majority of your queries.