my version of consistent hashing

When sharding, a common method of distributing your key randomly is to use some sort of hashing algorithm in your code, which is not time dependent, key dependent (i.e. evens, odds), etc.. You also want to keep in mind the work in reallocating data if you either add or remove shards.

The is a good amount of material on the subject,

here are a few links,

http://michaelnielsen.org/blog/consistent-hashing/
http://www.paperplanes.de/2011/12/9/the-magic-of-consistent-hashing.html

and the original paper,

http://www.akamai.com/dl/technical_publications/ConsistenHashing…

if you like technical papers.

Why am I writing about it again? I don’t love the visual explanation out there (the circle) and breaking it into whole sections, so I’m going to explain it how I understand it best.

Naive hashing

Let’s say you had shards 1, 2 and 3, and wanted to add shard 4,

with naive hashing, you do some sort of modulus of the hash to assign it to a shard.

whichshard = hash('key') % N

where N = number of shards, % is the modulus and produces a number between 1 and N, which would be the number of the shard. A typical hash function might be crc32().

reallocation of data when adding or removing a shard is a real problem with this method. To demonstrate, we’ll assign some data represented by numbers 1-> 15 through out the nodes.

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute these same pieces of data,

 shard1 1, 5, 9, 13
 shard2 2, 6, 10, 14
 shard3 3, 7, 11, 15
 shard4 4, 8, 12

you’ll see that when you compare the data, most of them have shifted to other shards. in fact 3/4 of your data will be moved for 4 shards, 7/8 for 8 shards, etc… you’ll
have to move almost all your data each time you adjust the number of shards. This is because your algorithm using a modulus is accounting for an extra shard.

consistent hashing

So we turn to consistent hashing, and the explanation is usually accompanied by a unit circle, where it’s divided up into sections, one per shard and a range of values can be assigned per shard. It is then explained to add a shard, that you essentially divide one of these sections into two, and move only the keys you need from the shard you divided to the new one. Of course you’ll now have two smaller shards that are half the size of the rest of the shards, when of course you want to balance the load on all shards. The explanation is then instead of dividing one shard into two sections, take bits from all the shards to make your new shard. Now why push the idea that the shards are assigned full sections (i.e. 0 -> 1/5 of 15  = (1, 2, 3)) when your additional shard will need to be assigned random values (4, 9, 14)? Let’s just start with a random distribution for all shards so there is the understanding it doesn’t matter what the algorithm assigned to what node, as long as it assigns the same thing with additional shards in the mix. So let’s start with the same distribution as before (but really, any of these numbers could be assigned anywhere)

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute manually,

 shard1 1, 4, 7, 10
 shard2 2, 5, 8, 11
 shard3 3, 6, 9, 12
 shard4 13, 14, 15

We see that we have chosen to assign node4 with a result from each on the first three shards, it’s does not matter which ones go to node4, and the result is none of the first three shards need data moved onto them, only moved off to the new shard. It also only accounts for 1/5 of the data; if we had 20 shards, we would only have to move about 1/20 of the data. Obviously you are going to have some sort of logic that maps the formula results to a node,

if (result in (2,5,8,11)){
   shard=2
}
etc...

so for our demo, how does getting a random value from 1 to 15 work in a formula? We now have,

whichshard = ceiling(hash('key')/maxvalue(hash())*N)

ceiling would be a function that rounds up to the closest result N. N normalizes the result so instead of an answer between 0 and 1 to give you a result between 1 and N, matching your assigned values.

if we are using crc32, the max value is 4294967295, so something like,

whichshard = ceiling(crc32('key')/4294967295 * N)
e.g.. -> ceiling(crc32('1234567')/4294967295 * 15) = 5

and we see that the result ‘5’, we assigned to shard2, and will always be assigned to shard2 unless we move it to another shard in our grouping definition.

Lastly, you can run the above formula as a select SQL statement, but in production, do it in your code so you are not hitting the db.

Boo! a frightful hour in production, ghost of BIGINT

Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.

processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..

The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.

Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.

What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.

Galera gotchas

We’ve recently implemented Galera clustering and have been pleased with the relatively easy install and implementation. A quick description of galera is the joining of individual mysql dbs as nodes to create a cluster that features multi-threaded synchronous replication, which allows for true high availability while still using your original db and engine (innodb). Likewise, the ability to quickly break down the cluster to individual servers if need be.  Our specific set up includes two dbs and an arbitrator to avoid ‘split-brain’ in case of a node failure. The process of adding a node simply involves connecting to the group, and the data transfer and sync is automatically done via SST (state snapshot transfer) with the typical overhead associated with the familiar backup methods, mysqldump, rsync and more recently xtrabackup. With xtrabackup you truly have a non-blocking method to add and re-add nodes. Recent improvements also include the addition of IST (incremental state transfer) which allows you to disconnect a node, do work, and reconnect and quickly catch up on the missing transactions on that node.

As mentioned, the install and implementation has been quite smooth, however here are a few things to keep in mind,

1. On installation, when running into errors, it’s important you analyze both the joiner AND donor error logs, as they of course will have differing messages. We ran into what ended up being an xtrabackup issue, which was misleading from the joiner logs, but clear as day in the donor logs.

2. On initial installation, you’ll be told to set wsrep_cluster_address=gcomm:// for your first node, as there is nothing to join to. However, DON’T keep this in your my.cnf, as on restart, you’ll end up creating a new cluster (again), not join the one you’ve made. Change it to specify the ip of one of the other nodes.

3. Similar to replication, Galera will auto-increment offset by the number of nodes, this is automatic, however, keep this in mind regarding large tables and datatype limits.

4. You may be surprised to learn that some fundamental operations will lock the entire cluster for the duration of the operation without some care. Here are two examples and by no means the only statements that can be long running and cause grief,

  • An alter table locks the entire cluster, even on a table that is not in use
  • A load data infile also locks the entire cluster even on a table that is not in use

The first is partially due to how galera handles DDL statements, total order isolation (TOI) is the default, more info can be found here,

http://www.codership.com/wiki/doku.php?id=rolling_schema_upgrade

the fact that is affects all tables and dbs is a bug / feature, more details here,

https://bugs.launchpad.net/galera/+bug/928919

I assume the load data infile lock is due to the synchronous requirements of the cluster waiting to receive confirmation of the commit on the second node.

you have a couple of options to avoid a cluster wide lock,

a) for the alter table scenario, as detailed at the link above, you can use a DDL handling method known as rolling schema upgrade (RSU) which automatically detaches the node for the duration of the operation, then resynchronizes when finished.

b) for both the load data infile and alter table you can do a more manual version of this by simply disconnecting each node, performing the operation, and reconnecting

c) A third version is to issue a command to only apply the command locally,

SET wsrep_on=0;
do your command (alter, load data, etc...)
SET wsrep_on=1;

All methods would have to be performed on all nodes and particularly with ALTER, you’ll need to consider whether inconsistent schemas on the nodes will cause replication problems.

5. As noted in the limitations section,

http://www.codership.com/wiki/doku.php?id=limitations

only innodb is supported (myisam is starting to become experimental). when dealing with the mysql permissions db (which are myisam), use ‘grant’ and ‘create’ instead of ‘insert’ if you want the commands to replicate.

6. A caveat and further explanation to the last point – to tables other than innodb, DML statements are not replicated, but DDL statements are. This is because DDL statements seem to be passed on using a different method than DML statements. This difference also has implications that can cause some confusion; on our two nodes, we were mystified as to why the dbs would replicate DDL statements both directions node1 <–> node2 , yet would not replicate DML statements, only node1 –> node2. Figuring that the DDL was replicating both ways, we ruled out a configuration restriction issue, which was wrong, as the eventual cause was replicate-wild-do-table= specification in my.cnf for a particular db, while we were using  the ‘test’ db for the test. The setting would not allow DML replication, yet Galera allowed the replication of the DDL (create and alter) of the table in the ‘test’ db.

7. It may take some time for the node to become fully operational, for instance you might not be able to login at first, or you’ve logged in and issue a command, ‘command not found’ is returned. Just give it a little time and/or check the error logs to confirm it’s on track.