Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.
processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..
The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.
Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.
What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.