warehousing and data analytics in a mysql world

Mysql in it’s early days had the reputation of being simple and fast but lacking critical features for many use cases. Over the years, it’s become more stable and mature with a ton of advancements in the innodb storage engine, features such as stored code, replication improvements, query planning improvements, etc.. One area mysql has not tackled is BI and analytic OLAP. I can only assume it’s not been a priority with the user base to date. Oracle and SQL server cover a different market and have these options, however, in this new era of big data, a wider array of businesses want to collect and analyze data such as performance statistics to improve their business offering. Fortunately, there are a few 3rd party vendors who have taken on this area with the spirit of open source and/or tiered pay structure in mind, as you’ll find it can be very costly otherwise.

In our search to analyze the ever increasing volume of statistical data we were collecting, we considered the following products,

  • infobright
  • infinidb
  • luciddb
  • monetdb
  • fastbit
  • metakit
  • greenplum
  • hive

most all of these solutions implement a columnar architecture to the data in one way or another, essentially rows become columns, columns become rows. This is done to take advantage of the layout of the data and allow sequential reading (vs. random reads) of each long row (which was a column) on disk, which will happen much of the time doing OLAP type queries, aggregating on a dataset much larger than memory.

Back to the products, a number we rejected for one reason or another, generally because they were either developed more for the scientific community (metakit, fastbit), geared towards another open source db such as postgresql (greenplum), or focused on distributed systems (hive) which is a bit of overkill for us at this point.

so we were left with,

  • infobright
  • infinidb
  • luciddb
  • monetdb

infobright and infinidb have a free and paid version (the free versions both have limitations), luciddb and monetdb are open source.

our priorities in order were as follows,

– high compression
– ease of use (ideally a mysql client and connection)
– full DDL, DML statement use

there is a very good benchmark comparison done by percona on these 4 dbs, http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/

keep in mind it’s a couple of years old, on a specific set of data, so definitely worth testing yourself for your particular needs.

so why do we care about compression? The pros are two-fold, first, getting significant compression obviously saves space, one of the premiums of an OLAP system, the longer you can put off sharding (a distributed system). Typical compression on full products range from a very impressive 10:1 – 40:1, so essentially, you can have 10 TBs of data on a typical 1TB disk. Second, the compression means the disk arm for a sequential read only travels a fraction of the distance, making the operation and read faster.

Right off the bat, infinidb offers only a very modest 22% compression reduction and is their major limitation on the free version (paid you get 40:1 compression), and monetdb it’s not clear there is any meaningful compression at all, the tests matching their non-commital statement about the products compression

Infobright and LucidDb both offer in the neighborhood of 10:1 compression, very nice for a free product.

note: Now, if we were in the market for a paid solution, we take a hard look at InfiniDB along with the 40:1 compression it is a mysql build with their proprietary storage engine incorporated making it an easy transition from stock mysql.

So we are down to,

  • Infobright
  • LucidDB

The two have very different remaining pros and cons, Infobright is also a mysql build with a proprietary storage engine incorporated, however, the drawback is the limitations they’ve implemented on the free version, namely most of the DDL and DML statement functionality has been stripped away. That means you can not insert, update, delete, truncate, or alter a table. At first glance that may sound insane, but keep in mind an analytics environment in many cases needs only primarily two things, to be able to load data, and to be able to select data, and do it quickly, which it can do very well, mysql style. However if you wish to change or alter the table or data, you’ll need to fully reload the table, and these restriction may be tough for an environment needing to serve customers, however our requirements are only a daily load, appending the existing data, internal analysis, and we rarely change table structure on these particular tables, and so we are willing to deal with a dump and reload of a table occasionally if need be.

LucidDB is a completely different animal, requiring Java to run, and uses it’s own SQL variant and data loading methods (for instance, there is no datetime datatype, you’re only choice is varchar to keep the data integrity). It does have full DDL, DML statement functionality and no other obvious restrictions. Certainly with a learning curve, we could use this product, but when considering the people that would use this database, such as BAs and developers, it made sense to choose a product that is as consistent as possible with what we are currently using.

So infobright was the choice, with some work arounds to make it usable. Truncate and rename table can be emulated, and other issues had to be dealt with, for instance, you can not simply copy over an innodb schema to a brighthouse schema in many cases, it will complain about keys, foreign keys, auto increment, unsigned datatypes, etc.. We wrote a script to parse this stuff out. Not even close to being usable generally, just our specific tables, otherwise I’d include it here. Another example the utf8 charset is handled badly, you’ll need to increase your datatype definition size 4x e.g. varchar(32) -> varchar(128). There are number of odd things such as this, I’m not so sure I’d be as eager to use for outside customers, but for our internal purposes, it will fit just fine.

One Reply to “warehousing and data analytics in a mysql world”

Leave a Reply to Eduardo Cancel reply

Your email address will not be published. Required fields are marked *