As with most data centric companies, there comes a time where it’s necessary to be able to aggregate data to provide meaningful insight into raw data that is collected. At Fuse Powered that is essentially one of the cores services we provide, whether it be via ad-hoc queries, reports, or our dashboard.
The core database that we use is of course mysql with a default storage engine Innodb, which is geared toward OLTP and does it very well. However, when it comes to aggregating a large number of rows, and/or loading a large amount of data quickly, this default storage engine is not geared towards these sorts of operations efficiently when performing a very large amount of data. A class of storage engine called columnar storage is designed for just this sort of work, where the data is re-oriented, and the data is put into separate column based data files, where each of the columns can be read from disk sequentially and stored with high compression (resulting in disk storage savings and less data to read), as the speed in reading far outweighs the cpu overhead of decompression.
There are a number of columnar based offerings out there, however, almost all are closed source and quite pricey. Up until now we’ve utilized Infobright, a mysql client-based solution with a watered-down community version, stripping much of the functionality and performance from it’s full version. Recently a company, Calpont, (recently re-branded as it’s product name Infinidb) has come out with a open source licensing of it’s full 4.0.x version, a move that may potentially revolutionize how columnar databases are marketed.
This full version fills a number of core needs, such as compression, parallel processing, distributed storage, filesystem redundancy (HDFS, GlusterFS) and high speed data loading, and so we’ve been eager to test and implement.
The architecture is comprised of two main functional units, user modules (UM) and performance modules (PM). Performance modules work with and store data, while user modules coordinate queries across the entire system. Scaling performance modules gives you more storage and potentially better performance from parallel work being done. Scaling user modules allows for more concurrent queries to be run and redundancy for a failed um. You can start with a single server that contains all units, and grow out to a large distributed system as both type of modules can be added as your data increases. Each PM has a set number of dbroots, which can easily be reallocated/moved to other PMs on other servers as your data grows. It also uses a form of partitioning called extent mapping which will rule out sections of data in storage to speed up queries where possible. Queries are processed in parallel by the available processors and nodes that are appropriate.
the following is our general experience with it to date,
Documentation – Very good, answers quite a bit in terms of concepts, installation, options, etc.. However, because it’s just released as open source, really needs much more use and documentation by the user community to fish out bugs, suggest fixes, etc..
Installation/configuration – straight forward, enable passwordless ssh between nodes, download/install the initial rpm/deb/binary files and then run their script which will ask a series of questions. You can adjust the main configuration file later only on the root pm, usually pm1 at /usr/local/Calpont/etc/Calpont.xml.
Schema transformation – you need to create infinidb versions of the table structure, this is going to take some time, there are all kinds of incompatibilities when converting innodb schema to infinidb.
Loading – load data infile vs. the infinidb loader. As can be guessed, the loader is much faster, anywhere from 100,000 – 1 million+ rows/s for a single module load. The source of the loader can be at the user module, which will automatically distribute the data in 10,000 rows chunks, or at the performance modules individually using pre-split files to be able to scale. You can set an error threshold over which the load will abort, with errors logged.
Compression – our data is fairly typical, largely numerical with some varchar and text/blob columns here and there. Our data is compressed at about 5X compared to uncompressed innodb.
Backups – pretty rudimentary, really just a file copy which requires it be in read-only mode to ensure data consistency. However, it being a file copy, straight forward and fast.
Queries – performance is very impressive, especially for more complex queries, we had a class of aggregation queries that was taking 5-10 minutes on a dedicated box with 256GB ram, on a three-node public/shared cloud system 16GB each, finished in 10-15 seconds.
Logging – very well done, especially the loading, there a two files per load per table that detail the error type and the exact characters that cause the issue. Also various server logs to debug issues.
Administration – much of it is done through an independent ‘cmconsole’ which allows you to set and view various parameters and processes as well as add and reallocate modules. There are several processes per node that need to be up and running and can be a source of problems if they aren’t, viewable with the console.
Monitoring – built in monitoring of a number of key metrics such as disk/memory/cpu usage, networking issues, module issues, etc..
Cross engine compatibiity – not just the infinidb storage engine, also a full version of mysql and you can add innodb capability with the plugin. However, pretty darn old at 5.1.39. You can also mix storage engines in your queries, although you’ll need to specify it in the configuration file.
Before we get in the issues, it should be said that we have not attempted to contact infinidb reps, and any resources we used were simply the documentation and the infinidb forums (with posted responses by infinidb reps). Now, some of the issues are annoyances, other are fairly serious.
The most serious revolves around loading of data, we typical re(load) and append about 1000 tables per day and we’ve run into this behavior a few times in the last couple of weeks.
Locking – each load requires a table lock and occasionally, one of two things happen,
a) a table is locked and does not release the lock – we’ve even had a time or two where we’ve restarted the system and still has not released the lock (contrary to documentation). There is a view and clear lock utility, but for some reason it’s currently not working for us at the moment either. Our solution was to hunt for the file that contains this lock, and sure enough, you can find it on your parent pm at /usr/local/Calpont/data1/systemFiles/dbrm/tablelocks. Simply removing it and restart seems to work. This behavior seems to depend on the particular data being loaded, this happened on the same table 3 times, and we ran into most of this when initially loading our full data set.
b) table lock can not be acquired – In this case, a load is attempting to get a table lock, can’t get it and aborts the load. There were no other queries on the same table running at the same time, and in fact seemed to happen mainly when loading of other tables/dbs simultaneously.
Critical errors – we’ve ran into this a few times, where an error such as ‘image inconsistency’ initiates the system to go into ‘read-only’ mode. We only experienced it during a load which I suppose makes sense as it’s the only time we are significantly writing. When encountering this during a load it also attempts to rollback but can’t because again it’s been put into read-only mode. You might deal with the lock separately as above, and the read-only issue is suppose to clear after a system restart, but we’ve seen where it doesn’t. You then have two options. The advice at this link may help, http://infinidb.co/community/infinidb-not-starting otherwise, you may have to remove the entire infinidb installation and replace with your latest backup. We are not using a redundant file system such as HDFS or glusterFS, which may or may not help.
Not enough memory to perform select query – Infinidb uses a distributed hash join to evaluate large data sets quickly, the memory requirements of which is dictated by the second largest table in the join. There is a variable, PmMaxMemorySmallSide at a default of 64M up to a max of 4GB on the PM nodes, or all of TotalUmMemory on the UM depending on the data set size (if larger than the PM setting, uses the latter). We frequently join two tables, one at 40+ Billion rows the next at 1 Billion rows which has been an issue for certain queries, although our UM node is very modest at 16GB. Query span does make a difference (i.e. one day at a time vs. the entire year) so it’s not just a total table size metric.
backups needing to be in read only mode – is a self-explanatory issue. hopefully your needs don’t require frequent/real-time writes/loads 24 hours a day.
no real query plan evaluation – explain doesn’t give you much as it’s without indexes and use an alternate storage mechanism and execution algorithm, although probably irrelevant as it’s evaluated in a completely different way. Query profiling (set profiling=1) works. The question arises whether differences in the query structure matters and it would be interesting to see the performance of the same query written two different ways that take can longer one way in Innodb, and how they behave differently here (e.g. join vs. subquery class of queries) * – indeed there is a detail analyzer ‘select calgetstats()’ and ‘select calgettrace()’ directly after a query on the command line
sql fickleness – various issues such as,
– not being able to add a column anywhere other than the end of the table
– when doing mysqldump, must do –single-transaction, otherwise will err with lock issue
– All non-aggregate columns in the SELECT and ORDER BY clause must be included in the GROUP BY clause.
– ‘insert into (select..)’ errs, while ‘insert into select’ works
And so to sum up, we are using Infinidb and recommend it, however I’d suggest a thorough evaluation to your specific needs before making a commitment, luckily with an open source license, you can do just that.
9 Replies to “Infinidb 4.0.2 review”
Thanks for the review and noting the availability of high performance of our fully open source Community version !
I would like to give some tips regarding some of the issues you are having:
Table is locked only when loading only on the PM node on which the data is being loaded. You can use cpimport mode 3 to do loads directly on PM node as well as specify the the PM nodes on which you want to load data when using cpimport mode 1. This lock is only for writing not for reading – that is while data is being loaded/appended you can still do read query.
Not enough memory to perform select query:
In order to get the the best performance for joins requiring large memory, For evaluation we recommend 32G memory, but for production environment based on Active Data Set – recommendation is provided in Memory Requirement section of the Hardware Sizing guide available on our website at http://infinidb.co/phocadownload/documentation/4.5/InfiniDB_Hardware_Sizing_Guide.pdf – The recommendation also applies to 4.0 release. Our current GA vesion is 4.5. However in the next release in near future we are also introducing option to have disk based joins that will allow you to do joins that cannot fit in the memory. As a practice, for best performance plan your memory size based on recommendation in our sizing guide.
No real query plan evaluation –
You can see infinidb’s query plan showing each query step – including whether this step is at UM or at PM, and details on IO performance, number of rows, Table/Column involved, using calgettrace() utility. The details on this utility can be found in the Performance and Tuning Guide of our documentation at http://infinidb.co/phocadownload/documentation/4.5/InfiniDBTuningGuide_45-1.pdf
– Add a column to the end of the table only – We are aware of this and if it is pressing issue for you please engage with our support team on it.
– non-aggregate columns in SELECT and ORDER BY clause must be included in GROUP BY clause – if you want to include non-aggregate columns in projection list of your aggregate queries – please use our new Windowing Function feature that is available as of 4.0 . Please refer to SQL Syntax Guide in the documentation set. Available online at (http://infinidb.co/phocadownload/documentation/4.5/InfiniDBSyntaxGuide_45-1.pdf)
– insert into (select..)’ errs, while ‘insert into select’ works – we will create an issue to address this in future release.
Thanks again for sharing your experience with InfiniDB on your blog. Please download the new 4.5 release if you have not already done it and let us know your experience.
thanks for the input, I’ll update the review accordingly… flat out missed the calget tools, I’ll take a look. Regarding memory, certainly, Calpont does indeed recommend 32GB per node, although, of course, memory size is relative, and at some point, depending on the size of the data and query, it becomes an issue. Really the sticking point is more that a query flat out fails with insufficient allocated memory (as opposed to going to disk), and as you mentioned the solution is disk-based joins, good to see that’s coming. We’ve jacked up our swap space as a poor man’s fix.
regarding locking, really the main issue as mentioned is the occasional non-release of a lock (with no attempt to go back and release), or the failure/timeout to acquire a lock (with no attempt to retry or timeout adjustable setting, correct me if I’m wrong) and operation is aborted. Perhaps there being a global file for all table locks has something to do with it, that’s a question best left to your team. Regardless, it would be nice to see odd locking issues resolved more gracefully by the system automatically without a restart.
Yes, we’re experimenting with the parallel load now, although dealing with a similar ‘not failing gracefully’ situation where total memory can hit the threshold while loading (or other query) and one or more nodes are restarted, corrupting the load (literally loading a fraction of the records). I understand the restart is to protect the system from exhausting memory and it is indeed adjustable, however in it’s default setting and especially with the propensity to use lots of memory for queries and no disk option, the likelihood of hitting that scenario is likely at some point, the resulting cure being worse than the disease. Even more importantly, the load results should really be either be a successful load, or a rolled back operation, not half loaded.
Doing ad hoc queries, these issues are not as important, but in scenarios like automated aggregation and dashboard queries by the public, it’s important that system locks behave properly and the system attempt to deliver the resulting queries to the absolute best of it’s ability and above all, not require a restart and/or restart automatically to resolve an issue.
You guys have a great product, really just of matter of smoothing out some failure scenarios and other bugs and fixes, which I suppose is what open source is all about. Kudos for releasing it as open source and certainly hope you are seeing a large uptick in downloads and distribution.
“and text/blob columns here and there”
So how did you go around with blobs which aren’t supported?
yeah, lots of issues supporting data types, no text field support either, but does allow up to varchar(8000). We don’t have anything beyond that so that was a solution. As for no blob/binary field support at all, really we more or less circumvented the issue due to the fact that we only have a couple of blob columns in a few tables isolated in a relatively small database compared to many others, and we wanted to have this db replicated in real time into the warehouse instead of an ETL process for the larger dbs. We also decided to keep the tables in this replicated db as innodb to mitigate any compatibility issues (which lack of blob support is a prime example) and enabled cross engine compatibility to support queries using both innodb and infinidb.
If you’re looking for suggestions, I’d say this. If you can,
a) be 100% sure your blob data will not be used to be compared/filtered or be sorted/grouped on, and simply stored and read
b) max length of any stored data in the blob column less than 8000
then using a text (varchar(8000)) field instead is probably ok.
If you can’t guarantee that, then I see a few options, mainly variants of each other and not ideal,
a) strip out the blob column entirely in your ETL process (if you can do without it)
a) make all tables with blobs innodb (or myisam)
b) duplicate tables with blobs as both infinidb and innodb and use the latter for any queries needing to operate on that column
c) break up the table(s) and/or duplicate the columns that have blob(s) by making a new table as innodb with the primary key of the original table and your blob column(s) and then join for all needed queries.
I imagine resulting performance of mixing engines will depend on the comparative size of the innodb table and exactly which option you go with.
Thanks for the advise!
“we wanted to have this db replicated in real time into the warehouse instead of an ETL”
I’m pretty much following your path 🙂 just bumped into yet another thing – Infinidb outdated mysql 5.1 doesn’t even know “slave_type_conversions”…
Sorry, I should clarify a little bit, because we went with keeping the tables as innodb, there was no need to worry about datatype conversions (i.e. we kept the text/blob fields), now if we wanted to go with infinidb engine, varchar(8000) would have been a solution for our text fields. You make a good point regarding replicating into a different datatype…
Are you using row based replication? I believe setting your replication method to ‘statement’ should get around this..
No, we have “mixed” due to some circumstances, plus, moving currently from 5.5 to 5.6 which may make it incompatible… So, today it was decided that instead we’re going to create a set of scripts to add/update data and keep it relatively fresh (3-5 min), leaving out unneeded columns including blob/text.
Could you post more details on how you were able to achieve this speedup? What was the bottleneck of those queries?
(Cool blog btw)
Really we haven’t done anything so far, except literally drop the query into the Infinidb instance. I say so far because it might be possible to further optimize the query using calgetstats/calgettrace tools, although not necessarily the traditional rules of query optimization used for an OLTP database.
I should also mention before going further these sorts of queries were not part of our former warehousing processes (Infobright), these particular queries were being performed on a mysql server. Now we also ran a few rudimentary tests between Infinidb and the community version of Infobright, which is stripped of numerous performance and functionality features, and not surprisingly Infinidb performed drastically better in a number of areas such as loading and queries that required more than one filter/operation (i.e. two parameters in a where clause). Maybe Infobright will make their full product open source too one of these days and it can be a fair comparison 🙂
The reason for the speed up (infinidb vs. mysql) generally comes from the nature of the type of queries we were doing, each query contained several unioned sub-queries, the result set of which could be up to 1 billion rows, which was then grouped on several fields to get counts on several factors. We have about 1.5TB of data on this machine and only 256GB of ram and almost the entire data set is active on an hourly basis, and so generally most queries such as this have no choice but to go to disk, thus you could consider that the bottleneck.
Infinidb speeds this up by,
a) Compression of data (we got 5x) which cuts down on the data read as well as no additional storage for indexes
b) The architecture of the storage, columnar, which stores data sequentially much like indexes of which aggregation type queries (counts, sums, etc..) can take advantage.
Considering A and B, you’re more likely to find your data in memory, and even when going to disk, it’s much smaller, always sequential and only the columns you need, which for many counts and sums queries is potentially only one.
c) Extent mapping which act like a form of partitioning, so the queries don’t have to do full column scans each and every time
d) The parallel processing ability of Infinidb distributes the work across all the pm nodes in your system simultaneously
And so the secret is in identifying the types of queries and processes that would benefit most from this sort of architecture. Now if you tried to run many simultaneous single row reads/writes as an OLTP database (mysql) does, it would be quickly overwhelmed.