When ParAccel announced its TPC-H figures for 100Gb, 300Gb and 1Tb data warehouses at the end of last October it must have been pretty pleased with itself. For example, at the one terabyte level it not only offered performance that was more than four times better than Microsoft or Oracle, but its total cost of ownership was more than five and ten times better respectively. However, it must have been disappointed when, by the middle of December EXASOL had published results beating it on all counts for both the 100Gb and 1Tb tests with, in the latter case, nearly double the performance but only 40% of the total cost of ownership. What's more, EXASOL has just published record-breaking 3Tb performance as well.
Now, long-time readers will know that I do not place much faith in benchmarks but, for a new company, they do put a stake in the ground. So, who is EXASOL and what do they do?
To begin with, the name EXASOL derives from exabyte (1 million terabytes) and solutions. It is a German company whose founders were previously engaged in university research into parallel computing. What they have built is a hybrid column-based/in-memory data warehouse that can be delivered either pre-installed on appropriate hardware (the company has a number of partners, most notably Hitachi who will be reselling the EXASOL data warehouse in Japan) or simply as software.
From the columnar perspective, EXASOL hasn't, as far as I can see, done anything particularly clever: it compresses the data, it avoids the need for whole table scans or for pre-aggregating data, but all column-based databases do those things. However, the in-memory capability is interesting. When a query comes in the optimiser will determine whether this can be answered directly from the column-based storage in use or if it needs to create an index. In the latter case, it can either create look-up indexes or (multi-way) join indexes. These indexes (if explicit rather than implicit) will be saved onto disk (and updated as is appropriate) for subsequent re-use. However, in the short term they are also retained in-memory, along with the data that is retrieved (which applies regardless of whether you need an index or not) so that future queries can be answered directly from memory. Which, of course, raises the question of how much memory you can have?
This leads me onto my next point, which is that EXASOL is a clustered solution. In fact, the data warehouse runs on top of EXACluster OS, which is the company's own operating system built on top of the Linux Kernel. I intend to discuss this (and clustering and grid technologies) in some detail in a future article so suffice it to say that separating the clustering software from the database has significant advantages. For example, the 3Tb benchmark that EXASOL will shortly be publishing was run on an 80 node cluster, which is far more than most databases that have clustering built-in. Indeed, EXASOL tells me that, in theory at least, there is no reason why you couldn't run a thousand node cluster.
Anyway, the point is that queries (and data) are distributed across the nodes in the cluster so that it is the total memory across all nodes that is the pertinent consideration. Of course, that does not mean that you have to have an 80 node system for a 3TB warehouse. Indeed, one of the company's existing customers has a four node system for a warehouse of this size: it depends on your query profile and performance requirements. In practice the company has tested systems with up to 5Tb of memory and it estimates current scalability up to around 50Tb of raw data.
With so many new data warehousing vendors entering the market there is the inevitable question of who will survive, because some surely will not. While I cannot give any guarantees, the partnership with Hitachi, in particular, should see EXASOL gain significant traction, which will give them a better chance than most.