Let's start out with some figures comparing HPCC with Hadoop (yes, yes, I know you don't know who/what HPCC is - I'll come to that in a moment). Specifically, let's talk about the TeraSort benchmark, which is a sort of standard benchmark within the NoSQL community for sorting data. HPCC currently holds the record, albeit only by sorting 100 Gb rather than the Terabyte suggested by the name. When questioned about this they told me that the previous record holder had also only sorted 100 Gb and they wanted to compare like with like. Anyway that's not really the point; the really interesting thing is that, with HPCC, the sorting routine in ECL (enterprise control language) required just 4 lines of code. Writing the equivalent routine in Java, to make use of MapReduce, would take several hundred (the HPCC guys claim 700) lines of code. Of course you have to learn ECL but if that's typical that's a huge productivity gain.
So, who are these "HPCC guys" and what is HPCC? HPCC stands for high performance computing cluster and it is the database developed by, and underpinning, LexisNexis. As possibly the world's leading data aggregator, LexisNexis is well known as a subsidiary of Reed Elsevier. As its name, and the foregoing discussion suggests, HPCC is a NoSQL, open source, schema-free, clustered database that looks on the surface very much like Hadoop. Except that it has been in production for a decade, has no single point of failure (no NameNode or JobTracker issues), has built-in data integration capabilities, and, by all indications, out-performs Hadoop. This shouldn't be surprising because ECL (which is declarative rather than procedural) code is compiled whereas Hadoop depends on a Java Virtual Machine. There's also the sort of optimiser that you would expect from a conventional relational database and the company offers various text and predictive analytic capabilities such as clustering, regression and so on.
The question is, if it's so good, how come everybody is flocking to Hadoop and no-one is talking about HPCC? Well, as is usually the case with such things it comes down to marketing. But, in this case, one particular marketing decision. This is that HPCC used to be a proprietary product that was marketed and sold in a conventional manner. With increased interest in Hadoop the HPCC guys recognised that they needed to go down the open source route also, but it took them some time to persuade the bosses at Reed Elsevier that this was a good idea. So the product lost time and it didn't become open source until the middle of last summer, when a Community Edition was introduced (the Enterprise Edition is chargeable). Needless to say it therefore doesn't have the momentum that Hadoop does.
All of this begs a further question. If HPCC can scale to support multiple petabytes (LexisNexis stores more than 4Pb), if it has a declarative programming language and optimiser, if it is able to support multiple different types of data model (tabular, relational, ontological and so on), if it supports complex analytics against both structured and unstructured data, and if it runs on low-cost commodity hardware (which can be delivered as an appliance on a turnkey basis) then should we be thinking about HPCC as a data warehouse in its own right as opposed to just a Hadoop competitor? Indeed, given that all the warehousing vendors are talking about co-existing with Hadoop, doesn't HPCC represent an alternative that has been designed to have all of your data in one place rather than two? The answer to both of these questions has to be yes and, in fact, LexisNexis has customers that do indeed use HPCC as a traditional data warehouse.
Regardless, HPCC is definitely worth a look.