IT-Analysis.com
IT-Analysis.com Logo
Enterprise SME Business Issues Technology Services Channels
Module Header
Neil Ward-DuttonMWD Advisors
Neil Ward-Dutton
21st February - Process mining - creating passive management systems?
Roger WhiteheadOffice Jotter
Roger Whitehead
21st February - What is the problem that social networking is there to solve?
David NorfolkThe Norfolk Punt
David Norfolk
20th February - ITIL 2011 Update
Neil Ward-DuttonMWD Advisors
Neil Ward-Dutton
17th February - Taking the pulse of BPM in the cloud
David NorfolkThe Norfolk Punt
David Norfolk
15th February - An operational approach to managing Big Data
Blogs > Bloor IM Blog
Cassandra and Hadoop
Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 25th January 2012
Copyright Bloor Research © 2012
Logo for Bloor Research

I am continuing to investigate Hadoop storage options as I get briefed by more vendors and as new products get released. In this article I want to focus on Cassandra.

DataStax is the leading commercial provider for distributions of Cassandra, which is a BDDB (big data database). However, unlike HDFS (the standard storage mechanism for Hadoop) or GPFS (IBM's alternative) Cassandra is not a key-value store but a column-family store. This is not to be confused with a column-based relational database such as HP Vertica or ParAccel. In fact, it is unfortunate that whoever thought of the name "column-family" didn't think of something else. The point is that while Infobright and Sensage (more columnar relational databases) and Cassandra all use columns, this is the limit of their similarity: the former two are relational and Cassandra isn't.

I don't intend to go into the details of column-family databases and how they are architected. At least not now. But the main difference between a column-family database such as Cassandra and a key-value data store such as HDFS is that the latter stores just a key and a value while the former stores tuples that consist of a name, a value and a time stamp. It is this last that makes a big difference: there are lots of environments - smart metering, security logs and so on - where understanding time series is important and this means that Cassandra can support applications that Hadoop cannot. Not surprisingly, DataStax is exploiting this capability. Thus, for example, you can either store timestamps as the order in which they arrive in the database or as the order in which the events actually occurred (which may not be the same thing). You can also index against the timestamps and, indeed, the software supports secondary indexes as well. One further notable feature is that DataStax has introduced CQL as a query language, which is a subset of SQL, although you can't do such things as joins, because there are no tables.

In so far as Hadoop is concerned you can implement Hadoop and Cassandra on the same cluster. This means that you can have your time-based and real-time applications (real-time being a strength of Cassandra) running under Cassandra while batch-based analytics and queries that do not require a timestamp can run on Hadoop. In practice, in this environment, Cassandra replaces HDFS under the covers but this is invisible to the developer. You can reassign (dynamically where appropriate) nodes between the Cassandra and Hadoop environments as is appropriate for your workload. The other major upside is that using Cassandra removes the single points of failure that are associated with HDFS, namely the NameNode and JobTracker, which I have discussed in previous articles.

One final point is that Cassandra has a reputation for being difficult to get started. In order to simplify this process, DataStax is providing installers, examples and so forth within its Community Edition, while the Enterprise Edition, amongst other things, includes a visual point-and-click, web-based management environment that integrates with third party environments such as Tivoli and OpenView.

Reader Comments

Post A Comment?

Do you agree with what Philip Howard is saying? Perhaps you feel, or even know, different? Why not post your personal opinion on this issue?

All fields must be completed to submit a comment. Email addresses, whilst required, are never published on the website. We reserve the right to pass your email address on to the author(s) so they may contact you in direct relation to this post.





25th January 2012: 'Andy Ormsby' said:

Great to see you covering Cassandra, but it's worth mentioning that the commercial organisations supporting Cassandra extend beyond Datastax. Acunu, a company founded in the UK and with offices in London and San Francisco provides support and training for Cassandra as well as its own distribution of the software together with a high performance back-end that eliminates some of the complexity of realising the great performance that Cassandra promises. (Disclaimer - I'm an Acunu employee).

Reply to Andy Ormsby?

Advertisement



Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761
Email: