IT-Analysis.com
IT-Analysis.com Logo
Technology Big Data
Business Issues Channels Enterprise Services SME Technology
Module Header
Craig WentworthMWD Advisors
Craig Wentworth
16th April - Egnyte the blue touchpaper...
Louella FernandesLouella Fernandes
Louella Fernandes
11th April - Managed Print Services: Are SMBs Ready?
Louella FernandesLouella Fernandes
Louella Fernandes
11th April - The Managed Print Services (MPS) Opportunity for SMBs
Simon HollowayThe Holloway Angle
Simon Holloway
11th April - Intellinote - capture anything!
David NorfolkThe Norfolk Punt
David Norfolk
11th April - On the road to Morocco

Analysis

Problems with Hadoop
Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 10th November 2011
Copyright Bloor Research © 2011
Logo for Bloor Research

Hadoop sounds great but it has a number of issues associated with it. The first is that there are problems around high availability. In particular, Hadoop has a single NameNode. This is where the metadata is stored about the Hadoop cluster. Unfortunately, there is only one of them, which means that the NameNode is a single point of failure for the entire environment. If you don't mind that then fine but otherwise you will either need a much more expensive and robust server to house the NameNode or you will need to take an alternate approach: there are several of these. One is to go with a different distribution of Hadoop such as MapR, which fixes the NameNode problem. Or there are companies such as ZettaSet that have built additional tooling around Hadoop, including NameNode high availability, but which do not fork the Apache distribution. Or, since this NameNode issue is specific to HDFS (Hadoop distributed file system), you could replace this with IBM's GPFS-SNC, which similarly averts this problem. GPFS is also POSIX (portable operating system for UNIX) compliant, which HDFS is not.

Another associated problem is with the JobTracker. This is used to manage the MapReduce tasks and assign tasks to relevant servers (close to where the data is stored). Unfortunately, JobTracker, too, usually runs only on a single node, so it also represents a single point of failure. Fortunately, the same approaches that fix the NameNode issue will generally also handle JobTracker failures.

The second major issue is that MapReduce requires programming skills, and programming skills, as we all know, are in short supply. As a result there have been a number of developments to make things easier and to de-skill requirements. For example, there is Pig, which actually consists of PigLatin (the language) and a runtime environment that executes PigLatin code. This doesn't really help lessen the programming issue but Pig has been designed specifically for analysis purposes and takes away having to understand map and reduce functions. Next there is Hive. This provides a SQL interface to Hadoop. Unfortunately, it is limited: some functions are not available and some perform very poorly. Next, there is Jaql, which is a query language based on JSON (JavaScript Object Notation) and which was donated to the open source community by IBM. In its BigInsights product IBM offers an ANSI-compliant SQL interface that sits on top of Jaql.

In addition to this menagerie, some traditional vendors are exploiting MapReduce directly within their own products. For example, Syncsort, in the newly announced DMExpress 7.0, supports MapReduce functions directly from its GUI. In other words you can define a data integration task directly with DMExpress, using traditional drag-and-drop methods, and it will take care of the exploitation of MapReduce for you. This is great for data integration but unfortunately it doesn't help with query processing.

The third issue is that Hadoop and a number of associated products perform poorly. Again, various vendors have stepped into the breach. Thus MapR has re-written Shuffle so that it is 30% faster, while ZettaSet has made Pig multi-threaded. MapR has also made numerous other improvements and it estimates that these will halve your hardware requirements. Then again, Pervasive DataRush supports Hadoop clusters and it can be used either in conjunction with HDFS as an alternative to MapReduce or in conjunction with Hadoop, in either case providing significantly improved performance. Pervasive also has a product called TurboRush for Hive which the company claims improves the performance of Hive queries but with half the hardware. In internal benchmarks the product was out-performing native Hive (which you don't have to change) by a factor of three.

Finally, Hadoop is not an easy environment to manage. Not surprising really, when you consider that you might have hundreds of servers in a cluster. Both alternate distributions (MapR and so forth) and build-around products (ZettaSet, BigInsights et al) aim to help here, and there is also the ZooKeeper project from Apache, which provides synchronisation, configuration management and other cross-cluster services.

The bottom line is that there are a lot of considerations around Hadoop. It is by no means a mature environment and it is likely that you will require multiple additional products to make it work properly, especially if you go down the open source route. If you are happy to go commercial then you will probably need fewer such add-ons but then, of course, you will have to pay for them.

Advertisement



Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761
Email: