IT-Analysis.com
IT-Analysis.com Logo
Technology Big Data
Business Issues Channels Enterprise Services SME Technology
Module Header
Louella FernandesLouella Fernandes
Louella Fernandes
22nd April - Internet of Things: A New Era for Smart Printing?
Simon HollowayThe Holloway Angle
Simon Holloway
18th April - Virgin Media expose private email addresses
Craig WentworthMWD Advisors
Craig Wentworth
17th April - Box's enterprise customers step forward to be counted
Craig WentworthMWD Advisors
Craig Wentworth
16th April - Egnyte the blue touchpaper...

Analysis

Another look at big data
Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 8th November 2011
Copyright Bloor Research © 2011
Logo for Bloor Research

In the previous article in this series I discussed what big data is and talked about the ability to query ALL data that is relevant to the organisation. There are several ways to look at this issue. In that previous article I focused on where the data comes from: transactional, content, instrumented and external. However, there are other viewpoints to be considered as well, notably the type of data involved and the type of query capabilities that are available. These are closely linked.

There are various forms of structured data: transactional data is structured and you can access it via conventional query tools, analytics and SQL. XML-based documents are also structured and can be accessed via XQuery. Several vendors have extended SQL capability that allows XML documents to be queried alongside transactional data. Unfortunately, neither of these approaches are much use with unstructured data.

There are various other forms of structured data. For example, the data in a spreadsheet is structured: it's just that there is no metadata to describe what the rows and columns mean. Other examples are sensor-based, clickstream and log data. All of these are essentially structured. However, they are not really relational. They don't typically, for example, have the primary-foreign key relationships that are typical of relational data. For this reason something like Hadoop is well-suited for storing this sort of information simply because you don't need the complexity (and expense) of a full relational database. Nevertheless, historically, relational databases and relational access methods have been used with these sorts of data or else flat file systems.

Instrumented data is also, frequently, time-stamped. This has two different requirements: the ability to store time series data and the ability to analyse it. The former is clearly an advantage when it comes to supporting the latter. Nevertheless, a number of data warehousing products support analysis but very few relational databases (or any others of other types) store data this way. One of the few exceptions is Informix, which has supported time series since it acquired Illustra back in the 90s.

The other type of data is unstructured. Search is the historic way of querying this or, if you have built a suitable taxonomy, then you can do a more thorough analysis of content. However, if you want to analyse tweets, for example, then it is unlikely that you will have such a taxonomy, in which case you will really need to parse the data. Consider a 140 character product description: data quality tools can parse this description for things like colour, model number, number of pixels, horsepower, voltage, dimensions and other characteristics of relevant products. In other words, you can extract structured information from the text. You would really like to be able to do the same with tweets and, as it happens, Informatica has just released HParser, which is parser for a Hadoop. While this is the first such product (as far as I know) it surely won't be the last.

Leaving that aside, traditional products aren't very good at querying text even if they have text indexing, which products such as Sybase IQ do, which is precisely where Hadoop comes into play.

Finally, it isn't just a question of structured and unstructured data. Frequently it will make sense to combine the two. There are different ways you can do this. One is to use a business intelligence tool that uses an index-based approach for both structured and unstructured data. Examples are Endeca (Oracle) Latitude and Connexica. These will run on top of a standard relational data warehouse. The second, theoretical, possibility is to put all the data into Hadoop, but you probably wouldn't want to be without your data warehouse. The third option is to use a data warehouse that directly supports MapReduce, such as Aster Data (Teradata); a fourth would be to implement HBase (a column-oriented store) on top of Hadoop; and the fifth is to use a data warehouse linked via federation (companies like Denodo and Composite Software support this) or ETL processes (IBM, Informatica, Talend, Syncsort et al) to Hadoop. Most companies are opting for this last option, an approach I will discuss further in a forthcoming article.

Advertisement



Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761
Email: