As I continue to explore big data and how it can be exploited through analytical applications, I am constantly discovering that those in the know use a language that they assume is self explanatory, but which is a major barrier to those outside of the inner core to get to grips with. I think most people can now get a handle on the idea that up to now what we have stored is largely transactional data, what we sold, what we installed, how many calls we received in a call centre etc, and that is only a small part of the amount of data that is now generated and can hold vital business information. What is less clear is how is storing all of that data actually translated into something of value.
The most commonly referred to processing environment when discussing Big Data is the Hadoop cluster, and associated with that the Hadoop Distributed File System and the MapReduce function. The most commonly held misconception is that this is an ancillary database used alongside the traditional data warehouse to store and process the non-structured data just as the data warehouse stores and processes the structured transactional data. It is not intended to be a database; it should instead be viewed as a tool, a place to process masses of data to find something of value.
So what we have is a cluster of cheap processors, producing a massively parallel processing environment, a means to distribute data sets across those nodes and a low level programming function to query that data. The MapReduce function allows a query to be broken down into sub queries by a master node, and distributed to worker nodes, which may themselves further sub divide the query. This is the Map step. That allows a function to be applied to all of the elements of a list. So we can order the list, find the frequency of occurrences of key values; identify the sequence of values in the list etc. This is all done over the distributed data in parallel, and each worker node obtains its result set. Those are then built back up to a single answer in the Reduce step.
To illustrate what that can be used for I came across a description of what I think is the sort of analytical application that Big Data is really going to facilitate-this one is called a Klout score. So what the Klout score is doing is using data from Facebook and Twitter to measure the influence that individuals have. It does this by building up a score, so it's rather like traditional calculation of lifetime value applied to a new domain. The score is based on over 35 variables which it uses to measure what they call True Reach, which is a measure of the size of the individual's engaged audience, which users factors like followers, friends, retweets, unique comments, unique likes and so forth to size the reach of a message. The second component is Amplification Probability, which is the likelihood that the content is acted upon, which is measured by analysis of retweeting rates, generation of new followers, likes per post, comments per post etc. The third element is Network influence, which is a measure of the authority and quality of the content measured by list inclusion, unique likers, unique retweeters, influence of followers and friends etc. In a world where consumers are now making decisions based on peer pressure from social networks such insights are potentially invaluable.
Another area that is grabbing attention is sentiment analysis, where comments on social networks are analysed to identify how people are actually feeling about something. For those of us who have used questionnaires and traditional tools and have become increasingly aware that people are telling you what they think you want to hear, again this is all invaluable.
Hopefully that goes some way to illustrating what Big Data is capable of doing and explaining why people are getting so excited; it has massive potential but we need to learn how to exploit it effectively, and that is what I am trying to get to the bottom of. Is this a technology that will be broadly applicable or is it going to require highly sophisticated users to make anything of it? The problem, if it is going to be constrained to just the brightest and best, is that we probably do not have enough to go round. Which is why I am also looking at Agile BI, as that may be the way to unlock this potential for a broader audience.