It's Early Days for Big Data. Just sayin'...
By: Robin Bloor, Co-Founder, The Bloor Group
Published: 27th March 2013
Copyright The Bloor Group © 2013
The main thrust of Big Data is analytics. The simple fact is that we never had the possibility of analyzing vary large data heaps until recently becasue of practical factors, like the cost of buying the hardware and networking it, the absence of analytics software to run in parallel across server grids, fast scale-out databases, the lack of cloud deployment options and so on. Most of what was needed has gradually emerged and because of that Big Data is off and running.
The Upside and the Downside
We can look at this in two ways, from the positive side and the negative side. The positive side requires little explanation. Companies can quickly assemble Big Data pools that were previously too expensive or too slow to work with. In ananlyzing them they may discover extremely valuable knowledge that they can apply to good effect. This is the promise of Big Data and everyone knows it.
I'll deal with the downside simply by providing a list:
- There is a whole new software stack to get to grips with, starting—for many companies—with Hadoop and its many children (Hbase, Pig, Hive, Mahout, Flume, etc.) Hadoop is a little like the old woman who live in a shoe; "She had so many children, she didn't know what to do." Right now Hadoop is highly capable, but immature. There are effective ways to use it, but it's easy to abuse it.
- There is a data flow issue. The old data flow of: transactional systems -> data cleansing and staging -> data warehouse -> data marts -> personal data extracts, is now being displaced. That might be fine if it was a matter of rip and replace.
- Only people who have extreme courage and an overdose of optimism will rip and replace, for many reasons. Ripping and replacing databases is rarely simple and usually gives you few business payoffs aside from reducing the number of legacy systems.
- There is no proven "new" data flow model. The old data flow was about internal data. The new world embraces external data (partner data, unstructured data, social media data, web data, even event data) It is no longer clear what corporate data actually is.
- In-memory processing is a new factor in all of this. The nice thing it provides is speed, but it is not yet clear which in-memory technologies will prove to be strategic. We have the possibility now of holding quite large databases in memory and having memory as the prime source of that data, and processing the data far faster than ever before. But no concensus has yet emerged about the best way to leverage memory.
- Master Data Management was a credible idea even though in practice it has been hard to pull off. When you add in unstructured data, social media data, other external data, etc. it is not clear how you achieve a concensus on metadata and enable broad usage of such data.
- If you are wondering why I've not yet mentioned Data Governance, it's because I was saving the best to last. There are four strands worth worrying about: data security (who can use what data and who can even view it), compliance (there are data laws, there are usage rules), data cleansing (a thorny problem when it's data from other sources—there needs to be an audit trail) and data lifecycle (when do we throw the data away).
The world of events
Behind all of this, a fundamental change is occurring in the world of IT. We are moving from the processing of transactions to the processing of events. Most machine generated data, for example, is event data. Transcations are events too, but now they are in the minority. When we move to the "Internet of Things" there will be an explosion of event data way beyond its current volume, which already exceeds transaction data by a wide margin.
Put simply, Big Data is about events. They have become the atoms of data.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
27th March 2013: 'Philip Howard' said:
Robin, you raise an interesting point about the distinction between events and transactions (even though the latter is a subset of the former). If we assume that there are some businesses and/or companies where transactions will continue to predominate over events, does this mean that we end up with twin track solutions? So, there's a transaction architecture and an architecture around big data and some companies are in one stream and some in the other? And then there's probably a gradual transition from the transaction-based environment to the big data environment? Which probably means that we need some sort of sensible migration path from one to the other.
Reply to Philip Howard?
27th March 2013: 'Robin Bloor' said:
You ask a good question.
I'm not sure what the eventual outcome will be - it's difficult to predict beyond the "near horizon." As you suggest, there are some businesses that may not need to care about events, or maybe not care much about them, or may even care about them in part of the business but not for the whole business.
This is likely to be the case in many parts of the mid-market. Let's say you manufacture widgets and just two or three companies buy from you. Maybe your existing ERP system which is transaction based, tells you everything you need to know about the business. Your BI will be transactional. No need to change, at least not in the near future.
Alternatively, at the other extreme, think of Google, Linked-in, Yahoo and other such web companies. Of course there are transactions there in the mix, but there are constant tsunami waves of events going on all the time. They have to make sense of them.
I suspect (but I'm not sure) that the former situation is fundamentally database oriented and the current data warehouse architecture is stable and works reasonably well. I suspect that the latter situation is fundamentally data flow oriented and it is not so easy to work out what data heaps to deploy and how to flow data between them.
Data flow orientation suggests a distributed arrangement of databases or as James Kobielus refers to it "logical data warehouse" (see his comment on http://www.it-director.com/technology/big-data/content.php?cid=13743)
So I guess a migration architecture would be to move towards a distributed arrangement. This would be complicated enough if all you were concerned about was data access at a practical latency for applications, but when you add in data cleansing, MDM, governance, and the rest it could get really complex.
I've not come across any technology that makes that easy. Have you?
I'm not even sure many companies see the problem coming at them yet.
Reply to Robin Bloor?
Recent articles from Robin Bloor