Internet-scale
data collecting, swarms of sensors outputs, and content clouds from the
mobile device fabric—as well as enterprises piling up ever more
kinds of analytics metadata to analyze—have stretched traditional data management models to the breaking point.
Yet advances in parallel processing using multi-core chipsets have prompted new software approaches such as MapReduce
that can handle these data chores at surprisingly low total cost. The
technical response to oceans of data is something that has been
building for some time. But the time now seems ripe to bring the
technical solutions of lower-cost parallel computing advances into play with the economic imperatives of huge data crunching requirements.
And
so just what are the technical underpinnings that support the new
demands being placed on, and by, extreme data sets? What economies of
scale can we anticipate? How will these advances spur the movement of
data to Internet cloud models?
BriefingsDirect's Dana Gardner
put these and other questions to a panel of new data architecture
experts, to plumb into how parallelism, modern data infrastructure, and
MapReduce technologies come together. He spoke with Joe Hellerstein, professor of computer science at UC Berkeley; Robin Bloor, analyst at Hurwitz & Associates, and Luke Lonergan, CTO and co-founder at Greenplum.
Here are some excerpts:
Data growth has been following and exceeding Moore's Law
over time. What we've been seeing is that the data sets that people are
gathering and storing over time have been doubling at a rate of even
faster than every 18 months. ... We're going to see all kinds of large
organizations gathering data from all sorts of automated sources.
...
What's changed in the last few years is that clock speeds on processors
have stopped doubling every 18 months. ... Instead, what they are doing
is putting more processing cores on every chip. You can expect the
number of processors on your chip to double every 18 months, but
they're not going to get any faster.
So data is growing faster,
and we have chips basically standing still, but you're getting more of
them. If you want to take advantage of that data, you're going to have
to program in parallel to make use of all those processors on the
chips. That's the confluence that's happening.
There are very
many people storing and analyzing more data. We're very encouraged that
most of our customers are finding new uses for data that are earning
them more money. Consequently, the driver to analyze more and more data
continues to grow. As our customers get more successful, this use of
data is becoming really important.
It's easy to parallelize the
data. You break it up into little chunks and you throw it out to
different machines. What can we do cleverly in computing with that kind
of a framework? There are a lot of ideas for how to move forward ...
where you are taking this massively parallel data-flow approach.
One
thing that's kind of invisible is that there is a lot of data out there
that's not being analyzed fast enough to be analyzed effectively.
That's something that I think parallelism is going to address. ... The
only reason not to gather that data is when you run out of affordable
processing and storage. Anybody with the budget will have as much data
as they can budget for and will try to monetize that. It's going to be
pervasive.
The core problem we've solved is the ability for our engine
to redistribute the data and the computation on the fly, as these
queries and analysis are being performed. ... The combination of the
software-switch interconnect, which Greenplum built into the Greenplum
product, and the underlying use of commodity parallel computers, is
brought together in this database system that makes it possible to do SQL query and languages like MapReduce with automatic parallelism.
Businesses
have invested a tremendous amount of their time over the last 15 to 25
years in SQL, and some of the more traditional kinds of business
analysis that pay off very well are ensconced in that programming
model. So, packaging a system that can do transactional, mixed
workloads with large amounts of concurrency, with applications that use
the SQL paradigm, is very important.
Packaging this together as
software plus hardware, making that available as a reference
architecture for customers, has been very important and has been very
successful in our accounts at New York Stock Exchange, Fox, MySpace, and many others.
The
combination of SQL and MapReduce in a unified way in programming
environments ... is a very pragmatic [step] that can help with people's
ability to get their hands on data in an organization. ... You want to
have the same access to all your data via either an SQL interface or a
MapReduce programming interface. ... You ought to be able to access
those with whatever language suits you, mix and match.
Some
things are easier to do in MapReduce, and some things are easier to do
in SQL, even when you know both. Good programmers have a lot of tools
in their tool belt. They like to be able to use whatever tool is
appropriate for the task. Having both of these things interleaved is
really quite helpful.
[The solution] is about users being able
to gain access to all that power. What really turned the corner for
general data analysis using SQL is the ability for a user to not to
have to worry about what kind of table structure they have. They can
have lots of small tables joining to lots of big tables, and big tables
joining to each other.
What the developer needs is an engine
that doesn't care how the data is distributed, per se, just being able
to use all of that parallelism on the problems of interest. ... The
physical model of how the database is distributed in a shared nothing architecture in a Greenplum system is not visible to the developer.
There
are a couple of questions about how an individual organization's data
will end up in the cloud. Inevitably it will, but in the short-term,
people like to keep their data close, particularly database data that's
traditionally been in the warehouses, very carefully managed. ... It's
going to be some time until we really see everybody's data warehouses
up in the cloud. ... How long will it be until you really get big
volumes of data in the cloud[?] The answer is that certainly new
applications will be up there. We may start to see old data getting
uploaded in the cloud as well.
We'll start to see big data sets
up there that don't necessarily belong to anyone, and they are going to
be big. In that environment, you can imagine big data analytics will
have to run in the cloud, because that's where the data will be. One of
the fun things about the cloud that's really exciting is the elasticity
of the resources. You don't buy yourself a data center full of
machines, but you rent as many machines as you need for a task.
If
you have a task that's going to look at a lot of data, you would rent a
lot of machines for a few hours, and then you would shrink your pool.
What this is going to allow people to do is that even small
organizations may, for a short period of time, look at an enormous
amount of data, which perhaps doesn't originate in their own data
production environment, but is something that they want to utilize for
their purposes.
Disk densities show no signs of slowing down.
So, data is going to be essentially no cost. The data-gathering
infrastructure is also going to be mechanized. We're going through what
I call the industrial revolution of data production. We're just going
to build machines to generate data, because we think we can get value
out of that data, and we can store it essentially for free.
The
compute cost of multi-core with parallelism is going to continue
Moore's Law. It's just going to continue it in a parallel programming
environment. If we can get all those cores looking at all that data, it
won't cost much to do that, and the cost of that will continue to
shrink by half.
The only real barrier to the process is to make
those systems easy to program and manageable. Cloud helps somewhat with
manageability, and programming environments like SQL and MapReduce are
well-suited to parallelism. We're going to just see an enormous use of
data analysis over time. It's just going to grow, because it gets
cheaper and cheaper and bigger and bigger.
Read a full transcript of the discussion. The full podcast is also available for download here.
Sorry, we are no longer accepting comments on this item. We suggest trying to contact the author directly.