These days “big data” is a hot topic, but in fact there are many issues around analysing high volume data, depending on the use case. Relational databases are very good at dealing with analysis of data that is well-structured and relatively stable, and column-oriented databases bring some advantages to many analytic use cases, partly due to their ability to allow greater data compression. Hadoop storage has proved well suited to highly parallel processing of less structured data, such as web traffic. However certain kinds of analysis remain difficult, even with all the improvements in hardware and newer database approaches. One such problem is that of detecting patterns in large volumes of unpartitionable, volatile, poorly structured data. For example if you are a government intelligence agency then you would like to detect potential threats by analysing the myriad of sources that such agencies have access to: purchase information (perhaps someone buying a lot of fertiliser), signal intelligence, human intelligence, known connections to suspicious people. Finding threats is not just a matter of writing a simple SQL query, but traversing large datasets in ad hoc ways that cannot be tackled using traditional approaches of caching or pre-fetching.
Another use case that has a similar characteristic is in medicine: detecting adverse reactions amongst a large database of patients and trying to spot patterns. There may be a huge number of “dimensions”, to use data warehouse parlance, to model here: family history, genetic markers, prescriptions, treatments, patient notes, clinical trial data and research papers. In mathematics such data relationships can be modelled by “graph theory”, where a set of objects is connected by links. Unfortunately traditional database approaches make it inefficient to traverse this kind of unparititionable graph data in real time as data volumes grow, as many techniques of dealing with high data volumes revolve around partitioning the data in some form like clustering data together on disk or in memory so it can be conveniently processed. If you need to access a data on a different storage media, or to use data in memory rather than directly process in hardware, then things can become slow.
A recent product launch takes an unusual approach to this particular problem. Working with early adopters, including certain US government organizations, YarcData, a spin-off from Cray, developed a proprietary processor that could deal with 128 threads simultaneously, and ingest data at the rate of 350 TB an hour. This technology has being wrapped up in a standards-based layer, so that Java or query approaches like SPARQL can take advantage of this processing ability. The result is a product called uRiKA, an appliance that can be used for a new problem segment called graph analytics - the real-time visualisation and discovery of unknown relationships in complex, high volume data. Launched in March 2012, YarcData already has some early customers in the US government, but also in medicine, such as the Mayo Clinic and the Institute of Systems Biology.
Of course graph analytics is a highly specialised area, and it is early days. But clearly this type of use case, detecting hidden relationships in poorly structured, changing, high volume data should have a variety of applications. If you have a problem of this type then uRiKA is worth a look.