This is the third in my series of articles about graph databases and here I am going to highlight Neo4j from Neo Technologies but first a further discussion on the use of graph databases versus (other) NoSQL approaches.
A graph database has a different storage paradigm from, say, Hadoop, typically storing data in triples as a subject entity, a relationship and an object entity. So, for example, you could have "Philip Howard" "is bridge partner of" "Dave" where Dave and I are both nodes in the graph and "bridge partner" is the edge or relationship.
Now you could store this information in a key-value store such as Hadoop as name-value, relationship-value and name-value but if you now want to add the information that I am also a partner of Wendy and that Wendy also plays with Dave but that when I play with Dave we play a Precision Club system and when I play with Wendy we play 2 over 1 game forcing and when Dave plays with Wendy that they play Acol, then this all gets a lot more complicated and it becomes easier to represent this sort of information in a graph, not least because graph databases allow entities and relationships to be qualified, so Precision Club, 2 over 1 game forcing and Acol (all of which, for the uninitiated, are bidding systems) could all be qualifiers to the relationship and you could further add our rankings (life master and so forth) as attributes of each of us. It would be incredibly difficult and time-consuming to manage this sort of environment and then run queries against it (and, in any case, Hadoop doesn't support ad hoc queries) using a key-value store or, for that matter, a traditional data warehouse because the essential element that you are interested in is the relationships rather than the entities.
Moving away from Bridge, the same applies to social network analysis of other types, in retail for example. Moreover, relationships don't have to be between people and people or even people and things: things can also have relationships to one another. For example, any network management environment, whether it is pipelines or traffic or an IT network or the Cloud, essentially consists of things and relationships so it might make sense to build relevant applications in these areas on top of a graph database. This would include such things as SIEM (security information and event management) where a graph-based approach might make a lot more sense than the file-based systems that typify the SIEM market. Other potential markets include bioinformatics, medicine, capital markets and, of course, security services and agencies.
So, to talk about Neo4j, this is probably (almost certainly) the leading and most well-known commercial vendor in the graph database market. It has implementations at Adobe, Cisco, Deutsche Telekom, Viadeo, Comparex and the Telenor Group, amongst others. The applications for which these companies are using Neo4j are diverse. As I have previously mentioned, Neo4j is not only suitable for running queries against relationships but also for transaction processing. Thus, it is both ACID compliant and supports XA-compliant two-phase commit.
One particularly interesting use case at one of its customers is for master data management. And the reason is interesting: the company in question had previously implemented MDM on Oracle RAC but the number of relationships and hierarchies that had to be managed was so large and complex that, in the words of the CIO, "performance was killing us". Hence the move to an environment that was designed to understand and manage relationships as opposed to the relational database world which, despite its name, was not designed with that in mind. Well it was, but very much on a one-to-one-to-one based approach rather than the many-to-many environment which reflects the real world. Indeed, a relational database is very good for transaction processing where the queries are understood in advance and you can pre-determine they types of queries and you can design the tables to support those kind of queries. However, where these queries are not predictable the schema-free approach provided by graph databases can bring significant benefit.
It is also worth pointing out that while Neo4j is not a clustered solution (at present at least) the company does offer a high availability option whereby you can have a second server acting as back-up. This is an active-active arrangement so that during normal operations you can run transactions on one system and queries on the other, thereby providing a genuine hybrid approach. Also, with respect to this scaling-up approach, this means that you don't have or need MapReduce. You can, nevertheless, use various common languages for programming, just as with a conventional database. Thus Neo4j doesn't just support Java (as the j implies) but also Python, Perl, Ruby and so on, as well as SPARQL.
To conclude, astute readers will have gathered that I think graph databases are very interesting. I cannot say that, at this stage, I have looked at a wide enough list of graph database products to say definitively that Neo4j should be recommended compared to other such projects but it is a good place to start, especially for transactional and hybrid environments where relationships are key to your application.