Sunday, April 5, 2009

The current database debate and graph databases

During the latest month or so there has been a lot of new energy in the database debate. People are questioning the RDBMS to an extent that we're not used to. Tony Bain asks: Is the Relational Database Doomed?. He talks about the simplicity, robustness, flexibility, performance, scalability, and compatibility of databases, where the RDBMS has been "good enough" in all areas in most cases. He goes on to say:

Today, we are in a slightly different situation. For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance. That benefit is scalability.

He goes on to describe alternatives to the traditional RDBMS, foremost he gives an overview over key/value databases.

Robin Bloor commented on the article and among other things comes up with three flaws in the relational data model:

  1. it has problems representing common data structures like ordered lists, hierarchies, trees or web page content
  2. it isn't helpful when the data model evolves over time
  3. there's a lack in access semantics, being restricted to the semantics provided by storing items as rows in tables

What about the alternatives to RDBMS? Most of the buzz is around key/value stores and schema-less databases. Richard Jones wrote the article Anti-RDBMS: A list of distributed key-value stores, listing the most important projects. Going schema-less can look very different, and the solution FriendFeed uses may not be the most common way to implement this.

What's then the most important difference between key/value stores and a graph database? Key/value stores lack built-in support for representing relationships between entities, and to capture such relationships is fundamental for data modeling. As Neo4j supports arbitrary key/value pairs on nodes and relationships it could as well be viewed as a key/value store with full built-in support for relationships.

Because of the advantages of using a graph database and the lack of products available in the market today, companies tend to build their own graph database as underpinning to highly scalable applications. Recently Scott Wheeler of Directed Edge wrote about their in-house database solution: On Building a Stupidly Fast Graph Database. In the associated Hacker News discussion he also mentioned that they tried Neo4j but were disappointed with the performance. This puzzles us since Neo4j is used in production with way larger amounts of data than what's mentioned in the discussion.

In summary, there's an increasing awareness that the 30+ year old relational model may not be optimal for a lot use cases that we as developers encounter today. It's great to see that the community is seriously starting to tackle the challenges of efficiently handling information in the 21st century!

10 comments:

Anonymous said...

There is one novel and rather unusual approach where partially ordered sets are used instead of graphs: http://c2.com/cgi/wiki?ConceptOrientedModel What is attractive is that it is claimed to be joinless and integrated with programming (like LINQ?). Yet, it is not clear if it is something really new or simply a number of old techniques and patterns.

leandro said...

Ðis is ſtupid. People condemn ðe relational model for SQL limitations, but SQL is not relational at all. Much better ðan reinventing ðe wheel would be to implement truly relational databaſe management ſyſtems.

Ben said...

"People are questioning the RDBMS to an extent that we're not used to."

I commented on that article on my blog, so I won't repeat my arguments here. After debating with a commenter, I'm inclined to think that this "new energy" is another round of inventors trying to attract attention to their new DBMS. I don't think there's anything wrong with that; after all, if they really do have a good idea it ought to get attention. I just think there's no reason we should be less skeptical of hype or FUD than we normally are.

Richard said...

One of the other big issues is that a lot of database work ends up as follows...

1) Using JPA (or whatever), serialize and de-serialize a lot of object graphs. This happens a lot interactively.

2) Occasionally (less often than (1) but still fairly often), create broad-based sweeping queries against the data represented by those object graphs.

RDBMS systems can handle (1) moderately poorly, and handle (2) really, really well. Graph and Object systems handle (1) well and (2) poorly. The thing is that, generally, I'd almost always take a trade-off if the interactive graph building was 10x slower and the big, hairy, complicated reporting was 10x faster, based on application usage.

Every now and then I start feeling differently, and make a system that's more optimized for OO usage than for data usage. Every single time, so far, I've come to regret that decision.

Phil Goetz said...

I was excited to hear about Neo4j. I was disappointed to see how it's implemented.

It's a vast improvement over relational DBs. You should write examples for the SQL people who can't see that. For example, here's an SQL query in a program I work on:

SELECT r.role_id FROM omnium..ident_names i, omnium..role_link r, omnium..asm_feature f, omnium..db_data d WHERE i.lower_com_name = ? AND i.locus = r.locus AND i.locus=f.locus AND f.db_data_id=d.id AND (d.original_db < 'nt' OR d.original_db >= 'nu')

With a logic system, the query would be

com_name(L, ?), original_db(L, DB), or(less(DB, 'nt'), less('nu', DB)), role_id(L, R).

The query is simpler because for each node, every fact about that node is connected to it, so you don't have to know what table the info is in, or query fields like db_data_id whose only purpose is to link to tables together.

Also, with a logic system, you can add rules that will then be automatically used to answer queries. So if you have to look up role_id by com_name a lot, you add the rule

comname_roleid(CN, R) :- com_name(L, CN), original_db(L, DB), or(less(DB, 'nt'), less('nu', DB)), role_id(L, R).

and then can just ask

comname_roleid('ATP synthase', ?)

You can put more and more intelligence into the graph itself.

More importantly, using a relational DB hinders you from using the information you have. For instance, if you have a reaction pathway database, it would be really hard to write an SQL query that meant 'Find all the proteins X whose mutation could account for the loss of production of protein Y." It would be simple with a graph database.

Sadly, you can't do all of these things with neo4j either! It falls short of the basic principles of knowledge representation developed in the 1970s in 3 critical ways:

1. It stores only triples. It can't represent unary or trinary relations. This means that it can't represent the "not" predicate. So if your schema lets you store "relation(A,B)" in the DB, you can't store the fact "not(relation(A,B))". Also, you can't store the relation "true(relation(A,B))", which means you can't distinguish between facts that are asserted as true, and facts that are merely the constituents of larger, opposite assertions, like "not(relation(A,B))". This means you can't do reasoning, because you can't represent what one agent believes that another agent believes.

2. The 'link' part of a neo4j relation is not a node! So you can't write a relation about another relation. So you can't even represent boolean statements like and(or(A,B), or(C,D)). All the info about a link has to be stored in that link, so it is impossible to add arbitrary modifiers about links! For instance, if you have a link saying that a gene has an annotation, you can't then say how confident you are about that annotation, except by building that into the link schema. And even if you do that, you can't then make another assertion about your confidence in the data you used to make the first confidence assertion. (Yes, it is necessary to do this in the real world.) Which defeats the purpose of moving away from a relational DB.

3. It doesn't use the same representation for rules and data. So there are a lot of things you can't do. You can't use machine learning to learn rules; only programmers can write rules, because they are in Java code, not in Java data structures. You can't write rules to inspect other rules; for example, to prioritize rules based on their complexity.

What I want is a Prolog that works using terabyte knowledge stores on disk. See SNePS for an example of a system that gets knowledge representation right. Unfortunately, I think it requires its knowledge to all be in memory.

Anders Nawroth said...

@Phil Goetz

Thanks for your input!

Neo4j was designed to provide a simple, flexible and performant way to persist directed labeled graphs. Other representations can then easily be built on top of Neo4j and as different use cases have different needs, it wouldn't be a good design choice to try to put everything into the core Neo4j engine.

At the moment Neo4j is quite new and doesn't come with lots of different layers built on top of it, but this part improves more and more. One example of a layer on top of Neo4j that already exists is the RDF/SPARQL implementation, maybe that is of some interest for you.

To solve your problems, you could for example add a layer to represent rules using Neo4j. Also, you could add a layer to make relationships able to have relationships themselves by simply using an intermediate node. I'm sure you could get some help from people on the mailing list.

Puneet said...

While you are correct in your assesment that key value stores dont have any relationships between the key value pairs. But you should also mention that the main draw to key value stores is their distributability. I just started reading about Neo4j so I dont know how ready it is to run on data spread over 100s of different machines or to be run in an online environment.

Jens Østergaard Petersen said...

At least some of the things that Phil Goetz misses in Neo4J are, as far as I can see, to be found in Topic Maps. Check out e.g. the implementation by Ontopia http://code.google.com/p/ontopia/. The query language that is being developed for Topc Maps is also based on Prolog.

Anonymous said...

im interested to hear if Scott Wheeler (of DirectedEdge) problems with Neo4j performance have been solved?

Anders Nawroth said...

ªAnonymous: At the time Scott Wheeler tried out Neo4j, it didn't have a batch inserter, so inserting large datasets wasn't nearly as fast as it is now. I guess this solves at least part of their problem. As we never got any detailed information regarding the problem, it's hard to be more specific.