Thursday, February 25, 2010

Access control lists the graph database way

In many contexts you need to handle user permissions to access, create or change some kind of resources. A common example is a file system, and that's what we are going to dive into in this blog post. We're going to use Ruby bindings for the Neo4j graph database to create a small - but working - example application.

Preparation

To set up the environment for this example on Ubuntu, I used the following commands:

sudo apt-get install jruby
sudo jruby -S gem install neo4j

To import the libraries, the following code was used:

require 'rubygems'
require 'neo4j'
require 'neo4j/extensions/find_path'

Heading for the node space

So user permissions, what are they all about? Obviously it's about users, and usually user groups as well. We'll abstract this away a bit and use the term principals, which can be single users or groups.

The other side of user permissions are the resources which are to be protected. In our case we'll have a file system, so there will be folders and files. Here we'll use the term content.

Let's start out building a graph to support the application from what we have gathered so far! When working with a graph it's beneficial to think in a graphy manner, so that's where we'll begin. Graphs are presumably about connecting things, so our first step is to create some relationships. Neo4j comes with a built-in reference node, which is easily accessible at all times. We use this to create our own "subreference nodes", one for principals and one for content. This is how our graph looks so far:

To create (and get) the subreference nodes, we use this function:

def get_or_create_sub_ref( name )
result = Neo4j.ref_node.rels.outgoing( name ).nodes.first
if ( result.nil? )
result = Neo4j::Node.new :name => name.to_s.capitalize.gsub("_", " ")
Neo4j.ref_node.rels.outgoing( name ) << result
end
return result
end

This function is then called whenever we need to use a subreference node. The important parts here are:

  • ref_node: the built-in reference node
  • rels: relationships connected to a node
  • outgoing: the direction of the relationship (the relationships are always directed, but you can choose to ignore the direction in traversals)
  • ( name ): the type of relationships to follow (the type can be ignored in traversals as well, but in our case we want to use it)
  • nodes: the nodes in the other end of the relationships
  • first: the first node found - there sould only be one subreference node of each type

If the subreference node isn't found, it will be created and connected to the reference node. As you can see, we're adding a property with the key name to the nodes as well, which is there solely for the purpose of visualization (the images in this post are created using Neoclipse).

Basic structure

For the principals part, we are going to connect the top-level ones to the corresponding subreference node using a PRINCIPAL type of relationship. Other than that, there's just users and groups, so let's use a IS_MEMBER_OF_GROUP relationship type to encode that. This is how that looks in the graph:

And here's the code to create it:

def new_principal( name, member_of_groups = [] )
principal = Neo4j::Node.new
principal[ :name ] = name
if member_of_groups.empty?
get_or_create_sub_ref( :PRINCIPALS ).rels.outgoing( :PRINCIPAL ) << principal
else
for group in member_of_groups
principal.rels.outgoing( :IS_MEMBER_OF_GROUP ) << group
end
end
return principal
end

If a new principal isn't member of any groups, it's added as a top-level principal, connected to the principals subrefererence node. In other case, it's simply added to the groups.

With Neo4j all operations on the graph have to be encapsulated in a transaction, so this is how we'll call the above function:

Neo4j::Transaction.run do
all_principals = new_principal( "All principals" )
root = new_principal( "root", [ all_principals ] )
regular_users = new_principal( "Regular users", [ all_principals ] )
user1 = new_principal( "user1", [ regular_users ] )
user2 = new_principal( "user2", [ regular_users ] )
end

For the content part, things are very similar to the principals part. The main difference is that in this case, an item can have only a single parent item. Here's the graphical view on that:

And this is the code to create the structure:

def new_content( name, parent = nil )
content = Neo4j::Node.new
content[ :name ] = name
if ( parent.nil? )
get_or_create_sub_ref( :CONTENT_ROOTS ).rels.outgoing( :CONTENT_ROOT ) << content
else
parent.rels.outgoing( :HAS_CHILD_CONTENT ) << content
end
return content
end

Similar to how the principals were created, this is the code to create the content data:

Neo4j::Transaction.run do
root_folder = new_content( "Root folder" )
temp_folder = new_content( "Temp", root_folder )
home_folder = new_content( "Home", root_folder )
user1_home_folder = new_content( "user1 home", home_folder )
user2_home_folder = new_content( "user2 home", home_folder )
a_file = new_content( "MyFile.pdf", user1_home_folder )
end

At the core

Now that we have the basic structure in place, what's left regarding our data is a small but crucial part: the permissions information! We're using a simple scheme: adding security relationships with optional boolean flags for read and write permission. Not much to say here, this is what we want the full graph to look like (click for a bigger version):

A small function will help us add the security information:

def apply_security( content, principal, map_with_flags )
security_relationship = Neo4j::Relationship.new( :SECURITY, principal, content )
map_with_flags.each_pair {|key, value| security_relationship[ key ] = value}
end

It's time to add the security data:

Neo4j::Transaction.run do
apply_security( root_folder, root, { "w" => true } )
apply_security( root_folder, all_principals, { "r" => true } )
apply_security( temp_folder, all_principals, { "w" => true } )
apply_security( user1_home_folder, regular_users, { "r" => false, "w" => false } )
apply_security( user1_home_folder, user1, { "r" => true, "w" => true } )
apply_security( user2_home_folder, user2, { "r" => true, "w" => true } )
end

To check the permission for some action by an actual principal for some content, there's some work to do. This is the algorithm we use to retrieve a permission flag:

  1. Move from the content node and upwards through the file system structure and investigate each level for permission information.
  2. On each level, see if there are any principals related to or identical with the principal concerned.
  3. Make sure to use the permission information from the principal closest to the principal concerned.
  4. If permission information was found, return it; otherwise, continue traversing to the next level in the file system.

In the code for this, we'll use a function named depth_of_principal() to calculate the distance between the principal we have traversed to and the principal concerned. More on that later, here's the code to check the permissions:

def has_access( content, principal, flag )
for current_content in content.incoming( :HAS_CHILD_CONTENT ).depth( :all )
lowest_score = nil
lowest_modifier = nil
for rel in current_content.rels.incoming( :SECURITY )
rel_principal = rel.start_node
if !rel[ flag ].nil?
score = depth_of_principal( rel_principal, principal )
if !score.nil?
modifier = rel[ flag ]
if lowest_score.nil? || score < lowest_score ||
( score == lowest_score && modifier )
lowest_score = score
lowest_modifier = modifier
end
end
end
end
if !lowest_modifier.nil?
return lowest_modifier
end
end
return false
end

Here's our function to check the distance between principals (and to see if they're on the same path at all).

def depth_of_principal( principal, reference_principal )
result = reference_principal.outgoing( :IS_MEMBER_OF_GROUP ).depth( :all ).path_to( principal )
return result.nil? ? nil : result.size
end

Finally, we want to see that everything works, so here's a utility function to print permission information:


def print_has_access( content, principal, flag )
print principal[ :name ] + " +" + flag.upcase + " access to " + content[ :name ] + "? " +
has_access( content, principal, flag ).to_s + "\n"
end

And here's how to use the function:

Neo4j::Transaction.run do
print_has_access( home_folder, root, "w" )
print_has_access( home_folder, user1, "w" )
print_has_access( a_file, root, "r" )
print_has_access( a_file, user2, "r" )
print_has_access( a_file, user1, "w" )
end

Next steps

The full source code is found here

Here's a few useful resources to help you on your way:

Thanks for reading - any feedback is welcome!

Tuesday, February 16, 2010

The top 10 ways to get to know Neo4j

Today is a big day in Neo4j land because after ten long years of development and seven years of commercial 24/7 production we just announced Neo4j 1.0!

We're very excited about this and this post will outline the ten most interesting and fun ways of getting started with Neo4j. Without further ado, let's go!
  1. Wait, what is Neo4j?

    Neo4j is a graph database, that is, it stores data as nodes and relationships. Both nodes and relationships can hold properties in a key/value fashion. Here's a small example:

    You can navigate the structure either by following the relationships or use declarative traverser features to get to the data you want.

  2. Introduction

    For a high-level, 9 minutes cocktail-party introduction of Neo4j, check out this interview with Emil Eifrem:

    (blip.tv)

    To watch a longer introduction, see the no:sql(east) 2009 presentation by Emil Eifrém

  3. Handling complexity

    Most applications will not only have to scale to a huge volumes, but also scale to the complexity of the domain at hand. Typically, there may be many interconnected entities and optional properties. Even simple domains can be complex to handle because of the queries you want to run on them, for example to find paths. Two coding examples are the social network example (partial Ruby implementation) and the Neo4j IMDB example (Ruby variation of the code). For more examples of different domains modeled in a graph database, visit the Domain Modeling Gallery

  4. Storing objects

    The common domain implementation pattern when using Neo4j is to let the domain objects wrap a node, and store the state of the entity in the node properties. To relieve you from the boilerplate code needed for this, you can use a framework like jo4neo (intro, blog posts), where you use annotations to declare properties and relationships, but still have the full power of the graph database available for deep traversals and other graphy stuff. Here's a code sample showing jo4neo in action:

    public class Person {
    //used by jo4neo
    transient Nodeid node;
    //simple property
    @neo String firstName;
    //helps you store a java.util.Date to neo4j
    @neo Date date;
    // jo4neo will index for you
    @neo(index=true) String email;
    // many to many relation
    @neo Collection roles;

    /* normal class oriented
    * programming stuff goes here
    */
    }

    Another way to persist objects is by using the neo4j.rb Neo4j wrapper for Ruby. Time for a few lines of sample code again:

    require "rubygems"
    require "neo4j"

    class Person
    include Neo4j::NodeMixin
    # define Neo4j properties
    property :name, :salary, :age, :country

    # define an one way relationship to any other node
    has_n :friends

    # adds a Lucene index on the following properties
    index :name, :salary, :age, :country
    end
  5. REST API

    Of course you want a RESTful API in front of the graph database as well. There's been plenty of work going on in that area and here are some options:

    • The neo4j.rb Ruby bindings comes with a REST extension.
    • The neo4jr-simple Ruby wrapper has the neo4jr-social example project, which exposes social network data over a REST API.
    • Similarly, the Scala bindings has a companion example project which will show you how to set up a project exposing your data over REST.
    • Last but not least, Jim Webber has joined up with the core Neo4j team to create a kick-ass REST API. The current code base is only in the laboratory but a lot of people are already kicking its tires.
  6. Language bindings

    The Neo4j graph engine is written in Java, so you can easily add the jar file and start using the simple and minimalistic API right away. Your first stop should be the Getting started guide, or if you want to add a package of useful add-on components to the mix, go for Getting started with Apoc. Other language bindings:

  7. Frameworks

    Work is being done on using Neo4j as backend of different frameworks. Follow the links to get more information!

  8. Tools

    • Shell: a command-line shell for browsing the graph and manipulate it.
    • Neoclipse: Eclipse plugin (and standalone application) for Neo4j. Visual interface to browse and edit the graph.
    • Batch inserter: tool to bulk upload big datasets quickly.
    • Online backup: performs backup of a running Neo4j instance.
  9. Query languages

    Beyond using Neo4j programmatically, you can also issue queries using a query language. These are the supported options at the moment:

    • SPARQL: Neo4j can be used as a triple- or quadstore, and has SAIL and SPARQL implementations. Go to the components site to find out more about the related components.
    • Gremlin: a graph-based programming-language with different backend implementations in the works as well as a supporting toolset.
  10. Inspiration

    Have a look at the Neo4j in the wild page to see what others are doing with Neo4j. Here's a selection:

Hopefully this post was a good starting guide to the Neo4j ecosystem. As always, please ask any questions on the mailing list or come hang out with us in the #neo4j channel on IRC.

Sunday, February 7, 2010

Yay! The Graph Processing Infrastructure is starting to emerge!




Hi all,
in the last months, the Tinkerpop team has been starting to venture into the big task of starting a unified ecosystem for the world of graphs and related projects and products. Now, I am proud to say that it seems things are starting to get some traction and see increasing contributions from outside the core team, mainly the awesome Marko Rodriguez:



Logo contributed by Ketrina Yim
  • The JUNG graph library got adapted to Gremlin
  • HyperGraphDB is being adapted to work with Gremlin
  • a REST API based on the awesome work of Jim Webber and the Neo4j team is in the making by Michael Hunger and Pavel Yaskevich
So, here is the current project ecosystem - great work of everyone involved!

Gremlin:
  • mainly driven by Marko Rodriguez
  • a library and standalone, single-user Java project, defining a
  • number of data models - to start with the Property Graph Model (PGM) and the
  • General Document Model (GDM) , soon to be broken out of the core Gremlin code.
  • Adapters to different underlying graph implementations, from Neo4j to SAIL, integrating anything from Sesame to a live LinkedData SAIL
  • Adapters to other interesting graph frameworks like JUNG, suggested by Seth@Automenta
  • A Turing complete scripting language for querying, modification and transformation of PGM and GDM compliant data structures
  • All selectors are XPath-based in syntax
  • Pluggable external path elements and function implementation.

RESTling:
Webling:
  • Driven mainly by Pavel Yaskevich via financing from Neo Technology
  • A web based visual end-user interface to Restling
  • A web based terminal supporting execution of Gremlin operations and logic
  • Visualization support with graph libraries
  • Multi-user support
  • Via REST support to connect to remote Restling instances
Gargamel:
  • Driven mainly by Marko Rodriguez
  • a execution framework primarily targeted at Bulk Synchronous Parallel graph algos
  • A number of highly parallel base graph algos integrated into Gremlin to use this framework
  • A communication framework for execution of gremlin tasks on different (partitioned or replicated) graph instances, firstly using LinkedProcess (financed by the LANL) and XMPP, but replaceable with e.g. an Erlang-implementation (kudos to Ingo Schramm for suggesting it) or RESTling- based communication for optimization of different aspects like inter-process communication during execution

All in all, I just wanted to express my excitement over the whole emerging community around Gremlin, Neo4j and graphs in general! It is thrilling to see that the easy use of graphs and the internet-scale processing of complex data structures is starting to take shape in an open world, getting the different views on graphy data onto one page and providing a broader audience the possibility to use graph structures in the real world.

/peter neubauer

Friday, December 25, 2009

Holiday fun with Neo4j

Looking for something fun to do during the holidays? Here are a few suggestions for some new cool Neo4j things that you can play around with.

A very recent addition to the Neo4j space is the JRuby library Neo4jr-social by Matthew Deiters:

Neo4jr-Social is a self contained HTTP REST + JSON interface to the graph database Neo4j. Neo4jr-Social supports simple dynamic node creation, building relationships between nodes and also includes a few common social networking queries out of the box (i.e. linkedin degrees of seperation and facebook friend suggestion) with more to come. Think of Neo4jr-Social is to Neo4j like Solr is to Lucene.

Neo4jr-social is built on top of Neo4jr-simple:

A simple, ready to go JRuby wrapper for the Neo4j graph database engine.

There's also the Neo4j.rb JRuby bindings by Andreas Ronge which have been developed for quite a while by multiple contributors.

Staying in Ruby land, there's also some visualization and other social network analysis stuff going on.

Looking for something in Java? Then you definitely want to take a look at jo4neo by Taylor Cowan:

Simple object mapping for neo. No byte code interweaving, just plain old reflection and plain old objects.

There's also a blog post where Taylor shows how to model a User/Roles pattern using jo4neo.

There's apparently a lot of work going on right now in the Django camp to enable support for SQL and NOSQL databases alike. Tobias Ivarsson (who's the author and maintainer of the Neo4j Python bindings) recently implemented initial support for Neo4j in Django. Read his post Seamless Neo4j integration in Django for a look at what's new.

One more recent project is the Neo4j plugin for Grails. There are already some projects out there using it. We want to make sure Neo4j is a first-class Grails backend so expect more noise in this area in the future.

You can find (some of the) projects using Neo4j on the Neo4j In The Wild page. From the front page of the Neo4j wiki you'll find even more language bindings, tutorials and other things that will support you when playing around with Neo4j!

Happy Holidays and Happy Hacking wishes from the Neo4j team!

Tuesday, September 15, 2009

Social networks in the database: using a graph database

Recently Lorenzo Alberton gave a talk on Trees In The Database where he showed the most used approaches to storing trees in a relational database. Now he has moved on to an even more interesting topic with his article Graphs in the database: SQL meets social networks. Right from the beginning of his excellent article Alberton puts this technical challenge in a proper context:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively.

After a brief explanation of what a graph data structure is, the article goes on to show how graphs can be represented in a table-based database. The rest of the article shows in detail how an adjacency list model can be used to represent a graph in a relational database. Different examples are used to illustrate what can be done in this way.

This post is going to show how the same things can be done when using a native graph database, namely Neo4j. To begin with it's good to have a grip on how your data actually looks, and when it comes to graphs I prefer a graphical representation, so here goes:

As you can see it's a fairly small social network, so it should be easy to comprehend what's going on in the examples. Anyhow, please read the original article for the full context of the examples; this article will mainly deal with how to solve the same problems in a true graph database. For the sample code, Java will be used. If you're no fan of Java, go check out the current language bindings of Neo4j.

Representing a graph

This part is very different when dealing with a graph database: there's no need to think much about the representation at all when using a graph database. Nodes and relationships (also called edges) are native to a graph database and this includes making sure that the graph is consistent (for example ensuring that every relationship has a start node and an end node). As a developer you can focus more on what to model and less on how to represent it.

Traversing the graph

Our first task is to find the nodes that are directly connected to a node in the case of directed relationships. We assume the node variable as input and then the code is as easy as follows:

for ( Relationship rel : node.getRelationships( RelationshipTypes.KNOWS, Direction.OUTGOING ) )
{
Node otherNode = rel.getEndNode();
// Here goes your code to make use of otherNode.
// You can aggregate it or whatever you want to do.
}

What we are doing here is to fetch all the outgoing relationships and then get the node at the other end of the relationship. In Neo4j all relationships between nodes are typed, and as seen from both the image above and the code, a KNOWS relationship type was used in this case. The relationship types are explained on the RelationshipType page of the Neo4j API reference.

To following image shows node1 and all the nodes that are directly connected to it. When following all outgoing relationships, the result will be a single node, namley node3.

Next it's time to do the same thing again but using undirected relationships. Both Neo4j and the model in Alberton's article store the relationships as directed but then you can retrieve them in a different way to ignore their direction. Here's how to do this using Neo4j:

for ( Relationship rel : node.getRelationships( RelationshipTypes.KNOWS, Direction.BOTH ) )
{
Node otherNode = rel.getOtherNode( node );
// Make use of otherNode here ...
}

Did you spot the difference? Where the previous code had Direction.OUTGOING we know have Direction.BOTH. For those of you who are curious now: yes, there's Direction.INCOMING as well.

When viewed with a perspective that ignores directions, the whole graph looks like this::

When running the new code with Direction.BOTH, all the directly connected nodes will be retrieved irregardless of direction:

Difference graph-based/table-based: First, obviously the SQL solution is declarative, while the Neo4j solution is procedural. As we'll see later on there is a declarative API to traverse the node space (graph) in Neo4j as well. Second, changing from directed to undirected relationships almost didn't change the code neccesary at all with the graph database approach, while the relational one requires some changes.

Transitive closure

Transitive closures in the sense Alberton uses it is irrelevant in a graph database context as you have access to the full graph from the beginning. The need for defining a transitive closure over the graph in relational databases is caused by its lack of native support for graph data structures.

What is still of course relevant in the graph database setting is how to find all nodes that are reachable from one node, using directed relationships. So let's see how to do that. As previously we assume the node variable as input to this piece of code:

Traverser traverser = node.traverse(
Order.BREADTH_FIRST,
StopEvaluator.END_OF_GRAPH,
ReturnableEvaluator.ALL_BUT_START_NODE,
RelationshipTypes.KNOWS,
Direction.OUTGOING );
for ( Node foundNode : traverser )
{
final int depth = traverser.currentPosition().depth();
// Use foundNode and depth to whatever you want.
}

So what's happening in this code? We start out by setting up a traverser. The traverser will use a breadth first algorithm, so that we can intercept the depth/distance to each of the returned node correctly. The traverser will not stop until the end of the graph and will return all nodes but the start node. It will follow relationships of the KNOWS type in the outgoing direction. Finally the loop fetches the result from the traverser.

If we for some reason want to do this for all nodes in the graph, we simply feed the code above with all existing nodes. This time we assume having all nodes in our network in the allNodes variable. It's easy to get them as the Neo Service API provides a getAllNodes() method. What we need to do is just a regular loop:

for ( Node node : allNodes )
{
// perform the traversal for each node here
}

How then to do the same thing but using undirected relationships? I rest assured that readers will have no problems solving this so I leave it as a homework excercise!

Difference graph-based/table-based: This time it's less obvious what the SQL actually does; you need a bit of SQL fu to parse that code! The corresponding Neo4j code this time uses a declarative approach, which depicts the wanted result.

LinkedIn: degrees of separation

Now it's definitely time to get closer to what a graph database could do for you in an application. In this case we want to know the possible paths from one person in a social network to an other. With those paths at hand we can do things similar to the "How you're connected to N.N." feature of LinkedIn.

For a deeper analysis it could be of interest to know all possible paths between two nodes within a limited distance (and not only go for the shortest path), so we will see how that would work. This task could be performed with the low level API of Neo4j, but in this case we will use the graph-algo package instead. Of course a graph database should come with basic graph algorithms included as well! This time we will assume that we (quite obviously) know the start and end nodes and how long paths we are looking for (maxPathLenght). Let's get to the code:

AllSimplePaths simplePaths = new AllSimplePaths( 
startNode,
endNode,
maxPathLenght,
Direction.BOTH,
RelationshipTypes.KNOWS );
List<List<Node>> pathsAsNodes = simplePaths.getPathsAsNodes();
// Use pathsAsNodes in your application.

Do you want to sort the paths by distance (a.k.a. lenght of the paths)? Then add this code:

Collections.sort( pathsAsNodes, new Comparator<List<Node>>()
{
public int compare( final List<Node> left, final List<Node> right )
{
return left.size() - right.size();
}
} );

When looking for paths from node1 to node6 using undirected relationships all nodes but node7 will be part of at least one path:

If we for some reason use directed relationships, the result will be found among the nodes that are colored yellow in this graph:

Difference graph-based/table-based: In this case you could argue that it's not a fair comparison, since a relational database can't ship with built-in functions for such specific things as finding paths in a graph. There's some validity to that. But that's also my point: graphs are, as Alberton and Google say, ubiquitous these days and no longer a narrow special case. In an increasingly connected world, a graph database many times brings a lot more functionality and performance to the table -- pun intended! -- than a relational system designed for the payroll systems of the 70s.

Facebook: you might also know

Our last example will require some more complexity to solve. But on the other hand I regard it as the most fun and rewarding piece. The challenge this time is to traverse into friends of friends (or even deeper) and find people with something in common. Based on this we can distill someone the user might know, or could be interested in knowing. The persons in our social network all have different values for two features, feat1 and feat2. The idea of the solution presented here is to keep track of the features that are common in each branch of the traversal and to stop when there is nothing in common any more or the maximum distance has been reached. To do this we use recursion (which was never a problem in graph databases, unlike in relational database systems). Before we start to recurse, we have to set some things up:

class YouMightKnow
{
List<List<Node>> result = new ArrayList<List<Node>>();
int maxDistance;
Set<Node> buddies = new HashSet<Node>();

public YouMightKnow( Node node, String[] features, int maxDistance )
{
this.maxDistance = maxDistance;
for ( Relationship rel : node.getRelationships( RelationshipTypes.KNOWS ) )
{
buddies.add( rel.getOtherNode( node ) );
}
findFriends( node, Arrays.asList( new Node[] { node } ), Arrays.asList( features ), 1 );
}

The features parameter contains the names of the features we want to check. As all features of course matches at the starting point, we can send them in as the matches list. To make sure we don't suggest people who are already buddies we keep track of those in a set. That's all we need to prepare, so off we go to the recursion! It's initially fed with the root node and path to start from, together with the list of the matching features and finally the current distance. Here's how things continue from there:

    void findFriends( Node rootNode, List<Node> path, List<String> matches, int depth )
{
for ( Relationship rel : rootNode.getRelationships( RelationshipTypes.KNOWS ) )
{
Node node = rel.getOtherNode( rootNode );
if ( (depth > 1 && buddies.contains( node )) || path.contains( node ) )
{
continue;
}
List<String> newMatches = new ArrayList<String>();
for ( String match : matches )
{
if ( node.getProperty( match ).equals( rootNode.getProperty( match ) ) )
{
newMatches.add( match );
}
}
if ( newMatches.size() > 0 )
{
List<Node> newPath = new ArrayList<Node>( path );
newPath.add( node );
if ( depth > 1 )
{
result.add( newPath );
}
if ( depth != maxDistance )
{
findFriends( node, newPath, newMatches, depth + 1 );
}
}
}
}
}

Now there's a few things we need to think about to get this code right. It's not hard if you keep the network structure in your mind. To begin with there are some nodes we have to avoid. As soon as we have passed by (depth > 1) the buddies, we don't want to meet them again. We also want to avoid nodes which are already contained in the path we are investigating, thus avoiding cycles in the graph.

When we know we are looking at the right nodes it's time to compare their features. We only look at the features that are matching everyone in the path behind us, so the newMatches list will get shorter than the matches list if one or more features don't match any more. If there are no newMatches we're finished with this branch. Otherwise we construct a newPath. Then we have to make sure not to add buddies to the result and to stop recursing when the maxDistance has been reached.

Finally, this is how to use the class:

String[] features = new String[] { "feat1", "feat2" };
YouMightKnow mightKnow = new YouMightKnow( node5, features, 2 );
// get result from mightKnow.result

The graphical view of this problem looks like this, when searching for "might-know" nodes of node5:

If you would like to do something more advanced it would be a small step to add in you own feature matching operations instead of the simple equals() method used in this example.

Difference graph-based/table-based: If you compare the code, you'll note that the graph database code is more intuitive. This is because it works with abstractions (nodes, relationships etc.) that fit the domain better. But I think the most important difference in this case appears when it's time to evolve the code. How do you change the traversal depth in the SQL solution? How to change the traversal depth dynamically depending on facts found in the attributes/features?

Conclusion

It should be noted that the examples above are all very trivial. When dealing with real-world problems differences that may seem small from the tiny code examples could prove to have a strong impact. Also, I didn't mention performance so far, but in many cases this could be the main reason to move graph data to a native graph database.

What is actually the root cause of the differences between relational databases and graph databases outlined in this article? It lies in a fundamental difference in design of the underlying data models. The relational model is based on the assumption that database records are what primarily matters while relationships between records come second place.

Graph databases instead assume that the relationships are as important as the records. This difference has numerous consequences for ease of use as well as for performance. A table-based system makes a good fit for static and simple data structures, while a graph-based fits complex and dynamic data better. What actually matters in your data? Think about it!

Further reading

Update 2009-09-21: changes to the you-might-know code and description

Saturday, June 6, 2009

Community buzz heating up!

It's been a lot of fun to take part in the Neo4j community the last few weeks. There's been quite some mentions on Twitter and increasing contributions to the mailing list. Among the themes discussed on the mailing list are new features like online backup and batch insert. There was also recently a design thread on how to model entities with versions.

Mattias Ask (founder of Open Causes) published the article Spring and load-time weaving of Neo4j-based domain objects which demonstrates how Spring can be used to inject Neo4j nodes into domain objects. Mattias previously wrote the article Neo4j matches my mental model of information. In this article he shows an alternative to the domain entity pattern suggested by the Neo4j Design Guide. Here's how Mattias describes his encounter with Neo4j:

When I first started looking at Neo4j I was blown away by how precise the graph database structure matched my mental model of information. You have blobs of informations (nodes with properties) that relates to other blobs of information. Perfect!

Sujit Pal contributed some semantic web goodness with his article Using Neo4J to load and query OWL ontologies. The code is part of his Java Text Mining Toolkit open source project. In the blog post, he shows how to load ontologies using Jena for parsing and Neo4j for storage, thus achieving a cleaner solution than the previous one which didn't use a graph database. From his conclusion:

I have barely scratched the surface of the Jena API with this, but I think I have exercised quite a bit of the Neo4J API, and I was quite impressed with the latter.

Irene Gabashvili commented on Web 3.0 databases in an awesome way:

This world is too complex to be described by rigid tables. The 40-year old relational technology may be here to stay, but not to rule.

Her blog post goes on to mention graph databases as her favorite amongst the emerging storage alternatives and links to two introductory graph db presentations.


Sunday, April 5, 2009

The current database debate and graph databases

During the latest month or so there has been a lot of new energy in the database debate. People are questioning the RDBMS to an extent that we're not used to. Tony Bain asks: Is the Relational Database Doomed?. He talks about the simplicity, robustness, flexibility, performance, scalability, and compatibility of databases, where the RDBMS has been "good enough" in all areas in most cases. He goes on to say:

Today, we are in a slightly different situation. For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance. That benefit is scalability.

He goes on to describe alternatives to the traditional RDBMS, foremost he gives an overview over key/value databases.

Robin Bloor commented on the article and among other things comes up with three flaws in the relational data model:

  1. it has problems representing common data structures like ordered lists, hierarchies, trees or web page content
  2. it isn't helpful when the data model evolves over time
  3. there's a lack in access semantics, being restricted to the semantics provided by storing items as rows in tables

What about the alternatives to RDBMS? Most of the buzz is around key/value stores and schema-less databases. Richard Jones wrote the article Anti-RDBMS: A list of distributed key-value stores, listing the most important projects. Going schema-less can look very different, and the solution FriendFeed uses may not be the most common way to implement this.

What's then the most important difference between key/value stores and a graph database? Key/value stores lack built-in support for representing relationships between entities, and to capture such relationships is fundamental for data modeling. As Neo4j supports arbitrary key/value pairs on nodes and relationships it could as well be viewed as a key/value store with full built-in support for relationships.

Because of the advantages of using a graph database and the lack of products available in the market today, companies tend to build their own graph database as underpinning to highly scalable applications. Recently Scott Wheeler of Directed Edge wrote about their in-house database solution: On Building a Stupidly Fast Graph Database. In the associated Hacker News discussion he also mentioned that they tried Neo4j but were disappointed with the performance. This puzzles us since Neo4j is used in production with way larger amounts of data than what's mentioned in the discussion.

In summary, there's an increasing awareness that the 30+ year old relational model may not be optimal for a lot use cases that we as developers encounter today. It's great to see that the community is seriously starting to tackle the challenges of efficiently handling information in the 21st century!

Wednesday, February 18, 2009

The most important algorithm

Danny Nieuwegein wrote in the blog post Persistency solution decision algorithm about his strategy for picking a persistence solution:


IF rdbms is already in use THEN
IF problem is trivial THEN
Spring JDBC
ELSE
iBATIS
ELSE
a graph DB

He also gives some thoughts on the problems with object-relational mapping as well. Last but not least neo4j gets mentioned. It's really nice to see that developers are starting to think about using a graph database!

Saturday, January 10, 2009

What's the buzz?

Here's what was buzzing around Neo4j last year on planet.neo4j.org.


Wednesday, January 7, 2009

Future of databases

Stefan Kleineikenscheidt wrote a short entry on his thoughts/wishes for databases in 2009. This is part the future according to him:

Non-relational databases will become more popular this year. Examples include document oriented data stores or as graphs databases.

It seems quite clear that developers want alternatives to tradtitional RDBMS, either as a replacement or a complement.