Friday, July 30, 2010

The top seven news in Neo4j 1.1

The Neo4j graph database release 1.1 has just arrived, so here's some information on the new things that have been included. The main points are the additions of monitoring support, an event framework and a new traversal framework to the kernel. Then two useful components have been added to the default distribution (called "Apoc"): graph algorithms and online backup.

1. Graph algorithms

Since the previous release, the graph algorithm component has been promoted to the default Neo4j distribution package. Here you'll find implementations of algorithms that will help you find the shortest paths, all simple paths or, if you want, all paths between two nodes. Or you can use the Dijkstra algorithm to handle weighted paths, or why not the cool A* algorithm wich is useful in a geospatial context among other uses:

astar node space

The above image is from an A* routing example project with code in Ruby or Java.

Since the previous release of the graph algorithm component, much work has been done to improve memory efficiency and speed. It now also uses the new traversal framework under the hood. As a developer, your starting point is the GraphAlgoFactory, wich provides access to the algorithms.

2. Event framework

The Neo4j 1.1 kernel includes support for a simple but powerful event framework, which allows you to hook into and react to any substantial change of the graph. For example, let's say that you have a UI widget that displays a specific property on a node. In previous releases, you'd have to manually add code to refresh the widget wherever you modify that specific node. With the 1.1 release, you instead write a simple listener that detects changes to that node and repaints the widget.

You can listen to the following events:

  • beforeCommit
  • afterCommit
  • afterRollback

This will allow you to perform actions such as:

  • read changes before commit
  • modify the transaction
  • decide if the transaction should be committed or not

The most powerful use of the event framework is likely in framework-like components that implement horizontal concerns such as validation and integration of data. You can imagine writing a simple component that automatically keeps a Neo4j IndexService up to date with the graph, so you won't have to manually maintain indices.

This is where you can hook into the database life cycle:

  • shutdown event
  • kernel panic - for instance when the disk is full

In this case, a typical use is to make sure layers on top of Neo4j are properly shut down before the database shuts down.

3. Traversal framework

To extend the possibilities for how to traverse a graph, a new traversal framework has been introduced. It's still in an early stage and has been included in the 1.1 release to gather feedback from users. Even if it's indeed already very useful, be preperad for the API to change somewhat!

One main design goal of the new traversal framework has been to increase the flexibility in how a traverser can be controlled. Examples of improvements compared to the old traversal framework are:

  • The user can select in which order relationships will be followed, which opens up for best-first traversals and fine grained traversal control, e.g. weighted traversals. Breadth-first and Depth-first traversals are just trivial examples of a global static branch selection policy implemented for convenience.
  • Paths now play a central role: the current path during traversal is exposed in the traversal context and paths can be returned as the traversal result.
  • For convenience, traversal results can be returned as nodes, relationships or paths.
  • The uniqueness constraints in a traversal now have a wide range of possibilities, like visiting nodes only once, visiting relationships only once (but possibly revisiting nodes), visiting the same nodes and relationships but in different paths and so on.
  • To reduce the memory footprint of the traversal, uniqueness constraints on nodes or relationships can be set to only guarantee uniqueness among the most recent visited nodes, with a configurable count.

There's more to it, but the above list should suffice for this blog post! Let's look at a code example to get a view of how the new traversal framework used.

As seen from this example the framework uses a fluent API:

for ( Path position : Traversal.description()
.depthFirst()
.relationships( KNOWS )
.relationships( LIKES, Direction.INCOMING )
.prune( Traversal.pruneAfterDepth( 5 ) )
.traverse( myStartNode ) )
{
System.out.println( "Path from start node to current position is " + position );
}

The traversal descriptions are immutable and can be reused to create new traversal descriptions. Here's an example of how this is done:

static final TraversalDescription FRIENDS_TRAVERSAL = Traversal.description()
.relationships( KNOWS )
.depthFirst()
.uniqueness( Uniqueness.RELATIONSHIP_GLOBAL );
// ...
// Don't go further than depth 3
for ( Path position : FRIENDS_TRAVERSAL
.prune( Traversal.pruneAfterDepth( 3 ) )
.traverse( myNode ) ) {}
// Don't go further than depth 4
for ( Path position : FRIENDS_TRAVERSAL
.prune( Traversal.pruneAfterDepth( 4 ) )
.traverse( myNode ) ) {}

Please note again that the API is likely to change before going final in the next release. We would love feedback on the new traverser framework! Just head over to the mailing list (which is a great place to hang out if you want to learn more about graph databases!) or say something on twitter.

4. Monitoring

Neo4j now supports monitoring over JMX. For example you can use a tool like JConsole to inspect what's happening in a live Neo4j instance. For example, say that your Neo4j-backed web site has been up and running for two days since the last restart. You can then go in and get statistics on how many transactions have been commited, the number of transactions open right now, the total number of transactions that have been rolled back, and LOTS more. Here's an image of the transaction information that is available:

jmx.transactions

Find out more on the wiki!

If you're using the Neo4j rest server (also see this blog post) there's a lot to look forward to with regard to monitoring. Namely, there's the new Neo4j webadmin project. At the moment it includes:

  • Lifecycle management
  • Monitoring of memory usage, disk usage, cache status and database primitives (nodes, relationships and properties)
  • JMX overview
  • Data browsing
  • Advanced data manipulation via Gremlin console
  • Server configuration
  • Online backups

Here's how the webadmin tool looks at the moment (click for bigger version):

neo4j webadmin dashboard

Currently, the webadmin tool builds on the Neo4j rest layer. In subsequent releases, it will be adapted to also work with embedded Neo4j.

5. Kernel

Much of the news in the Neo4j kernel has already been mentioned, but there's still some points to be adressed:

  • Read operations are not required to be performed inside a transaction any more. This can for example help a lot when doing traversals and gives you more options on the architecture side of things. Note however that in order to read uncommitted data, the read operations still have to be carried out in the corresponding transactional context.
  • At startup, the Neo4j kernel will look at the available amount of RAM and heap space and configure itself accordingly. In most cases there should be no more need to use a detailed configuration. If you have added a lot of data, a restart of Neo4j will let the automatic configuration catch up on what's happened and optimize the configuration.
  • At the creation of a new database, the block sizes for strings and arrays on the storage level can be configured. This settings can't be changed after the database has been created. If you know your data very well, this could be useful if the default settings doesn't cut it for you.
  • The GraphDatabaseService can now be accessed from every node and relationship, so you don't need to pass the instance around or inject it or whatevery you did before.
  • For your conveneince, a helpers package has been added to the kernel (previously it was a separate component named "commons"). The most interesting part is the collection helpers. By using them, creating a traverser that returns you domain objects instead of nodes/relationships is a breeze. There's other goodies in there as well, take a look!

6. Index

Other than bug fixes and performance optimizations, the integrated Lucene index has got some new features:

  • Improved support for removing indexes.
  • Index lookups can be performed without being in a transaction.
  • Exact lookups can be carried out (even when using a fulltext service).
  • Indexing of array values. If a value is an array it's split up and each value in that array is indexed separately.

7. Online backup

Of course you want to backup your Neo4j database while it's running, and in this release the online backup component is included in the default distribution package. This is an example of how to use it from your code:

EmbeddedGraphDatabase graphDb = getTheGraphDbFromApp();
String location = "/var/backup/neo4j-db";
// this will include the integrated lucene indexes as well
Backup backup = Neo4jBackup.allDataSources( graphDb, location );
backup.doBackup();

Conclusion

If you haven't already played with Neo4j there's now some more reasons to do so! And if you have, the fun will be even greater now!

The main starting points are:

No comments: