Wednesday, May 29, 2013

New Milestone Release Neo4j 2.0.0-M03

The latest M03 milestone release of Neo4j 2.0 is as you expected all about improvements to Cypher. This blog post also discusses some changes made in the last milestone (M02) which we didn’t fully cover.

Cypher now contains a MERGE clause which is pretty big: It will be replacing CREATE UNIQUE as it also takes indexes and labels into accounts and can even be used for single node creation. MERGE either matches the graph and returns what is there (one or more results) or if it doesn’t find anything it creates the path given. So after the MERGE operation completes, Neo4j guarantees that the declared pattern is there.

We also added additional clauses to the MERGE statement which allow you to create or update properties as a function of whether the node was matched or created. Please note that -- as patterns can contain multiple named nodes and relationships -- you will have to specify the element for which you want to trigger an update operation upon creation or match.

MERGE (keanu:Person { name:'Keanu Reeves' })
ON CREATE keanu SET keanu.created = timestamp()
ON MATCH  keanu SET keanu.lastSeen = timestamp()
RETURN keanu

We put MERGE out to mainly collect feedback on the syntax and usage, there are still some caveats, like not grabbing locks for unique creation so you might end up with duplicate nodes for now. That will all be fixed by the final release.

Going along with MERGE, MATCH now also supports single node patterns, both with and without labels.

Cypher Changes
Two new functions startNode(rel) and endNode(rel) allow quick access to both ends of a relationship.

Besides the existing USING index hints, you can now also require Cypher to scan the given labels for nodes (if it doesn’t do so automatically).

The Cypher ExecutionResult is now closeable and will immediately release resources upon closing. It’s no longer necessary to exhaust it.

We fixed an issue with UNION and textual output, close readonly index results and fixed an issue where index lookups failed with literal collections.

Transactional Cypher HTTP endpoint
Already with Milestone 2 we added a new Cypher HTTP endpoint for allowing transactions to span multiple HTTP requests. This endpoint only works with Cypher statements, of which you can post multiple in a single go. It streams data from and to the server and has a much more concise format for returning the data. Returned Nodes and relationships are just represented by their properties (a JSON map). Aside from improved support for transactionality, this API should perform better than the existing (older) Cypher REST Endpoint because of the decreased verbosity of the response.

You can use this endpoint to post a bunch of statements in one go, or post multiple read and write statements in sequence. Rollbacks are requested by HTTP DELETE requests and commits by POSTing to a commit URL.

Here is an example session:

Create Transaction and do some work.
>> POST http://localhost:7474/db/data/transaction
{ "statements" : [ {
   "statement" : "CREATE (n {props}) RETURN n",
   "parameters" : { "props" : { "name": "My Node" } }
} ] }

<< 201: Created
Location: http://localhost:7474/db/data/transaction/3
 "commit" : "http://localhost:7474/db/data/transaction/3/commit",
 "results" : [ { "columns" : [ "n" ],
 "data" : [ [ { "name" : "My Node" } ] ] } ],
 "transaction" : { "expires" : "Tue, 28 May 2013 13:19:59 +0000" },
 "errors" : [ ]

Commit transaction
>> POST http://localhost:7474/db/data/transaction/3/commit
{ "statements" : [ {
   "statement" : "MATCH n RETURN id(n)"
 } ] }

<< 200: OK
 "results" : [ { "columns" : [ "id(n)" ],
 "data" : [ [ 2 ] ] } ],
 "errors" : [ ]

The endpoint now supports transaction timeouts and keep-alive.

In the Neo4j-Shell we added commands for listing automatic indexes and their state.
Also the BatchInserter is now label aware, allowing you to create initial Neo4j stores containing labels.

Breaking Changes
Please note that you cannot upgrade a graph database store from 2.0.0-M02 to 2.0.0-M03 due to incompatible changes in the store files. Please recreate your database for the newer version.

Milestone 2 contained some breaking changes:

  • Replaced protected fields from org.neo4j.graphdb.factory.GraphDatabaseFactory
  • Removed org.neo4j.graphdb.index.BatchInserterIndex and BatchInserterIndexProvider please use the ones in the package org.neo4j.unsafe.batchinsert
  • The BatchInserter and the BatchGraphDatabase are not binary compatible with 1.9 due to some methods now taking a varargs array of labels as last argument, please recompile your code
  • We removed the alternative WITH syntax (==== a,b,c ====)

As always we welcome you to try out the new milestone release and report back any feedback, issues or suggestions.

Latest documentation is here. You can also find Neo4j 2.0.0-M03 on maven central.

Thanks a lot to Michael Bach, Rich Simon, Robert Herschke, Wes Freeman, Aseem Kishore, Morteza Milani, Javier de la Rosa and many more for great feedback on Neo4j 2.0. Keep it coming.

Friday, May 24, 2013

Graph Databases and Software Metrics & Analysis

This is the first in a series of blog posts that discuss the usage of a graph database like Neo4j to store, compute and visualize a variety of software metrics and other types of software analytics (method call hierarchies, transitive clojure, critical path analysis, volatility & code quality). Follow up posts by different contributors will be linked from this one.

Everyone who works in software development comes across software metrics at some point.
Just because of curiosity about the quality or complexity of the code we've written, or a real interest to improve quality and reduce technical debt, there are many reasons.
In general there are many ways of approaching this topic, from just gathering and rendering statistics in diagrams to visualizing the structure of programs and systems.

There are a number of commercial and free tools available that compute software metrics and help expose the current trend in your projects development.
Software metrics can cover different areas. Computing cyclomatic complexity, analysing dependencies or call traces is probably easy, using statical analysis to find smaller or larger issues is more involved and detecting code smells can be an interesting challenge in AST parsing.

Interestingly, many visualizations in and around software development are graph visualizations, from class- and other (UML) diagrams via dependency tracing between and within projects to architectural analysis. One of the reasons of this might be that source code in general can be easily represented as graphs. On the one hand we have trees, especially (abstract) syntax or parse trees (per file, class or structural element) on the other the actual dependencies from project, package, class to method level form a huge directed (cyclic) graph. Also related topics like application (DI) or system orchestration or hard- and software networks are effectively graph structures.

So, having a graph database like Neo4j at hand, what would be more obvious than parsing software systems and project at a certain level and importing the information into the graph database. The graph structure that would accomodate the information quite well would be a direct representation of the concepts in the software projects, consisting of projects, packages, classes, interfaces, types, methods, fields and containing relationships like dependencies, usage, creation, containment, calls, coverage, etc.

Simple Graph Model for Dependency Analysis

Having achieved this, it is completely up to your interests and needs, what you can do with this data. Computing metrics, visualizing and tracing dependencies, finding violations of architectural rules, finding co-usage of classes, detecting interesting patterns or code smells, there are many possibilities.

Just to give one example, a cypher query that calculates the top 10 classes with the longest inheritance paths:

START root=node:types(class="java.lang.Object")
MATCH chain = (root)<-[:EXTENDS*]-(leaf)
RETURN extract(class IN nodes(chain) : AS classes,
       length(chain) AS depth

Other tools besides Cypher to help you with this endeavour are:

  • ASM, Antlr or similar parsers for parsing byte- or source code.
  • Neo4j-Shell for exploration
  • Visualisation with GraphViz, D3, VivaGraphJS, Linkurious or others

Another options is to take a time dimension into account to see how structure, elements and relationships change over time.

So it is not suprising, that quite a number of people found this topic interesting enough to invest time and energy to create an intriguing and insightful example of using graph databases in this field. We asked all the participants listed below to write a blog post detailing their idea and make their code/approach accessible. We start to link to existing resources but will update them as soon as the blog posts are online.

  • Raoul-Gabriel Urma: Expressive and Scalable Source Code Queries with Graph Databases (Paper)
  • Rickard Öberg: NeoMVN is tracing maven dependencies (GitHub)
  • Pavlo Baron: Graphlr, a ANTLR storage in Neo4j (GitHub)
  • Dirk Mahler: jQAssistant Enforcing architectural constraints as part of your build process with Neo4j, Cypher and Labels
  • Michael Hunger: Class-Graph, leverages Cypher to collect structural insights about your Java projects (GitHub), (Slideshare), (

Thursday, May 23, 2013

Neo4j Community Ecosystem Project Releases for 1.9.GA

In the wake of Tuesdays Neo4j 1.9 GA release, the Neo4j community released a number of dependent ecosystem projects for Neo4j 1.9 to the Neo4j Maven repository (
Some of them follow a new versioning scheme.

Here is the list for 1.9:

There were also a few releases for Neo4j 1.8.2
  • Neo4j Spatial org.neo4j:neo4j-spatial:0.9-neo4j-1.8.2
  • Neo4j Graph Collections: org.neo4j:neo4j-graph-collections:0.4-neo4j-1.8.2
We'll update that list with further releases.

Please note that we plan to use a github based release repository for ecosystem projects in the future (starting with Neo4j 2.0):

We will also transition the projects that are still hosted in the Neo4j github repository to Neo4j-Contrib


Michael for the Neo4j community (team)

Tuesday, May 21, 2013

Neo4j 1.9 General Availability Announcement!

After over a year of R&D, five milestone releases, and two release candidates, we are happy to release Neo4j 1.9 today! It is available for download effective immediately. And the latest source code is available, as always, on Github.

The 1.9 release adds primarily three things:
  1. Auto-Clustering, which makes Neo4j Enterprise clustering more robust & easier to administer, with fewer moving parts
  2. Cypher language improvements make the language more functionally powerful and more performant, and
  3. New welcome pages make learning easier for new users

Auto-Clustering Capability

The vision for 1.9 started over a year ago. We were exploring ways to make Neo4j Enterprise clustering more operationally self-sufficient, as well as easy to set up and deploy. An idea that emerged was to take on the cluster coordination functions that were at the time being delegated to Zookeeper. This would mean being able to run a high-availability cluster without having to maintain and operate a separate Zookeeper cluster. After some amount of research, we embarked on the path of taking everything that we’d learned about high-availability clustering--including bringing in a leading-edge consensus protocol known as Paxos-- and set about the task of baking all of this knowledge and resilience into the product. The result, Neo4j 1.9 Enterprise, is an even more robust and manageable high availability graph cluster that runs wholly autonomously, without the need for Zookeeper. Fewer moving parts means fewer things to set up, configure, and manage. We figure that's a good deal.

Cypher Language Improvements

Over the course of building 1.9, we added a number of other improvements spanning all of the editions of the product. Cypher received quite a bit of attention, with improvements ranging from a new set of functions (string handling, REDUCE, TIMESTAMP), to improved memory utilization with aggregate and LIMIT operations. Also new in 1.9 is a Cypher profiler (rudimentary but already very useful, and slotted to improve over time). We also introduced the ability to “ORDER BY”, “SKIP” and “LIMIT” in conjunction with a WITH clause and semi-automatic string conversion. And of course we made a number of general performance improvements to the Cypher optimizer, as we now do with every Neo4j release; and upgraded the Scala version to 2.10.

Other Additions

Besides Cypher and auto clustering, the clustering architecture now includes a new neo4j-arbiter, for use in consensus making in clusters with even number of instances. New REST endpoints have been added for inspecting cluster status information (master, slave, etc). Online Backup in Neo4j Enterprise now includes auto-detection of full vs incremental backup based on existing content at the backup location. Improvements in performance have been made across the board, and for new users, the Neo4j Web UI now sports a Welcome Guide, that explains the basics of getting started.

This release is brought to you not only by the efforts of the Neo Technology team, but owes its existence to your contributions, questions, and requests. We'd like to thank you -- the Neo4j community -- for all of your engaged feedback. Please keep it coming!

We hope you enjoy this new release!

The Neo4j Team

P.S. Please be sure to have a look at the list of deprecations, to make sure your application aligns with where the product is going.

P.P.S. If you're interested in learning more about this release, we encourage you to have a look at the release notes. This is of particular note if you plan to upgrade from an earlier version of Neo4j HA, as the new clustering architecture involves some operational changes.

We also welcome you to attend one of our world-wide trainings or a GraphConnect conference near you

Saturday, May 18, 2013

Reloading my Beergraph - using an in-graph-alcohol-percentage-index

What happened before

As  you may remember, I created a little beer graph some time ago to experiment and have fun with beer, and graphs. And yes, I have been having LOTS of fun with it - using it to explain graph concepts to lots of not-so-technical folks, like myself. Many people liked it, and even more people had some questions about it - started thinking in graphs, basically. Which is way more than what I ever hoped for - so that's great!

One of the questions that people always asked me was about the model. Why did I model things the way I did? Are there no other ways to model this domain? What would be the *best* way to model it? All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model - it's the QUERY that will drive the modelling decisions.

One of the things that spurred the discussion was - probably not coincidentally - the AlcoholPercentage. Many people were expecting that to be a *property* of the Beerbrand - but instead in my beergraph, I had "pulled it out". The main reason at the time was more coincidence than anything else, but when you think of it - it's actually a fantastic thing to "pull things out" and normalise the data model much further than you probably would in a relational model. By making the alcoholpercentage a node of its own, it allowed me to do more interesting queries and pathfinding operations - which led to interesting beer recommendations. Which is what this is all about, right?

Taking the AlcholPercentage to the next level

So in my new version of my beergraph, I have done something different. I used the example of Peter to create an in-graph index of AlcoholPercentages - a bit like the picture of the new model that you see here.

Essentially what I am doing is I am connecting all the alcohol-percentages into a chain of alcholpercentages - using the [:PRECEDES] relationship. In Cypher-style ascii-art that would be something like

... -(alcperc-0.2)-[:PRECEDES]->(alcperc-0.1)-[:PRECEDES]->(alcperc)-[:PRECEDES]->(alcperc+0.1)-[:PRECEDES]->(alcperc+0.2)- ...

To do this, I of course did have to modify my beer-spreadsheet a little bit. You can find the new version over here. But from the screenshot below you can see that all I did was create another tab that had all the alcoholpercentages and that "PRECEDES" relationship between them. Easy peasy.

Nice. So what? The resulting dataset is very similar to what we had before - it's just a little bit richer. You immediately notice it as you start "walking" the graph on the WebUI: the links to the AlcoholPercentage-chain gives me a new and interesting way to explore the graph.

But what else what can we do with this? Well, querying it is the obvious answer. Let me give you a couple of examples:
  • how can I find beers that have the same beertype and a "same or similar" alcoholprecentage (let's say + or - 1%) as a beer that I really like (Orval). That's now become very easy:


Or another example:

  • how can I find other beers from the same brewery that have a similar AlcoholPercentage as a beer that I also like (Duvel)
order by;

Both of the queries above gave me some new, interesting insights that I did not know before, allowing me to discover even more and nicer Belgian beers. But what's important is of course that these in-graph indexes are fantastically interesting. By "pulling the data out", normalising even further, and then indexing the normalised data as a subgraph of it's own, we can much more easily derive new and interesting insights. And that, my dear friends, is what graphs are all about :) ...

Hope this was useful. If you like this post and want to discuss more about graphs and beer, please come to our Graph Café in June in Antwerp or Amsterdam - or at a pub near you?

Wednesday, May 15, 2013

New Incubator Project: Neo4j Mobile for Android v0.1!

During this busy week of Android hacking at Google I/O, we are pleased to announce an amazing new Community project, for all of those who have been yearning to run Neo4j on mobile: Neo4j Mobile for Android v0.1! This project is available today on GitHub: for hacking, experimenting, evolution, and use. As the 0.1 version number indicates, this is an incubation project. This means that it’s fully functional, but early days, and hence experimental, and as-yet unsupported. We’ll leave this in your capable hands to play with, extend, and comment upon. 

For those of you who are in San Francisco, we'll be holding a launch event tomorrow at #GoogleIO on Wednesday May 15.

A few facts to get you started:

  • First and foremost: credit where credit is due! You can thank Noser Engineering AG for this amazing bit of work. Noser has deep expertise in mobile and embedded devices, leading to this port, which was originally done for a client of theirs.
  • How does it work? Neo4j Mobile for Android runs as a service, and is accessed via Android Inter-Process Communication using an AIDL (Android Interface Definition Language) connector. This makes it possible for multiple apps on an Android device to access the same database service. The picture below summarizes an example application  architecture, taken from the original project, showing Neo4j Mobile for Android interacting with other solution elements:

  • What version of Neo4j is the Android port based on? The Android version is based on an older milestone version: 1.5 M05, which has been modified to run on Android. The primary thing to be aware of is that this version doesn’t support Cypher. (It also doesn’t include a number of Server components, that simply aren’t necessary or relevant on Android.) Upgrading to a more recent version of Neo4j that does have better support for Cypher is clearly something that will need to get done.
  • Building the Project. The project builds on Ant. Eventually this could be changed to use Maven. As Ant is pretty common and well-accepted in the Android world, we decided not to change the build system.
  • Licensing. Like our Community edition, the license is GPL. If you want to redistribute Neo4j with your app, but aren’t planning on making your app open source, come talk to us, and we’ll work with you to make that happen. This is an incubator project, so we’re not expecting you’ll go live with this tomorrow.
  • Where do I go for questions? This is a community-supported project. Post your questions on the Neo4j Google Group. If you need consulting assistance or have a specific project in mind, you can contact Noser (the authors of this port). For questions about commercial licensing, please contact Neo Technology.
  • Is there a test app that I can use? Yes. There’s a (very) rudimentary test app that is included in the project. (See the screenshot to the right.) We’re working on a webcast and some more detailed instructions, and will get those posted when we can.

  • Can I replicate this database? There are no replication services built in, but you can certainly do this at the application layer. Noser has done this and will cover it in their upcoming webinar.
  • How can I learn more? We’re planning a webinar with Urs Boehm, who is Noser’s lead engineer on the Android port. We’ll post webinar details online, and will send invites to the general Neo4j mailing list once it’s been scheduled. You can sign up to the Neo mailing list here.


Philip Rathle
Senior Director of Products

Monday, May 13, 2013

Neo Technology and the LDBC project - an update

A bit has happened since I (Alex Averbuch) last updated you about progress in the LDBC (Linked Data Benchmark Council), and Neo’s part in it. So, without further ado and in no particular order, here’s what we’ve been doing...

Second Technical User Community meeting

Of high priority for the LDBC is getting industry input on benchmark development - benchmarks that are not interesting to industry are generally not very interesting. To address this we engage with industry via bi-annual Technical User Community (TUC) meetings, where experts from both industry and academia are invited to present their data management use cases and participate in the LDBC benchmark development process.

This past April the second TUC meeting was hosted in Munich by the Technical University of Munich, an academic partner of the LDBC project.

The two-day meeting, dominated by presentations and subsequent discussion, was a complete success. Many thought leaders from leading graph/RDF data management organizations (both academic and industry) were there to give talks. Among them were: Wolters KluwerBBCR.J. Lee GroupOracleDshiniBNFSt. Judes MedicalUCBBroxACCESOActifyMax Planck Institute for Informatics (presenting the YAGO project), MediaproUniversity of CyprusAGT InternationalUIBK, and OpenPhacts.

One highlight was the talk by Klaus Großmann (Dshini CTO) entitled Neo4j at Dshini (Dshini is a German social network that aims to generate purchasing power through activity only - members earn virtual currency, save up and redeem it to fulfill their wishes).

In his presentation, Klaus shared his experience of using Neo4j as the main data storage technology at Dshini, and provided many insights regarding graph data modeling in the real world. A great talk and very useful input to our benchmark design process - perfect illustration of the value gained by involving industry in the LDBC!  

For anyone that’s interested, slides from most of the talks are available here. Thanks to all who participated!

Neo Technology in upcoming workshops and conferences

A natural byproduct of Neo’s participation in the LDBC is a general increased presence in academic circles. In the coming months Neo will be present and participating in a number of exciting events, including (but not limited to) the GRADES and GraphLab workshops.

GRADES workshop (23rd of June in NYC): co-sponsored by the LDBC, this workshop is designed to spark discussion and descriptions of application areas and open challenges related to the management large-scale graph data. Neo will contribute both as organizer (as member of the program committee) and as participant; in collaboration with the Institute for Scientific Interchange Foundation and SocioPatterns projectCiro CattutoAndré PanissonMarco Quaggiotto, and I will present a paper at GRADES about modeling time-varying social graphs in the Neo4j.

GraphLab workshop (1st of July in SFO): also co-sponsored by the LDBC, this event will focus on large scale machine learning on sparse graphs. Here too Neo is a member of the program committee, and we will have a number of representatives at the event.
Both I and my colleague Philip Rathle will be at the event, to represent Neo and the LDBC project.

Not to mention GraphConnect (@GraphConnect)... this will be a series of five conferences across the USA and England, held between June-November of this year!

Recent benchmark efforts, their relevance, and what we're busy building

Lately a number of graph database-related micro-benchmarking efforts have been published; these are obviously interesting to Neo, both in general and in the context of LDBC. Though a growing number of such examples are popping up, a recent one that stands out is LinkBench from Facebook. More specifically, what stands out is the data generator embedded in LinkBench.  

The general 'problem' with generators is they generate synthetic data, the data is not real and its characteristics perhaps not representative of the real world. LinkBench is unique in that it was developed at Facebook - few organizations have access to a real social network dataset as immense or rich as that of Facebook’s. This puts Facebook researchers in the unique position of being able to verify the “realisticness” (I just made it a word...) of the data generators they develop - and, now, Facebook have made LinkBench public, along with details of its data generator!

How does this relate to the LDBC?  

It assists us in developing more meaningful benchmarks. 
We (Vrije University and the Polytechnic University of Catalonia in particular) are in the process of developing the LDBC data generator - a continuation of the work performed by Vrije University on the SIB social network generator. We've now gone through the process of evaluating LinkBench (and a number of real datasets) and are modifying the LDBC data generator, applying the lessons learned to improve the generator's "realisticness".

In parallel, we've also started development of a benchmark driver, for future LDBC benchmarks to use. More on that in a later post!  

The first versions of both the LDBC benchmark driver and LDBC data generator will be published on our public github account as soon as we have something to share!   

In the meantime, stay up to date with the LDBC project via LinkedInTwitter (@LDBCproject)Facebook, or the main project page -