Friday, February 17, 2012

Modeling a multilevel index in neoj4

Hi all,

Today, for my lab project, I decided to model an in-graph index in Neo4j and query it with the Cypher Query Language.

The basic problem we try to solve here is the ordering of events in a timeline and asking for ranges of events ordered in time without needing to load the whole timeline, or let an external index like Lucene doing the sorting (which is very costly). So, a simple approach to do this is a multilevel tree, where you attach the domain nodes to the leafs of the index tree and query by traversing through that structure.



Now, to ask for all Events between 2011-01-01 and 2011-01-03 you simply find the starting and ending path (in this case they share the upper part of the tree) for these levels in the index, and then collect the Events hanging off the Day-nodes ordered via the NEXT relationships, following the VALUE relationships, if they exist.


All these five segments of the query structure can be expressed in one single Cypher query:

START root=node:node_auto_index(name = 'Root')
MATCH 
  commonPath=root-[:`2011`]->()-[:`01`]->commonRootEnd,
  startPath=commonRootEnd-[:`01`]->startLeaf,
  endPath=commonRootEnd-[:`03`]->endLeaf,
  valuePath=startLeaf-[:NEXT*0..]->middle-[:NEXT*0..]->endLeaf,
  values=middle-[:VALUE]->event
RETURN event.name
ORDER BY event.name ASC

Returning Event2 and Event3. This may seem surprising at first, since we've asked for the middle events, but notice that variable length path [:NEXT*0..] includes length 0 and has no upper limit. Because the startLeaf and endLeaf are bound through the previous path definitions, they will be the boundaries of the range.  

Some more examples on this data structure are available as part of the Neo4j Manual in the Cypher Cookbook section.

Happy hacking!

/peter


Tuesday, February 14, 2012

Webinar Follow Up: How to Get Started with Neo4j

Hey everyone,

We held our How to Get Started with Neo4j webinar last week, and received lots of great questions from our participants.


Here are the questions captured in the Q&A section. If you don't see your question here, please be sure to join our Neo4j User Group, where our community will be sure to help you out.


What are your experiences in the medicare/medicaid business world, and/or real-world cases that handle thousands of simultaneous requests? All of our commercial customers using Neo4j in production can be found here. Neo4j is used within the social media space, geo-spatial arena, telcos, and many other sectors.

As for our open-source community, there are so many projects going on, it is better to ask the community yourself. Go to our User Group and ask if anyone is using Neo4j for medical records.


How is the performance and scalability of Neo4j compared with something like MongoDB?
This all depends on what type of data you have. If you want to be able to throw your data somewhere quickly, Mongo is a great tool for that. If you have complex data with lots of connections, and want to be able to quickly retrieve data between different data points, Neo4j is a better fit. The great thing about NOSQL databases is that it is not "one size fits all" model. In fact, you can use more than one database for your set of data. We happen to be seeing that data is becoming more connected by nature, and the benefits of using a graph database are growing rapidly


How do you compare Neo4j with Cassandra or Hadoop?
Again, this all depends on what type of data you have. Cassandra is in the column-family category of NOSQL databases, all of which great scalability on a very simple data model. Neo4j is on the other end of the curve, with a rich data model but less scalability.

Since Hadoop is a framework for conducting analytics on large data sets, it is more comparable to projects like Golden Orb if you're interested in Pregel-style graph analytics versus map-reduce.


How do you decide between modeling tags as nodes or relations as more and more actions can be performed on tags themselves?
Like considering queries when designing your RDBMS schema, it's helpful to consider the graph traversals (queries) you want to run when laying out the structure of your graph. Use a whiteboard, sketch out example data and see how natural it is to answer questions by following paths in the graph.

On the topic of ACLs, with the rise of OAuth 2 have you seen OAuth 2 token ACLs modeled using a graph DB?

OAuth2 ACLs do make perfect sense to capture in a graph, but haven't yet come up in our discussion group. Let us know if you embark on such a project, I'm sure a lot of people would find it interesting.


How do you resolve duplicates?
Nodes and Relationships can be created with unique properties. See the documentation for details. For existing data, your application would have to scan through all nodes to check.

The documentation for Cypher seems to be a bit sparse. Are there efforts to more fully document the Cypher query language formally?
While very mature, stable, and capable, we have not released a fixed language specification of Cypher because it continues to evolve. For now, the Neo4j Manual contains all the latest (fully tested) information about Cypher syntax while we work towards a feature-complete language.

Is there a project underway to run Neo4j natively in .NET?
Unfortunately, no. There are client drivers available, but the Neo4j kernel is targeted very specifically at Java 1.6.

What kind of automated operational monitoring support is there in Neo4j (e.g. JMX)?
Neo4j does expose monitoring through JMX, which is also available through REST endpoints that are visible in Webadmin.

How to migrate existing data from relational DB to neo4j?
For initial data import, Neo4j offers a batch insertion mode that relaxes transactional requirements to enable higher throughput. Common practice is to migrate incrementally, identifying the tables involved in complex joins and mapping the schema to a graph layout.

Is there a strategy for some complex relational databases to be migrated to the Neo4j model, for example Sybase to Neo4j?
There is no automated tool for migration from a relational database, though we have done lab work on synchronizing relational tables with a companion graph. Migration is typically achieved with a custom importer written in java (or any jvm language) which uses the batch insertion mode. See the Neo4j Manual for some guidance.

Is there something more on data modelling in Neo4j, and how to structure data?
We have given workshops about best practices, and will consider scheduling a webinar and writing some blog posts to discuss best practices for structuring data in a graph.

Is there a way to apply continuous location-dependent queries (over moving objects) on graph-based spatial models in neo4j?
This sounds like an RDBMS view, which doesn't have an equivalent in Neo4j. Traversal queries execute quickly, but are lazy-loaded without read-locking. So the trick here would be balancing write updates for the moving objects with the timing of the spatial reads.



We have some great meetups and events coming up, and don't forget to sign up for our next webinar on Spring Data Neo4j in the Cloud, taking place February 16th.


-ayeeson

Friday, February 3, 2012

Webinar Follow Up: Intro to Graph Databases

Hey everyone,

Another awesome turnout at our Intro to Graph Databases webinar last week. We had loads of questions throughout the session, and we thank all of you for attending and participating!


Here are the questions captured in the Q&A section. If you don't see your question here, please be sure to join our Neo4j User Group, where our community will be sure to help you out.


To model a graph database, do you start directly with nodes and do not provide an ER diagram first?
  • Graph modeling often begins with whiteboarding the data in your domain. Usually, what you draw is what you graph.
Can I have custom RelTypes ?
  • Absolutely. All relationship types are defined by the application, so you can create them as appropriate.
Is there a way to keep the graph in memory all the time (except using a ram-disk)?
  • While there is no memory-storage mode, it is possible to keep the entire graph in memory by configuring large enough caches, then reading the entire graph (and properties) into memory.
Is it possible to versionize nodes and relationships in the graph?
  • Neo4j does not have native versioning, so you would have to model versioning of nodes using a linked list. Relationships could be versioned by using a unique Property to indicate the version.
Does Neo4j support XA transactions?
  • Yes, Neo4j is a proper XA transaction citizen.
How are nodes with defined relationships between them located? Do they
have embedded pointers stored with the node that point to the address where the related nodes reside in the database? I'm thinking of the network model used by IDMS.

  • On disk, there are separate stores for nodes, relationships and properties. For details, consider reading posts from our own Chris Gioran's blog.
How can I return the node which is the last node of traversing (basically the leaf nodes) ?
  • With Cypher, you would bind to and return to the last node. For instance, in `start a=node(0) match (a)--(b)--(c) return c` the result would list all of the nodes 'c' that are at the end of a depth-2 traversal from 'a'.
Is subgraph isomorphism possible?
  • Subgraph matching is not directly supported, just path pattern-matching. So the match would have to be expressed as a path pattern.
What's the impact of node v. relationship? i.e. is a database more performant with a lot of nodes or relationships?
  • The database handles both nodes and relationships very well, though query performance generally favors following relationships over checking property constraints.
Aside from social networks, what other types of applications might a graph database like Neo4j be well suited for?

Graph databases are extremely useful when dealing with large amounts of complex and highly connected data. Social networks are one example, here are some others:
  • Collaboration programs
  • Configuration Management
  • Geo-Spatial applications
  • Impact Analysis
  • Master Data Management
  • Network Management
  • Product Line Management
  • Recommendation Engines
How does Neo4j handle nodes that have a lot of relationships (let's say one node connected to all other nodes)? Is there an index on all relationships a node has?
  • There is work ongoing now to address what we call "supernodes" with huge numbers (more than 100k) of relationships.
When will production level sharding (even with Eventual Consistency) will be available ?
  • Our most bearded developers are locked away working on this right now, though we can't promise a time-frame other than sometime this year.
I am used to thinking of a graph database as set of RDF triples. What are sort of differences between RDF triples and Neo4j data model if any?
  • With RDF, each property of an entity requires another triple. In a Property Graph, both the nodes and relationships of the graph can store properties, making it much more efficient.
Is there any known commercial product that uses Neo4j?
  • Absolutely! Be sure to check out our Customer Page for a highlighted list of Neo4j in production.

We have some great meetups and events coming up, and don't forget to sign up for our next webinar discussing Neo4j, taking place February 9th.

-ayeeson