Friday, February 17, 2012

Modeling a multilevel index in neoj4

Hi all,

Today, for my lab project, I decided to model an in-graph index in Neo4j and query it with the Cypher Query Language.

The basic problem we try to solve here is the ordering of events in a timeline and asking for ranges of events ordered in time without needing to load the whole timeline, or let an external index like Lucene doing the sorting (which is very costly). So, a simple approach to do this is a multilevel tree, where you attach the domain nodes to the leafs of the index tree and query by traversing through that structure.



Now, to ask for all Events between 2011-01-01 and 2011-01-03 you simply find the starting and ending path (in this case they share the upper part of the tree) for these levels in the index, and then collect the Events hanging off the Day-nodes ordered via the NEXT relationships, following the VALUE relationships, if they exist.


All these five segments of the query structure can be expressed in one single Cypher query:

START root=node:node_auto_index(name = 'Root')
MATCH 
  commonPath=root-[:`2011`]->()-[:`01`]->commonRootEnd,
  startPath=commonRootEnd-[:`01`]->startLeaf,
  endPath=commonRootEnd-[:`03`]->endLeaf,
  valuePath=startLeaf-[:NEXT*0..]->middle-[:NEXT*0..]->endLeaf,
  values=middle-[:VALUE]->event
RETURN event.name
ORDER BY event.name ASC

Returning Event2 and Event3. This may seem surprising at first, since we've asked for the middle events, but notice that variable length path [:NEXT*0..] includes length 0 and has no upper limit. Because the startLeaf and endLeaf are bound through the previous path definitions, they will be the boundaries of the range.  

Some more examples on this data structure are available as part of the Neo4j Manual in the Cypher Cookbook section.

Happy hacking!

/peter


15 comments:

pehrlich said...

This is interesting, I like it. I imagine if you wanted to query for the next 20 days, rather than from a date range, you could do this:

START root=node:node_auto_index(name = 'Root')
MATCH
commonPath=root-[:`2011`]->()-[:`01`]->()-[:`01`]->,
valuePath=startLeaf-[:NEXT*0..]->middle-[:NEXT*0..20]->endLeaf,
values=middle-[:VALUE]->event
RETURN event.name
DISTINCT event
# natural order is by event start date
# would there be a good way to order by end date?

Peter Neubauer said...

Exactly, very nice use of the potential of Cypher IMHO :)

bww00 said...

Thanks!
I have a problem for a clinic setting, where there are multiple patients with multiple events per interval (say 1 - 15 minutes), and multiple state changes for an event. Would it be better to have tree for each patient, or a common tree for all ? What impact would the granularity of the time be, down to the minute or second ?

Regards,
Bryan Webb

bww00 said...

Thanks!
I have a problem for a clinic setting, where there are multiple patients with multiple events per interval (say 1 - 15 minutes), and multiple state changes for an event. Would it be better to have tree for each patient, or a common tree for all ? What impact would the granularity of the time be, down to the minute or second ?

Regards,
Bryan Webb

Peter Neubauer said...

@bww00 depends a bit on the amount of data. How big is the total number of events going to be?

Peter Neubauer said...

@bww00 I think you should just create a testcase and try it out. It sounds like doable with one big tree for all events, but I wouldn't do that without performance testing.

panisson said...

Hello Peter, thank you for this nice example.
We are trying to use this approach to represent a timeline of frames, and a graph on each frame. However, we are using the Neo4j createRelationshipTo method to create the relationships, and we found difficult to create the same structure of this example. The problem comes if you want to give types as `2011` and `01` to the relationship types. As RelationshipTypes are all enums, it is not possible to give such names to the relationships; moreover, we should create a long list of enum types, each type repesenting a year, a month or a day.
Our solution was to give all relationships the same type (NEXT_LEVEL), and add an attribute with the given year, month or day. In this case, the Cypher query would be something like this:

START root=node:node_auto_index(name = 'Root')
MATCH
commonPath=root-[y:`NEXT_LEVEL`]->()-[m:`NEXT_LEVEL`]->commonRootEnd,
startPath=commonRootEnd-[d1:`NEXT_LEVEL`]->startLeaf,
endPath=commonRootEnd-[d2:`NEXT_LEVEL`]->endLeaf,
valuePath=startLeaf-[:NEXT*0..]->middle-[:NEXT*0..]->endLeaf,
values=middle-[:VALUE]->event
WHERE y.year=2011 and m.month=01 and d1.day=01 and d2.day=03
RETURN event.name
ORDER BY event.name ASC

Is it correct or there is a better approach?

Regards,
André

Peter Neubauer said...

Hi think the key here is to use DynamicRelationshipsType.withName("MONTH_01") which will let you create dynamic relationship types.

André Panisson said...

Thanks Peter, I was not aware of the availability of the DynamicRelationshipsType class.

André Panisson said...

By the way, there's any performance advantage in using relationship types in place of relationship properties?

Simon Hallé said...

Hi Peter, thanks for the example, looks pretty good! Although I haven't done much tests with Lucene, I was wondering why you noted that using such index would be "very costly" to do the sorting? In my situation, I need to implement a filter that returns all Nodes within a date range. Nodes are tagged with a start and optional end date (node properties). I noticed that org.neo4j.index.lucene.LuceneTimeline is available, but doesn't seem to be used much? Is it deprecated? Have tests been done using the Lucene date "string-based" indexing and querying method? (i.e., instead of using long to represent time as the Neo4J Timeline, Lucene uses the yyyyMMddHHmmss string representation). Finally, if you have any pointers on the lucene index vs in-graph node indexing, I'm still trying to sort out what's best... thanks for any info you may have about that! And thanks for all the info I can get from your posts on the Net! ;)

bww00 said...

peter,
what would be the fastest way to create a test graph.
I also need to add a physician node with a relation to the event and a function node with a relation to the event.
I would need to query the events by date/time. I will need to query events by physician and also by function.

I am having trouble getting started in creating the graph and relationships.

I really appreciate all of your assistance.

Cheers

Unknown said...

Thanks Peter, for a nice explanation. I'm seriously looking at the graph databases now for our reporting suite - I have tried to model the time series data as you had suggested but since it goes down up to the seconds level I'm quite conscious about performance. I have devices sending the data so I can append to the timeline graph with seconds precision. These devices are tied to a client in RDBMS. Concept of client and device are static(usually) which I can represent in graph as well and I'm thinking of relating static device reference with timeline which I think will result in a lot of relationships with device and timeline graph. For a week its 604,800 (7[days] * 24[hours] * 60[mins] * 60[secs]) relationships to some point in time line graph. Anyway before I move on in this direction, could you please share your thoughts on my approach? I feel like I'm still thinking in relational way. Your comments would be much appreciated! Cheers

Roman Rytov said...

This is nice but only for the cases when a node is defined for in a timeline once and for all. What if the relationship is time-dependant? Reporting lines, period of studies, marriage, place of living, etc. How to model then relationship right to perform effectively? Create "since" and "until" attributes and run subfilters? What's the right way?

flip101 said...

@bww00 Can you share your data model? I would like to make a similar setup. I have the same question about: Where to introduce several "users" into this model. Have you gotten any test results back?

@Unknown I don't think you have to specify the frames to the smallest unit of time. You can have an hour as smallest frame and then attach all those events to this hour. But every event should have it's own timestamp and you should still be able to order them.

@Roman Rytov It's true that this model only puts nodes to a certain timeframe. If you want to make relationships time dependent you could put them as nodes. This page here shows you very nicely how to do that. Also if you Java you could try this solution you can find the slides here. Notice that this uses Java and Tinkerpop blueprint on top of Datomic database.

I'm using Neo4j database myself and i'm particularly interested in time dependent graphs to use in my own application. When someone reads this and has good datamodel/solution please leave a comment! The page i linke before has a nice model, but does not incoorperate any users. (similar to the questions of bww00)