Monday, April 8, 2013

Nodes are people, too



Neo4j 2.0 will let you define sets of nodes within the graph
Philip Rathle
Senior Director of Products

Update: 2.0.0-M02 is now available


Today we are releasing Milestone Release Neo4j 2.0.0-M01 of the Neo4j 2.0 series which we expect to be generally available (GA) in the next couple months. This release is significant in that it is the first time since the inception of Neo4j thirteen years ago that we are making a change to the property graph model. Specifically, we will be adding a new construct: labels.


We’ve completed a first cut at a significant addition to the data model, and are opening the code up now for early comment. Consider this milestone to be an experimental release, intended to solicit input. We look forward to hearing how you'd like to use these new features, and can't wait to hear what you think.  


It’s a What?

Let’s say you created a node for a person named Joe. Joe is not just any node: he is a person. Therefore you would probably want to designate the node for Joe as being a “Person”. If you’ve worked with Neo4j before, chances are that you’ve done this by adding a property called “type” with value “Person”, as follows:



This is useful, because now I can differentiate Joe from things in my graph that are quite different, such as “household goods” nodes and “geo location” nodes. Rightly so, these things should receive very different treatment.


Now let’s say you also want to give Joe a party affiliation: Left-Wing, Right-Wing, or the moderate Middle-Wing. While you could do this with a property as well, you may decide that you want to easily find all people of a given party affiliation. Knowing that Joe is "Middle-Wing", you might decide to break the parties into nodes, and then associate Joe with his party, as below:

One thing you’d now naturally want the graph to do, is to automatically index the “Person” nodes (and no other nodes), according to the unique identifier for “Person”. (Let’s oversimplify and say this is "name"). If you’re using Cypher, this is a challenge today. In fact it’s not possible at all, because Neo4j doesn’t inherently know anything about “Person” being different from geo locations. If you want to index “name”, you end up doing it for everything in the graph, which mixes concerns. Geo Location names aren’t the same as person names, any more than a city is like a person. As for the “Middle-Wing” node, it ends up becoming extremely dense, cluttering the graph with lots of connections whose sole purpose is to designate nodes as belonging to a group.


We’ve been looking at better ways to do this. The ideal solution would help to make one’s graph more understandable, as well as to make Cypher more powerful, by allowing it to home in on nodes (as well as to index them) according to what they are.


2.0 therefore introduces a means of grouping or categorizing nodes. Provisionally we are calling this construct a “Label”. The term “Label” speaks to its generic use, and to the fact that nodes can have multiple labels. One of the many uses of labels--and perhaps the most intuitive one at first--is to provide "hooks" in the graph that you can associate with your application's type system. Because the facility isn't itself explicitly hierarchical (it's just literally a tag, of which you can have zero to many per node), they're being called labels.


Labels


A graph is a graph because it has relationships in the data. In a Property Graph, a relationship always has a type, describing how two nodes are related. Labels expand on that idea, describing how entire sets of nodes are related. This is a grouping mechanism for nodes. How does it work? Very simple: in the example above, rather than adding a “Type” property and connecting Joe to a Party node, you would add two labels: one for “Person”, and one for “Middle-Wing”, just like so:




This opens up quite a few possibilities, and probably stirs up a lot of ideas in your head. Rather than color your thinking about how to use labels, let’s look at an example using different color sets.

Color me happy


Let’s say we have an arbitrary domain of loosely related stuff, within which we at least know that things can be red, green, or blue. We could just add a “color” property to each node, or relate them to a value node for each color. But because we want to always work within this group, we’ll use labels to identify members of the sets.


First, create something red:

CREATE a node with a Label


CREATE (thing:Red {uid: "TK-421", make: 191860 })

RETURN thing;


To find the thing we just created, we can search within just the Red nodes, then return the labels:

Find the Labels on a node


MATCH (thing:Red)

WHERE thing.uid = "TK-421"

RETURN labels(thing);


Why labels, plural? Because nodes can have multiple labels. Let's say that "TK-421" also belongs to the blue set. Add a blue label like this:


Add a Label to a node


MATCH (thing:Red)

WHERE thing.uid = "TK-421"
SET thing :Blue;

The benefits of intentional labeling


While some Danes may be nervous about labels, much good comes from their use. Applying a label to a set of nodes makes your intention obvious — "these nodes are accessed frequently and thought of as a group." The database itself can gain benefit from having your intention be explicit, because it can now do things with this information.

For starters, Neo4j can create indexes that will improve the performance when looking for nodes within the set. (Note the new Cypher syntax for index creation!):

CREATE INDEXES to speed up finding Red and Blue nodes

CREATE INDEX ON :Red(uid);
CREATE INDEX ON :Blue(uid);

Create a second labeled node and a relationship

CREATE (other_thing:Blue {uid: "TURK-182", make: 181663})
WITH other_thing
MATCH (thing:Red)
WHERE thing.uid = "TK-421"
CREATE (thing)-[:HONORS]->(other_thing)
RETURN thing, other_thing;

There is much more fun to be had. Details are, as always, in the Neo4j Manual. Again, this simple change can have profound impact. As we're exploring the possibilities and tuning the language and APIs, we'd love for you to play around with labels. Let us know how you want to use them, by providing feedback on the Google Group. (That way other people can see your feedback and respond with their own opinions and observations.)

One more thing...

Just in CASE


Cypher has a new CASE expression for mapping inputs to result values: a cousin to similar constructs found in every common programming language.  

  • In its simple form, CASE uses a direct comparison of a property for picking the result value from the first matching WHEN:
MATCH (r:Red) RETURN CASE r.uid
     WHEN "TK-421" THEN "Why aren’t you at your post?"
    WHEN "TURK-182" THEN "the work of one man"
    ELSE "..."
END

  • In the general form, each WHEN uses an arbitrary predicate for picking the result:

MATCH (r:Red) RETURN CASE
   WHEN r.color > 180000 THEN "redish"
   WHEN r.color < 180000 THEN "purplish"
   ELSE "simply red"
END

Summary


Enjoy this preview milestone! Use the Neo4j Google Group to tell the Neo4j team and other members of the Neo4j community what you think. There are a few other improvements baked into this release as well, including to the shell, that we'll cover in upcoming blogs. And of course you'll be seeing more in upcoming Milestones of Neo4j 2.0. Meanwhile, we have upgraded a preview of the online console for you to test the new features, it now features the Matrix graph enhanced with labels.

One final note: if you are planning to go into production soon, we strongly recommend developing against 1.9, which we expect to be going GA in the next couple weeks (look for an RC this week).


Update - 2.0.0-M02 introduces Remote Transactions


The latest 2.0 milestone introduces a new HTTP endpoint for managing multiple Cypher statements within a single transaction. Just create the transaction with the first batch of statements. You'll receive a URL to which additional requests can be submitted, and for committing or rolling back the transaction. See the Neo4j manual for all the details. 


Enjoy, from the Neo4j Team!

19 comments:

Joe Parry said...

Thanks Philip - looks very interesting.

Can I use labels for aggregating bits of graphs? I'd like to search for all 'Red' nodes, all 'Blue' nodes and all the links between the two node types.

Could you provide a cypher query which does this? I'm interested to find out what happens when nodes are both Red and Blue.

Also - could you link to specific documentation changes rather than the whole documentation? I'd like to read which bits of Neo are impacted by this change.

And lastly: can you point to a good example data set that has labels present already?

Thanks! I'm looking forward to playing with this.

Hendy Irawan said...

...that?

Hendy Irawan said...

I think your blog mobile theme is broken, i cant read it on Android browser

Andreas Kollegger said...

Hey Joe,

You can do graph-global queries using labels. Your example would look like:

MATCH (reddish:Red)-[related]-(bluish:Blue)
RETURN reddish, related, bluish;

This would return all the pairs of Red node which have any relationship to Blue nodes. If the same node is Red and Blue, it can appear on either side, and even by itself if it has a self-relationship.

Here are some deeper documentation links:
- CREATE with labels
- MATCH with labels

All the code with this is still under active development, so we have not yet produced substantial data sets which use labels. We will have additional blogs which explore specific use cases with best practices for modeling.

Cheers,
Andreas

Florent Biville said...

That is an awesome piece of news! Cypher is now very very close to being a full-fledged alternative to the traversal framework :)

Cypher and you guys rock!

Dmitry Serebrennikov said...

Hi guys! Awesome pace of progress!

But I'm not sure I fully see the benefit of the Type-Labels vs. the usual properties with an index. After all, Spring-Data has a @Indexed annotation that allows creation of separate indexes by "type". Seems very similar.

Is the point of Type-Labels to bring this functionality into the core of Neo, and thus make it more powerful?

Also, don't labels themselves need a type? Suppose I label some People as "Red" and "Blue" (due to their political affiliation) and then also label cars as "Red" and "Blue" (due to their actual color). Wouldn't the graph still confuse the two?

Neo4J Fan said...

Cool! When will 1.9 get released though?

bryan roberts said...

The index added using the labels is not working for me.

Could you please tell me if its the right way to use it?

CREATE INDEX ON :nodes(id);

No errors here.

then I try to get a node using
start n=node:nodes(id="bryan.roberts")
return n;


I am getting an error,
Index `nodes` does not exist


Any help is appreciated...

Thanks,
Bryan

Peter Neubauer said...

Bryan, indexes are no longer explicit in that form, you do something like

create index on :Crew(name)

and then see http://docs.neo4j.org/chunked/preview/query-using.html

match n:Crew
using index n:Crew(name)
where n.name='Neo'
return n;

Philip Rathle said...

Neo4j Fan: We're currently in the final process of preparing a 1.9 Release Candidate. Current expectation for 1.9 GA is that it will be available in the next couple-to-few weeks.

Pieter-Jan Van Aeken said...

Looks very interesting indeed. I remember Stefan Armbruster mentioning this in the tutorial back in the Netherlands. I am curious though about the possibilities :

Since labels are basically categories, can you create a category and then some subcategories? For instance Employee and then SalesEmployee and DevelopmentEmployee?

And can you add multiple categories to someone, say a person is both a SalesEmployee and a DevelopmentEmployee?

Pieter-Jan Van Aeken said...

I'm sorry, I didn't realise that the API was already available. I found out that you can in fact add multiple labels to a single node which is very useful for me. So ignore that question.

I couldn't find something though about "sublabels". ATM, I'm able to create a CategoryNode who has a specific relation to each of his SubCategoryNodes. I index all these category nodes in a custom index, and with a simple traversal I can easily get all nodes for a specific category or subcategory.

Is this possible with labels, and if it is, how's the performance? my personal experience tells me that index isn't really that fast with big datasets. Hence why I only indexed categories until now and did search operations with traversals rather than adding indexed category properties to every node... The labels approach seems to work similar to the property approach?

Mattias Persson said...

Pieter-Jan Van Aeken:
So you're thinking about performance of doing an index lookup, looping through that result vs. getting the relationships of a node directly, right?

That is very much up to the index in use. We've given much thought about the index provider API for labels&indexing so that it's as easy as possible to plug in new ones. The default for now is Lucene, although there are other providers probably much better suited for this use case. That will come later on.

On the topic of performance (and diving into the internals of Neo4j transaction management): the indexing here will happen in the same data source, that is there will be no two-phase-commit overhead due to the graph AND an index being involved, other than updating the index itself. So that's a win already.

Pieter-Jan Van Aeken said...

@Mattias Persson

You're right. I'm interested in the performance difference between retrieving all nodes of a specific category by traversing the relations of that category node, vs doing an index lookup based on the label of that node where the label is an indication of that nodes category.

I'm still somewhat worried though. The only nodes I need to index atm are the category nodes. They are indexed in the same transaction as the one in which I create them. Unless I misunderstood what you meant, I believe there is thus no two-face commit. But even if I'm mistaken. There's only about 30 category nodes to add to my index. From what I understood (and again I may be wrong), all nodes with a specific label would be added to the labels index, potentially adding a very large group of nodes to the index. I'd expect such an index lookup to be a lot slower than actually looking up a category node and traversing its incoming IS_OF_CATEGORY relationships.

Perhaps you are right that a different index provider would be better for my use case. It's definitely something I should look into.

Andreas Kollegger said...

@Pieter

Labels would just present one more option for modeling your domain. Of course, picking the best approach really depends on the questions you want to ask.

What kind of queries do you need for things of a particular sub-category? Do you need to scan through all of them, or find a particular one (or few)?

Pieter-Jan Van Aeken said...

I'd like to be able to fill up tables with an entire category or subcategory. So if I have three categories, Person, Employee and Employer where the latter 2 are a subcategory of the first, I want to be able to create a Person table, a Employee table and an Employer Table.

In that table I want either all nodes, or a subset if I'm using pagination. And I want to be able to loop through all the nodes, because in the table I need the properties, not the actual nodes.

Now I have all three categories indexed, so it's not hard to find my entry point with Lucene. Then I do a traversal over the incoming relationships IS_OF_CATEGORY, and I get a list of all nodes in that category.

I'm just really curious to see how the label lookup vs my current method would do when you have about 10000 persons.

Livio Ribeiro said...

Labels looks like a very good way to map Classes in your application to Nodes in the graph. Also, with inheritance, one could label the node with the base class name and the subclass name. This is a very welcome addition to neo4j.

Bryan Vine said...

Can you create label indexes with fulltext and to_lower_case enabled?

Anonymous said...

Hi, maybe i do not understand it fully yet but imho a "label" is still a (multivalue) "property" of a node.. what is the exact differentiator why a "property" like we know right now cannot be used for exactly the same purpose as the addition of a Label to the datamodel (making it more complex) ?? maybe through some syntax extension of some commands etc etc..

and i guess we will have "labels" on relations too?

reg koen