Monday, January 28, 2013

Demining the “Join Bomb” with graph queries

For the past couple of months, and even more so since the beer post, people have been asking me a question that I have been struggling to answer myself for quite some time: what is so nice about the graphs? What can you do with a graph database that you could not, or only at great pains, do in a traditional relational database system. Conceptually, everyone understands that this is because of the inherent query power in a graph traversal - but how to make this tangible? How to show this to people in a real and straightforward way?

And then Facebook Graph Search came along, along with it’s many crazy search examples - and it sort of hit me: we need to illustrate this with *queries*. Queries that you would not - or only with a substantial amount of effort - be able to do in traditional database system - and that are trivial in a graph. 

This is what I will be trying to do in this blog post, using an imaginary dataset that was inspired by the Telecommunications industry. You can download the dataset here, but really it is very simple: a number of “general” data elements (countries, languages, cities), a number of “customer” data elements (person, company) and a number of more telecom-related data elements (operators - I actually have the full list of all mobile operators in the countries in the dataset coming from here and here, phones and conference call service providers).

So to start of with: what would this data set look like in a relational model?

What is immediately clear is that there is *so* much overhead in the model. In order to query anything meaningful from this normalised rdbms, you *need* to implement these things called “join tables”. And really: these things stink. It is just an example of what a poor match the relational model is to the real world - and the complexity it introduces when you start using it. 

Compare this to the elegance of the graph model:

It is such a good match to reality - it is just great. But the beauty is not just in the model - it’s in what you do with the model, in the queries.

So let’s how we could ask some very interesting, Facebook style, queries of this model:
  • Find person in London who speaks more than one language and who owns an iPhone5
  • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator
  • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator in the city that he lives in
  • Find a person, not living in the Germany, who roams to more than 2 countries and who emails people who live in London
  • Find a person who roams to more than 2 countries and who lives in the USA and uses a ConferenceCallService there

These are all very realistic queries that could serve real business purposes (pattern recognition, recommendation engines, fraud detection, new product suggestions, etc), and that would be terribly ugly to implement in a traditional system, and surprisingly elegant on a graph. To do that, we would use our favourite graph query language, Cypher, to describe our patterns and get the data out.

So let’s explore a couple of examples with some real world queries.

The first thing to realise here is the relevance of a very important concept in the world of databases, and more specifically so in the world of Graph Databases - the use of Indexes. In a traditional database, indexes are expensive but indispensable tools to quickly find the appropriate records in a table using a “key”. And when joining two tables, the indexes on both tables would need to be scanned completely and recursively to find *all* the data elements fitting the query criteria. This is why “joins” are so expensive computationally - and this is also why graph queries are so incredibly fast for join-intensive requests. The thing is, that in a graph database, you *only* use the index on the data *once*, at the start of the query - to find your starting points of the “traversals”. Once you have the starting points, you can just “walk the network” and find the next data element by hopping along the edges/relationships, and NOT using any indexes. This is what we call “index-free adjacency” - and it is a fundamental concept in understanding graph traversals.

In the example below you can see that we are using three index lookups (depicted in green, and I even added a nice little parachute symbol to illustrate what we are doing here) to “parachute” or land into the network, and start walking the graph from there.

The query above is to look for a city where someone from Neo Technology lives that speaks English and has Three as his operator in the city that he lives in.

// These are the three parachutes, landing by doing an index lookup for nodes using the node_auto_index of Neo4j.
  neo=node:node_auto_index(name="Neo Technology"),

// Here we describe the pattern that we are looking for. From the three starting points, we are looking for a city that has very specific, directed relationships that need to match this pattern.

// We return the city’s name and the person’s name as a result set from this query.

// and order them by the name of the city

And here’s another example:

Here are are looking for two people in the same countries but on different home operators that call, mail or text each other.

// Here we use just one index lookup to find a “country” and then we start looking for the pattern.

// The pattern in this case is quite a complex one, we quite a few hops on the different relationship types.

// Here we limit the results to a specific condition that has to be applied.
  AND operator <> otheroperator

// And here we return the distinct set of name of the person and the countries’ name.

I am hoping that you can see that these kinds of queries, which directly address the nodes and relationships rather than going through endless join-tables, is a much cleaner way to pull this kind of data from the database.

The nice thing about this way of querying is that in principle, its performance is extremely scalable and constant: we will not suffer the typical performance degradation that relational databases suffer when doing lots of joins over very long tables. The reason for this is simple: because we only use the indexes to find the starting points, and because the other “joins” will be done implicitly by following the relationships from node to node, we can actually know that performance will remain constant as the dataset grows. Finding the starting point may slow down (a bit) as the index grows, but exploring the network will typically not - as we know that not everything will be connected to everything, and the things unconnected to the starting nodes will simply “not exist” from the perspective of the running query. 

Obviously there a lot more things to say about graph queries, but with these simple examples above, I hope to have given you a nice introduction as to where exactly the power of graph traversals is - and it’s in these complex, join-intensive queries.

Yours sincerely,

Rik Van Bruggen.

Thursday, January 24, 2013

2013: What's Coming Next in Neo4j!

Following the recent 2012 retrospective, we’d like to share some of our plans for the coming year. If you’ve been following our latest progress, you already know that we have a 1.9 version in the works. Neo4j 1.9 makes great strides in HA cluster manageability by eliminating the need to run Zookeeper. We will release 1.9 when it’s ready. (We are readiness driven more so than we are date driven.) As of this writing, the latest Neo4j 1.9-M04 milestone has just rolled out. We are expecting a release announcement for Neo4j 1.9 GA some time in February.

Beyond Neo4j 1.9:

Even though roadmaps can change, and it's nice not to spoil all of the surprises, we do feel it's important to discuss priorities within our community. We've spent a lot of time over the last year taking to heart all of of the discussions we've had, publicly and privately, with our users, and closely looking at the various ways in which Neo4j is used. Our aim in 2013 is to build upon the strengths of today's Neo4j database, and make a great product even better.

The 2013 product plan breaks down into a few main themes. This post is dedicated to the top two, which are:

1.     Ease of Use. Making the product easier to learn, use, and maintain, for new & existing users, and
2.     Big(ger) Data. Handling ever-bigger data and transaction volumes.

Ease of Use

Our goal is to make it as easy as possible to learn Neo4j, to build applications using Neo4j, and to maintain Neo4j. One observation we've made is that most Neo4j users develop their own way to classify nodes: either with type nodes, or with properties. We've decided it's time that we build in support for this. So for the first time since Neo4j founder and CEO Emil Eifrem sketched out what today is called the property graph model on the back of a napkin on a flight to Bombay in Fall of 2000, we will be extending Neo4j's data model. 

The key idea is that you will be able to tag nodes with labels, for example: “Person”, "Employee", “Parcel”, “Gene”, etc. We will soon invite discussion on the Neo4j Google Group concerning the implementation details. Expect to hear more about this soon.

Another area where we are planning improvements is indexing. Auto indexing reduces the effort required to manage indexes. However they're not suitable in all cases, and Cypher is likewise still lacking in its support for index creation and maintenance. We have a number of improvements in store here, but the most significant one is that as of the 2.0 release, you can perform every single index operations via Cypher. This means that many applications can be built with Cypher as the sole and exclusive way to access Neo4j

These improvements are both planned in a 2.0 release, which will be the next major release after 1.9, and is expected in the first half of the year. Designating 2.0 as a major release also means that we will also be removing all features and functions deprecated in 1.x. Now is a good time to start making sure that you’re not using anything that’s been deprecated. 

Big(ger) Data

Before we dive into futures, let’s revisit what Neo4j can do today. Presently, Neo4j's clustering solution has excellent availability and read scaling characteristics, for up to dozens of servers. Features such as cache sharding (which allows cluster members to each keep a different portion of the graph in memory) and turbo cache (which provides up to 10x improvements in cache speed for very large cache sizes), are just a few of the technologies that allow Neo4j to scale extremely well. We have seen customers running globally-distributed 24x7 Neo4j clusters across EC2 regions on three continents, and replicate hundreds of thousands of write transactions between continents seconds. And we have seen mission-critical use of graphs range from tens of thousands of nodes & relationships at the low end, to billions of nodes and relationships in a single (usually clustered) graph.

Tens of billions turns out to be enough for the vast majority of uses we see. It allows a social graph (for example) to scale to right about the size of the people and friendships in Facebook. Today, a single Neo4j instance can handle graphs into the tens of billions of nodes/ relationships/ properties. Using the right design patterns and some application development, it is possible to partition a graph oneself, and scale even higher. Whilst we have some users doing this, our goal is to minimize the amount of effort required to scale, no matter how large. 

Looking ahead, we do see a need to support even bigger graphs, particularly as data volumes, together with the demand for graph databases, continue to grow. For this reason, we have several initiatives in 2013 aimed at scaling into the 100B+ range and beyond (in a single graph), horizontally or vertically. Let's talk about vertical scalability first. We will be increasing the upper size limits of a single instance, to a number that's high enough not to be a concern. These limits were never meant to be obstacles, but are rather storage optimizations intended to keep on-disk footprint as modest as possible. Few users ever bump into these limits. But we want to raise them before anyone does. Another project we have planned around “bigger data” is to add some specific optimizations to handle traversals across densely-connected nodes, having very large numbers (millions) of relationships. (This problem is sometimes referred to as the "supernodes" problem.)

The last item on the Big(ger) Data list is definitely not the least. It is a horizontally-scalable, distributed graph, designed to spread massive graphs across clusters of machines. This is a multi-year project that's been going on in the background for some time, code-named Rassilon. Rassilon extends Neo4j's existing horizontal scalability story to allow distributed read & write processing across clusters of machines for graphs of any size. Rassilon fully supports sharding of graphs. Together with the improvements mentioned above, it will ensure that whatever graph you have will be able to be stored in Neo4j, no matter how large. We are continuing to work actively on this, and will share further details on this as the year progresses.

Some Other Items

Other key items in this year's plan include the following:
  • Neo4j will become easier to run on EC2 and Azure, to cater to the increasing number of users who are running on these platforms.
  • Cypher will continue to see improvements in both functionality and performance, including the indexing improvements mentioned above. Cypher has been popular because of its power, compactness, and readability.
  • The remote API will continue to improve, as Server becomes the primary means of accessing Neo4j, over Embedded. We're looking at a variety of ways to improve the experience, beyond simply making REST tweaks, though we will do that too. We look forward to opening this topic up for community discussion when we start the design. One important improvement is the ability for a transaction to span multiple REST calls. We expect to have this out by mid year.
  • Work is planned to make it easier for our community members to make and share contributions to the Neo4j ecosystem, such as visualization tools, language bindings, and adapters.
  • We wil improve the learning experience, via improvements to documentation, the web UI, etc.
  • And finally, robustness. We will continue the steady stream of behind-the-scenes changes to ensure that Neo4j remains robust across a wide and growing range of applications.

Closing Note

We at Neo are very optimistic for the coming year. We’re looking forward to great things happening in the community, with the product, and in the industry. Market forces are driving more and more people to look at graph databases. One factor is the enormous value that can be had in exploiting the connectedness inherent in one’s data. Another is the agility and speed that users and applications experience when using Neo4j. 

I'd like to invite you to reach out to me personally (philip - at - neotechnology - dot - com), or in reply to this post, if you should have any feedback on product direction, including areas that you don't see covered in the 2013 plans above.

We look forward to a successful year together in 2013. Thanks for being a part of our community!

Philip Rathle
Senior Director of Products, Neo Technology

Tuesday, January 22, 2013

Neo4j Milestone 1.9.M04 released

Today we're happy to announce Neo4j 1.9.M04, the next milestone on our way to the Neo4j 1.9 release.

For this milestone we have worked on further improvements in Cypher, resolving several issues and continued to improve performance.

Something many users has asked for is Scala 2.10 support which we are providing now that a stable Scala 2.10 release is available.

There were some binary changes in the Scala runtime so by adopting to these, Cypher became incompatible to Scala 2.9. Please ping us if that is an issue for you.

In the Kernel we finally resolved a recovery problem that caused the recovery process to fail under certain conditions.

Due to a report from Jérémie Grodziski we identified a performance issue with REST-batch-operations which caused a massive slowdown on large requests (more than thousand commands).
Solving this we got a 30 times performance increase for these kinds of operations. So if you are inserting large amounts of data into Neo4j using the REST-batch-API then please
try 1.9.M04 if that improves things for you.

We invested more time in making our High Availability Cluster easier to operate by automatically taking care of some failure conditions.

We now also provide a new master information endpoint for a HA cluster that can be used for instance to control a load-balancer to route write requests always to the master.

It is usually found under: /db/manage/server/ha/isMaster which returns true or false depending on the target server being the master and
/db/manage/server/ha/getMaster which returns the host and port of the current master machine as a Location header.

If you run Neo4j enterprise please note that the setting online_backup_port is now deprecated in favour of online_backup_server which supports a port range. Also we now enable
the online backup if possible by default.

As always, feedback is warmly welcome so we can make Neo4j 1.9.GA as stable as possible!


Michael Hunger and the Neo4j contributor team

Thursday, January 17, 2013

Why the most important part of Facebook Graph Search is "Graph"

You've heard the news: Facebook has announced a new offering called Graph Search. Use of graph technologies has been growing over the last few years, and there’s been quite the buzz around graph databases of late.

We believe that Graph Search is part of a trend that is much bigger than Facebook, and more widespread than search. Facebook is tapping into a fundamentally new way to exploit the information that exists in all the world’s databases. In this post, we will look at the Facebook announcement from a different angle, that of connected data: a growing trend that is on the verge of changing how companies large and small understand their data.

Graphs and Search: a bit of history

Web search and graphs have a long history. Throughout most of the 1990s, the technology behind web search was based on “atomic data”: it indexed each page and ranked it in isolation, based solely on its contents, and without any reference to other pages. 

But in 1999, a small startup called Google adopted a new graph-centric approach, invented by co-founder Larry Page, called PageRank. PageRank changed the fundamentals of web search, and catapulted Google ahead of its competitors, who to this day have not caught up. What was novel about this new algorithm is that instead of ranking pages in isolation, without any reference to one another, it achieved markedly better results by taking into account how the pages are connected.

Connected Data as a New Source of Insight

In his keynote at last year’s GraphConnect Conference in San Francisco, social researcher James Fowler (author of the book “Connected”) shared his latest research findings, indicating how one can learn more about someone by knowing how they interact with the people and things around them, than by just learning discrete facts about that person. The difference between insights gained from atomic data, and the intelligence that can be discovered from connected data, is vast, and calls for specialized technologies that are designed to exploit connectedness.

How Does Graph Search Work?

Graphs are inherently visual. It’s not so difficult to understand how the technology works, even if you’re not that technical. Let’s take one of Facebook’s example Graph Search queries, which is to find all of the Sushi restaurants in New York that my friends like. Below is an illustration of what the underlying graph looks like:

The data stored inside of the graph database looks exactly like the drawing. Getting the answer is a very simple matter for a graph database. You just need to formulate the question in a way that the database understands. Those who are more technically inclined can see an example below for the query that answers the question: 

Sushi restaurants in New York that my friends like

START me=node:person(name = 'Philip'),  
location=node:location(location='New York'), 

MATCH (me)-[:IS_FRIEND_OF]->(friend)-[:LIKES]->(restaurant)

RETURN restaurant

Cypher Query Language Example: Sushi restaurants in New York that my friends like

Other Applications for Graphs

Thinking in graphs is natural, and contagious. The more you think in terms of connections, the more you realize that graphs are the way that we implicitly think. What is a decision tree, for example, but a graph of possibilities? The more you look, the more you start to notice that graphs are, in fact, everywhere

Graph database users regularly use queries like the one above to answer questions, and the more you ask, the more you think of new questions that never occurred to you to ask previously. Graph queries can get quite elaborate, and it’s entirely possible to run queries that scan within a social network that is two, three, or more levels of friends apart.

Opportunities for leveraging connected data extend far beyond social and search. The pattern that applies to Graph Search is also applicable to bioinformatics, fraud detection, network management, logistics, and a variety of other use cases. Neo Technology has customers in all these areas (and more!) using the Neo4j graph database to achieve new and higher levels of insight.

I’m not Facebook... How can I get this?

Technology giants such as Facebook, Google, and Twitter have all built graph technologies from the ground up to differentiate and grow their business. Building and maintaining one’s own database management system however is not a practical solution if you’re not Facebook. 

The good news is that companies wanting functionality like Graph Search are a click away from getting the tools they need to build it. At its core, Graph Search is a database. Unlike a decade ago, one can now find commercial off-the-shelf graph databases that are proven and robust, and built from the ground up to support connected data. 

Neo4j is the most widely used graph database today. Companies have adopted it because it's 1000 times faster than relational databases for working with connected data, and much easier to work with than by shoehorning graphs into tables. 

Neo4j is freely available as open source software, with a Community Edition available under same open source license as MySQL, and an Enterprise edition. Commercial subscriptions are available from Neo4j creator and sponsor Neo Technology

Commercial users include Cisco, Adobe, Deutsche Telekom, Accenture, and many more; as well as lots of startups, including FiftyThree (makers of Paper, winner of Apple’s 2012 iPad App of the Year), Seth Godin’s Squidoo, and Justdial (one of India’s most talked-about startups).

As we move into an era where more and more companies are benefiting from understanding connected data, having the right tools available to anyone means that no one needs to get left behind. Neo4j is available for download today. Give it a try, or check out the interactive Cypher web console, to try out the Cypher graph query language immediately from your web browser.

Emil Eifrem and Philip Rathle, co-authors

Click on the image below to view the example query above in an online interactive Cypher console: