Monday, April 23, 2012

Streaming REST API - Interview with Michael Hunger

Recently, Michael Hunger blogged about his lab work to use streaming in Neo4j's REST interface. On lab days, everyone on the Neo4j team gets to bump the priority of  any engineering work that had been lingering in a background thread. I chatted with Michael about his work with streaming.

ABK:  What inspired you to focus on streaming for Neo4j?
MH:  Because it is a major aspect for Neo4j to behave as performant as possible, especially with so many languages / stacks connecting via the REST API. The existing approach is several orders of magnitude slower than embedded [note: Neo4j is embeddable on the JVM] and not just one as was originally envisioned.

ABK:  What do you mean by "streaming" in this context, is this http streaming?
MH:  Yes, it is http streaming combined with json streaming and having the internal calls to Neo4j generate lazy results (Iterables) instead of pulling all results from the db in one go. So writing to the stream will advance the database operations (or their "cursors"). This applies to: indexing, cypher,  and traversals.

ABK:  Ah, so this isn't for streaming binary data, like video or something, right?
MH:  Yes. The "binary" data is actually json-results from the Neo4j REST API .

ABK:  Does it require any changes to the existing clients?
MH:  The only change that is required is to signal to the server to return the data in a streaming manner. Right now that is through an extended accept header (application/json;stream=true) but that will probably change to a more standards compliant transport encoding header.

ABK:  Could a client use this streaming to "page" results?
MH:  Good question. In theory yes, but then it would have to keep open the connection and stop receiving until the next page is requested which would probably result in connection hogs and a timeout. But it can be used to just retrieve as much as is needed and then close the connection.

ABK:  Is multipart/mixed used to indicate chunks, or is the json stream left in-tact?
MH:  Right now the json-stream is left intact. The chunking will be part of the transport encoding negotiation that will be added later. We put that in now (without requiring additional changes on the client) so that we can gather feedback from driver authors and users.

ABK:  When will this be available in Neo4j for public review?
MH:  it is already available in the SNAPSHOT version of Neo4j and part of the first 1.8 milestone release which is due this week.

ABK:  Thanks, Michael. I'm looking forward to trying it out.
MH:  And I look forward to your feedback. Thanks, Andreas.

Performance Results

With a freshly installed Neo4j 1.8-SNAPSHOT server started, I tried out the streaming as Michael recommended. First, I created a sample data set of 50,000 nodes with a Gremlin one-liner:
(0..50000).inject(0) { count,idx -> v2=g.addVertex(); g.addEdge(g.v(idx),v2,"TYPE"); count+1;}
Then from bash I ran curl to compare retrieving query results with and without streaming, the difference just being the ";stream=true" in the accept header.
curl -i -o streamed.txt -XPOST -d'{ "query" : "start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p" }' -H "accept:application/json;stream=true" -H content-type:application/json http://localhost:7474/db/data/cypher
curl -i -o nonstreamed.txt -XPOST -d'{ "query" : "start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p" }' -H "accept:application/json" -H content-type:application/json http://localhost:7474/db/data/cypher
Running on my humble Mac laptop, the streaming took 10 seconds to return a complete result transferring between 8 to 15 MB/s for 130MB of data. The normal non-streaming result took 1 minute, 8 seconds to provide the same result and a Heap of 2GB.  Pretty impressive. This is something to look forward to in the upcoming milestone release. Check out Michael's blog for even more detail.

Cheers,
Andreas

Friday, April 20, 2012

Neo4j 1.7 GA "Bastuträsk Bänk" released


We’re very pleased to announce that Neo4j 1.7 GA, codenamed "Bastuträsk Bänk" is now generally available. The many improvements ushered in through milestones have been properly QA’d and documented, making 1.7 the preferred stable version for all production deployments. Let’s review the highlights.

Welcome to the Enterprise Turbo package

Johan. Speed.
For enterprise deployments contending with high-volume requests, we are working on a number of features that support these scenarios in terms of better and more predictable speed.

First out is the new GCR cache - it's 10x faster and accommodates 10x more primitives. The GCR smooths out the rough spots that can occur when processing huge graphs with thousands of simultaneous user operations, by directly managing a fixed amount of memory with thread safe operations.

While the other cache types are almost maintenance free and are very effective for general use, the GCR cache’s finer control over memory usage can achieve more consistent responsiveness when tuned correctly. The GCR is a good choice for large cluster deployments where you’re tuning every aspect of your system.

Cypher-[:DESCRIBES]->Results

Andres Taylor - Cypher champion
The craftsmanship that guides the Cypher language cares deeply about balancing clarity with comprehensive expressions, working towards a grammar that naturally matches the questions you would ask a graph. From the START to the MATCH and the WHERE, the clauses are intentionally unsurprising; they’re what you’d likely write when describing a question in an email.

With 1.7, Cypher now has a full range of common math functions for use in the RETURN and WHERE clause. Combined with basic arithmetic, you can now say things like:
START a=node(3), c=node(2) RETURN a.age, c.age, abs(a.age - c.age)
START a=node(1) RETURN round(3.141592)
START a=node(*) WHERE sqrt(a.prop) > 5  RETURN a
START a=node(3) RETURN sign(-17), sign(0.1)
Sometimes it makes sense to consider multiple relationship types at the same time, as in “my friends and my neighbors.” Now Cypher matches allow you to combine relationships into a single path like so:
START me=node(1) MATCH (me)-[:FRIEND|NEIGHBOR]->(fandn) RETURN fandn
Similarly, in the WHERE clause you might want to accept a few different values for a property. Cypher’s IN operator let’s you present the alternatives in a collection like this:
START a=node(3, 1, 2) WHERE a.name IN ["Peter", "Tobias"] RETURN a
Collections may contain more than you want, so you can pick what elements of the collection to return with HEAD, TAIL, LAST or FILTER.. To complement the ‘?’ operator for optional properties, the new missing property operator ‘!’ defaults to a false value when the property is missing. Consider the difference in these two statements:
START n=node(*) WHERE n.belt? = 'white' AND n.age>32 RETURN n
START n=node(*) WHERE n.belt! = 'white' AND n.age>32 RETURN n
With the optional operator ‘?’ you’d get all the nodes with age greater than 32, and if they have a belt, it must be white. With the missing operator ‘!’ you’d get all the nodes with an age greater than 32, and who *must* have a white belt. In the previous example, we also used a great new convenient notation for indicating all nodes, by
START a=node(*)

Other notable improvements

Like all of our releases, Neoj 1.7 GA incorporates important performance improvements under the hood and fixes for various bugs - discovered both through the open community and by the field team working with the customers. We try to get them into the open codebase as fast as possible, so everyone can benefit. Other notable features that have been added include:

  • SSL Support - When you want remote access to Neo4j across the public internet, yet secured, https is of course the way to go. Now you can provide your own certificate, or have a self-signed certificate auto-generated and access Neo4j remotely across https. 
  • Wildcards in security authorization rules - With this simple change, security rules can be more easily applied to branches of a URL like /protected/*/
  • System properties can set configuration - We’ve made some changes to how Neo4j exposes configuration parameters, making it possible to use system properties to override settings. This is particularly convenient when deploying to different staging and production environments.
  • Cypher performance improvements through a first round of query execution optimizations.
  • The Changelogs

    Full details about each of the changes are included in the changes.txt files, available for the community, advanced and enterprise components.

    This is a strong new release that we think you’ll enjoy. For reference and more information on the new features that came in the different milestones, refer back to the 1.7.M01, 1.7.M02 and 1.7.M03 blogs and the changelogs (see above).

    Please download it now. Heroku users also have it available immediately as the default version for new applications, so heroku addons:add neo4j now. Wherever you use Neo4j, join us in the google group to chat about your progress and to share any new ideas you’d like to see as we move on to developing the 1.8 series.

    The Neo4j Team

    Thursday, April 12, 2012

    Neo4j 1.7.M03 - Feature Complete


    The full general release of Neo4j 1.7 is now in view, with this milestone marking feature completeness. This 1.7.M03 release is recommended for migrating your test servers, client applications and drivers in anticipation of 1.7.GA, since there will be no more visible API changes.

    Atomic Array -[:renamed_to]-> Garbage Collection Resistant

    I think we can all agree, “there are only two hard problems in Computer Science: cache invalidation, naming things, and off-by-one errors.” (Cheers Tim Bray). Naturally, we started by creating a new cache, which focused on occupying a fixed size in memory with improved performance. Check.

    Then, there’s the name. We liked the sound of Atomic Array Cache, because it sounds like something you could wield against hordes of invading aliens. But, in reality it was specifically identifying the implementation details of the cache. The actual behavior of the cache, the reason why you’d choose it over the other caches, is that by occupying a fixed amount of space in memory, it is effective at avoiding those horrid Garbage Collection pauses. Appropriately, we’ve renamed it to the Garbage Collection Resistant (GCR) cache, to reflect its purpose.

    And off by one? Well, that is the tricky thing about the GCR cache: getting it configured correctly to actually have a good, responsive application takes some hard thinking. For now, we’re only including it with the enterprise edition of Neo4j, generally used when you’re going into large scale deployment and are tweaking every knob in your system.

    Feedback Appreciated

    We always appreciate your feedback, whether tweeting praise or concerns, or discussing things with us in the google group. Please go download Neo4j 1.7.M03 right now, and let us know what you think.



    Cheers,
    your friendly neighborhood graphistas

    Monday, April 9, 2012

    Rabbithole, the Neo4j REPL console

    Over the last few days the Neo4j community team worked on the initial iteration for an interactive Neo4j tutorial.
    The first result we are proud to publish is a sharable console that runs an in-memory Neo4j instance in a web-session.
    It supports Cypher queries of the graph and Geoff for importing and modifying the graph data. The graph itself and the cypher results are visualized in an overlay using d3.js.
    You can easily get a link to share your current graph content and even tweet it.
    For the web application we use the minimal Spark web-framework.
    The app is deployed to Heroku and available on github.

    Here it is in live-action:


    You can also test setting up your own graph:





    Now, with all these nice things in place, imagine what we can start doing:

    • share a graph setup with just one URL in mails, Twitter and bookmarks like http://tinyurl.com/cak4oc8
    • Embed live example consoles into the Neo4j manual (we are experimenting with it currently)
    • Easy embedded of working graph visualizations via iframes and widgets, possibly with different visualization options (see Max De Marzis blog for some great examples)
    • More interesting visualizations like Geographic rendering and Hiveplots
    And much, much more. Let's see if we can together build cool and innovative stuff around this! Feel free to discuss on the Neo4j mailing list !

    Yours truly

    Michael, Andreas and Peter