Friday, April 26, 2013

Data migration between MySQL and Neo4j

Luanne Misquitta, IDMission LLC

Many organizations that are looking at modeling highly connected data to add business intelligence or analytical capabilities using Neo4j already have a database in place.
Introducing a graph database into the mix does not have to be a disruptive process. As with any technology, understanding its place and contribution to the entire system is key to determining how the pieces fit together.

Dealing with two data stores

We picked a common approach to managing our data after determining that-
1. The existing database, MySQL in our case, would continue to be the primary system of record.
2. Neo4j would be a secondary data store with a much smaller subset of data, used in two specific ways:
a. OLTP mode where it is essential to the business to have some questions answered in near real time and therefore have certain sets of data as current as possible
b. Batch mode where some data is collected and processed in a delayed manner

This implied that first, we needed a way to bring the new Neo4j system up to speed with data already collected over time in our primary RDBMS, and secondly, a way to keep them both in sync once the Neo4j system was up and running.
Instead of exporting data and then importing it into Neo4j for the initial load, and then planning for keeping the databases in sync, we decided to first not worry about how the data would be supplied and just design our application without assumptions about the data source.

The model
For the purpose of this blog, I will use an extremely simplified set of entities to better describe the process. Note that the design and code are put together to illustrate the example and it is not necessarily working code.

Assume that we wish to model Customers buying Products sold by Merchants.
We set up POJOs representing the domain objects- a Product, Merchant and Customer class.

Their corresponding DAO’s take care of persistence to and from Neo4j. We used mutating Cypher to store a representation of the object as nodes and relationships, in a few cases dropping down to the Neo4j API for complex objects.
Every entity ties back to the RDBMS via an id which is the primary key in the system of record and it is this key that is indexed.

The graph model for this looks like:

So, once that part is done, the application is testable, independent of the data source.

Initial Import

Now, for the initial import of data,  the only thing that matters is to be able to transform the data received into the domain objects and save them.
Options include SQL queries to fetch only the data you need, or an API exposed by the primary system if it exists, or exported set of data.
Let us use SQL:

SELECT id, name from customers where
Customer customer=new Customer(rs.getInt(“id”), rs.getString(“name”));
customer.save(); //Creates or updates a customer node. Indexes the id if it is created.

SELECT id, name from products where
Product product=new Product(rs.getInt(“id”),rs.getString(“name”));
product.save(); //Creates or updates a product node. Indexes the id if it is created.

SELECT id, name, type from merchants where
Merchant merchant=new Merchant(rs.getInt(“id”), rs.getName(“name”));
merchant.setType(rs.getString(“type”));
merchant.save(); //Creates or updates a merchant node. Creates a “type” node if it does not already exist. Creates a relationship between the merchant and type node. Indexes the type and merchant ID if they are created.

SELECT customer_id,product_id,purchase_date from customerPurchases
Customer customer=repository.getById(rs.getInt(“customer_id”));
customer.purchaseProduct(rs.getInt(“product_id”),rs.getDate(“purchase_date”);
//Creates a relation from the customer to the product and sets the purchase date as a property on the relation.

Keeping data in sync

Once the data is imported and both systems are now running, the next task is to keep the data in sync. Depending on what you plan to do with Neo4j, you might decide that periodic imports of data serve the purpose and you might run a scheduled process to do essentially what we did in the initial import.
Or you might require to know about things as they happen. In that case, an event based integration is a simple, but powerful solution. As “events” take place in the primary system such as a new product created, or a customer purchased a product, the event is published and the Neo4j application picks it up and deals with it. The link between the two systems can be as simple as a messaging queue where the content of the message might be custom content, the result of an API call, or anything that shares the information necessary for the secondary system to make sense of.
Whatever that content, again, all we need is the ability to pick it up, parse it, and call business methods on our domain.

The result

The applications are loosely coupled and the problem of consuming data from multiple sources in multiple formats is reduced to a simple problem of parsing.

We found that this approach works well for us. The import can take a bit of time since it is transactional but considering that the initial import is not a frequent process, the wait is worth not introducing another data import tool. Once past that point, the event based sync works nicely.

As mentioned at the start of this article, understanding the pattern of data sync for your application is very important to determine how to go about it. If you need a one time migration of data, the approach above might be overkill and you should consider some of the excellent tools available such as the Batch Importer , GEOFF, or the REST batch api if using the Neo4j Server.
Also, Spring Data is something to look into if you wish to use annotated automatically-mapped entity classes.

-Luanne Misquitta, IDMission LLC.

Wednesday, April 24, 2013

Gmail Email analysis with Neo4j - and spreadsheets

A bunch of different graphistas have pointed out to me in recent months that there is something funny about Graphs and email. Specifically, about graphs and email analysis. From my work in previous years at security companies, I know that Email Forensics is actually big business. Figuring out who emails whom, about what topics, with what frequency, at what times - is important. Especially when the proverbial sh*t hits the fan and fraud comes to light - like in the Enron case. How do I get insight into email traffic? How do I know what was communicated to who? And how do I get that insight, without spending a true fortune?


So a couple of days ago I came across an article or two that features Gmail Meter. This is of course not one of these powerful enterprise-ready forensics tools, but it is interesting. Written by Romain Vialard, it provides you with a really straightforward way to get all kinds of stats and data about your use of Gmail. It does so using a Google Apps Script, that is available to anyone using Google Docs’ spreadsheet functionality. In this blog post, we’ll take a look at how we can actually use Romain’s output, and generate a neo4j database that will allow us to visually and graph-ically explore the email traffic in our gmail inboxes - without doing doing any coding of course. Because I don’t know how to do that - but I do do spreadsheets, as you know by now.


Using Gmail Meter to create the dataset


The first thing you need to do to get going is to get Gmail Meter installed. To do that, you just create a Google Doc Spreadsheet, and insert Gmail Meter from the Script Gallery.



You will need to give it permission to analyse your mailbox (and potentially remove that when you want to uninstall it), but that’s easy. More instructions can be found over here should you need that. But really it’s dead easy. Once you have given it permission that, the Apps Scripts starts churning away at your mailbox.


The end result is a Google spreadsheet with two tabs:
  • the first tab (Sheet1) contains information about which email addresses you have been exchanging with, and how many emails you have been exchanging with them (sending, and receiving)
  • the second tab (Sheet2) contains more information about the conversations, number of words, etc.
Now all we need to do is create a neo4j database based on this data - and that too is really easy.

Importing the Gmail Meter data into Neo4j


For the graph import that I will illustrate here, we will only be using the first sheet. Basically just getting to grips with
  • the people that we are emailing to
  • the people that we are receiving emails from
  • the frequency that we are emailing to/from these contacts
There’s definitely more data here - but let’s start with this.

The way I have done it - which is probably not the only way to do it, but still - is to add two sheets to the workbook coming from Gmail Meter.
  • “Graph”: to convert the data from GmailMeter’s first worksheet into nodes and relationships
  • “Cypher”: to generate the Cypher statements that we can use to generate the Neo4j database and start playing around .
I have shared the worksheet over here - so please take a look and customize it whereever necessary. You should end up with something like this sheet:


Note that - as mentioned above - in the Graph sheet we have, additional to the nodes and relationships, also added the number of emails to the “EMAIL” relationship as a property/weight. This will come in handy later when visualising the email traffic. In a larger graph you could actually use these weights for pathfinding algorithms as well - in case you would want to find out the volume of email traffic between two persons at different places in the graph, through other people.

You see that I am using the exact same techniques of my spreadsheet import blog post to generate the Cypher statements required. All I need to do after that is to put all statements into the “Cypher” worksheet, wrap it with a transaction - and copy paste that into the neo4j shell of my empty neo4j database. That will execute the queries, insert all nodes, relationships and properties, and voila - we have our Gmail Meter Graph!

Exploring the Gmail Meter Graph


Once we have generated the database, we can explore it in all the ways that we can with Neo4j’s traditional visualisation/querying tools. First thing I did of course is to look at it with Webadmin. That gives you some ideas already, but things get a lot more interesting when you can visualise the weights (= numbers of emails between two persons).


To do that, I plugged another visualisation tool on top of the database to get a feel for the weight of certain relationships. Our friends at Linkurio.us actually have some very neat  (and above all: simple AND powerful) ways to do this - as you can see below it immediately gave me an idea of where the traffic is coming from and going to.



Obviously we can explore the network, and also query it with Cypher for interesting relationships:
  • find out who I am emailing,

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail-[email:EMAIL]->otherperson
RETURN
id(otherperson), email.number, otherperson.name;

  • find out who is mailing me more than 4 emails,

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson
WHERE
email.number > 4
RETURN
id(otherperson), email.number, otherperson.name;

  • find out who is at the same time sending mail to me, and receiving mail from me.

START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson,
mymail-[email2:EMAIL]->otherperson
WHERE
email.number > 2
AND
email2.number > 2
RETURN
id(otherperson), email.number as From, email2.number as To,

otherperson.name;

  • And of course Cypher offers many more interesting possibilities…

I hope you understand that - because we are only looking at data from one and only one mailbox, the dataset’s power is quite limited. But I am hoping you get the point that the graph exploration of this dataset is great. I did a little experiment where I actually put my professional email data and my personal email data (both use Gmail) into one neo4j database - and that was really interesting.

For those of you wanting more detailed info on this topic, I would encourage you to take a look at the wonderful Graph Databases book, that has a specific chapter about email analysis.

Hoping this was useful - it sure was a learning experience (again) for me.

Monday, April 22, 2013

Neo4j goes Nasa Space Apps Challenge

Bonjour, namaste, aloha, hej!



This past weekend, a team from Neo Technology participated in the NASA International Space Apps Challenge. Pernilla, Tobias, and Mattias from Neo Technology joined forces with our friend Hatim, who is an organizer of the Stockholm Neo4j meetup group. Together we joined the event in Gothenburg, and formed Team Awesome (team name was not where we spent our imaginative energy) to tackle one of the challenges NASA threw our way.

After introduction and other meet-and-greet activities that were part of the event we got to work on a challenge called
“A database for Near Earth Objects”. What could be more fitting? It was a challenge involving databases, and the things to store are called NEOs...



Our proposed approach to the problem was to make it really simple for your everyday person to contribute findings of Near Earth Objects to these databases. The data used for tracking these objects contain many complex numbers and terms that you’d have to be at least a hobby level astronomer in order to understand and be able to calculate. After having spent some time analysing the meaning of these numbers we realized that much of it could be computed given the exif data present in photos taken by modern cameras. We set out to build a service that allows user to upload photos of objects they’ve seen in the sky. The service would then use the exif data from the photo to compute where this object has been seen and when to match up other potential sightings of the same object. These possible matchings would be presented on the web site for users to, in a crowdsourced way, determine which photos were of the same objects. Given multiple photos of the same objects, the system can compute the exact location of the object in space, as well as its direction. This allows the system to find even more possible matches, which when verified, further improves the accuracy of the computed data, and allows for the computation of the arc of the object, etc, giving us all the data used for tracking Near Earth Objects. Essentially taking all the hard parts out of the process and making discovering Near Earth Objects easy and fun.

While we didn’t get much further than the above idea description and a crude mock up of the web app, we don’t think that it would be a very hard project to pull off. Had we done the research before arriving in Gothenburg, we would have probably made pretty good progress towards our end goal. As things are now, most of what we have to show of our work is the presentation we made for the closing of the event.Successful app or not, we still had a great time. It was wonderful to get to spend some time on a challenge from an area where we don’t usually work, it was a great learning experience. But possibly even more rewarding was all the interesting people that we met. Not just the other participants and the amazing organizers, but Olle Norberg of The Swedish National Space Board (Rymdstyrelsen), and of course Christer Fuglesang, Sweden’s only astronaut, were a privilege to get the opportunity to meet and talk with. It’s not everyday you get to pitch your space app idea to someone who has actually been in space.




Take care, and keep watching the skies,

Pernilla, Tobias, and Mattias


Monday, April 15, 2013

Almost There: Neo4j 1.9-RC1!

Today is Leonhard Euler’s birthday, and we’re celebrating by announcing a first Release Candidate for Neo4j 1.9, now available for download! This release includes a number of incremental changes from the last Milestone (1.9-M05). This release candidate includes the last set of features we'd love our community to try out, as we prepare Neo4j 1.9 for General Availability (GA).

Google is celebrating Euler’s birthday with this Doodle.
Neo4j rings it in with a 1.9 Release Candidate!


Key changes since the last milestone are as follows:


High Availability


  • Introduction of pseudo quorum writes. If half or more of the instances are unreachable (i.e. have gone down), the instance will stop accepting write requests and all subsequent transactions will time out. Transactions will be able to resume once quorum is re-established.


Backup


  • Neo4j now automatically determines what type of backup should be performed based on the contents of the target directory. -full and -incremental backup flags are now deprecated.


Cluster

  • The experimental mechanism for automatically assigning server ids based on the instance's URI has been changed. The administrator must now explicitly set integer server ids in exactly the same manner as in 1.8.
  • We have removed an experimental feature introduced in earlier 1.9 milestones, which added the ability to specify a central cluster definition URI. This turned out to be underused and to introduce unneeded complexity.
  • It is now possible to introduce a new instance to the cluster to replace a failed one. This requires the new instance to have the same ID as the one that failed.
  • Cluster formation requires a majority of instances to be available, based on the instance count implied by initial_hosts
  • Fixes addressing cluster formation, when instances are concurrently started up
  • The cluster will explicitly deny instances from joining if they have a server_id that is already in use

Index Provider

  • Lucene upgraded to 3.6.2

Server

  • Introduces new welcome screen in the web UI, containing a guide to Neo4j, aimed at helping new users to find their way around the basics. Also several small aesthetic improvements

Cypher

  • Fixed #578 - problem with profiling queries that use LIMIT
  • Fixes #550 - problem when using START after WITH
  • Allows single node patterns in MATCH
  • Fixes problem for some patterns and all-nodes start points
  • Fixes #650 - issue when doing multiple aggregation expressions on the same identifier
  • Added timestamp function

Packaging

  • plugins/ subdirectory is searched recursively for server plugins

Manual
  • The HA Setup tutorial has been fully rewritten to update the latest functionality. It also provides two distinct paths in the examples: one for local testing, and one for production setup


As always with a release candidate, please download and examine this version thoroughly and help us spot anything that is amiss. We’re looking forward to the final 1.9 GA release in a short time.


Enjoy!


Philip Rathle & The Neo4j Team



















Heroku Addon News: New "Try" Plan Migration and Request for Feedback




We’ve been working to improve our architecture in our Heroku Add-on.  We’ve also been working on making it possible for you guys to migrate off of our deprecated Test plan, and onto our supported Try plan.  That’s taken longer than we thought, and we’ve learned a lot along the way.

Now, it’s possible for you to upgrade to our Try plan. We’ve implemented feature toggles for this, and you’ll get an email when the toggle is enabled on one of your databases.  We’ll be enabling this for all users shortly.



Now here’s the good bit: The first ten people who convert to the Try plan are going to get a Neo4j T-shirt.  It’s our way of thanking you for sticking with us.  The Try plan is free and will remain so.  We’re quite happy that we now have a reliable way of converting you to a future paid plan, however!

We’re still making progress towards our goal of a robust Neo4j offering in the cloud, and we think you’ll like the improvements that we’re making: we’re going to make it more reliable, secure and informative.

However: we want to ask for the opinion of the people who matter most: you, the users.  So, if you have comments or suggestions, please don’t hold back: email us at heroku-feedback@neotechnology.com.

Julian Simpson

From the land of the Long White Cloud

Monday, April 8, 2013

Nodes are people, too



Neo4j 2.0 will let you define sets of nodes within the graph
Philip Rathle
Senior Director of Products

Update: 2.0.0-M02 is now available


Today we are releasing Milestone Release Neo4j 2.0.0-M01 of the Neo4j 2.0 series which we expect to be generally available (GA) in the next couple months. This release is significant in that it is the first time since the inception of Neo4j thirteen years ago that we are making a change to the property graph model. Specifically, we will be adding a new construct: labels.


We’ve completed a first cut at a significant addition to the data model, and are opening the code up now for early comment. Consider this milestone to be an experimental release, intended to solicit input. We look forward to hearing how you'd like to use these new features, and can't wait to hear what you think.  


It’s a What?

Let’s say you created a node for a person named Joe. Joe is not just any node: he is a person. Therefore you would probably want to designate the node for Joe as being a “Person”. If you’ve worked with Neo4j before, chances are that you’ve done this by adding a property called “type” with value “Person”, as follows:



This is useful, because now I can differentiate Joe from things in my graph that are quite different, such as “household goods” nodes and “geo location” nodes. Rightly so, these things should receive very different treatment.


Now let’s say you also want to give Joe a party affiliation: Left-Wing, Right-Wing, or the moderate Middle-Wing. While you could do this with a property as well, you may decide that you want to easily find all people of a given party affiliation. Knowing that Joe is "Middle-Wing", you might decide to break the parties into nodes, and then associate Joe with his party, as below:

One thing you’d now naturally want the graph to do, is to automatically index the “Person” nodes (and no other nodes), according to the unique identifier for “Person”. (Let’s oversimplify and say this is "name"). If you’re using Cypher, this is a challenge today. In fact it’s not possible at all, because Neo4j doesn’t inherently know anything about “Person” being different from geo locations. If you want to index “name”, you end up doing it for everything in the graph, which mixes concerns. Geo Location names aren’t the same as person names, any more than a city is like a person. As for the “Middle-Wing” node, it ends up becoming extremely dense, cluttering the graph with lots of connections whose sole purpose is to designate nodes as belonging to a group.


We’ve been looking at better ways to do this. The ideal solution would help to make one’s graph more understandable, as well as to make Cypher more powerful, by allowing it to home in on nodes (as well as to index them) according to what they are.


2.0 therefore introduces a means of grouping or categorizing nodes. Provisionally we are calling this construct a “Label”. The term “Label” speaks to its generic use, and to the fact that nodes can have multiple labels. One of the many uses of labels--and perhaps the most intuitive one at first--is to provide "hooks" in the graph that you can associate with your application's type system. Because the facility isn't itself explicitly hierarchical (it's just literally a tag, of which you can have zero to many per node), they're being called labels.


Labels


A graph is a graph because it has relationships in the data. In a Property Graph, a relationship always has a type, describing how two nodes are related. Labels expand on that idea, describing how entire sets of nodes are related. This is a grouping mechanism for nodes. How does it work? Very simple: in the example above, rather than adding a “Type” property and connecting Joe to a Party node, you would add two labels: one for “Person”, and one for “Middle-Wing”, just like so:




This opens up quite a few possibilities, and probably stirs up a lot of ideas in your head. Rather than color your thinking about how to use labels, let’s look at an example using different color sets.

Color me happy


Let’s say we have an arbitrary domain of loosely related stuff, within which we at least know that things can be red, green, or blue. We could just add a “color” property to each node, or relate them to a value node for each color. But because we want to always work within this group, we’ll use labels to identify members of the sets.


First, create something red:

CREATE a node with a Label


CREATE (thing:Red {uid: "TK-421", make: 191860 })

RETURN thing;


To find the thing we just created, we can search within just the Red nodes, then return the labels:

Find the Labels on a node


MATCH (thing:Red)

WHERE thing.uid = "TK-421"

RETURN labels(thing);


Why labels, plural? Because nodes can have multiple labels. Let's say that "TK-421" also belongs to the blue set. Add a blue label like this:


Add a Label to a node


MATCH (thing:Red)

WHERE thing.uid = "TK-421"
SET thing :Blue;

The benefits of intentional labeling


While some Danes may be nervous about labels, much good comes from their use. Applying a label to a set of nodes makes your intention obvious — "these nodes are accessed frequently and thought of as a group." The database itself can gain benefit from having your intention be explicit, because it can now do things with this information.

For starters, Neo4j can create indexes that will improve the performance when looking for nodes within the set. (Note the new Cypher syntax for index creation!):

CREATE INDEXES to speed up finding Red and Blue nodes

CREATE INDEX ON :Red(uid);
CREATE INDEX ON :Blue(uid);

Create a second labeled node and a relationship

CREATE (other_thing:Blue {uid: "TURK-182", make: 181663})
WITH other_thing
MATCH (thing:Red)
WHERE thing.uid = "TK-421"
CREATE (thing)-[:HONORS]->(other_thing)
RETURN thing, other_thing;

There is much more fun to be had. Details are, as always, in the Neo4j Manual. Again, this simple change can have profound impact. As we're exploring the possibilities and tuning the language and APIs, we'd love for you to play around with labels. Let us know how you want to use them, by providing feedback on the Google Group. (That way other people can see your feedback and respond with their own opinions and observations.)

One more thing...

Just in CASE


Cypher has a new CASE expression for mapping inputs to result values: a cousin to similar constructs found in every common programming language.  

  • In its simple form, CASE uses a direct comparison of a property for picking the result value from the first matching WHEN:
MATCH (r:Red) RETURN CASE r.uid
     WHEN "TK-421" THEN "Why aren’t you at your post?"
    WHEN "TURK-182" THEN "the work of one man"
    ELSE "..."
END

  • In the general form, each WHEN uses an arbitrary predicate for picking the result:

MATCH (r:Red) RETURN CASE
   WHEN r.color > 180000 THEN "redish"
   WHEN r.color < 180000 THEN "purplish"
   ELSE "simply red"
END

Summary


Enjoy this preview milestone! Use the Neo4j Google Group to tell the Neo4j team and other members of the Neo4j community what you think. There are a few other improvements baked into this release as well, including to the shell, that we'll cover in upcoming blogs. And of course you'll be seeing more in upcoming Milestones of Neo4j 2.0. Meanwhile, we have upgraded a preview of the online console for you to test the new features, it now features the Matrix graph enhanced with labels.

One final note: if you are planning to go into production soon, we strongly recommend developing against 1.9, which we expect to be going GA in the next couple weeks (look for an RC this week).


Update - 2.0.0-M02 introduces Remote Transactions


The latest 2.0 milestone introduces a new HTTP endpoint for managing multiple Cypher statements within a single transaction. Just create the transaction with the first batch of statements. You'll receive a URL to which additional requests can be submitted, and for committing or rolling back the transaction. See the Neo4j manual for all the details. 


Enjoy, from the Neo4j Team!