Wednesday, August 28, 2013

GraphConnect SF: Innovate. Share. Connect.

I’m super stoked for our conference and want to tell you why!

GraphConnect, the highly successful graph database conference, returns to San Francisco on October 3 and 4, welcoming all graph database enthusiasts to explore new ideas, share innovations in graph technology, and make connections with researchers and developers from around the globe.

This year’s GraphConnect SF will include four different tutorials (including two brand new ones!), 20+ scheduled presentations, an unconference track, and a GraphClinic, all packed into two days.

The agenda is packed with great presentations from the world's leading graph users and researchers including representatives from Neo Technology, GraphAlchemist, AtomRain, Brinqa Risk Analytics, Infoclear, Glowbl, Information Analysis and more! Some of the highlights are:

Building Hybrid Apps with Neo4j and Windows Azure
David Makogon, Sr. Cloud Architect, Microsoft
In this session, we’ll see how to deploy and configure 
Neo4j, both standalone and HA. We’ll also show how to access your Neo4j deployment from IaaS, PaaS, and Web Sites, and walk through security best practices.

The Increasing Value of Graph Visual Analysis in National Security, Public Safety, Communication Networks, and Financial Transactions
Jin Kim, Product Marketing, Tom Sawyer
In this session, we will explore the role of graph visualization in complex analysis projects such as national security, public safety, communication networks, and financial fraud.

How to search, explore and visualize Neo4j with Linkurious
Jean Villedieu, Co-founder, Linkurious
Now you can explore your data with Linkurious, a web-based graph visualization solution for Neo4j. We’ll see how everyone can use it to solve common problems like correcting errors, identifying patterns or finding and communicating insights.

Topics at GraphConnect SF include:
  • Concrete use-cases from a variety of industries and customers
    • Supply chain and master data management solutions
    • Entertainment with content, conversations and consumers
    • The many graphs of telecommunications
    • Operational and security risk management
  • Technical talks on how to implement these solutions using Neo4j
  • Delightful detours into visualization and analytics
  • Beer (especially Belgian!)
  • and the many other use cases and projects in the emerging graph database ecosystem!

WHAT? 
GraphConnect SF is the graph database conference, focusing on graph databases and applications using connected data.

WHERE? 
GraphConnect SF will be held at UCSF’s Mission Bay Conference Center.

WHEN? 
This year’s conference is on Friday, October 4 and the tutorials will run on Thursday, October 3.

HOW DO YOU PARTICIPATE?
Register: Sign up soon since the Early Bird Pricing ends August 31.

Submit a lightning talk: We’re running an unconference session featuring lightning talks at GraphConnect SF. If you have something interesting to present in 5 minutes, submit your proposal for a lightning talk.

Tweet GraphConnect: Help spread the word! Everyone who posts an original tweet mentioning @graphconnect gets a printed copy of the Graph Databases book for free, either to pick up at the conference or at one of our offices. Just email us with a link to your tweet.

Sponsor: Get your brand out in front of the GraphConnect audience; email us at graphconnect@neotechnology.com

See you in San Francisco in October!

Adam Herzog

Wednesday, August 21, 2013

And Now for Something Completely Different: Using OWL with Neo4j

Why would you want to do this?

OWL has been around for a while now and is used for a variety of semantic applications. Ontologies are freely available and help developers to create models for real world scenarios. They can be instantiated, combined and enriched using SWRL rules and a reasoner such as Hermit or Pellet. The reasons for creating such a representation of data differ: natural language processing, reusing data across domains or contextualisation are just some of many. The data obtained is then stored in a knowledge base, from which it can be retrieved using SPARQL queries. But if it's all already there, what's the point of combining it with a graph database?

While SPARQL certainly has its strong points like using different ontologies at the same time and the similarity to well known SQL, it also has weaknesses. Triple stores which are the starting point for most SPARQL applications consume a lot of disk space compared to relational databases. They are also slow for very large datasets.
Neo4j stores whole graphs as opposed to “just” triples. It has an easy to learn and easy to use query language and a web based, graphical, interface which allows users to easily browse and explore the graph. Also it is fast for querying and scales very well to handle larger datasets.

 As it is the case most of the time, there is no ideal solution and it really depends on the use case, considering, for example:
  • the amount and frequency and connectedness of incoming data
  • importance of speed and size
  • the type of query executed on the database

The playground

There is the PROV-O ontology, which models causation and influence between activities, agents and entities. This concept is fairly abstract but useful to answer a number of questions related to the origin of entities (and what may have influenced them throughout their lifetime). It is applicable to a number of fields, for example social networking (“Who was the author of the blog post that influenced Peter to write his mashup?”) or experimenting (“Who was the last person to access the experiment before it failed and when did he access it?”).

PROV-O is used in the BonFIRE project, which is a multi-site cloud experimentation and testing facility. There are people (agents) conducting experiments using resources (entities). At an infrastructure level, to perform their experiments, they create, use and destroy (activities) compute nodes, storages and virtual networks (entities). After their experiment has finished, they download the results (entities) from the virtual machine for further analysis. These results are influenced by a large number of activities and agents, and often it is difficult to determine how such a result came to be, who was involved in its formation or why it is different from other results. Using provenance, these questions can be answered.

Preparations

In BonFIRE, the data arrives on a RabbitMQ as a set of JSON messages that look like this:

{"timestamp":1375801302,"eventType":"state.shutdown","objectType":"compute","objectId":"/locations/server1/computes/123","groupId":"group1","userId":"bert"}

In this case, bert shut down compute node 123 located on server1. This message is filled into Java classes which are used to transform them (“manually”) into triples. Using the single message from the above example, we derive several triples that would look something like this:

:Action_state.shutdown_1375801302 rdf:type :Action
:Compute_/locations/server1/computes/123 rdf:type :Compute
:Compute_/locations/server1/computes/123 prov:invalidatedBy :Action_state.shutdown_1375801302
:Experimenter_Bert rdf:type :Experimenter
:Experimenter_Bert prov:wasAssociatedWith :Action_state.shutdown_1375801302
...

The prefixes used are defined in the ontology into which these triples are going to be imported.

The above step is not necessary if the messages are supposed to go into the ontology directly – OWLAPI could be used instead to create individuals, properties and so on. Transforming them to triples however serves as an interface to be able to read data from all kinds of sources as long as it's formatted as triples. If OWLAPI was used instead, the code would have to be changed every time the data changes.

These triples can then be added to an ontology using the OWLRDFConsumer class from the OWLAPI. This adds the triples to the ontology where the reasoner can be invoked to enrich the data. So far, that's not really special. The interesting bit follows after the reasoning has taken place.

Getting graphy

Now there is this ontology object sitting in the memory, which contains the ontology itself as well as the individuals that came from the triples. Now it could simply be stored in a knowledge base but if it was, you wouldn't be reading about it here :)

An ontology is a graph. It has a top node (owl:Thing) and classes extending it. There are individuals that belong to classes and object properties connecting the individuals. Individuals can have data properties and annotations that can be represented as node properties and relationship properties or as relationship types.

The import of an ontology is pretty straightforward:

Step 1

The only object you need is the ontology object created earlier. It could also be loaded from a file, that doesn't make a difference.

private void importOntology(OWLOntology ontology) throws Exception {
    OWLReasoner reasoner = new Reasoner(ontology);
       
        if (!reasoner.isConsistent()) {
            logger.error("Ontology is inconsistent");
            //throw your exception of choice here
            throw new Exception("Ontology is inconsistent");
        }
        Transaction tx = db.beginTx();
        try {

Step 2

Create a starting node in Neo4j representing the owl:Thing node. This is the root node of the graph we're going to create.

            Node thingNode = getOrCreateNodeWithUniqueFactory("owl:Thing");


Step 3

Get all the classes defined in the ontology and add them to the graph.

            for (OWLClass c :ontology.getClassesInSignature(true)) {
                String classString = c.toString();
                if (classString.contains("#")) {
                    classString = classString.substring(

                     classString.indexOf("#")+1,classString.lastIndexOf(">"));
                }
                Node classNode = getOrCreateNodeWithUniqueFactory(classString);

Step 4

Find out if they have any super classes. If they do, link them. If they don't, link back to owl:Thing. Make sure only to link to the direct super classes! The relationship type used to express the rdf:type property is a custom one named “isA”.

                NodeSet<OWLClass> superclasses = reasoner.getSuperClasses(c, true);

                if (superclasses.isEmpty()) {
                    classNode.createRelationshipTo(thingNode,

                     DynamicRelationshipType.withName("isA"));   
                } else {
                    for (org.semanticweb.owlapi.reasoner.Node<OWLClass>

                     parentOWLNode: superclasses) {
                       
                        OWLClassExpression parent =

                         parentOWLNode.getRepresentativeElement();
                        String parentString = parent.toString();
                       
                        if (parentString.contains("#")) {
                            parentString = parentString.substring(

                             parentString.indexOf("#")+1,
                             parentString.lastIndexOf(">"));
                        }
                        Node parentNode =

                         getOrCreateNodeWithUniqueFactory(parentString);
                        classNode.createRelationshipTo(parentNode,

                         DynamicRelationshipType.withName("isA"));
                    }
                }

Step 5

Now for each class, get all the individuals. Create nodes and link them back to their parent class.

                for (org.semanticweb.owlapi.reasoner.Node<OWLNamedIndividual> in
                 : reasoner.getInstances(c, true)) {
                    OWLNamedIndividual i = in.getRepresentativeElement();
                    String indString = i.toString();
                    if (indString.contains("#")) {
                        indString = indString.substring(

                         indString.indexOf("#")+1,indString.lastIndexOf(">"));
                    }
                    Node individualNode = 

                     getOrCreateNodeWithUniqueFactory(indString);
                                             

                    individualNode.createRelationshipTo(classNode,
                    DynamicRelationshipType.withName("isA"));

Step 6

For each individual, get all object properties and all data properties. Add them to the graph as node properties or relationships. Make sure to get all axioms, not just the asserted ones.

                    for (OWLObjectPropertyExpression objectProperty:
                     ontology.getObjectPropertiesInSignature()) {

                       for  

                       (org.semanticweb.owlapi.reasoner.Node<OWLNamedIndividual> 
                        object: reasoner.getObjectPropertyValues(i,
                        objectProperty)) {
                            String reltype = objectProperty.toString();
                            reltype = reltype.substring(reltype.indexOf("#")+1,

                             reltype.lastIndexOf(">"));
                           
                            String s =

                             object.getRepresentativeElement().toString();
                            s = s.substring(s.indexOf("#")+1,

                             s.lastIndexOf(">"));
                            Node objectNode =

                             getOrCreateNodeWithUniqueFactory(s);
                            individualNode.createRelationshipTo(objectNode,

                             DynamicRelationshipType.withName(reltype));
                        }
                    }

                    for (OWLDataPropertyExpression dataProperty:

                     ontology.getDataPropertiesInSignature()) {

                        for (OWLLiteral object: reasoner.getDataPropertyValues(

                         i, dataProperty.asOWLDataProperty())) {
                            String reltype =

                             dataProperty.asOWLDataProperty().toString();
                            reltype = reltype.substring(reltype.indexOf("#")+1, 

                             reltype.lastIndexOf(">"));
                           
                            String s = object.toString();
                            individualNode.setProperty(reltype, s);
                        }
                    }
                }
            }
            tx.success();
        } finally {
            tx.finish();
        }
    }


That's it, you're done! Now for the fun bit: querying the ontology!

Graphwalking

This is the graph now sitting in the database:



It has the ontology as well as all the individuals and properties, represented in their “natural” form. Now the querying can begin. Whether it is a simple query to find out what happened to a specific VM (entity) during its lifecyle

START e=node:name(name="experiment123"), ag=node:name(name="Agent")
MATCH e-[r:hadActivity]->ac-->a-[:isA*]->ag
RETURN distinct e.name as experiment, type(r) as relationship, a.name as agent
ac.name as activity, ac.startedAtTime as starttime, ac.endedAtTime as endtime
ORDER BY starttime

or do some more complicated pattern matching to find out how two experiments are different when they look the same at first glance – the only boundary is imagination.

Conclusion

Protege comes with a simple visualisation and the possibility to execute SPARQL queries. Neo4j has cypher, which makes querying the imported ontology much more intuitive - ontologies are graphs after all. Also the webadmin interface allows better "exploring" of the graph. Time is not an issue in this case, because the ontology import is not time-critical. It's done only once after the experiment has finished and imports the whole ontology. For an ontology containing several hours of experiment data, the import takes only a few seconds. Once the graph has been imported, querying is fast which makes it a great tool to analyse and visualise ontologies.

by Stefanie Wiegand


Monday, August 19, 2013

Finding the Shortest Path - through the Park

As you may know by now, I like beer. A lot - why else would I keep writing and talking about it? But there’s more to life than sweet beverages, and one of the things that I have been doing for as long as I can remember is Orienteering. I have been practicing the sport in Belgium since 1984 - I was 11 years old. My dad used to take me to races all across the continent - we truly had a blast. And we still do: I still orienteer almost every week, and so does my dad. Now I take my 8 and 10-year old kids with me to the races, and their granddad cheers them on every step of the way. It’s a fantastic family sport.


One of the reasons why it is so fantastic is that - Orienteering is a “thinking sport”. You have to concentrate to navigate. You have to run to have the best time (it’s a race), but if you run too fast, you are sure to make navigation mistakes. You have to find the balance between physical and mental fitness - which is hard but completely awesome when you succeed. And: it’s outdoors - in the woods and fields. What’s not to like?



So what does that have to do with neo4j? Well, orienteering is all about “finding the shortest path”: the *fastest* route from start to finish. Fast can be short. Fast can also mean that it is better to take a detour: if it is easier to run the longer route, than to walk the shorter route, you are better off choosing the longer route. In essence, every orienteering race is … a graph problem waiting to be solved in the middle of nature.


Orienteering = a green graph problem

In case you don’t know: orienteering races are a bit like an obstacle race. Every participant gets assigned a course, out there in the green forests and fields, and along that course are sequences of beacons that one needs to get to in order. Such a sequence is … a path on a graph - you have to choose how to navigate from obstacle to obstacle, from node to node.


Essentially, the orienteer has to navigate and choose the fastest route. Finding the fastest route effectively this boils down to a “weighted shortest path” calculation. You calculate the shortest path using
  • distance: shorter = better
  • runnability: higher = better. Runnability can be affected by the type of terrain (running on a road? through a field? through a forest? through a forest with soil covered with plants? over a hill? through a valley? …)
as your parameters. For every “leg” of the race, you estimate the presumable “best route” based on the assumption that distance / runnability will be the best indicator for your likely speed.

Example: a 2 control race in Antwerp, Belgium

This is the map of a training race that I did with my kids in a beautiful Antwerp park.



As you can see - the race assignment is a graph.



If we then look at every leg separately, you can see that for every leg, there are a number of route options.


The red route is the safe choice - running along the roads - but takes quite a detour. The blue route cuts straight across the field - but then requires me to go straight through the forest for a short distance.
The red route is the shortest - but requires me to run straight through the forest. The blue route just races along the forest road.
The red route just goes straight to the finish line. The blue route cuts through the forest and then follows the road. The green route safely hurries along the roads.



So 3 controls, and different routes with different characteristics. As you can see in the schematic representations, every route has different “waypoints” - specific points of interest that I can identify on the map, and recognize in the “field”. These waypoints are extremely important for the navigation exercise that we are doing - they allow us to break the problem up in smaller pieces and evaluate our options.


Intuitively, all of us will have a “feeling” about what would be the best route choice, but now let’s use graph algorithms to do this for us!


Graph database model to navigate

In order to apply a graph algorithm, we first need to create a graph. These are my nodes:
  • Control nodes: the race beacons that I have to pass by
  • The alternative route choices, decomposed in waypoints.





Then, let’s create the relationships between these nodes. We will have “COURSE_TO” relationships between controls, and “NAVIGATE_TO” relationships between waypoints. Effectively, these will become “paths” on my graph, hopping from node to node along the relationships.
    • From the start to control 1: I have 0->0.11->0.12->0.13->0.14->0.15 as one route and 0->0.21->0.22->1 as another route
    • From control 1 to control 2 I have 1->1.11->1.12->1.13->2 as one route option, and 1->0.15->1.21->2 as another option.
    • From control 2 to the finish I have 3 options: 2->2.11->3, 2->2.21->2.22->3 and 2->1.21->2.31->2.32->3.



As you can see, I have immediately added “distance” (in meters) and “runnability” (in %) properties to my relationships.


When I then generate a neo4j database using the spreadsheet method, I get a nice little database - ready to be queried and ready for my algorithms.




Graphs algorithms to win the race!

In order to calculate the best route to win the race, I need to calculate the shortest path across the graph - which is standard functionality of neo4j. But because there’s more to it than running the course in straight lines between controls, I need to incorporate weights (distance, runnability) to get a realistic estimate of what would be the best route choice. To do so I am using a technique so well demonstrated by Ian Robinson on his blog last June.


Let’s go through two versions of this calculation:


  1. find the shortest path by distance only:


To do this, we will be using ‘reduce’ to sum the distance properties.


START  startNode=node:node_auto_index(name="Start"),
      endNode=node:node_auto_index(name="Finish")
MATCH  p=(startNode)-[:NAVIGATE_TO*]->(endNode)
RETURN p AS shortestPath,
      reduce(distance=0, r in relationships(p) : distance+r.distance) AS totalDistance
      ORDER BY totalDistance ASC
      LIMIT 1;



  1. find the best estimate of the fastest path, as a function of distance/runnability. In a real race this would probably be the route that I would choose - as it would give me the best chance of winning the race.


To do this, we will be using ‘reduce’ to sum the distance divided by the runnability: longer distance with superior runnability, is possibly faster than shorter distance with lower runnability:


START  startNode=node:node_auto_index(name="Start"),
      endNode=node:node_auto_index(name="Finish")
MATCH  p=(startNode)-[:NAVIGATE_TO*]->(endNode)
RETURN p AS shortestPath,
      reduce(EstimatedTime=0, r in relationships(p) : EstimatedTime+(r.distance/r.runnability)) AS TotalEstimatedTime
      ORDER BY TotalEstimatedTime ASC
      LIMIT 1;



As you can see, the first and second queries take the same (shortest) path for start to control 1 and from control 2 to finish, but recommends a clearly different path from control 1 to control 2 (following the forest road instead of cutting through the forest).




Many applications for weighted shortest paths!


Obviously Orienteering is not a business application, but in logistics, planning, impact analysis and many other applications, weighted shortest path algorithms will have a great potential. Whether it is to find out how things are related to eachother, determining the most efficient way to get something from point A to point B, or finding out who would be affected by a particular type of capacity outage - the approach that I used for my orienteering problem would work just as nicely!