Gustav Ehrenborg fullstack developer

Email symbol LinkedIn logo GitHub logo Instagram logo

Neo4j and Cypher Learning the basics of graph databases

Tuesday, March 5, 2019
A graph

I guess you know what relational databases like MySQL and PostgreSQL are. Did you know they are pretty bad at handling relations? Yes, one-to-one and one-to-many relations are no problems, but many-to-many requires a junction table. SQL queries that query multiple relations can be a hassle to write and can fast become very complex too. Let's look at an example:

A simple people table with a one-to-one pointing out the mother of the person.

CREATE TABLE people (
    id SERIAL PRIMARY KEY,
    mother_id integer,
    name text
);

Selecting a person

SELECT people.name 
FROM   people 
WHERE  people.name = 'Gustav' 

Selecting the mother of a person

SELECT m.name
FROM   people m
       INNER JOIN people n
               ON n.mother_id = m.id
WHERE  n.name = 'Gustav'

Selecting the grandmother of a person

SELECT m.name
FROM   people m
       INNER JOIN people n
               ON n.mother_id = m.id
       INNER JOIN people o
               ON o.mother_id = n.id
WHERE  o.name = 'Gustav'

That grew rapidly! In many cases you would have an ORM to handle this, but all those joins uses performance and one day you might need to debug queries like these. So much for relational database.

Graph databases

Let's look on how to handle this with a graph database. A graph database doesn't have columns and rows, it has nodes and relationships. One thing I really like about graph databases is that they are "white board friendly", that means that the same way you would illustrate something on a white board, the same way it looks in the database. Have a look at this person with a mother with a mother, could have been drawn on a white board.

Some relationships

Neo4j and Cypher

The probably most common graph database is Neo4j. It uses a query language called Cypher. Neo4j has a very good online sandbox that all the following code snippets can be run in. Create a database here at if you want to follow along.

To create the structure above, enter the following code.

CREATE
(gustav: Person {name: "Gustav"}) - [:MOTHER] -> (mother: Person {name: "His mother"}),
(mother) - [:MOTHER] -> (mothermother: Person {name: "Her mother"})

The CREATE statement takes one or more statements on the following form:

(node) -- (node)

The nodes can be defined directly and, if given a name, the name can be used to reference them

(<node name>: Node type {attribute key: attribute value, ...})

The relationships can be without direction or type

--

or have a direction

-->

or have a direction and type

- [:TYPE_OF_RELATIONSHIP] ->

In the case above, the relationships are called MOTHER, because the node, to which the relationship points, is the mother.

Querying

Let's fetch the same data as we did with the relational database.

Selecting a person

MATCH (n:Person)
WHERE n.name = "Gustav"
RETURN *

"MATCH (n:Person)" matches any node of type Person into the n variable. "WHERE" then filters on the n variable, much like in SQL. A return statement is also needed, like SELECT in SQL.

Selecting the mother of a person

MATCH (n:Person) - [:MOTHER] -> (m)
WHERE n.name = "Gustav"
RETURN m

This time we match any Person node into n, which has an outgoing mother relationship to a node, which we call m, and then return the nodes m, filtered with the where clause.

Selecting the grandmother of a person

MATCH (n:Person) - [:MOTHER] -> (m) - [:MOTHER] -> (o)
WHERE n.name = 'Gustav'
RETURN o

This is just like the mother case, but we make a second jump and store the grandmother in variable o.

There is a shorthand version to this, which come in handy if we want to fetch someones great great grandmother or further down the line:

MATCH (n:Person) - [:MOTHER*2] -> (m) WHERE n.name = 'Gustav'
RETURN m 

More

Neo4j is schemaless, you can store pretty much anything on a node, like you do with a document database. It caches a list of all incoming and outgoing relationship which make querying fast. Inserting new nodes and relationships is slow though, since the relationships are added to both the nodes. Some operations like counting sums, averages and so on is also slow. Where a relational database easily can take the average of a column, the graph database cannot to this fast on an attribute.

Just because my small genealogical example works very well on graph databases and not on relational databases doesn't mean relational databases are bad. It is about selecting the correct database for the task. Or working out really clever ways of modelling the data if you want to use graph databases for everything.