Using Graph Databases for Investigative Journalism workshop with Leila Haddou


Expediting and visualising data analysis Using Neo4J.

Building a node (your data point) and your database:

Note: Syntax not case sensitive but labels are.

CREATE (n:Node{ id:"ID",attribute:attribute})

() creating the bubble aka the node
CREATE command blindly goes through each line of the csv looking for matches.
MERGE command goes through every line and checks if it is the only instance and treats multiple entries as the same thing.
n: – assign the node label
Node – assigned a type (case sensitive)
{} – the attributes of the node, there can be multiple separated with commas

Note: If you have spaces in your header you need to put this character ` around the text string ie `description 1`

Example:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/leilahaddou/graph-data/master/pef.csv" AS line
MERGE (d:Donor {name:line.DonorName, status:line.DonorStatus})
MERGE (r:Recipient {name:line.EntityName})
CREATE (d)-[dt:DONATED_TO {amount:line.Value, ref:line.ECRef, date:line.RecivedDate}]->(r)
Connecting nodes
CREATE (l)-[r:knows]->(t)
r: label
knows - relationship type
Bringing in csv data
LOAD CSV WITH HEADERS FROM "http://www.nifer.co.uk/wp-content/uploads/2015/10/CAA500.csv" AS line
Viewing your
Strings VS Integers

“47” + “23” would output4723

amount:toInt(line.Value)

This turns the output into an integer not a string.

Querying
MATCH (x:Donor) RETURN x LIMIT 10
MATCH (d:donor)-[dt:DONATED_TO]->(r:Recipient) RETURN * LIMIT 10

This example matches the donor where the donor

MATCH d WHERE d.status= "Company"
RETURN d
LIMIT 20

You only need to bracket () the node name (in this case ‘d’) when creating a node, to query it you just refer to it.

Fuzzy Matching – string query which includes “a bit of”. Change the NAME to an individual in the

MATCH r WHERE r.name=~".*NAME.*"
Return r

Who has given to both the conservatives and labour party:

MATCH (d)-[dt1:DONATED_TO]-(l), (d)-[dt2: DONATED_TO]->(c)
WHERE l.name=~".*Labour.*"
AND c.name=~".*Conservative.*"
RETURN d

Find a donor who has donated to recipient 1 and recipient 2 where recipient 1 is not recipient 2 and they are a company:

Match (d:Donor)-[dt1:DONATED_TO]->(r1), (d)-[dt2:DONATED_TO]->(r2)
WHERE r1 <> r2
AND d.status= "Company"
RETURN*
LIMIT 20

Bristol Cable and Centre for Investigative Journalism (Goldsmiths)

  • CAA plane data: CAA500
    Register of private planes registered by the UK Civil Aviation Authority (first 500 only) Source: CAA
  • To delete all nodes and relationships:
    MATCH (n)
    OPTIONAL MATCH (n)-[r]-()
    DELETE n,r;
  • Political Donations: CAA500
    Cash donations made to polticians and political party between January 2010 and June 2015. Source: PEF online

Part two

  • Cypher refcard
    Creating Nodes and Relationships
    Building nodes and relationships always begin with the CREATE command.
    To create a node, you need to use the following syntax: CREATE (x:Label {property:Property})
    To create a relationship: CREATE (x)-[r:CONNECTED_TO]->(y)
    Delete entire databases in the terminal:
    neo4j stop && rm -rf /usr/local/Cellar/neo4j/*/libexec/data/* && neo4j start

MISC:

Example: Open Corporate api allows you to retrieve company’s house data (retrieved using a scraper) so you can match data based on company number for example. Cross matching ?

Cleaning data – when finding pattens things like blanks in data or several entries with the same name spelled different ways will misrepresent your results.  So your methadology of data visualising might need to be explicit that you are grouping Shell plc & Shell Ltd into one

Gephi – alternative graph database software.

Open Refine – data cleaning automation

periodic commit (save every 100 lines)

NoSQL databases therefore structured query language cannot

linkurio.us – visualising graph data – great inspiration blog including explanations of how they do what they do

JournoCoders – london based group run by Leila Haddou learning tools together

Global Investigative journalism network 100 best data basesTip Sheet

Web scraper tools : Kimono  import.io

David Donald weeklong bootcamp

Paul Mayer’s digital investigation tools and tips