URL = "bolt://localhost:7687"
auth = ("neo4j", "123456789")Jaccard similarity using GDS
In a related note on entity resolution we highlighted how Jaccard similarity can be used as a simple link prediction or similarity measure. In this note we show how within Neo4j you can quickly compute Jaccard via GDS.
You can use the GDS procedures inside the Neo4j browser but here we’ll go via the Python client.
Note that this client is a wrapper around the GDS procedures, you need to have it installed on the database prior to using the client.
If all is well, the following should print out the GDS version:
from graphdatascience import GraphDataScience
gds = GraphDataScience(URL, auth=auth)
print(gds.version())2.9.0
The following is some dummy data:
gds.run_cypher("""
CREATE
(alice:Person {name: 'Alice'}),
(bob:Person {name: 'Bob'}),
(carol:Person {name: 'Carol'}),
(dave:Person {name: 'Dave'}),
(eve:Person {name: 'Eve'}),
(guitar:Instrument {name: 'Guitar'}),
(synth:Instrument {name: 'Synthesizer'}),
(bongos:Instrument {name: 'Bongos'}),
(trumpet:Instrument {name: 'Trumpet'}),
(alice)-[:LIKES]->(guitar),
(alice)-[:LIKES]->(synth),
(alice)-[:LIKES {strength: 0.5}]->(bongos),
(bob)-[:LIKES]->(guitar),
(bob)-[:LIKES]->(synth),
(carol)-[:LIKES]->(bongos),
(dave)-[:LIKES]->(guitar),
(dave)-[:LIKES {strength: 1.5}]->(trumpet),
(dave)-[:LIKES]->(bongos);
""")which looks like the following 
This simple data is representative for what Jaccard tries to achieve: the more two nodes have neighbors in common, the more they are similar.
To proceed with GDS one has to project the data into a virtual/memory graph:
G, result = gds.graph.cypher.project(
"""
MATCH (source:Person)
OPTIONAL MATCH (source)-[r:LIKES]->(target:Instrument)
RETURN gds.graph.project(
'myGraph',
source,
target,
{ relationshipProperties: r { strength: coalesce(r.strength, 1.0) } }
)
""", # Cypher query
database="neo4j", # Target database
graph_name="myGraph", # Query parameter
)
assert G.node_count() == result["nodeCount"]GDS has four execution channels and mutate alters the projected data without changing the database:
gds.nodeSimilarity.mutate(G, mutateRelationshipType="SIMILAR", mutateProperty="score", similarityCutoff=0)preProcessingMillis 0
computeMillis 3
mutateMillis 5
postProcessingMillis 0
nodesCompared 4
relationshipsWritten 12
similarityDistribution {'min': 0.0, 'p5': 0.0, 'max': 0.6666679382324...
configuration {'mutateProperty': 'score', 'jobId': '9793e7e1...
Name: 0, dtype: object
Note that you don’t need to mutate before the write (see below). This method is useful if you wish to accumulate changes in memory, say, as part of a pipeline. There is no direct way to use Cypher on projected graphs (it used to be but got deprecated), you can use however things like the gds.graph.nodeProperty.stream utility to inspect changes.
The stream method returns the algorithm without changing the projected graph or the underlying one:
gds.nodeSimilarity.stream(G, topK=1 )| node1 | node2 | similarity | |
|---|---|---|---|
| 0 | 0 | 1 | 0.571429 |
| 1 | 1 | 0 | 0.571429 |
| 2 | 2 | 0 | 0.428571 |
| 3 | 3 | 0 | 0.500000 |
Finally, you can write the similarity to the real graph like so:
gds.nodeSimilarity.write(G, writeRelationshipType= 'SIMILAR', writeProperty= 'score', topK=1 )preProcessingMillis 0
computeMillis 1
writeMillis 12
postProcessingMillis 0
nodesCompared 4
relationshipsWritten 4
similarityDistribution {'min': 0.4285697937011719, 'p5': 0.4285697937...
configuration {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object
The results are not surprising, just by looking at the image above one could compute the Jaccard coefficient. The topK parameter picks up the most similar sibling for each node.
If you want to use a threshold or refine the similarity, the nodeSimilarity has a sub-method where you can supply things like a minimum degree or a similariy threshold:
gds.nodeSimilarity.filtered.stream(G,similarityCutoff=0.5 )| node1 | node2 | similarity | |
|---|---|---|---|
| 0 | 0 | 1 | 0.571429 |
| 1 | 0 | 3 | 0.500000 |
| 2 | 1 | 0 | 0.571429 |
| 3 | 3 | 0 | 0.500000 |