Cora

A straightforward ML example using Mathematica to learning connectivity in the Cora dataset.
You can learn about the Cora dataset here.

Author: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).
Last update: July 2023.

Data Setup

Set the base path for the data and let’s download the two Cora files (Cora.cites and Cora.content) via the link provided links in the article above:

index_1.gif

The edges are contained in the TSV-formatted “cora.cites” file:

index_2.png

A sample shows that the edges are just tuples:

index_3.png

index_4.png

The node content sits in the “cora.content” files and can be imported in a similar fashion:

index_5.png

The textual content is already encoded and the last entry represent the label (of the encoded article)

index_6.png

index_7.png

The graph from the edges is simply

index_8.png

Mathematica has no issue rendering the graph:

index_9.png

index_10.gif

Learning

This gets the content of a single node

index_11.png

In order to reduce dimensionality we’ll fetch a reducer

index_12.gif

A reducer in Mathematica is really an embedding, so this reducer is actually a node embedding. It’s not as good a real node2vec because the topology is not taken into account. If you really wish to reproduce that you would have to implement biased random walks in order to linearize the local neighborhood of each node and thereafter use the reducer to encode the payload with the topology of the nodes. This would produce a much better model than the one created here.

For a given tuple this will return one record combining the data contained in the corresponding nodes

index_13.png

Tells you whether the tuple represents an edge in the graph

index_14.png

Returns the specified amount of tuples, each tuple being an edge in the graph

index_15.png

The same for disconnected nodes

index_16.png

Finally, this samples data suitable for machine learning. The binary classification corresponds to a 1 if there is an edge for the combined data and 0 otherwise

index_17.png

Sampling the train, test and validation data. Note that we ensured in the sampling that there are as many positives as negatives.

index_18.gif

index_19.png

index_20.png

index_21.png

Created with the Wolfram Language