<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Orbifold Consulting</title>
<link>https://discovery.graphsandnetworks.com/</link>
<atom:link href="https://discovery.graphsandnetworks.com/index.xml" rel="self" type="application/rss+xml"/>
<description></description>
<generator>quarto-1.9.35</generator>
<lastBuildDate>Sun, 29 Mar 2026 22:00:00 GMT</lastBuildDate>
<item>
  <title>AI &amp; The Future of Expert Work</title>
  <link>https://discovery.graphsandnetworks.com/reflections/ai.html</link>
  <description><![CDATA[ 





<section id="what-is-actually-changing" class="level2">
<h2 class="anchored" data-anchor-id="what-is-actually-changing">What Is Actually Changing</h2>
<p>AI is genuinely, measurably displacing routine cognitive work. This is not hype. Tasks that require assembling known information, generating standard artifacts, or pattern-matching against large corpora are being compressed — in time and cost — by an order of magnitude. Pretending otherwise would be both intellectually dishonest and strategically dangerous.</p>
<p>I work with junior devs and observe on a daily basis how programming is becoming a question of how to formulate things (aka vibe coding) and dispatching tasks, rather than a learned skill or API expertise. Especially UI development is dramatically affected, which includes graph and data visualization. The reason customers contacted me is quickly fading away and replaced by a chat with Claude. I hear innovators and entrepreneurs spending sometimes thousands a month on AI rather than on consulting.</p>
<p>For myself, I experience AI with awe and dispeair in equal measures. The excitement to see AI doing things in a few minutes is also the reason a customer does not need my (UI) skills anymore.</p>
<p>The Ikea effect is immense, coding and advice is becoming a commodity and there is an uneasy sentiment in the consulting market.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="Market shift">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Market shift
</div>
</div>
<div class="callout-body-container callout-body">
<p>The market for assembling known facts is contracting. The market for structuring unknown problems is expanding.</p>
</div>
</div>
<p>The critical distinction is between <strong>execution work</strong> — doing things we already know how to do — and <strong>sense-making work</strong> — figuring out what needs to be done, why, and whether the result is trustworthy. AI is collapsing the cost of the former while raising the premium on the latter. There is a polarization.</p>
<hr>
</section>
<section id="what-ai-systematically-cannot-replace" class="level2">
<h2 class="anchored" data-anchor-id="what-ai-systematically-cannot-replace">What AI Systematically Cannot Replace</h2>
<p>AI systems are powerful interpolators. They synthesize within their training distribution extremely well. But several categories of expert value remain structurally outside that reach:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 50%">
</colgroup>
<thead>
<tr class="header">
<th>Capability</th>
<th>Why It Endures</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>✤ <strong>Ill-posed Problem Framing</strong></td>
<td>Clients rarely arrive with clean problem definitions. Recognizing what the real problem is — beneath the stated one — is a deeply human, contextual skill.</td>
</tr>
<tr class="even">
<td>✤ <strong>Judgment Under Ambiguity</strong></td>
<td>When data is sparse, stakes are high, and the answer is “it depends” — clients need a trusted human to take a stance and own it.</td>
</tr>
<tr class="odd">
<td>✤ <strong>Cross-Domain Synthesis</strong></td>
<td>Connecting physics, software architecture, and business strategy in a single coherent insight is where deep specialist generalists thrive.</td>
</tr>
<tr class="even">
<td>✤ <strong>Relational Trust</strong></td>
<td>Complex decisions are rarely signed off on without a person behind the recommendation. Authority and accountability remain human properties.</td>
</tr>
<tr class="odd">
<td>✤ <strong>Domain Taste &amp; Standards</strong></td>
<td>Knowing what “good” looks like in graph modeling or knowledge architecture requires years of calibrated exposure — not retrieval.</td>
</tr>
<tr class="even">
<td>✤ <strong>Verification &amp; Validation</strong></td>
<td>As AI-generated outputs proliferate, someone must audit them. The demand for expert review is growing, not shrinking.</td>
</tr>
</tbody>
</table>
</section>
<section id="ai-as-a-force-multiplier-for-specialists" class="level2">
<h2 class="anchored" data-anchor-id="ai-as-a-force-multiplier-for-specialists">AI as a Force Multiplier for Specialists</h2>
<p>The more productive frame is not “AI vs.&nbsp;experts” but <strong>AI-augmented experts vs.&nbsp;unaugmented experts.</strong> The productivity differential between those who can direct AI precisely and those who cannot is widening rapidly.</p>
<div class="callout callout-style-default callout-warning callout-titled" title="What to worry about">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>What to worry about
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong>The risk is not being replaced by AI.</strong> The risk is being replaced by another expert who uses AI better than you do.</p>
</div>
</div>
<p>Four concrete moves follow from this framing:</p>
<ol type="1">
<li><p><strong>Compress routine deliverables.</strong> Reports, summaries, code scaffolding, documentation — tasks that once took days now take hours. This frees capacity for higher-value engagement.</p></li>
<li><p><strong>Scale proprietary knowledge.</strong> Domain expertise becomes a durable moat only when embedded in systems, workflows, and products — not just in a person’s head. AI makes this extraction tractable.</p></li>
<li><p><strong>Shift from time-billing to outcome-pricing.</strong> When execution time falls, the old hourly model erodes. The new model prices the value of the outcome — insight, decision quality, risk reduction — not the hours spent.</p></li>
<li><p><strong>Build AI-native products around expertise.</strong> The highest-leverage play: encode specialist knowledge into a product or platform that scales independently of billable time. This is the transition from practitioner to product company.</p></li>
</ol>
<hr>
</section>
<section id="where-graph-expertise-is-uniquely-defensible" class="level2">
<h2 class="anchored" data-anchor-id="where-graph-expertise-is-uniquely-defensible">Where Graph Expertise Is Uniquely Defensible</h2>
<p>The graph domain sits at a particularly interesting intersection. As AI systems grow more capable, their own internal reasoning is increasingly structured as graphs — knowledge graphs, semantic networks, entity stores. The infrastructure to build, validate, visualize, and query these structures requires expertise that is sparse and deeply technical.</p>
<table class="table-striped table-hover caption-top table">
<caption>Capability exposure and premium assessment</caption>
<thead>
<tr class="header">
<th>Capability Area</th>
<th style="text-align: center;">AI Exposure</th>
<th style="text-align: center;">Expert Premium</th>
<th style="text-align: center;">Trajectory</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Graph schema / ontology design</td>
<td style="text-align: center;">Medium</td>
<td style="text-align: center;"><strong>High</strong></td>
<td style="text-align: center;">↑ Growing</td>
</tr>
<tr class="even">
<td>Graph architecture</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;"><strong>Very High</strong></td>
<td style="text-align: center;">↑ Growing</td>
</tr>
<tr class="odd">
<td>Document → knowledge graph pipelines</td>
<td style="text-align: center;">High</td>
<td style="text-align: center;"><strong>High</strong></td>
<td style="text-align: center;">↑ Growing</td>
</tr>
<tr class="even">
<td>Graph ML / analytics</td>
<td style="text-align: center;">Medium</td>
<td style="text-align: center;"><strong>High</strong></td>
<td style="text-align: center;">→ Stable</td>
</tr>
<tr class="odd">
<td>Generic report writing</td>
<td style="text-align: center;">Very High</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;">↓ Declining</td>
</tr>
<tr class="even">
<td>Standard code generation</td>
<td style="text-align: center;">Very High</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;">↓ Declining</td>
</tr>
</tbody>
</table>
<p>As a person or as a company, if you invest considerable amounts of time and money in something or if the strategic position of your product needs consideration you will want to talk to a human being. If you have a terminal disease, you want to talk to a person and not AI. Trust needs eye contact and a hand-shake. People are complex beings and the things that matter are always a combination of ratio, emotions and intuition. Many challenges don’t have a clear formulation or answer and dealing with the undefined aims (and the things between the lines) is what makes consulting and human contact as important as ever.</p>
<p>Personally, I think business is about people. Life is about people. Having a coffee or a beer with AI is not around the corner and the covid years have proven we are more than information processors.</p>
<hr>
</section>
<section id="three-moves-still-that-matter" class="level2">
<h2 class="anchored" data-anchor-id="three-moves-still-that-matter">Three Moves (Still) That Matter</h2>
<p><strong>Productize</strong></p>
<p>Convert expertise into repeatable, scalable products. Every engagement that requires the same deep knowledge is a product waiting to be built. The goal is revenue that does not require proportional time investment.</p>
<p><strong>Specialize Deeper</strong></p>
<p>Generalism is more exposed. The more specific and verifiable your domain mastery, the less substitutable you are. Breadth is available via AI; depth is not. Go narrow and go deep.</p>
<p><strong>Adopt Aggressively</strong></p>
<p>Use AI across every workflow. The goal is to work at a pace and scope that was previously impossible for a one-person operation. The one-person firm that builds AI-augmented products is more competitive today than a five-person firm that doesn’t.</p>
<div class="callout callout-style-default callout-tip callout-titled" title="Thought">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Thought
</div>
</div>
<div class="callout-body-container callout-body">
<p>The small consulting firm that builds AI-augmented products is more competitive today than a five-person firm that doesn’t. This is a structural inversion worth taking seriously.</p>
</div>
</div>
<hr>
</section>
<section id="a-considered-positive-stance" class="level2">
<h2 class="anchored" data-anchor-id="a-considered-positive-stance">A Considered, Positive Stance</h2>
<p>The framing of “AI replacing skills” is both true and misleading. True, in that a specific set of execution-oriented tasks are now cheaper and faster to produce. Misleading, in that it obscures what is growing in value: judgment, synthesis, accountability, and the ability to direct AI toward problems that actually matter. Talking matters, trust comes with communication and time.</p>
<p>For deep specialists — particularly those who can reason across technical and business domains — this is one of the more favorable environments in a generation. The barriers to building products have dropped. The demand for structured, trustworthy knowledge systems is rising. The combination of deep expertise and AI leverage is genuinely rare. The threshold to build amazing things has lowered. If you are creative and/or innovative these are golden times to go where no one has bone before.</p>
<p>The professional attitude for this moment is neither dismissal nor alarm. It is <strong>clear-eyed adaptation</strong>: understanding precisely which parts of your work are being commoditized, doubling down on what is not, and using the tools available to operate at a scale that was previously inaccessible.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>Paths are made by walking.</p>
</div>
</div>


</section>

 ]]></description>
  <category>Business</category>
  <category>Opinion</category>
  <guid>https://discovery.graphsandnetworks.com/reflections/ai.html</guid>
  <pubDate>Sun, 29 Mar 2026 22:00:00 GMT</pubDate>
</item>
<item>
  <title>Twenty RAG techniques</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/rags.html</link>
  <description><![CDATA[ 





<p>⚠️ Work in progress</p>
<p>Retrieval-Augmented Generation (RAG) blends retrieval systems with LLMs in order to extend language models beyond their training knowledge. By leveraging a knowledge base or external dataset, RAG enhances the relevance and accuracy of generative AI outputs, making it invaluable for applications such as customer support, content creation, and research assistance.</p>
<p>But like any technology, the effectiveness of RAG depends on how it’s implemented. Small tweaks and strategic optimizations can transform a functional RAG system into an exceptional one. This article explores twenty practical and advanced techniques to refine a RAG pipeline.</p>
<section id="concepts-and-dimensions" class="level2">
<h2 class="anchored" data-anchor-id="concepts-and-dimensions">Concepts and dimensions</h2>
<p>The ingredients of a RAG solution are fairly simple and if you take a step back you can observe the following elements:</p>
<ul>
<li>documents</li>
<li>chunks</li>
<li>metadata (doc, chunk, node…)</li>
<li>entities (links, graphs…)</li>
</ul>
<p>Each of these elements can be tuned or improved via diverse techniques. Together they form a structural dimension along which you can experiment.</p>
<p>There are also three processing dimensions in a RAG pipeline:</p>
<ul>
<li>ingestion</li>
<li>query</li>
<li>feedback (optional)</li>
</ul>
<p>Finally, you can also play with the more technical aspects in a pipeline:</p>
<ul>
<li>chunking strategy (semantic, size…)</li>
<li>vector comparisons, embedding models</li>
<li>indexing techniques and (re)ranking (BM25, neural reranking…)</li>
</ul>
<p>The twenty techniques highlighted below can all be place within these three dimensions. I call them dimensions because each has a certain level of complexity and sometimes incremental. Whether you need one or the other is more an art than science, many factors can play a role in deciding what to use:</p>
<ul>
<li>the business context</li>
<li>the corpus or repository</li>
<li>the budget</li>
<li>the accuracy of the end-result</li>
<li>the type of user-experience</li>
<li>the time allowed for processing queries</li>
<li>how dynamic the corpus is and the necessity for updates</li>
<li>whether knowledge has to be structured (hierarchic, graph, visualizations…)</li>
</ul>
<p>You don’t necessarily need a knowledge graph, ontology and agentic frameworks to be successful. Many RAG projects can do without a feedback pipeline or explainable AI. What matters is to understand the business case and the options at your disposal.</p>
<p>Take a look at our article regarding <a href="../graphAI/qaDeepeval.html" target="_blank">QA generation for DeepEval</a> which highlights the process of evaluating a RAG solution.</p>
</section>
<section id="i-document-compressors" class="level2">
<h2 class="anchored" data-anchor-id="i-document-compressors">I Document compressors</h2>
<ul>
<li><a href="https://python.langchain.com/docs/how_to/contextual_compression/" target="_blank">LangChain contextual compression</a></li>
<li><a href="https://www.llamaindex.ai/blog/longllmlingua-bye-bye-to-middle-loss-and-save-on-your-rag-costs-via-prompt-compression-54b559b9ddf7" target="_blank">LlamaIndex research on compression</a></li>
<li><a href="https://arxiv.org/abs/2407.09014" target="_blank">CompAct: Compressing Retrieved Documents Actively for Question Answering</a></li>
</ul>
</section>
<section id="ii-header-augmentation-aka-out-of-context-chunks-contextual-chunk-headers-or-cch" class="level2">
<h2 class="anchored" data-anchor-id="ii-header-augmentation-aka-out-of-context-chunks-contextual-chunk-headers-or-cch">II Header augmentation (aka out of context chunks, contextual chunk headers or CCH)</h2>
<ul>
<li><a href="https://python.langchain.com/docs/how_to/parent_document_retriever/#with-contextual-chunk-headers" target="_blank">LangChain implementation</a></li>
<li><a href="https://arxiv.org/abs/2406.00456v1" target="_blank">Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation</a></li>
<li><a href="https://arxiv.org/abs/2407.01219v1" target="_blank">Searching for Best Practices in Retrieval-Augmented Generation</a></li>
<li><a href="https://arxiv.org/abs/2402.05131v3" target="_blank">Financial Report Chunking for Effective Retrieval Augmented Generation</a></li>
<li><a href="https://arxiv.org/abs/2407.19794v2" target="_blank">Introducing a new hyper-parameter for RAG: Context Window Utilization</a></li>
<li><a href="https://arxiv.org/abs/2411.03920v1" target="_blank">RAGulator: Lightweight Out-of-Context Detectors for Grounded Text Generation</a></li>
</ul>
</section>
<section id="iii-query-augmentation-aka-query-expansion" class="level2">
<h2 class="anchored" data-anchor-id="iii-query-augmentation-aka-query-expansion">III Query augmentation (aka query expansion)</h2>
<ul>
<li><a href="https://python.langchain.com/docs/how_to/MultiQueryRetriever/">LangChain query expansion via a multi-query retriever</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/query_transformations/query_transform_cookbook/" target="_blank">LlamaIndex query transformation cookbook</a></li>
</ul>
</section>
<section id="iv-hypothetical-document-embedding" class="level2">
<h2 class="anchored" data-anchor-id="iv-hypothetical-document-embedding">IV Hypothetical document embedding</h2>
<ul>
<li><a href="https://arxiv.org/abs/2212.10496" target="_blank">Precise Zero-Shot Dense Retrieval without Relevance Labels</a></li>
<li><a href="https://python.langchain.com/api_reference/langchain/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html" target="_blank">LangChain HypotheticalDocumentEmbedder</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/optimizing/advanced_retrieval/query_transformations/" target="_blank">LlamaIndex query transformations (including HyDE)</a></li>
</ul>
</section>
<section id="v-graph-rag" class="level2">
<h2 class="anchored" data-anchor-id="v-graph-rag">V Graph RAG</h2>
<ul>
<li><a href="https://github.com/Graph-RAG/GraphRAG/" target="_blank">An extensive compilation of graph RAG research papers</a></li>
<li><a href="https://microsoft.github.io/graphrag/" target="_blank">Microsoft Graph RAG</a></li>
<li>Formerly Neo4j GenAI but now called <a href="https://neo4j.com/docs/neo4j-graphrag-python/" target="_blank">Neo4j GraphRAG for Python</a></li>
<li><a href="https://github.com/gusye1234/nano-graphrag" target="_blank">Nano Graph RAG</a></li>
<li><a href="https://github.com/HKUDS/LightRAG" target="_blank">LightRAG</a></li>
<li><a href="https://github.com/circlemind-ai/fast-graphrag">Fast Graph RAG</a></li>
<li><a href="https://trustgraph.ai">TrustGraph</a></li>
</ul>
</section>
<section id="vi-adaptive-rag" class="level2">
<h2 class="anchored" data-anchor-id="vi-adaptive-rag">VI Adaptive RAG</h2>
<p>This techniques introduces a classification of a given question prior to one of several specialized RAG pipelines. For instance, a question can be cataloged as “factual” or “opinion” and for each an appropriate prompt handles the question accordingly. This can also mean that the question is multiplied into multiple question. If it’s an “opinion” this would involve asking the LLM first what aspects are relevant and generating a separate question for each perspective.</p>
<p>In essence, this is prompt tuning and not so much RAG tuning but the difference is sublte and can lead to better answers.</p>
<ul>
<li><a href="../graphAI/adaptiveRAG.html" target="_blank">Our own take on adaptive RAG</a></li>
<li><a href="https://arxiv.org/abs/2403.14403" target="_blank">Adaptive RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity</a></li>
<li><a href="https://arxiv.org/abs/2410.11321" target="_blank">Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG)</a></li>
<li><a href="https://arxiv.org/abs/2412.01572" target="_blank">MBA-RAG: A Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity</a></li>
<li><a href="https://arxiv.org/abs/2405.18727" target="_blank">CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control</a></li>
<li><a href="https://arxiv.org/abs/2406.19215" target="_blank">SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation</a></li>
<li><a href="https://arxiv.org/abs/2402.16457?utm_source=chatgpt.com" target="_blank">RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering</a></li>
<li><a href="https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_adaptive_rag/" target="_blank">LangGraph implementation</a></li>
<li><a href="https://github.com/mistralai/cookbook/blob/main/third_party/LlamaIndex/Adaptive_RAG.ipynb" target="_blank">LlamaIndex implementation</a></li>
</ul>
</section>
<section id="vii-context-enrichment" class="level2">
<h2 class="anchored" data-anchor-id="vii-context-enrichment">VII Context Enrichment</h2>
<ul>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtractionSEC/" target="_blank">Extracting Metadata for Better Document Indexing and Understanding</a></li>
</ul>
</section>
<section id="viii-corrective-rag-aka-crag" class="level2">
<h2 class="anchored" data-anchor-id="viii-corrective-rag-aka-crag">VIII Corrective RAG (aka CRAG)</h2>
<ul>
<li><a href="https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_crag/" target="_blank">LangGraph CRAG</a></li>
<li><a href="https://arxiv.org/abs/2401.15884" target="_blank">Corrective RAG</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/workflow/corrective_rag_pack/" target="_blank">LlamaIndex CRAG</a></li>
</ul>
</section>
<section id="ix-explainable-rag" class="level2">
<h2 class="anchored" data-anchor-id="ix-explainable-rag">IX Explainable RAG</h2>
<ul>
<li><a href="https://arxiv.org/abs/2407.11005" target="_blank">RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems</a></li>
<li><a href="https://arxiv.org/abs/2405.00449" target="_blank">RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models</a></li>
</ul>
</section>
<section id="x-fusion-retrieval" class="level2">
<h2 class="anchored" data-anchor-id="x-fusion-retrieval">X Fusion retrieval</h2>
<ul>
<li><a href="https://python.langchain.com/docs/additional_resources/arxiv_references/#rag-fusion-a-new-take-on-retrieval-augmented-generation" target="_blank">RAG-Fusion: a New Take on Retrieval-Augmented Generation</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/retrievers/simple_fusion/" target="_blank">LlamaIndex simple fusion retriever</a></li>
</ul>
</section>
<section id="xi-hierarchical-rag" class="level2">
<h2 class="anchored" data-anchor-id="xi-hierarchical-rag">XI Hierarchical RAG</h2>
<ul>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/query_engine/multi_doc_auto_retrieval/multi_doc_auto_retrieval/" target="_blank">LlamaIndex Structured Hierarchical Retrieval</a></li>
<li><a href="https://pixion.co/blog/rag-strategies-hierarchical-index-retrieval" target="_blank">RAG Strategies - Hierarchical Index Retrieval</a></li>
</ul>
</section>
<section id="xii-propositional-chunking" class="level2">
<h2 class="anchored" data-anchor-id="xii-propositional-chunking">XII Propositional chunking</h2>
<p>Dense retrieval performance is significantly impacted by the choice of retrieval unit, particularly when using propositions, which are atomic expressions encapsulating distinct factoids. Fine-grained retrieval units, such as propositions, outperform passage-level units in retrieval tasks and improve downstream QA tasks. Propositional chunking involves breaking down text into atomic units called propositions, each representing a distinct fact or idea.</p>
<ul>
<li><a href="https://arxiv.org/abs/2410.12788" target="_blank">Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception</a></li>
<li><a href="https://arxiv.org/abs/2312.06648" target="_blank">Dense X Retrieval: What Retrieval Granularity Should We Use?</a></li>
</ul>
</section>
<section id="xiii-query-rewriting" class="level2">
<h2 class="anchored" data-anchor-id="xiii-query-rewriting">XIII Query rewriting</h2>
<ul>
<li><a href="https://arxiv.org/abs/2305.14283" target="_blank">Query Rewriting for Retrieval-Augmented Large Language Models</a></li>
<li><a href="https://python.langchain.com/docs/integrations/retrievers/re_phrase/" target="_blank">RePhraseQuery LangChain retriever</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline/" target="_blank">An Introduction to LlamaIndex Query Pipelines</a></li>
</ul>
</section>
<section id="xiv-raptor" class="level2">
<h2 class="anchored" data-anchor-id="xiv-raptor">XIV Raptor</h2>
<ul>
<li><a href="https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raptor/examples/raptor.ipynb" target="_blank">LlamaIndex Raptor implementation</a></li>
<li><a href="https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb" target="_blank">LangChain Raptor cookbook</a></li>
<li><a href="https://arxiv.org/abs/2401.18059v1" target="_blank">RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval</a></li>
</ul>
</section>
<section id="xv-relevant-segment-extraction" class="level2">
<h2 class="anchored" data-anchor-id="xv-relevant-segment-extraction">XV Relevant segment extraction</h2>
<p>Relevant Segment Extraction (RSE) is an optional (but strongly recommended) post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can.</p>
<ul>
<li><a href="https://superpowered.ai/blog/introducing-relevant-segment-extraction" target="_blank">Introducing Relevant Segment Extraction (RSE)</a></li>
<li><a href="" target="_blank"></a></li>
</ul>
</section>
<section id="xvi-reliable-rag" class="level2">
<h2 class="anchored" data-anchor-id="xvi-reliable-rag">XVI Reliable RAG</h2>
<p>The “Reliable-RAG” method improves RAG by incorporating layers of validation and refinement to enhance the accuracy and relevance of retrieved information. The method incorporates checks for document relevance, hallucination prevention, and highlights the exact segments used in generating the final response.</p>
<ul>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/cookbooks/cleanlab_tlm_rag/" target="_blank">Trustworthy RAG with the Trustworthy Language Model</a></li>
<li><a href="https://arxiv.org/abs/2401.05561" target="_blank">TrustLLM: Trustworthiness in Large Language Models</a></li>
<li><a href="https://arxiv.org/abs/2308.05374" target="_blank">Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment</a></li>
</ul>
</section>
<section id="xvii-reranking" class="level2">
<h2 class="anchored" data-anchor-id="xvii-reranking">XVII Reranking</h2>
<ul>
<li><a href="https://python.langchain.com/docs/integrations/retrievers/flashrank-reranker/" target="_blank">FlashRank reranker</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/workflow/rag/" target="_blank">RAG Workflow with Reranking</a></li>
</ul>
</section>
<section id="xviii-feedback-loop" class="level2">
<h2 class="anchored" data-anchor-id="xviii-feedback-loop">XVIII Feedback loop</h2>
<ul>
<li><a href="" target="_blank"></a></li>
</ul>
</section>
<section id="xix-self-rag" class="level2">
<h2 class="anchored" data-anchor-id="xix-self-rag">XIX Self RAG</h2>
<ul>
<li><a href="https://arxiv.org/abs/2310.11511" target="_blank">Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/" target="_blank">Self Correcting Query Engines - Evaluation &amp; Retry</a></li>
<li><a href="https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_self_rag/" target="_blank">LangGraph SelfRAG</a></li>
</ul>
</section>
<section id="xx-semantic-chunking" class="level2">
<h2 class="anchored" data-anchor-id="xx-semantic-chunking">XX Semantic chunking</h2>
<ul>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/" target="_blank">LlamaIndex semantic chunker</a></li>
<li><a href="https://python.langchain.com/docs/how_to/semantic-chunker/" target="_blank">How to split text based on semantic similarity</a></li>
</ul>


</section>

 ]]></description>
  <category>LLM</category>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/rags.html</guid>
  <pubDate>Wed, 01 Jan 2025 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/ragtree.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>GliNER</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/gliner.html</link>
  <description><![CDATA[ 





<p>Before transformers were around you needed things like <a href="https://spacy.io/">SpaCy</a> to do Named Entity Recognition (NER). Now you can use a transformer model like <a href="https://github.com/urchade/GLiNER?tab=readme-ov-file" target="_blank">GliNER</a> (see also <a href="https://arxiv.org/abs/2311.08526" target="_blank">the original research article</a>). Although generic models like Lllama and GPT can extract entities, GliNER is specifically designed for NER. It’s also fast and more flexible. SpaCy remains a good choice for general NLP and there is a <a href="https://spacy.io/universe/project/gliner-spacy" target="_blank">Gliner-Spacy wrapper</a> giving you the best of both worlds.</p>
<p>Let’s take a look at how it works.</p>
<div id="cell-2" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> gliner <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GLiNER</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize GLiNER with the base model</span></span>
<span id="cb1-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GLiNER.from_pretrained(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"urchade/gliner_medium-v2.1"</span>)</span>
<span id="cb1-5">model.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">eval</span>()</span></code></pre></div></div>
</div>
<p>The above will, like all Huggingface things, automatically download the necessary tensors and configs. The following extracts the entities with staggering speed:</p>
<div id="cell-4" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb2-2">start_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb2-3">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb2-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- John Field was born January 26, 1782, and died January 23, 1837. He was an Irish pianist, composer, and teacher.</span></span>
<span id="cb2-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- James Clerk Maxwell was born June 13, 1831, and died November 5, 1879. He was a Scottish scientist in the field of mathematical physics.</span></span>
<span id="cb2-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb2-7">labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Person"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>]</span>
<span id="cb2-8">entities <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict_entities(text, labels, threshold<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb2-9">end_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb2-10">elapsed_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> end_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_time</span>
<span id="cb2-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> entity <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> entities:</span>
<span id="cb2-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=&gt;"</span>, entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>])</span>
<span id="cb2-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Time taken: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>elapsed_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> seconds"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>John Field =&gt; Person
January 26, 1782 =&gt; Date
January 23, 1837 =&gt; Date
James Clerk Maxwell =&gt; Person
June 13, 1831 =&gt; Date
November 5, 1879 =&gt; Date
Time taken: 0.09 seconds</code></pre>
</div>
</div>
<p>You can specifiy anything you like for the labels but not every word is automatically a named entity. For instance, if you tell Gliner to extract ‘Math’ or ‘Number’ this will happen:</p>
<div id="cell-6" class="cell" data-execution_count="14">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">The transcendental number e ≈ 2.71828 is Euler's number, and it is the base of the natural logarithm. It's also crucial in calculus, e.g. $e^{i</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">pi}=-1$.</span></span>
<span id="cb4-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-4">labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Person"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Number"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Math"</span>]</span>
<span id="cb4-5"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> entity <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> model.predict_entities(text, labels, threshold<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>):</span>
<span id="cb4-6">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=&gt;"</span>, entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>e =&gt; Number
Euler =&gt; Person
natural logarithm =&gt; Math
calculus =&gt; Math
e =&gt; Number</code></pre>
</div>
</div>
<p>It’s indeed correct semantically that ‘e’ is the symbol for a number but the actual value should have been extracted. It’s remarkable that ‘natural logarithm’ is correctly identified as a mathematical entity. The threshold expresses the confidence and if you lower the threshold:</p>
<div id="cell-8" class="cell" data-execution_count="15">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> entity <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> model.predict_entities(text, labels, threshold<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>):</span>
<span id="cb6-2">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=&gt;"</span>, entity[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>The transcendental number =&gt; Number
e =&gt; Number
2 =&gt; Number
71828 =&gt; Number
Euler =&gt; Person
natural logarithm =&gt; Math
calculus =&gt; Math
e =&gt; Number
e =&gt; Number
i =&gt; Number
pi =&gt; Number
-1 =&gt; Number</code></pre>
</div>
</div>
<p>It does not see that there is a float and the dot is dientified as the end of a sentence. The threshold is a trade-off between precision and recall. The default is 0.5.</p>
<p>Gliner does not replace the need for sophisticated graph extraction as explained in <a href="../graphAI/graphRAG.html">our Graph RAG article</a> but it can speed it up. Considering the NER extraction speed Gliner can be used to first extract the entities and handing them over to a more generic LLM to extract the relationships. This means that the graph RAG prompt is less complex and will speed up the graph extraction process. See also our <a href="../graphAI/nuextract.html">Nuextract article</a> for an alternative approach based on structured data extraction.</p>



 ]]></description>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/gliner.html</guid>
  <pubDate>Fri, 06 Dec 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/gears.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Marker PDF Parser</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/marker.html</link>
  <description><![CDATA[ 





<p><a href="https://github.com/VikParuchuri/marker" target="_blank">Marker</a> by <a href="https://www.datalab.to/" target="_blank">Datalab</a> is an open source Python library that can be used to extract structured data from PDFs. We listed various Pdf parsing solution in the lengthy <a href="../graphAI/graphRAG.html">Graph RAG article</a> and Marker is stands out because you can run it locally, thus avoiding sending data to a service. Services like <a href="https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/" target="_blank">LlamaParse</a> are indeed very powerful but they require sending your data to a remote server. Marker is a good alternative if you want to keep your data local.</p>
<p>Setting Marker up is as easy as <code>pip install marker-pdf</code> and if to parse an article to markdown you can use:</p>
<pre class="{shell}"><code>marker_single /my-article.pdf</code></pre>
<p>You can also output JSON or HTML and a more detailed prompt could be</p>
<pre class="{shell}"><code>marker_single --output_dir /tmp/output/ --page_range 0-2 --output_format json  /my-article.pdf</code></pre>
<p>Some notable features of Marker are:</p>
<ul>
<li>extracts images from PDFs</li>
<li>it recognizes formulas and can extract them as LaTeX</li>
<li>it can extract tables and output them as markdown tables</li>
<li>footnotes and references!</li>
</ul>
<p>Marker is free to use <a href="https://www.datalab.to/onprem" target="_blank">under $5M in TTM revenue</a>, which is really generous. If you are a larger company, you can contact Datalab for a quote.</p>
<p>One aspect that is of particular interest in some domains like legal is the heavy use of references and footnotes. Marker correctly extracts them and in the snippet below you can see the original pdf with the markdown preview next to it.</p>
<p><img src="https://discovery.graphsandnetworks.com/images/ParsingFootnotes.png" class="img-fluid" width="500"></p>
<p>On a markdown level the footnotes are extracted as:</p>
<pre class="{text}"><code>&lt;sup&gt;2&lt;/sup&gt; David M. Scobey, Empire City: Politics, Culture, and Urbanism in Gilded-Age New York (New Haven: Yale University, 1989), 25, 26, 188.
&lt;sup&gt;3&lt;/sup&gt; Scobey, 29, 334.
&lt;sup&gt;4&lt;/sup&gt; James Miller, Miller's New York as it Is (New York, James Miller Press, 1872), 23.
&lt;sup&gt;5&lt;/sup&gt; Thomas Bender, New York Intellect: A History of Intellectual Life in New York City from 1750 to the Beginnings of Our Own Time (Baltimore: Johns Hopkins University Press, 1987), 171.</code></pre>
<p>Note that the citation is neatly converted as well. All in all, Marker is quite a catch.</p>



 ]]></description>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/marker.html</guid>
  <pubDate>Fri, 06 Dec 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/gears.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Nuextract</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/nuextract.html</link>
  <description><![CDATA[ 





<p><a href="https://huggingface.co/numind/NuExtract-1.5" target="_blank">Nuextract</a> by Numind is dedicated to extraction information in a predefined format. It’s not like named entity recoginition but more like JSON extraction.</p>
<p>This model can effectively shortcut the need for structured output via Pydantic specs via other LLMs. In addition, it allows complex JSON structures.</p>
<p>Let’s setup the preliminaries:</p>
<div id="cell-2" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> AutoModelForCausalLM, AutoTokenizer</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict_NuExtract(model, tokenizer, texts, template, batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, max_length<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10_000</span>, max_new_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4_000</span>):</span>
<span id="cb1-6">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> json.dumps(json.loads(template), indent<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb1-7">    prompts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""&lt;|input|&gt;</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">### Template:</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>template<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">### Text:</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>text<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">&lt;|output|&gt;"""</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> text <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> texts]</span>
<span id="cb1-8">    </span>
<span id="cb1-9">    outputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> torch.no_grad():</span>
<span id="cb1-11">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(prompts), batch_size):</span>
<span id="cb1-12">            batch_prompts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> prompts[i:i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>batch_size]</span>
<span id="cb1-13">            batch_encodings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenizer(batch_prompts, return_tensors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pt"</span>, truncation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, padding<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, max_length<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>max_length).to(model.device)</span>
<span id="cb1-14"></span>
<span id="cb1-15">            pred_ids <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.generate(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>batch_encodings, max_new_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>max_new_tokens)</span>
<span id="cb1-16">            outputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> tokenizer.batch_decode(pred_ids, skip_special_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-17"></span>
<span id="cb1-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> [output.split(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;|output|&gt;"</span>)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> output <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> outputs]</span>
<span id="cb1-19"></span>
<span id="cb1-20">model_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"numind/NuExtract-v1.5"</span></span>
<span id="cb1-21">device <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mps"</span></span>
<span id="cb1-22">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> AutoModelForCausalLM.from_pretrained(model_name, torch_dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>torch.bfloat16, trust_remote_code<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>).to(device).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">eval</span>()</span>
<span id="cb1-23">tokenizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> AutoTokenizer.from_pretrained(model_name, trust_remote_code<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<script type="application/vnd.jupyter.widget-view+json">
{"model_id":"31abd989949345fba4e036f90f24e54e","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"}
</script>
</div>
</div>
<p>The above will download (~10GB) the model automatically.</p>
<p>Let’s start with a simple text and a person/children extraction:</p>
<div id="cell-5" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""My name is John and I am 56 years old. I have two daughters, Anna and Lisa. Anna is 28 years old and List 25."""</span></span>
<span id="cb2-2"></span>
<span id="cb2-3">template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb2-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "Person": {</span></span>
<span id="cb2-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "Name": "",</span></span>
<span id="cb2-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "Age": "",</span></span>
<span id="cb2-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "Children": [</span></span>
<span id="cb2-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            {</span></span>
<span id="cb2-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                "Name": "",</span></span>
<span id="cb2-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                "Age": 0</span></span>
<span id="cb2-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            }</span></span>
<span id="cb2-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        ]</span></span>
<span id="cb2-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    }</span></span>
<span id="cb2-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">     </span></span>
<span id="cb2-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}"""</span></span>
<span id="cb2-16"></span>
<span id="cb2-17">prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> predict_NuExtract(model, tokenizer, [text], template)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb2-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(prediction)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{"Person": {"Name": "John", "Age": "56", "Children": [{"Name": "Anna", "Age": 28}, {"Name": "Lisa", "Age": 25}]}}
  </code></pre>
</div>
</div>
<p>Note that the default in the schema also defines the data type. The ‘age’ is an integer, the ‘name’ is a string.</p>
<p>Does it work with a graph structure? Yes, it does. Let’s try a simple example:</p>
<div id="cell-8" class="cell" data-execution_count="14">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- John knows Mary</span></span>
<span id="cb4-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Mary knows Peter</span></span>
<span id="cb4-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Peter knows John</span></span>
<span id="cb4-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-6"></span>
<span id="cb4-7">template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb4-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "nodes": [{"name":""}],</span></span>
<span id="cb4-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "edges":[{ "source": "", "target": "" }]</span></span>
<span id="cb4-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">     </span></span>
<span id="cb4-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}"""</span></span>
<span id="cb4-12"></span>
<span id="cb4-13">prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> predict_NuExtract(model, tokenizer, [text], template)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(prediction)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{"nodes": [
    {
        "name": "John"
    },
    {
        "name": "Mary"
    },
    {
        "name": "Peter"
    }
],
    "edges": [
    {
        "source": "John",
        "target": "Mary"
    },
    {
        "source": "Mary",
        "target": "Peter"
    },
    {
        "source": "Peter",
        "target": "John"
    }
]}
  </code></pre>
</div>
</div>
<p>Let’s try something a bit more complex, does it understand SPO triples?</p>
<div id="cell-10" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb6-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- John knows Mary</span></span>
<span id="cb6-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Mary knows Peter</span></span>
<span id="cb6-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Peter knows John</span></span>
<span id="cb6-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- John is 23 years old</span></span>
<span id="cb6-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- John has a red car</span></span>
<span id="cb6-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Peter is 25 years old</span></span>
<span id="cb6-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">- Mary has a blue car</span></span>
<span id="cb6-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb6-10"></span>
<span id="cb6-11">template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb6-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "nodes": [{"name":"", "age": 0}],</span></span>
<span id="cb6-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "edges":[{ "subject": "", "predicate": "", "object": "" }]</span></span>
<span id="cb6-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">     </span></span>
<span id="cb6-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}"""</span></span>
<span id="cb6-16"></span>
<span id="cb6-17">prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> predict_NuExtract(model, tokenizer, [text], template)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb6-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(prediction)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{"nodes": [
    {
        "name": "John",
        "age": 23
    },
    {
        "name": "Mary",
        "age": 0
    },
    {
        "name": "Peter",
        "age": 25
    }
],
"edges": [
    {
        "subject": "John",
        "predicate": "knows",
        "object": "Mary"
    },
    {
        "subject": "Mary",
        "predicate": "knows",
        "object": "Peter"
    },
    {
        "subject": "Peter",
        "predicate": "knows",
        "object": "John"
    }
]}
  </code></pre>
</div>
</div>
<p>This is a simple example, but it shows that the model can understand a graph structure and, specifically, SPO triples.</p>
<p>Does it work towards graph RAG (LightRAG to be precise)?</p>
<div id="cell-12" class="cell" data-execution_count="20">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">This site serves as a comprehensive resource hub (blog or journal) for all things graphs and our consulting services. It offers a rich collection of notebooks, articles, tutorials, and other resources, covering a wide range of topcs from basic graph theory to advanced visualization techniques and AI applications. Whether you’re a beginner looking to learn the fundamentals or an expert seeking to deepen your understanding, you’ll find valuable content here to support your journey.</span></span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">In addition to the educational materials, the site also provides a wealth of code snippets, practical tricks, and tools designed to enhance your workflow. These resources are ideal for developers, data scientists, and researchers who need quick, effective solutions to common challenges in graph-related projects. The code and tips shared here are the result of years of hands-on experience and experimentation, making them highly practical and immediately applicable to real-world problems.</span></span>
<span id="cb8-5"></span>
<span id="cb8-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">The site also features reflections on the evolution of graph consulting over the past 20 years. These insights are drawn from decades of experience in the field, offering a unique perspective on how graph technologies have evolved and the impact they have had on various industries. Alongside these reflections, you will find research papers and experimental results that push the boundaries of what is possible with graphs, providing inspiration and guidance for those looking to innovate in this dynamic area.</span></span>
<span id="cb8-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-8"></span>
<span id="cb8-9">template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb8-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "nodes": [{"name":"", "description": ""}],</span></span>
<span id="cb8-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "edges":[{ "source": "", "description": "", "target": "", "keywords": [] }]</span></span>
<span id="cb8-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">     </span></span>
<span id="cb8-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}"""</span></span>
<span id="cb8-14"></span>
<span id="cb8-15">prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> predict_NuExtract(model, tokenizer, [text], template)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb8-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(prediction)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{"nodes": [{"name": "site", "description": "comprehensive resource hub for all things graphs and our consulting services"}, {"name": "educational materials", "description": "rich collection of notebooks, articles, tutorials, and other resources"}, {"name": "code snippets", "description": "practical tricks, and tools designed to enhance your workflow"}, {"name": "reflections", "description": "on the evolution of graph consulting over the past 20 years"}, {"name": "research papers", "description": "experimental results that push the boundaries of what is possible with graphs"}], "edges": [{"source": "site", "description": "serves as a comprehensive resource hub", "target": "educational materials", "keywords": ["graphs", "consulting services", "resources"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "code snippets", "keywords": ["developers", "data scientists", "researchers"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "reflections", "keywords": ["evolution", "graph consulting", "industries"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "research papers", "keywords": ["graphs", "innovate", "dynamic area"]}]}
  </code></pre>
</div>
</div>
<p>Nuextract is not specifically designed to generate knowledge graphs, but a shallow test reveals that it can do it. This is quite remarkable.</p>
<p>Outside KG creation, it works well to collect form-like information from arbitrary input. One could scan the generated JSON to ensure that all required info has been supplied (and re-iterate if necessary).</p>



 ]]></description>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/nuextract.html</guid>
  <pubDate>Thu, 05 Dec 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/gears.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Nuextract v2</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/nuextractV2.html</link>
  <description><![CDATA[ 





<p>Let’s take a look at the v2 of the <a href="https://huggingface.co/numind" target="_blank">NuExtract</a> model we previously visited <a href="../graphAI/nuextract.html" target="_blank">here</a>. It’s possible alternative to extract knowledge graphs from text to feed into <a href="https://knwl.ai">knwl</a> or other knowledge graph building tools.</p>
<p>The UV pyproject.toml file should look like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode toml code-with-copy"><code class="sourceCode toml"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[project]</span></span>
<span id="cb1-2"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">name</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"nuex"</span></span>
<span id="cb1-3"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">version</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0.1.0"</span></span>
<span id="cb1-4"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">description</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Add your description here"</span></span>
<span id="cb1-5"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">readme</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"README.md"</span></span>
<span id="cb1-6"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">requires-python</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&gt;=3.12"</span></span>
<span id="cb1-7"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">dependencies</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb1-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"accelerate&gt;=1.11.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"haystack-ai&gt;=2.19.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"jupyter&gt;=1.1.1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"networkx&gt;=3.5"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pillow&gt;=12.0.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-13">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pyvis&gt;=0.3.2"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-14">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"qwen-vl-utils&gt;=0.0.14"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-15">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"torch&gt;=2.9.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"torchvision&gt;=0.24.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"trafilatura&gt;=2.0.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"transformers&gt;=4.57.1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb1-19"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span></span></code></pre></div></div>
<p>There are three models (2B,4B, 8B) available on Hugging Face and from a little experimentation the 2B model seems to be giving the best results. What makes this model so interesting is the speed at which it can extract structured data from text. If you follow the quick-start of <a href="https://knwl.ai">Knwl</a> you will see that it takes a lot longer to extract entities and relationships from text compared to NuExtract v2. NuExtract does the job in seconds, compared to minutes when using big models like GPT-5 or Claude 4.</p>
<div id="a561e38e-58e5-4f7e-80ee-eabc2d05e49a" class="cell" data-tags="[]" data-execution_count="18">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> transformers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> AutoProcessor, AutoModelForImageTextToText</span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://huggingface.co/numind/NuExtract-2.0-2B</span></span>
<span id="cb2-6">model_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"numind/NuExtract-2.0-4B"</span></span>
<span id="cb2-7"></span>
<span id="cb2-8"></span>
<span id="cb2-9">processor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> AutoProcessor.from_pretrained(model_name, </span>
<span id="cb2-10">                                          trust_remote_code<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, </span>
<span id="cb2-11">                                          padding_side<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'left'</span>,</span>
<span id="cb2-12">                                          use_fast<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb2-13">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> AutoModelForImageTextToText.from_pretrained(model_name, </span>
<span id="cb2-14">                                               trust_remote_code<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, </span>
<span id="cb2-15">                                               dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>torch.bfloat16,                                        </span>
<span id="cb2-16">                                               device_map<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)</code></pre>
</div>
<div class="cell-output cell-output-display">
<script type="application/vnd.jupyter.widget-view+json">
{"model_id":"c52a85d71b3a4669ad3fbaf0f39b7d7d","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"}
</script>
</div>
</div>
<p>We take a little paragraph from the Wikipedia article on Charles Dickens and provide the LLM with some examples of the kind of structured data we want to extract. The <a href="https://huggingface.co/docs/transformers/main_classes/text_generation">Transformers parameters</a> are particular interest here and after some experimentation we found that greedy decoding (do_sample=False, num_beams=1) seems to work best. That is, the greedy approach of picking the most likely next token at each step seems to give better results than sampling from a distribution of possible next tokens.</p>
<div id="146baa04-2206-4eca-86bd-66f94924e293" class="cell" data-tags="[]" data-execution_count="29">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"></span>
<span id="cb4-2">template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb4-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "output": {</span></span>
<span id="cb4-4"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "entities": ["string"],</span></span>
<span id="cb4-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "relationships": [["string", "string", "string"]]</span></span>
<span id="cb4-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    }</span></span>
<span id="cb4-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}"""</span></span>
<span id="cb4-8">document <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""Charles John Huffam Dickens (7 February 1812 – 9 June 1870) was an English novelist, journalist, short story writer and social critic. He created some of literature's best-known fictional characters, and is regarded by many as the greatest novelist of the Victorian era. His works enjoyed unprecedented popularity during his lifetime and, by the 20th century, critics and scholars had recognised him as a literary genius. His novels and short stories are widely read today.</span></span>
<span id="cb4-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Born in Portsmouth, Dickens left school at age 12 to work in a boot-blacking factory when his father John was incarcerated in a debtors' prison. After three years, he returned to school before beginning his literary career as a journalist. Dickens edited a weekly journal for 20 years; wrote 15 novels, five novellas, hundreds of short stories and nonfiction articles; lectured and performed readings extensively; was a tireless letter writer; and campaigned vigorously for children's rights, education and other social reforms.</span></span>
<span id="cb4-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Dickens's literary success began with the 1836 serial publication of The Pickwick Papers, a publishing phenomenon—thanks largely to the introduction of the character Sam Weller in the fourth episode—that sparked Pickwick merchandise and spin-offs. Within a few years, Dickens had become an international literary celebrity, famous for his humour, satire and keen observation of character and society. His novels, most of them published in monthly or weekly instalments, pioneered the serial publication of narrative fiction, which became the dominant Victorian mode for novel publication. Cliffhanger endings in his serial publications kept readers in suspense. The instalment format allowed Dickens to evaluate his audience's reaction, and he often modified his plot and character development based on such feedback. For example, when his wife's chiropodist expressed distress at the way Miss Mowcher in David Copperfield seemed to reflect her own disabilities, Dickens improved the character with positive features. His plots were carefully constructed and he often wove elements from topical events into his narratives. Masses of the illiterate poor would individually pay a halfpenny to have each new monthly episode read to them, opening up and inspiring a new class of readers.</span></span>
<span id="cb4-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">His 1843 novella A Christmas Carol remains especially popular and continues to inspire adaptations in every creative medium. Oliver Twist and Great Expectations are also frequently adapted and, like many of his novels, evoke images of early Victorian London. His 1853 novel Bleak House, a satire on the judicial system, helped support a reformist movement that culminated in the 1870s legal reform in England. A Tale of Two Cities (1859; set in London and Paris) is regarded as his best-known work of historical fiction. The most famous celebrity of his era, he undertook, in response to public demand, a series of public reading tours in the later part of his career. The term Dickensian is used to describe something that is reminiscent of Dickens and his writings, such as poor social or working conditions, or comically repulsive characters."""</span></span>
<span id="cb4-12"></span>
<span id="cb4-13">examples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb4-14">    {</span>
<span id="cb4-15">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"input"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Stephen is the manager of Anna. Stephen works in Belgium."</span>,</span>
<span id="cb4-16">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"output"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb4-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            "entities": ["Stephen", "Anna", "Belgium"],              </span></span>
<span id="cb4-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            "relationships": [["Stephen", "manager_of", "Anna"], </span></span>
<span id="cb4-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                              ["Stephen", "works_in", "Belgium"]]</span></span>
<span id="cb4-20"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        }"""</span></span>
<span id="cb4-21">    },</span>
<span id="cb4-22">    {</span>
<span id="cb4-23">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"input"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."</span>,</span>
<span id="cb4-24">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"output"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""{</span></span>
<span id="cb4-25"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            "entities": ["Google", "Larry Page", "Sergey Brin", "Stanford University"],              </span></span>
<span id="cb4-26"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">            "relationships": [["Larry Page", "co_founder_of", "Google"], </span></span>
<span id="cb4-27"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                              ["Sergey Brin", "co_founder_of", "Google"],</span></span>
<span id="cb4-28"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                              ["Larry Page", "student_at", "Stanford University"],</span></span>
<span id="cb4-29"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                              ["Sergey Brin", "student_at", "Stanford University"]]</span></span>
<span id="cb4-30"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        }"""</span></span>
<span id="cb4-31">    }</span>
<span id="cb4-32">    </span>
<span id="cb4-33">]</span>
<span id="cb4-34">messages <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: document}]</span>
<span id="cb4-35">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> processor.tokenizer.apply_chat_template(</span>
<span id="cb4-36">    messages,</span>
<span id="cb4-37">    template<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template,</span>
<span id="cb4-38">    examples<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>examples, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># examples provided here</span></span>
<span id="cb4-39">    tokenize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb4-40">    add_generation_prompt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb4-41">)</span>
<span id="cb4-42"></span>
<span id="cb4-43"></span>
<span id="cb4-44">inputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> processor(</span>
<span id="cb4-45">    text<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[text],</span>
<span id="cb4-46">    images<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb4-47">    padding<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb4-48">    return_tensors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pt"</span>,</span>
<span id="cb4-49">)</span>
<span id="cb4-50">inputs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> inputs.to(model.device)</span>
<span id="cb4-51"></span>
<span id="cb4-52">start_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb4-53"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://huggingface.co/docs/transformers/main_classes/text_generation</span></span>
<span id="cb4-54">generated_ids <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.generate(</span>
<span id="cb4-55">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>inputs,</span>
<span id="cb4-56">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>{</span>
<span id="cb4-57">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"do_sample"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># False means greedy decoding</span></span>
<span id="cb4-58">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"num_beams"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, </span>
<span id="cb4-59">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"max_new_tokens"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4096</span>, </span>
<span id="cb4-60">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"temperature"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, </span>
<span id="cb4-61">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"top_p"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, </span>
<span id="cb4-62">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"top_k"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb4-63">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># "max_time": 120,</span></span>
<span id="cb4-64">}</span>
<span id="cb4-65">)</span>
<span id="cb4-66">end_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb4-67">generated_ids_trimmed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb4-68">    out_ids[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(in_ids) :] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> in_ids, out_ids <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(inputs.input_ids, generated_ids)</span>
<span id="cb4-69">]</span>
<span id="cb4-70">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> processor.batch_decode(</span>
<span id="cb4-71">    generated_ids_trimmed, skip_special_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, clean_up_tokenization_spaces<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb4-72">)</span>
<span id="cb4-73"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<span id="cb4-74">    g <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> json.loads(result[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])</span>
<span id="cb4-75">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(json.dumps(g, indent<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>))</span>
<span id="cb4-76">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Found </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(g[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'entities'</span>]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'entities'</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> g <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> entities and </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(g[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'relationships'</span>]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'relationships'</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> g <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> relationships in </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>(end_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_time)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.1f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> seconds."</span>)</span>
<span id="cb4-77"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> json.JSONDecodeError <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> e:</span>
<span id="cb4-78">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bad format JSON output:"</span>, result[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{
    "entities": [
        "Charles John Huffam Dickens",
        "Sam Weller",
        "Miss Mowcher",
        "David Copperfield",
        "A Christmas Carol",
        "Oliver Twist",
        "Great Expectations",
        "Bleak House",
        "A Tale of Two Cities",
        "Victorian era",
        "Victorian London",
        "legal reform in England"
    ],
    "relationships": [
        [
            "Charles John Huffam Dickens",
            "born_in",
            "Portsmouth"
        ],
        [
            "Charles John Huffam Dickens",
            "created",
            "Sam Weller"
        ],
        [
            "Charles John Huffam Dickens",
            "wrote",
            "A Christmas Carol"
        ],
        [
            "Charles John Huffam Dickens",
            "adapted",
            "Oliver Twist"
        ],
        [
            "Charles John Huffam Dickens",
            "adapted",
            "Great Expectations"
        ],
        [
            "Charles John Huffam Dickens",
            "adapted",
            "Bleak House"
        ],
        [
            "Charles John Huffam Dickens",
            "adapted",
            "A Tale of Two Cities"
        ],
        [
            "Charles John Huffam Dickens",
            "published",
            "The Pickwick Papers"
        ],
        [
            "Charles John Huffam Dickens",
            "lectured",
            "public reading tours"
        ]
    ]
}
Found 12 entities and 9 relationships in 16.8 seconds.</code></pre>
</div>
</div>
<p>Doing this in less than 17s on a Mac M4 Pro is quite impressive. A bit of NetworkX and PyVis magic at the end to visualize the extracted knowledge graph.</p>
<div id="7608b846" class="cell" data-execution_count="30">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a new graph</span></span>
<span id="cb6-4">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.Graph()</span>
<span id="cb6-5"></span>
<span id="cb6-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add nodes and edges</span></span>
<span id="cb6-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> e <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> g[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"entities"</span>]:</span>
<span id="cb6-8">    </span>
<span id="cb6-9">    G.add_node(e, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>e)</span>
<span id="cb6-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> r <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> g[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"relationships"</span>]:</span>
<span id="cb6-11">    from_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> r[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb6-12">    relation_type <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> r[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb6-13">    to_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> r[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]</span>
<span id="cb6-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> from_node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes() <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> to_node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes():</span>
<span id="cb6-15">        G.add_edge(from_node, to_node, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>relation_type)</span></code></pre></div></div>
</div>
<div id="49181f19" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyvis.network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Network</span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> IPython.display <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> display, HTML</span>
<span id="cb7-3"></span>
<span id="cb7-4"></span>
<span id="cb7-5">net <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Network(notebook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cdn_resources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in_line'</span>)</span>
<span id="cb7-6">net.from_nx(G)</span>
<span id="cb7-7"></span>
<span id="cb7-8">net.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Charles_Dickens.html'</span>)</span>
<span id="cb7-9">display(HTML(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Charles_Dickens.html'</span>))</span></code></pre></div></div>
</div>
<p>Resulting in something like this:</p>
<p><img src="https://discovery.graphsandnetworks.com/images/NuextractV2.png" class="img-fluid"></p>
<p>The v2 is definitely a step up from the previous version and a very interesting model to extract structured data from text very quickly. When the structure becomes more complex however, it seems to struggle a bit more. But for simple entity and relationship extraction tasks, it’s a great tool to have in your toolbox.</p>



 ]]></description>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/nuextractV2.html</guid>
  <pubDate>Thu, 05 Dec 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/gears.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Graph Laplacian</title>
  <link>https://discovery.graphsandnetworks.com/graphAnalytics/graphLaplacian.html</link>
  <description><![CDATA[ 





<p>The following is a classic setup in physics with springs attached to to gliders. The vertical position is continuous but the horizontal position is fixed. This dynamical system can be described by simple Newtonian mechanics.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/graph-laplacian.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>If <img src="https://latex.codecogs.com/png.latex?x_i"> is the vertical position of each glider we can form a vector <img src="https://latex.codecogs.com/png.latex?%5Cvec%7Bx%7D%20=%20(x_1,%20x_2%5Cldots)"> and if every spring has Hooke constant <img src="https://latex.codecogs.com/png.latex?k"> the system can be described as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Am%5Cfrac%7Bd%5E2%20x_i%7D%7Bdt%5E2%7D%20=%20-k%20(x_i-x_%7Bi+1%7D)%20-k(x_i-x_%7Bi-1%7D)%0A"> If we bring the mass <img src="https://latex.codecogs.com/png.latex?m"> to the other side or assume unit mass this can be simplified as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5E2%20x_i%7D%7Bdt%5E2%7D%20=%20-%5Comega%5C,(-x_%7Bi-1%7D%20+2x_i%20-x_%7Bi+1%7D)%0A"> Or equivalently</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5E2%20%5Cvec%7Bx%7D%7D%7Bdt%5E2%7D%20=%20%5Cunderset%7B%5CDelta%7D%7B%5Cunderbrace%7B%5Cbegin%7Bbmatrix%7D1&amp;-1&amp;&amp;&amp;&amp;&amp;%5C%5C-1&amp;2&amp;-1&amp;&amp;&amp;&amp;%5C%5C&amp;-1&amp;2&amp;-1&amp;&amp;&amp;%5C%5C&amp;&amp;&amp;%5Cddots%5C%5C&amp;&amp;&amp;-1&amp;2&amp;-1%20%5C%5C&amp;&amp;&amp;&amp;-1&amp;1%5Cend%7Bbmatrix%7D%7D%7D%5Cvec%7Bx%7D%0A"> If we define <img src="https://latex.codecogs.com/png.latex?D%20=%20(d_1,%20d_2%5Cldots)"> as the degree and <img src="https://latex.codecogs.com/png.latex?A"> the adjacency matrix we get <img src="https://latex.codecogs.com/png.latex?%5CDelta%20=%20D%20-%20A"> and this is called the Laplacian of the graph. Which graph? Well, the undirected line graph</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/line-graph.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>with adjacency matrix</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA%20=%20%5Cbegin%7Bbmatrix%7D0&amp;-1&amp;&amp;&amp;&amp;&amp;%5C%5C-1&amp;0&amp;-1&amp;&amp;&amp;&amp;%5C%5C&amp;-1&amp;0&amp;-1&amp;&amp;&amp;%5C%5C&amp;&amp;&amp;%5Cddots%5C%5C&amp;&amp;&amp;-1&amp;0&amp;-1%20%5C%5C&amp;&amp;&amp;&amp;-1&amp;1%5Cend%7Bbmatrix%7D%0A"></p>
<p>Calling the matrix above the Laplacian is based on the similarity of the equation with the continuous case. In classic field theory this would lead to <img src="https://latex.codecogs.com/png.latex?%5CBox%5C,%20%5Cphi=%20(%5Cpartial%5E2/%5Cpartial%20t%5E2%20-%20%5Cpartial%5E2/%5Cpartial%20x%5E2)%5Cvec%7B%5Cphi%7D%20=%200."></p>
<p>This graph Laplacian is derived on the basis of a line graph but the definition generalizes to arbitrary graphs.</p>
<p>The graph Laplacian can also be obtained by looking at the discrete change of a scalar field on a graph <img src="https://latex.codecogs.com/png.latex?f"> (i.e.&nbsp;every vertex <img src="https://latex.codecogs.com/png.latex?i"> has value <img src="https://latex.codecogs.com/png.latex?f_i">) and defining</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpartial%20f%7D%7B%5Cpartial%20x_%7Bij%7D%7D%20=%20f_j%20-%20f_i%0A"> That is, we take the difference in a direction at a vertex provided there is an edge. Summing this gradient in all directions leads to:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Csum_j%7B%5Cfrac%7B%5Cpartial%20f%7D%7B%5Cpartial%20x_%7Bij%7D%7D%7D%20=%20%5Csum_j%20f_j%20-%5Csum_j%20f_i%20=%20(Df-Af)_i%20=%20(%5CDelta%20f)_i%0A"> The total derivative at a vertex is the graph Laplacian. This is somewhat difficult to translate to the continuous case since the sum <img src="https://latex.codecogs.com/png.latex?%5Cpartial/%5Cpartial%20x%20+%20%5Cpartial/%5Cpartial%20y%20+%20%5Cpartial/%5Cpartial%20z"> is a first order differential operation while the continuous Laplacian is a second order differential. In this sense, the graph Laplacian measures change and not acceleration.</p>
<p>If we take the definition of the graph Laplacian like this we can write down the diffusion on graphs:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5Cvec%7Bu%7D%7D%7Bdt%7D%20=%20-c%5C,%20%5CDelta%20%5Cvec%7Bu%7D%0A"> with <img src="https://latex.codecogs.com/png.latex?c"> some arbitrary diffusion constant. Let <img src="https://latex.codecogs.com/png.latex?%5Clambda_1,%20%5Clambda_2%5Cldots"> be the eigenvalues of the Laplacian and <img src="https://latex.codecogs.com/png.latex?%5Cvec%7Bv%7D_1,%20%5Cvec%7Bv%7D_2%5Cldots"> the corresponding eigenvectors:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20%5Cvec%7Bv%7D_i%20=%20%5Clambda_i%5C,%5Cvec%7Bv%7D_i%0A"> We can decompose any (time dependent) solution as <img src="https://latex.codecogs.com/png.latex?%5Cvec%7Bu%7D(t)%20=%20%5Csum_i%20a_i(t)%20%5Cvec%7Bv%7D_i"> : <img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%20a_i(t)%7D%7Bdt%7D%20+c%5C,%5Clambda_i%20a_i(t)%20=%200.%0A"> leading to</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aa_i(t)%20=%20a_i(0)%20%5C,e%5E%7B%20-%20c%5C,%5Clambda_i%20%5C,%20t%7D.%0A"> This is called the spectral solution of the diffusion equation.</p>
<p>In the spring system above you can see that the sum of each row in the Laplacian is zero. This is general, simply because of the definition of degree and the definition of the adjacency graph. This implies that the Laplacian is singular. Why? The sum of the rows equal to zero can be expressed as <img src="https://latex.codecogs.com/png.latex?A.%5Cmathbb%7B1%7D=0"> with <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7B1%7D%20=%20(1,1%5Cldots)">. This contradicts that there is an inverse which would lead to <img src="https://latex.codecogs.com/png.latex?A%5E%7B-1%7D.A.%5Cmathbb%7B1%7D%20=%20%5Cmathbb%7B1%7D=0."></p>
<p>Since the Laplacian is singular it necessarily has at least one zero eigenvalue and we can rearrange things to assume that <img src="https://latex.codecogs.com/png.latex?%5Clambda_1%20=%200."> The corresponding eigenvector consists of all 1’s since this is the sum of the rows leading to zeros:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA.%5Cmathbb%7B1%7D%20=%20%5Clambda_1%5C,%20%5Cmathbb%7B1%7D%20=%200.%0A"> If <img src="https://latex.codecogs.com/png.latex?B"> is the incidence matrix of an orientation of <img src="https://latex.codecogs.com/png.latex?G">, then <img src="https://latex.codecogs.com/png.latex?L=BB%5ET">. So</p>
<p><img src="https://latex.codecogs.com/png.latex?%20x%5ET%5CDelta%20x%20=%20%5C%7CBx%5C%7C%5E2%20%5Cge%200"> for all <img src="https://latex.codecogs.com/png.latex?x">. The matrix has rows indexed by the vertices, columns by edges and the <img src="https://latex.codecogs.com/png.latex?ij">-entry is 1 if the <img src="https://latex.codecogs.com/png.latex?i">-th vertex is the head of the <img src="https://latex.codecogs.com/png.latex?j">-th edge, <img src="https://latex.codecogs.com/png.latex?-1"> if its the tail and 0 otherwise. The eigenvalues are hence all positive. This fits with the expectation that the diffusion equation should, well, diffuse and not explode. Equally well, if there would be a spreading of a disease on a graph the spreading should exponentially decrease over time.</p>
<p>An interesting property of the Laplacian eigenvalues is the following: <strong>the number of zeros corresponds to the amount of connected components</strong>. <a href="https://math.uchicago.edu/~may/REU2022/REUPapers/Li,Hanchen.pdf">See here for a proof</a>, but it’s available in many textbooks as well.</p>
<p>Coming back to the diffusion equation, the initial state <img src="https://latex.codecogs.com/png.latex?%5Cvec%7Bu%7D(0)"> can be used to expressed the coefficients:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aa_i(0)%20=%20%5Cfrac%7B%5Cvec%7Bu%7D(0).%5Cvec%7Bv%7D_j%7D%7B%5Cvec%7Bv%7D_j.%5Cvec%7Bv%7D_j%7D%0A"> If we look at the asymptotic solution of the diffusion equation, knowing that all the eigenvalues are positive, we can conclude that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cvec%7Bu%7D_%5Cinfty%20=%20a_1(0)%20%5C,%20%5Cvec%7Bv%7D_1%0A"> and since the first eigenvector consists of 1’s only:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cvec%7Bu%7D_%5Cinfty%20=%20%5Cfrac%7Bu_1(0)+%5Cldots+u_n(0)%7D%7Bn%7D%5C,%5Cmathbb%7B1%7D%0A"> meaning that all the nodes in the graph get the average of the initial state. The diffusion leads to equilibrium across the graph. This corresponds to the intuition in the continuous case that a droplet of ink in water diffuses equally across the fluid after a long time.</p>
<p>Note that the asymptotic solution is also the solution to <img src="https://latex.codecogs.com/png.latex?%5CDelta%20%5Cvec%7Bu%7D_%5Cinfty%20%20=0."></p>
<section id="networkx-exploration" class="level2">
<h2 class="anchored" data-anchor-id="networkx-exploration">NetworkX Exploration</h2>
<p>The above can be easily explored in NetworkX. Let’s take the <a href="https://en.wikipedia.org/wiki/Zachary%27s_karate_club" target="_blank">classic Karate Club graph</a></p>
<div id="1fc63105" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"></span>
<span id="cb1-4">g <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.karate_club_graph()</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the adjacency matrix</span></span>
<span id="cb1-7">A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.adj_matrix(g)</span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the Laplacian</span></span>
<span id="cb1-10">Delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.laplacian_matrix(g)</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># from the formula we can get the degree matrix</span></span>
<span id="cb1-13">D <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A.toarray() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Delta.toarray()</span>
<span id="cb1-14"></span>
<span id="cb1-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to confirm, you can take the degrees explicitly</span></span>
<span id="cb1-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(np.diag(D))</span>
<span id="cb1-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>([nx.degree(g, u) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> g.nodes])</span></code></pre></div></div>
</div>
<pre><code>[16  9 10  6 ... 17]
[16  9 10  6 ... 17]</code></pre>
<p>The eigenvalues can be obtained via the <code>laplacian_spectrum</code> method:</p>
<div id="50e43c27" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">eigenvalues <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.laplacian_spectrum(g)</span>
<span id="cb3-2">np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(eigenvalues,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div></div>
</div>
<pre><code>array([-0.  ,  0.47,  0.91,  1.13,  1.26,  1.6 ,  1.76,  1.83,  1.96,
    2.  ,  2.  ,  2.  ,  2.  ,  2.  ,  2.49,  2.75,  3.01,  3.24,
    3.38,  3.38,  3.47,  4.28,  4.48,  4.58,  5.38,  5.62,  6.33,
    6.52,  7.  ,  9.78, 10.92, 13.31, 17.06, 18.14])</code></pre>
<p>You can see that the first eigenvalue is indeed zero and that it has multiplicity one, meaning that the graph is connected. The Karate Club has one (weakly) connected component.</p>
<p>The eigenvectors are not directly available in NetworkX but you can use Scipy for this:</p>
<div id="725e4902" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.linalg <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> eigh</span>
<span id="cb5-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Calculate the eigenvalues and eigenvectors of the Laplacian</span></span>
<span id="cb5-3">eigenvalues, eigenvectors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> eigh(Delta.toarray())</span>
<span id="cb5-4">np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(eigenvectors, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
</div>
<pre><code>array([[-0.2, -0.1, -0.1, ...,  0.1,  0.9, -0.2],
   [-0.2, -0. , -0.1, ..., -0.1, -0.1, -0. ],
   [-0.2,  0. , -0. , ...,  0.3, -0.1, -0. ],
   ...,
   [-0.2,  0.1,  0. , ...,  0.1, -0.1,  0.1],
   [-0.2,  0.1,  0. , ..., -0.9,  0.1,  0.1],
   [-0.2,  0.1,  0. , ..., -0. , -0.2, -0.9]])</code></pre>
<p>Note that the first column is the unit (up to an irrelevant scaling factor) corresponding to the zero eigenvalue.</p>


</section>

 ]]></description>
  <category>Graphs</category>
  <guid>https://discovery.graphsandnetworks.com/graphAnalytics/graphLaplacian.html</guid>
  <pubDate>Sun, 01 Dec 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/graph-laplacian.png" medium="image" type="image/png" height="107" width="144"/>
</item>
<item>
  <title>Graph RAG</title>
  <link>https://discovery.graphsandnetworks.com/graphAI/graphRAG.html</link>
  <description><![CDATA[ 





<p>Over the past two years, I’ve explored Retrieval-Augmented Generation (RAG) and its intersection with graph-based systems. While the field of AI has seen dramatic advancements during this time, certain foundational principles remain relevant and stable. This article highlights those principles and explains how to build a robust graph RAG solution, emphasizing clarity and accessibility over technical depth. Although including code could simplify some explanations, I’ve deliberately avoided it to keep the content approachable for non-technical readers.</p>
<p><strong>Keeping the Focus Broad and Flexible</strong></p>
<p>The ecosystem surrounding RAG is vast, with a multitude of frameworks, tools, and databases. My aim is to provide a conceptual understanding rather than focusing on specific tools or implementations. For example, knowledge graphs can be stored in numerous formats—graph databases, relational databases, or even document stores. Similarly, frameworks like LangChain, LlamaIndex, or Crew AI simplify agent-based implementations but aren’t essential to understanding the core principles. I encourage practitioners to begin without relying heavily on frameworks to grasp the finer details. Frameworks are powerful but can obscure the underlying mechanisms.</p>
<p>While the implementation of RAG relies on a language model (LLM), the specifics of the LLM used—whether it’s running locally (e.g., Ollama) or via a hosted service—are secondary to the conceptual structure. At the time of writing, I favor models like Qwen 2.5, but the landscape evolves rapidly. In a year, we may see models capable of rendering knowledge graphs dynamically and reasoning over them in real time.</p>
<section id="introduction-the-challenge-of-limited-context-in-llms" class="level2">
<h2 class="anchored" data-anchor-id="introduction-the-challenge-of-limited-context-in-llms">Introduction: The Challenge of Limited Context in LLMs</h2>
<p>At its core, the issue we address with RAG stems from the limitations of LLMs. These models are trained on a fixed corpus of data at a specific point in time. This creates two major constraints:</p>
<ol type="1">
<li>Data Scope: LLMs cannot access proprietary, specialized, or private data unless explicitly provided.</li>
<li>Time Sensitivity: Models become outdated as new information emerges.</li>
</ol>
<p>The good news is that LLMs are designed to incorporate additional information when it’s included in a prompt. This is the essence of Retrieval-Augmented Generation (RAG). By retrieving and supplying relevant context to the LLM during inference, companies can utilize LLMs with their private or proprietary data without retraining or exposing sensitive information.</p>
<p>However, RAG introduces its own challenges:</p>
<ul>
<li><strong>Token Limits</strong>: LLMs have finite prompt sizes, limiting the amount of contextual information you can provide.</li>
<li><strong>Context Relevance</strong>: How do you identify and retrieve the most relevant information to include in the prompt?</li>
</ul>
<p>These challenges divide the problem into two main areas:</p>
<ol type="1">
<li><strong>Retrieval</strong>: Identifying the data relevant to a query.</li>
<li><strong>Prompting</strong>: Formatting the retrieved data to maximize LLM performance.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-challenges.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>The development of <strong>prompt engineering</strong> has emerged as a distinct discipline, focused on crafting input formulations that elicit accurate and predictable responses from LLMs. This process is inherently model-specific, with each LLM exhibiting unique characteristics that render universal prompting strategies ineffective. Consequently, efforts to codify and standardize prompt engineering have been undertaken (see e.g.&nbsp;<a href="https://github.com/stanfordnlp/dspy">DSPY</a>), albeit with varying degrees of success. Notwithstanding these challenges, the efficacy of well-crafted prompts in yielding high-quality responses remains a consistent truth, regardless of the context from which they are derived.️</p>
<p>How to retrieve (business) data relevant to a question is where knowledge graphs enter the picture. The problem is in fact independent of LLMs: given a question, how to query and retrieve information from a repository relevant to the question? Systems like Confluence, Sharepoint and others index documents and much like a table of contents or the index in a book, they return a list based on indexing and keyword matches. Frameworks like <a href="https://lucene.apache.org">Apache Lucene</a> have been integrated in countless systems to this end. This is the approach from the 20th century and has been superseded by vectors and embeddings.</p>
<p>The advent of machine learning has introduced a paradigm shift in natural language processing (NLP), predicated upon the concept of utilizing vectors to encapsulate linguistic information. While traditional NLP methodologies have long employed techniques such as <strong>chunking and tokenizing</strong>, which involve dissecting text into constituent parts, this approach remains relevant within the context of large language models.</p>
<p>In contrast, <strong>the notion of text vectors represents a novel approach to NLP</strong>, wherein the focus shifts from analytical examination of syntax, grammatical structure, coreference, and other linguistic phenomena to a more holistic treatment of language as a classification and feature engineering problem. By creating comprehensive vectors for words, this methodology ensures that pertinent information is preserved. The underlying principle is that text can be regarded as a numerical construct, analogous to an image being represented by a sequence of RGB values. Furthermore, the resultant vectors exhibit a degree of independence from their original textual context, capturing the essence of language in a manner akin to how children absorb and later specialize in linguistic terminology throughout their lives.️</p>
<p>The process of converting linguistic inputs into vector representations enables facile comparison and analysis. This methodology allows for the identification of semantic relationships between words, such as those linking “child” and “parents”, and facilitates the quantification of their proximity. In contrast, direct comparisons of words are often impractical due to their inherent complexity. Prior to the advent of LLMs, knowledge graphs and bases were constructed by domain experts to establish connections between concepts. However, this approach was labor-intensive and prone to errors, as it relied on manual curation. The development of biochemical and oncological knowledge bases exemplifies this challenge, requiring painstaking effort from human experts. These knowledge bases, while valuable, are often difficult to query and maintain due to their complexity. In response, a distinct academic discipline emerged, encompassing <strong>knowledge management, ontologies, SPARQL, and SHACL</strong>. Notably, the advent of LLM-based approaches has largely supplanted these traditional methodologies, offering a more efficient and effective means of information representation and retrieval.️</p>
<p>The advent of vectorized knowledge enables the comparison of semantic representations, thereby facilitating the identification of related concepts and entities. This approach obviates the need for explicit knowledge graphs or indexing systems, instead leveraging the inherent relationships between vectors to reveal connections.</p>
<p>For instance, when applied to a dataset containing information on climate change, this methodology can uncover associations between CO2 emissions and sea level rise, among other topics, without the necessity of manual graph construction. This technique underlies the transformative potential of artificial intelligence, empowering innovative applications and insights.</p>
<p>With vectorized knowledge in place, it becomes straightforward to extract relevant context from an information base to support a LLM. The process involves:</p>
<ol type="1">
<li>Utilizing a vast collection of vectors representing various pieces of information.</li>
<li>Generating a vector for the query or question at hand.</li>
<li>Comparing the query vector with the existing information-based vectors.</li>
<li>Selecting a set of similar words, paragraphs, or sentences, commonly referred to as k-nearest vectors/words/paragraphs.</li>
<li>Incorporating these contextual elements into the LLM prompt to inform and enhance its responses.️</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-enhance.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>Transcending traditional NLP methodologies involves seamlessly transitioning between text and vectors, thereby circumventing the need for programmatic or analytical processing. This approach enables rapid and accurate information retrieval. The technical complexities associated with this paradigm are multifaceted; however, they can be effectively managed through a simple download and minimal coding requirements, allowing for efficient text analysis via vectorization.</p>
<p>While this article primarily focuses on text-based applications, it is essential to note that the concept of <strong>vectorizing information is inherently generic</strong>. Recent advancements have enabled the embedding of various forms of data, including audio, video, and protein structures, into vectorized formats. Furthermore, multi-modal models have been developed, which can uniformly process and analyze all media types, thereby expanding the scope of this technology.️</p>
<p>The concept of retrieving context via text embedding is referred to as <strong>Basic Retrieval-Augmentation-Generation</strong> (RAG). This approach can be implemented in a relatively short code snippet and has been extensively modified and adapted by numerous developers. The storage and comparison of vectors have also given rise to an entire industry, with various vector databases and database systems that incorporate vector capabilities being widely available.</p>
<p>Most relational database vendors have integrated vector indexes or developed plugins to support this functionality. Furthermore, the NoSQL movement’s emphasis on multi-modal data stores (key-value, document, graph) has led to the incorporation of vectors and embeddings into these systems.</p>
<p>In conjunction with <strong>bot and agentic frameworks</strong>, the RAG mechanism has enabled the creation of numerous corporate bots and applications that allow users to access information bases via natural language. However, it has been observed that this approach does not function optimally, primarily due to the inherent ambiguity of words and sentences.</p>
<p>The term “apple,” for instance, can refer to a fruit, a corporate entity, or stock information, among other possibilities. Consequently, when retrieving information related to “apple” from a typical database, users may receive an excessive amount of data on both the fruit and the stock market. This has resulted in confusing responses from corporate bots, not to mention issues with <strong>hallucinations</strong>.️</p>
<p>The crux of the matter lies not in the contextual information provided to a LLM, but rather in the context of the query itself. To furnish an LLM with the requisite context, one must first comprehend the underlying intent behind the question. For instance, a query regarding the caloric content of an apple would necessitate an understanding that this pertains to fruit, whereas a request for pricing information likely relates to stock valuation.</p>
<p>This dichotomy underscores two primary concerns: how to store contextual information and how to retrieve it effectively. It is at this juncture that knowledge graphs and graph-based Retrieval Augmented Generation (RAG) come into play, offering a solution to the accuracy and context-related issues inherent in naive RAG approaches.</p>
<p>Studies and commercial entities have substantiated the subjective experience of bots providing inaccurate responses, often with vested interests tied to the promotion of graph databases. The efficacy of text-to-SQL, naive RAG, graph RAG, and their variants can be quantitatively measured, underscoring the commercial appeal of these technologies.</p>
<p>The utility of graphs in this context may seem counterintuitive at first, as one might expect vectorization to provide a suitable solution. However, attaching context to vectors proves challenging, as words like ‘apple’ assume unique vector representations independent of their contextual usage. This leads to a situation where the same word can be situated within multiple topic clusters, necessitating the capture of entities or concepts and the attachment of contextual information.</p>
<p>Graphs excel in this regard by capturing the affinity between entities while also highlighting the importance of context. The question of how to create knowledge graphs and interrogate them via LLMs is the central focus of this article.️</p>
</section>
<section id="what-is-a-knowledge-graph" class="level2">
<h2 class="anchored" data-anchor-id="what-is-a-knowledge-graph">What is a knowledge graph?</h2>
<p>The concept of a knowledge graph can be succinctly defined as <strong>an organizational framework for structured data, represented in a graph-like structure</strong>. This differs from relational databases, which implicitly create graphs but do not explicitly utilize them.</p>
<p>However, the notion of a knowledge graph is more complex and encompasses various knowledge cultures, including <strong>semantic stacks, property graphs, strongly typed graphs, ontologies, document stores</strong>, and others. These different approaches enable diverse methods for storing and organizing data as graphs, each with its own strengths and weaknesses.</p>
<p>Key elements that distinguish knowledge graphs from other data storage systems include:</p>
<ul>
<li>The concept of <strong>traversals</strong>: Knowledge graphs prioritize the traversal of data as a graph, returning results in a graph format. In contrast, relational databases store and query data as tables.</li>
<li><strong>Graph-specific metrics</strong>: Centrality, hops, connected components, and similar measures are unique to knowledge graphs and not found in key-value stores or document stores.</li>
</ul>
<p>Additional concepts that are relevant to knowledge graphs include:</p>
<ul>
<li><strong>Taxonomies and ontologies</strong>, which can be beneficial but also burdensome</li>
<li><strong>Hypergraphs</strong>, temporal graphs, geospatial graphs, and other specialized graph types</li>
<li>The various methods for storing and <strong>scaling graph storage</strong></li>
<li><strong>Querying speed</strong> and flexibility</li>
<li><strong>Combining storage dimensions</strong>, such as vectors, documents, and graphs️</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-graphtech.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>The significance of various features and functionalities often varies depending on the specific requirements and objectives of a particular business, research initiative, or project. <strong>Database vendors typically emphasize the capabilities that are most relevant to their industry</strong>, while academic researchers focus on aspects that contribute to the advancement of knowledge or enhance their professional reputation.</p>
<p>Similarly, <strong>cloud service providers tend to highlight the scalability and speed of their offerings</strong>. However, when it comes to developing a proof-of-concept (POC) or minimum viable product (MVP), the need for sophisticated infrastructure is not always necessary.</p>
<p>In fact, a POC or MVP can be effectively developed without incurring significant expenses. The following section outlines some essential Python components that can be used to establish a solid graph-based reasoning and analytics (RAG) solution, thereby facilitating the creation of a functional POC at minimal cost.</p>
<p>For the purposes of this discussion, <strong>I will refer to a knowledge graph as a data structure comprising named entities and relationships between them</strong>. This definition is independent of how such a graph is stored or queried, and it provides a useful framework for exploring various operationalization and implementation strategies.️</p>
</section>
<section id="creating-knowledge-graphs" class="level2">
<h2 class="anchored" data-anchor-id="creating-knowledge-graphs">Creating Knowledge Graphs</h2>
<p>The recipe for ingesting data into a knowledge graph consists of the following steps:</p>
<ol type="1">
<li><strong>Scraping</strong>: raw information, typically in the form of PDFs, is extracted and parsed into markdown format. Each raw source is referred to as a document.</li>
<li><strong>Chunking</strong>: documents are subsequently divided into smaller components, known as chunks or nodes (although this terminology does not yet relate to graph structures).</li>
<li><strong>Entity extraction</strong>: the named entities within each chunk, along with their relationships, are identified and extracted. This process involves consolidating duplicate entity references across multiple chunks.</li>
<li><strong>Graph consolidation</strong>: the resulting entities and relationships form the foundation of the knowledge graph, with metadata retained for referencing original chunk and document sources.</li>
<li><strong>Semantic consolidation</strong>: as new chunks are extracted, the knowledge graph expands accordingly, with descriptions truncated if necessary to maintain data integrity.</li>
<li><strong>Vectorize everything:</strong> the nodes, edges, and chunks within the knowledge graph are then vectorized, enabling cross-referencing and tracking of information origins.</li>
<li><strong>Variations</strong> on this process may necessitate community detection algorithms, such as the Louvain algorithm, which is discussed in further detail in the appendix.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-steps.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>The following sections will provide a detailed explanation of each step in this process.️</p>
<section id="what-is-being-ingested" class="level3">
<h3 class="anchored" data-anchor-id="what-is-being-ingested">What is being ingested?</h3>
<p>Text is a natural source for the majority of LLMs but graph RAG can handle any type of data provided: - you have a multi-modal LLM capable of extracting topics/entities from the medium - the raw data (say, images or audio) is stored appropriately in a way it can be picked up in the same what that text is referenced - the raw data can be vectorized or embedded.</p>
<p>In many cases it means that you need multiple LLMs. Call it <strong>a multi-agent system</strong> with each agent coupled to a LLM for a specific task.</p>
</section>
<section id="storage" class="level3">
<h3 class="anchored" data-anchor-id="storage">Storage</h3>
<p>You need three storage flavors to run a graph RAG solution:</p>
<ul>
<li><strong>JSON</strong> or document store to save the chunks</li>
<li>a <strong>graph</strong> store to, well, consolidate the knowledge graph</li>
<li>a <strong>vector</strong> store to embed nodes, edges and chunks.</li>
</ul>
<p>You can combine the stores and solutions like Neo4j can store graphs and vectors uniformly. Or even all in one via stores like <a href="https://unum-cloud.github.io/ustore/" target="_blank">Unum</a>. I often use <a href="https://www.trychroma.com" target="_blank">ChromaDB</a> for prototyping and <a href="https://networkx.org" target="_blank">NetworkX</a> for graphs in memory. Graph RAG is really an architecture and does not depend on a particular technology, despite all the marketing claims.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-stores.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
</section>
<section id="documents-chunks-tokens-and-all-that" class="level3">
<h3 class="anchored" data-anchor-id="documents-chunks-tokens-and-all-that">Documents, chunks, tokens and all that</h3>
<p>The optimal size of a document or text segment for referencing purposes lies between the extremes of a single sentence and a comprehensive, thousand-page document. A paragraph typically represents the most suitable chunk size, as it strikes a balance between conciseness and contextual richness.</p>
<p>However, determining the ideal chunk size often involves a degree of artistry rather than strict scientific calculation, as it can vary significantly depending on the specific business requirements and the nature of the raw data being processed.</p>
<p>Specialized techniques such as <a href="https://python.langchain.com/docs/how_to/semantic-chunker/" target="_blank">semantic chunking</a>, character <a href="https://python.langchain.com/docs/concepts/text_splitters/" target="_blank">splitters</a> can greatly enhance the effectiveness of text segmentation. For instance, when dealing with diverse corpora, employing different chunking strategies for distinct sources may be necessary to achieve optimal results.</p>
<p>Ultimately, the goal is to produce a collection of manageable chunks that retain references to their original sources. <strong>These references serve as verifiable anchors</strong>, enabling users to fact-check answers directly from the underlying data if needed. This process, known as <strong>grounding</strong>, provides a simple yet reliable means of detecting potential hallucinations and ensuring the accuracy of responses.</p>
</section>
<section id="the-importance-of-metadata" class="level3">
<h3 class="anchored" data-anchor-id="the-importance-of-metadata">The importance of metadata</h3>
<p>Metadata matters especially if the raw data is not text but in general:</p>
<ul>
<li>you need a reference from a chunk to the (parent) document</li>
<li>the entities need one or more references to the chunks where they appear</li>
<li>the relationships need a reference to one or more chunks where they were discovered</li>
<li>the embedding of nodes, edges and chunks needs to have a reference.</li>
</ul>
<p>The metadata isn’t just for end-user references, grounding and remembering how things got created, it matters also if you wish to delete documents or chunks. The removal of documents and chunks needs to cascade to all secondary elements in order to keep the <strong>integrity of the data.</strong></p>
</section>
<section id="scraping-and-formats" class="level3">
<h3 class="anchored" data-anchor-id="scraping-and-formats">Scraping and formats</h3>
<p>Markdown is ideal for AI because it’s plain text while preserving structure. Unlike binary large objects (blobs) and pdf’s where the binary format has to be transformed or scraped. This is also a business on its own and the technology has tremendously improved beyond basic OCR. Services like <a href="https://www.llamaindex.ai/blog/launching-the-first-genai-native-document-parsing-platform" target="_blank">LlamaParse</a> return quality markdown from fairly complex documents. The open source <a href="https://github.com/VikParuchuri/marker" target="_blank">Marker</a> project also deliver astounding results.</p>
<p>Since the quality of the raw data is the basis of a good knowledge graph it matters to investigate the various options and cost.</p>
<p>It is also worth noting that multi-modal models and vision-based models can be beneficial in this context. Providing a screenshot of a PDF document and requesting a structured output (headers, footnotes, etc.) can often yield satisfactory results. The effectiveness of this approach, however, ultimately depends on the specific business domain and the characteristics of the documents being processed. In some cases, particularly those involving complex tables, formulas, or quoted text, more advanced quality parsing may be required to achieve accurate output.️</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-parse.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
</section>
<section id="prompting" class="level3">
<h3 class="anchored" data-anchor-id="prompting">Prompting</h3>
<p>It is with some reluctance that I acknowledge the pivotal role played by carefully crafted prompts in the efficacy of graph RAG (and numerous other sophisticated AI applications). A thorough examination of Microsoft’s GraphRAG, Neo4j’s GenAI, LightRAG, and various graph RAG variants reveals that their success or failure is largely contingent upon the precision and effectiveness of their respective prompts.</p>
<p>Several techniques, such as DSPy, have been developed to treat prompting as a traditional machine learning task; however, its reliability has yet to be conclusively established. Companies like LangChain have created tools and applications that utilize agents (Large Language Models) to iteratively generate prompts. Notably, the efficiency of a prompt is heavily dependent on the specific LLM employed. Consequently, prompting remains an indispensable component and a nuanced craft, rather than a clearly defined or universally applicable formula for success.️</p>
</section>
<section id="entities" class="level3">
<h3 class="anchored" data-anchor-id="entities">Entities</h3>
<p>Extracting entities from chunks can be done in various ways:</p>
<ul>
<li>there are dedicated models like <a href="https://arxiv.org/abs/2402.15343" target="_blank">NuNER</a>, <a href="../nuextract.html" target="_blank">NuExtract</a> or <a href="https://github.com/urchade/GLiNER" target="_blank">GliNER</a></li>
<li>you simply ask any model to extract them (possibly with smart prompting)</li>
<li>there are models focused on triple extraction and knowledge graph creation like <a href="https://www.sciphi.ai/blog/triplex" target="_blank">Triplex</a>.</li>
</ul>
<p>If you extract an entity on its own it does have a pointer to the chunk it came from, but for downstream tasks like querying it’s beneficial to have a description or explanation within the entity. That is, rather than simply extract the name and type of an entity, it’s beneficial to also ask the model what the entity is. For instance, the entity <em>(John, PERSON)</em> is a lot less useful than <em>(John, PERSON, John is a lawyer living in Montana)</em>. The thing is of course that the more data becomes available about an entity, the more the description has to be fine-tuned.</p>
<p>Whether or not you have bare entities or you adorn them with context relates to deep issues with knowledge graphs:</p>
<ul>
<li>can the KG be incremented easily or does it require the whole corpus?</li>
<li>how to organize entities in semantically related domains?</li>
<li>how to ensure that when querying the KG you pick up the correct entity?</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-enrich.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
<p>Every framework or company has its own view on the matter. Frameworks like Microsoft GraphRAG are not incremental and very expensive in terms of LLM processing. RDF databases rely on ontologies to embed meaning. Like all else, it depends on many factors and what your end game is.</p>
<p>We are early 2025 and in my view, the most efficient way to graph RAG things is the <a href="https://arxiv.org/abs/2410.05779" target="_blank">LightRAG</a> approach but who knows how things will be in a few years from now.</p>
</section>
<section id="relationships" class="level3">
<h3 class="anchored" data-anchor-id="relationships">Relationships</h3>
<p>The extraction of relationships within knowledge graphs presents various complexities. A primary consideration revolves around whether entities should conform to a pre-defined schema (ontology) or allow relationships to emerge organically (inferred schema). However, this dichotomy poses a risk: the pursuit of a strictly structured knowledge graph may compromise its querying efficiency. Conversely, a KG without some form of schema can rapidly become an unmanageable repository of information.️</p>
<p>Creating ontologies based on RDF leads to triple stores and you potentially find value in reasoning engines, constraints (SHACL) and lots of things you will not find in the less rigid world of property graphs. My experience is that developing an ontology is a burden and asking domain experts to master ontologies (in order to implement their know-how into an ontology) is often difficult. People find it hard to make a leap of abstraction even if they understand their domain very well. Things like transitivity, superclasses, the difference between object and value properties…are quite abstract for most people and they correctly wonder what the benefits are. Of course, I am not saying that ontologies are useless, just that you have to understand what it entails and where it leads to.</p>
<p>Just like entities, it is useful to embed descriptions in relationships. It’s the alternative to a predefined ontology. Rather than a triple <em>(John, KNOWS, Maria)</em> you can have <em>(John, Maria, John met Maria on the train during a trip to a conference in London)</em>. The latter will be more helpful when embedded in a vector while the former will be more useful if you wish to have a pure KG. Reasoning engines will not work without ontologies but the latter form will be better for bot apps.</p>
<p>Utilizing a predefined ontology within the prompt context enables an LLM to generate entities and relationships. This methodology is favored by Neo4j and yields satisfactory results; however, its efficacy is limited when dealing with extensive ontologies, as it may lead to token size truncation. Consequently, this approach is more suitable for tutorial purposes rather than practical applications. Furthermore, if a complex schema is provided for each NER extraction, the process becomes computationally slow due to the LLM’s need to ingest the ontology on every occasion.️</p>
<p>Note that simple predicates also leads to issues with disambiguation. The better approach is to keep a description and use the vector embedding to figure out whether an entity is new or the same. If you only have (John, PERSON) it will be hardly possible to figure out whether the appearance of ‘John’ in a chunk is the same or not from the one already ingested.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-ontology.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
</section>
</section>
<section id="querying-knowledge-graphs" class="level2">
<h2 class="anchored" data-anchor-id="querying-knowledge-graphs">Querying Knowledge Graphs</h2>
<section id="global-and-local" class="level3">
<h3 class="anchored" data-anchor-id="global-and-local">Global and local</h3>
<p>The relationships between KB entities can be approached in two ways:</p>
<ul>
<li>use vector similarity with respect to the k-nearest nodes and take the attached edges, this is called <strong>local querying</strong></li>
<li>use vector similarity across all edges and take the k-nearest edges with the edge endpoints giving entities, this is called <strong>global querying</strong>.</li>
</ul>
<p>If you combine both approaches and merge the results you get <strong>hybrid querying</strong>.</p>
<p>It can be shown that results are improved if global search occurs with global topics. For instance, the question ‘What did John Field compose in early 1830?’ leads to global topics ‘music, composer, person’ and local keywords ‘John Field, 1830’. Global queries work best with global topics and local search with local keywords.</p>
<p>This is not written in stone, nothing prevents one from using for instance depth-first traversals or multi-hop queries. What works best depends on the KG and the business domain.</p>
</section>
<section id="graph-analytics" class="level3">
<h3 class="anchored" data-anchor-id="graph-analytics">Graph analytics</h3>
<p>A knowledge graph can be approached like any other graph and various tools can be useful:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Centrality" target="_blank">centrality</a> is the answer to the question ‘what are the most important nodes and edges?’, it helps discover dense hubs, uncoupled subgraphs (weakly connected components) and more. These measures can be used during querying to sort results.</li>
<li>cluster and <a href="https://en.wikipedia.org/wiki/Community_structure" target="_blank">community detection</a> help identify global topic clusters. Some approaches like <a href="https://microsoft.github.io/graphrag/" target="_blank">Microsoft GraphRAG</a> and <a href="https://github.com/neo4j-labs/llm-graph-builder" target="_blank">Neo4j’s graph builder</a> advocate this approach.</li>
<li>introducing <strong>meta-nodes</strong> to organize knowledge in some way. This happens often with respect to a company’s internal organization rathe than purely on the basis of the raw corpus.</li>
</ul>
<p>Note that certain things require the whole graph and difficult to implement incrementally. Conversely, deleting things in a KG can have a direct impact on topological metrics. You should carefully analyze the long-term impact of certain measures and how they affect a changing KG.</p>
</section>
<section id="naive-rag" class="level3">
<h3 class="anchored" data-anchor-id="naive-rag">Naive RAG</h3>
<p>The Graph RAG model encompasses a basic variant, often referred to as standard or naive RAG. This approach is applicable when a knowledge graph with embeddings is available, allowing for the computation of chunk embeddings against which queries can be posed. Consequently, naive RAG constitutes the fourth query method, alongside local, global, and hybrid approaches. In typical applications, these four options are presented to the end-user.</p>
<p>Notably, naive RAG does not involve the knowledge graph in its operation, thereby rendering it the fastest of the four query techniques.️</p>
</section>
</section>
<section id="interactions" class="level2">
<h2 class="anchored" data-anchor-id="interactions">Interactions</h2>
<p>The initial introduction of a dataset into a system or process is a critical phase; however, a thorough examination of the subsequent long-term interactions and consequences is equally essential for a comprehensive understanding.️</p>
<p>Any ingestion will lead to errors and issues. The end-user or domain expert needs to be able to adjust and correct things in the KG. This can take many forms:</p>
<ul>
<li>entities and relationships can be deleted</li>
<li>entities can be added or existing ones adorned (change the description, add keyword)</li>
<li>chunks might not contain relevant information</li>
<li>weights can be altered to increate the importance of some relationships.</li>
</ul>
<p>Each of these actions should update the vector storage, the KG and the chunk/document store:</p>
<ul>
<li>when a description is changed, the entity/relationship should be vectorized again</li>
<li>if a chunk is removed or changed, all stores should get updated</li>
</ul>
<p>and so on. The central idea here is <strong>data integrity</strong> or change propagation. On a code level this means that various bits used by the ingestion should be triggered in other ways as well. The CRUD (create/read/update/delete) has to be in place on all levels: document, chunk, node and relationship. Once again, if you understand conceptually how graph RAG hangs together it’s just a lot of coding and unit testing. I mean, it’s not algorithmically demanding, just correct plumbing and wiring of methods.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/images/rag-slang.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="500"></p>
</figure>
</div>
</section>
<section id="minimal-setup" class="level2">
<h2 class="anchored" data-anchor-id="minimal-setup">Minimal setup</h2>
<p>In this section I describe a typical POC/MVP, running fully local.</p>
<p>You need <strong>three types of data storage</strong>:</p>
<ul>
<li>for chunks and raw documents (text) you can use JSON files or, more advance, some document database like MongoDB.</li>
<li>for vectors the <a href="https://www.trychroma.com" target="_blank">ChromaDB</a> is the easiest of them all but any of <a href="https://en.wikipedia.org/wiki/Vector_database" target="_blank">the ones listed on Wikipedia</a> will do</li>
<li>for the knowledge graph you can use NetworkX and a file (e.g.&nbsp;GraphML) and a more advanced setup would entail <a href="https://graphsandnetworks.com/graph-databases/memgraph/" target="_blank">Memgraph</a>, <a href="https://graphsandnetworks.com/graph-databases/neo4j/" target="_blank">Neo4j</a>, <a href="https://graphsandnetworks.com/graph-databases/neptune/" target="_blank">AWS Neptune</a> and many more.</li>
</ul>
<p>A minimal setup based on files is obviously not going to scale well but can go a long way. You can combine everything in one and the same storage. For instance, Neo4j can store vectors graphs and text at the same time. There are also some open source options like <a href="https://unum-cloud.github.io/ustore/" target="_blank">Unum or UStore</a>.</p>
<p>The <strong>storage and querying part is straightforward</strong>, but depends on the underlying API. You can also explore the wrappers of LlamaIndex and LangChain for inspiration.</p>
<p><strong>Tokenizing</strong> can be based on <a href="https://github.com/openai/tiktoken" target="_blank">Tiktoken</a> or <a href="https://huggingface.co/docs/transformers/en/main_classes/tokenizer" target="_blank">the Huggingface tokenizer</a>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">ENCODER <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tiktoken.encoding_for_model(settings.tokenize_model)</span>
<span id="cb1-2">tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ENCODER.encode(content)</span></code></pre></div></div>
<p><strong>Chunking</strong> of documents can be done via tokenizing by taking a fixed amount of tokens each time, or you can use something smarter like semantic chunking. The size of a chunk and what works best depends on the type of data. You don’t need a fixed chunk size, using an amount of sentences can work too. Some advocate the need to have overlapping chunks, with a certain amount of tokens in adjacent chunks.</p>
<p>The most basic fixed token length approach looks like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> encode(content)</span>
<span id="cb2-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(tokens) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;=</span> max_token_size:</span>
<span id="cb2-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> content</span>
<span id="cb2-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb2-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> decode(tokens[:max_token_size])</span></code></pre></div></div>
<p>The original documents need to be stored with a unique identifier and each chunk has to point to this unique identifier so that grounding and showing references can be done with each request.</p>
<p>Once chunking is done you need to feed each chunk to the ingestion process:</p>
<ul>
<li>the entities are extracted with some relations</li>
<li>every chunk produces a little graph (it can be a single entity, nothing at all or a small graph with a dozen nodes and edges)</li>
<li>this little graph has to be merged into the global KG</li>
</ul>
<p>In addition: - every entity and relationship is vectorized - every chunk is vectorized - all the entities and relationships need to point to the chunk where they were found.</p>
<p>The most difficult part here is how to extract this ‘little graph’ from a chunk. As mentioned earlier, the magic resides in a sophisticated prompt. In essence, you ask very explicitly and by giving examples to the LLM what you need. This can look something like this:</p>
<pre class="text"><code>--Goal--

You get a piece of text and a list of entity types. Identify all entities of those types from the text and all relationships among the identified entities.

--Steps--
1. Identify the entities:
- the entity name
- the entity type
- a description of this entity
Format each entity as follows:
(entity name, entity type, entity description)
2. From the entities in step 1 identify all pairs that are related.
For each pair (source_entity, target_entity) you extract the following information:
- source_entity: the name as identified in step 1
- target_entity: the name as identified in step 1
- description: a description of this relationship
3. Identify high-level keywords that summarize the essence of the given text.

--Examples--
&lt;give concrete example here&gt;

--Given text--
&lt;inject the chunk here&gt;</code></pre>
<p>The magic sauce here is the neatly structured format and the very explicit guidelines and examples.</p>
<p>There are various models which focus on named entity extraction but they typically will not generate the high-level keywords and the description. These are, however, essential for the querying part and for the vectorization. If you leave out these you will generate a typically triple store but the incremental aspect will be much harder.</p>
<p><strong>What happens when the same entity appears in multiple places with a different description?</strong> This is the smart merging part. The LLM is used once again to create a comprehensive merge of the description. This idea also applies to the description and keywords on relationships. The trickier part is to identify an entity within a context. For example, if two paragraphs contain the word ‘apple’ and one refers to the company while the other refer to fruit, the merged description will not make sense. There are a few ways to proceed: - you can ask the LLM very clearly whether the two entities with the given descriptions refer to the same thing. - you analytically look at the similarity of the node’s neighborhood - you can use the neighborhood as context to the question for the LLM.</p>
<p>If the LLM thinks the two are the same you can proceed and if not you need to store the entity separately. This could mean that the unique key for a node in the KG is not the name but the combination (name, type) or possibly more. Whether this is necessary depends on the knowledge domain. For example, technical terms like ‘cyanobacteria’ or ‘synechococcus’ are unlikely to refer to more than one thing, but names like ‘phoenix’ or ‘apple’ lead to disambiguation modeling.</p>
<p>The little chunk graphs can be merge in the KG and in addition: - every entity is vectorized together with the description - every relationship is vectorized together with the description and the keywords</p>
<p>All of this can fit in a single Python file. A scalable solution would entail some queues and separate processes. A sophisticated implementation is <a href="https://trustgraph.ai">TrustGraph</a> for instance. A codebase like TrustGraph contains a lot of plumbing code related to Apache Pulsar and other things, meaning that the essence of a graph RAG pipeline is hard to discover. It helps tremendously to have a minimal solution as a map to develop something like TrustGraph.</p>
<p>In any case, at this point the KG is ready to be queried. As mentioned above there are four types and each type has been described. The implementation is quite straightforward, if you understand conceptually what needs to be assembled.</p>
<p>For instance, the local query requires: - fetching the most important nodes via the vector databases - collecting for these nodes the relationships attached (edge neighbors) - collecting the unique chunks - sorting the chunks on the basis of some order, typically the edge degree - using these chunks as context for the question.</p>
<p>It means you need to use the NetworkX API, the ChromaDB API and so on. The fact that these API are very plain makes it easy to see what is graph RAG specific and what is storage specific. Here again, if you use Neo4j, Pinecone, MongoDB and whatnot, you likely will get lost in API’s. Start simple and grow from there.</p>
</section>
<section id="appendix" class="level2">
<h2 class="anchored" data-anchor-id="appendix">Appendix</h2>
<section id="a1-things-to-consider" class="level3">
<h3 class="anchored" data-anchor-id="a1-things-to-consider">A1 Things to consider</h3>
<ul>
<li><strong>ontologies</strong> in a semantic context are based on OWL/RDF and if you use <a href="https://protege.stanford.edu" target="_blank">Protégé</a> for a while you will discover that it can be a vocation to create something great. People often refer to an ontology as an emerging schema (e.g.&nbsp;when using property graphs) or do not understand why it matters. My advice here is to research carefully and consider in advance the pro and contra of an ontology. You can explicitly via smart prompts instruct a LLM to stick to a schema but if your ontology is substantial this will not work (due to the window size). The type of graph database will often help or hinder your schema development. Using TypeDB or TigerGraph enforce a schema from the start, most RDF databases have dynamic inference and reasoning based on OWL, the property graphs typically lack support for ontologies. In this sense, make sure that if you don’t wish to a long ontological journey you also don’t pick a database which enforces it. Equally well, I have seen many developments discover rather late that having some schema is a good things. If you store your graph in MongoDB and a dozen of devs are working with it, you can be certain that the schema will diverge over time.</li>
<li>if you go big, you need big infrastructure. Systems like <a href="https://trustgraph.ai" target="_blank">Trustgraph</a> use async communication via Apache Pulsar to scale the various elements highlighted above. This is wonderful provided you understand it all and you know that this is necessary. Most customers don’t have petabyte knowledge bases but if one of them does, make sure you focus on scale, cost and performance.</li>
<li>for medium size developments you might benefit from ETL platforms like Apache Airflow, Apache Hop or (my new favorite) <a href="https://kestra.io" target="_blank">Kestra</a>. Orchestration, logging, security (encryption of chunks for instance) and many other enterprise factors matter. Just don’t let those aspects get in the way of the graph RAG intelligence you are after.</li>
<li><a href="https://graphsandnetworks.com/graphviz/" target="_blank">visualization of knowledge graphs</a> is one of my core competences and if you need assistance with this, <a href="https://graphsandnetworks.com/contact/" target="_blank">gimme a call</a> Graphs beg to be visualized and in many projects it’s essential to have some kind of diagrams.</li>
<li>the <strong>LLM cost</strong> is something you need to consider carefully. Development benefits from local setups like Ollama but you will discover that cloud services (like OpenAI, LlamaParse, Vertex AI…) are often cheaper if you wish to ingest large amounts of data. And a lot faster than anything you can achieve on a laptop. There tools on the market to help evaluate cost, like <a href="https://docs.ragas.io/en/stable/howtos/applications/_cost/#implement-tokenusageparser">Ragas</a> and <a href="https://docs.confident-ai.com/docs/getting-started">DeepEval</a>.</li>
<li>If you use frameworks like CrewAI you will need LMOps like <a href="https://docs.arize.com/phoenix" target="_blank">Arize Phoenix</a>, <a href="https://www.trulens.org" target="_blank">TrueLens</a> and <a href="https://www.langchain.com/langsmith" target="_blank">LangSmith</a>. With a minimal setup you will be able to use standard Python debugging. There is the incorrect perception that one has to use a framework to have (angentic) AI, but that is not the case. I think that implementing graph RAG without a framework will give a deeper understanding and a foundation to proceed (if needed) to these frameworks. There is also the general agreement that frameworks come with lots of things you don’t need. For example, why would you include code for all the different vector databases if you use only one? Frameworks capable of everything are often complex and fixing issues can lead to rabbit holes. Be careful with embracing agent frameworks (<a href="https://www.youtube.com/watch?v=U4uBHs0gym0" target="_blank">this video</a> is interesting in this context).</li>
</ul>
</section>
<section id="a2-on-clustering-and-global-topics" class="level3">
<h3 class="anchored" data-anchor-id="a2-on-clustering-and-global-topics">A2 On clustering and global topics</h3>
<p>Extracting a KG from a corpus leads to knowledge on a low level. That is, there is no semantic hierarchy. You can compare it with Wikipedia articles without content organization. There are no topics, you have an index but no table of contents. This works but experiments show that it leads to low accuracy. In addition, some question are about broad topics (What is oncology?) and can’t be answered by having thousands of details (entities).</p>
<p>Frameworks like Microsoft GraphRAG remedy this by post-processing knowledge and use clustering methods like the Louvain algorithm to infer hierarchy. This works but it’s demanding on many levels and requires the whole KG to be present. Incremental intelligence is difficult via this route.</p>
<p>The more scalable approach is to embed description/information within the KG and at query time prompt the LLM for high-level and low-level topics. For example, if the question is ‘When did John Field die?’ the high-level topics would be ‘person, composer, music’ while the low-level keywords would be ‘John Field, death’. The combination bypasses the need for community creation in the KG and it has been shown to be more accurate.</p>
<p>Clustering can have benefits outside RAG and if you want a clean KG, drill-down visualization, ML development (GNN)… clustering is the way to go. Once again, don’t blindly follow the marketing claims and understand how certain choices affect your business aims (app goals).</p>
</section>
<section id="a3-document-parsers" class="level3">
<h3 class="anchored" data-anchor-id="a3-document-parsers">A3 Document Parsers</h3>
<p>Turning documents into text or markdown has become a business avenue on its own. What works best depends on the type of documents you need to parse and on your budget. Documents with charts, tables, maths are more complex than <a href="https://www.gutenberg.org" target="_blank">Gutenberg books</a>, legal texts with references, footnotes and complex hierarchies can be hard to chunk. Some parsers combine OCR with computer vision and LLM, some are fast, some can handle noise (smudges), some deal well with many languages…there is a spectrum of solutions and features.</p>
<section id="open-source" class="level4">
<h4 class="anchored" data-anchor-id="open-source">Open Source</h4>
<ul>
<li><a href="https://github.com/Unstructured-IO/unstructured" target="_blank">Unstructured</a> The&nbsp;library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and&nbsp;<a href="https://docs.unstructured.io/open-source/core-functionality/partitioning" target="_blank">many more</a>. The use cases of&nbsp;<code>unstructured</code>&nbsp;revolve around streamlining and optimizing the data processing workflow for LLMs.&nbsp;The Unstructured&nbsp;modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.</li>
<li><a href="https://github.com/nlmatics/llmsherpa" target="_blank">LLMSherpa</a> Open source solution tackling generic problems that other parsers have, see <a href="https://ambikasukla.substack.com/p/efficient-rag-with-document-layout?r=ft8uc&amp;utm_campaign=post&amp;utm_medium=web&amp;triedRedirect=true" target="_blank">this article for an interesting overview</a>.</li>
<li><a href="https://github.com/VikParuchuri/marker" target="_blank">Marker</a> focuses on scientific papers and lots of languages.</li>
<li><a href="https://github.com/atlanhq/camelot" target="_blank">Camelot</a> an older but still relevant package.</li>
<li><a href="https://github.com/Filimoa/open-parse" target="_blank">Open Parse</a></li>
<li><a href="https://github.com/lazyFrogLOL/llmdocparser" target="_blank">LlmDocParser</a> A package for parsing PDFs and analyzing their content using LLMs.</li>
<li><a href="https://tika.apache.org" target="_blank">Apache Tika</a> See this <a href="https://medium.com/wellcome-data/how-to-parse-millions-of-pdf-documents-asynchronously-with-apache-tika-d27e06e57b22" target="_blank">article</a> looking into the scalability of Tika.</li>
</ul>
</section>
<section id="cloud-services" class="level4">
<h4 class="anchored" data-anchor-id="cloud-services">Cloud Services</h4>
<ul>
<li><a href="https://github.com/run-llama/llama_parse">LlamaParse</a> is part of LlamaIndex and a closed cloud service.</li>
<li><a href="https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence" target="_blank">Azure Document Intelligence</a> is a cloud service with strict compliance and security, well integrated with other cloud services.</li>
<li><a href="https://aws.amazon.com/textract/" target="_blank">Amazon Textract</a> much like the Azure service.</li>
</ul>
</section>
</section>
<section id="a4-graph-rag-frameworks" class="level3">
<h3 class="anchored" data-anchor-id="a4-graph-rag-frameworks">A4 Graph RAG Frameworks</h3>
<ul>
<li><a href="https://microsoft.github.io/graphrag/" target="_blank">Microsoft Graph RAG</a></li>
<li>Formerly Neo4j GenAI but now called <a href="https://neo4j.com/docs/neo4j-graphrag-python/" target="_blank">Neo4j GraphRAG for Python</a></li>
<li><a href="https://github.com/gusye1234/nano-graphrag" target="_blank">Nano Graph RAG</a></li>
<li><a href="https://github.com/HKUDS/LightRAG" target="_blank">LightRAG</a></li>
<li><a href="https://github.com/circlemind-ai/fast-graphrag">Fast Graph RAG</a></li>
<li><a href="https://trustgraph.ai">TrustGraph</a></li>
</ul>


</section>
</section>

 ]]></description>
  <category>Graphs</category>
  <category>LLM</category>
  <category>GraphAI</category>
  <guid>https://discovery.graphsandnetworks.com/graphAI/graphRAG.html</guid>
  <pubDate>Sat, 30 Nov 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/knowledgeTree.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Label spreading and propagation on graphs.</title>
  <link>https://discovery.graphsandnetworks.com/graphML/LabelPropagation.html</link>
  <description><![CDATA[ 





<p>Label propagation is a semi-supervised machine learning algorithm used for classification and community detection tasks on graphs. It works by propagating labels through the graph based on the structure and connectivity of the nodes:</p>
<ul>
<li>Initialization: Assign initial labels to a subset of nodes (these are the labeled nodes). The rest of the nodes are unlabeled.</li>
<li>Propagation: Iteratively update the labels of the unlabeled nodes based on the labels of their neighbors. This is typically done by majority voting or averaging the labels of neighboring nodes.</li>
<li>Convergence: Repeat the propagation step until the labels stabilize (i.e., no further changes occur) or a maximum number of iterations is reached.</li>
<li>Prediction: Once the algorithm converges, the labels of the unlabeled nodes are used as the final predictions.</li>
</ul>
<p>Label propagation is particularly useful in scenarios where labeled data is scarce but the graph structure is informative. It leverages the assumption that connected nodes are likely to share the same label.</p>
<p>Note that Neo4j has the <a href="https://neo4j.com/docs/graph-data-science/current/algorithms/label-propagation/">label propagation algorithm via GDS</a>. Other vendors, like Memgraph, prefer the <a href="https://arxiv.org/abs/1305.2006">LabelRankT</a> variation with the advantage that it can be <a href="https://memgraph.com/docs/advanced-algorithms/available-algorithms/community_detection_online">executed via a trigger</a>. This triggered detection makes it very useful for streaming data and <strong>realtime anomaly detection</strong>.</p>
<section id="simplistic" class="level2">
<h2 class="anchored" data-anchor-id="simplistic">Simplistic</h2>
<p>The most basic version of label propagation could be like this:</p>
<div id="cell-2" class="cell" data-execution_count="16">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> label_propagation(G, labels, max_iter<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>):</span>
<span id="cb1-5">    nodes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(G.nodes())</span>
<span id="cb1-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(max_iter):</span>
<span id="cb1-7">        new_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> labels.copy()</span>
<span id="cb1-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> nodes:</span>
<span id="cb1-9">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> labels[node] <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb1-10">                neighbor_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [labels[neighbor] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> neighbor <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.neighbors(node) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> labels[neighbor] <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>]</span>
<span id="cb1-11">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> neighbor_labels:</span>
<span id="cb1-12">                    new_labels[node] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(neighbor_labels), key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>neighbor_labels.count)</span>
<span id="cb1-13">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> new_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> labels:</span>
<span id="cb1-14">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span>
<span id="cb1-15">        labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> new_labels</span>
<span id="cb1-16">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> labels</span></code></pre></div></div>
</div>
<p>If we apply this to the karate graph and assign <code>None</code> to all nodes except to number 33 and 0:</p>
<div id="cell-4" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.karate_club_graph()</span>
<span id="cb2-2">initial_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {node: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes()}</span>
<span id="cb2-3">initial_labels[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Label node 0 as class 0</span></span>
<span id="cb2-4">initial_labels[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Label node 33 as class 1</span></span>
<span id="cb2-5"></span>
<span id="cb2-6">final_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_propagation(G, initial_labels)</span>
<span id="cb2-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(final_labels)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 1, 10: 0, 11: 0, 12: 0, 13: 0, 14: 1, 15: 1, 16: 0, 17: 0, 18: 1, 19: 0, 20: 1, 21: 0, 22: 1, 23: 1, 24: 0, 25: 0, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 0, 32: 1, 33: 1}</code></pre>
</div>
</div>
<p>This can be visualized for instance with PyVis:</p>
<div id="cell-6" class="cell" data-execution_count="18">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyvis <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> net</span>
<span id="cb4-2">g<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>net.Network(notebook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cdn_resources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in_line'</span>)</span>
<span id="cb4-3">g.from_nx(G)</span>
<span id="cb4-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> set_options(net_g):</span>
<span id="cb4-5">    options <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">var options = {</span></span>
<span id="cb4-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">   "configure": {</span></span>
<span id="cb4-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">        "enabled": true</span></span>
<span id="cb4-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">   },</span></span>
<span id="cb4-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  "edges": {</span></span>
<span id="cb4-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "color": {</span></span>
<span id="cb4-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      "inherit": false</span></span>
<span id="cb4-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    },</span></span>
<span id="cb4-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "smooth": true</span></span>
<span id="cb4-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  },</span></span>
<span id="cb4-16"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  "physics": {</span></span>
<span id="cb4-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "barnesHut": {</span></span>
<span id="cb4-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      "gravitationalConstant": -3000.0</span></span>
<span id="cb4-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    },</span></span>
<span id="cb4-20"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "minVelocity": 0.75</span></span>
<span id="cb4-21"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  }</span></span>
<span id="cb4-22"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb4-23"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-24">    net_g.set_options(options) </span>
<span id="cb4-25"></span>
<span id="cb4-26"></span>
<span id="cb4-27">set_options(g)</span>
<span id="cb4-28">g.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"simple.html"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>simple.html</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="18">

        <iframe width="100%" height="600px" src="simple.html" frameborder="0" allowfullscreen=""></iframe>
        
</div>
</div>
<p>By setting the size of the initial propagating nodes we get:</p>
<div id="cell-8" class="cell" data-execution_count="19">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb6-2">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb6-3">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb6-4">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb6-5"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> final_labels:</span>
<span id="cb6-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> final_labels[i]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:</span>
<span id="cb6-7">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb6-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb6-9">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb6-10">    g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i)</span>
<span id="cb6-11">g.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"example.html"</span>)  </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>example.html</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="19">

        <iframe width="100%" height="600px" src="example.html" frameborder="0" allowfullscreen=""></iframe>
        
</div>
</div>
<p>You can see that this simplistic propagation algorithm ain’t very good. There are some nodes with seem to be isolated.</p>
</section>
<section id="label-propagation" class="level2">
<h2 class="anchored" data-anchor-id="label-propagation">Label propagation</h2>
<p>A more rigorous approach can be seen in <a href="http://pages.cs.wisc.edu/~jerryzhu/pub/CMU-CALD-02-107.pdf">Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002</a> and the algorithm there can be summed up as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Barray%7D%7Bl%7D%7B%0A%5Cmathbf%7BW%7D%20%5Ctext%20%7B:%20adjacency%20matrix%20of%20the%20graph%7D%20%5C%5C%0A%5Ctext%20%7B%20Compute%20the%20diagonal%20degree%20matrix%20%7D%20%5Cmathbf%7BD%7D%20%5Ctext%20%7B%20by%20%7D%20%5Cmathbf%7BD%7D_%7Bi%20i%7D%20%5Cleftarrow%20%5Csum_%7Bj%7D%20W_%7Bi%20j%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Initialize%20%7D%20%5Chat%7BY%7D%5E%7B(0)%7D%20%5Cleftarrow%5Cleft(y_%7B1%7D,%20%5Cldots,%20y_%7Bl%7D,%200,0,%20%5Cldots,%200%5Cright)%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Iterate%20%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%201.%20%7D%20%5Chat%7BY%7D%5E%7B(t+1)%7D%20%5Cleftarrow%20%5Cmathbf%7BD%7D%5E%7B-1%7D%20%5Cmathbf%7BW%7D%20%5Chat%7BY%7D%5E%7B(t)%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%202.%20%7D%20%5Chat%7BY%7D_%7Bl%7D%5E%7B(t+1)%7D%20%5Cleftarrow%20Y_%7Bl%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20until%20convergence%20to%20%7D%20%5Chat%7BY%7D%5E%7B(%5Cinfty)%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Label%20point%20%7D%20x_%7Bi%7D%20%5Ctext%20%7B%20by%20the%20sign%20of%20%7D%20%5Chat%7By%7D_%7Bi%7D%5E%7B(%5Cinfty)%7D%7D%5Cend%7Barray%7D%0A"></p>
<p>with a graph where initially some nodes are labelled and the propagation assigns labels via diffusion.</p>
<p>Below you can find a label spreading algorithm and we’ll base both algorithm on the same base class:</p>
<div id="cell-11" class="cell" data-execution_count="20">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> abc <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> abstractmethod</span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> BaseLabelPropagation:   </span>
<span id="cb8-5">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, adj_matrix):</span>
<span id="cb8-6">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.norm_adj_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._normalize(adj_matrix)</span>
<span id="cb8-7">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_nodes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> adj_matrix.size(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-8">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span> </span>
<span id="cb8-9">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_classes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb8-10">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labeled_mask <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb8-11">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb8-12"></span>
<span id="cb8-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@staticmethod</span></span>
<span id="cb8-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@abstractmethod</span></span>
<span id="cb8-15">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _normalize(adj_matrix):</span>
<span id="cb8-16">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">NotImplementedError</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_normalize must be implemented"</span>)</span>
<span id="cb8-17"></span>
<span id="cb8-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@abstractmethod</span></span>
<span id="cb8-19">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _propagate(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb8-20">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">NotImplementedError</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_propagate must be implemented"</span>)</span>
<span id="cb8-21"></span>
<span id="cb8-22">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _one_hot_encode(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, labels):</span>
<span id="cb8-23">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the number of classes</span></span>
<span id="cb8-24">        classes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.unique(labels)</span>
<span id="cb8-25">        classes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> classes[classes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb8-26">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_classes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> classes.size(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-27"></span>
<span id="cb8-28">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># One-hot encode labeled data instances and zero rows corresponding to unlabeled instances</span></span>
<span id="cb8-29">        unlabeled_mask <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-30">        labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> labels.clone()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># defensive copying</span></span>
<span id="cb8-31">        labels[unlabeled_mask] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb8-32">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.zeros((<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_nodes, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_classes), dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>torch.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>)</span>
<span id="cb8-33">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels.scatter(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, labels.unsqueeze(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-34">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels[unlabeled_mask, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb8-35"></span>
<span id="cb8-36">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labeled_mask <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>unlabeled_mask</span>
<span id="cb8-37"></span>
<span id="cb8-38">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, labels, max_iter, tol):</span>
<span id="cb8-39">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Fits a semi-supervised learning label propagation model.</span></span>
<span id="cb8-40"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        </span></span>
<span id="cb8-41"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        labels: torch.LongTensor</span></span>
<span id="cb8-42"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            Tensor of size n_nodes indicating the class number of each node.</span></span>
<span id="cb8-43"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            Unlabeled nodes are denoted with -1.</span></span>
<span id="cb8-44"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        max_iter: int</span></span>
<span id="cb8-45"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            Maximum number of iterations allowed.</span></span>
<span id="cb8-46"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        tol: float</span></span>
<span id="cb8-47"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            Convergence tolerance: threshold to consider the system at steady state.</span></span>
<span id="cb8-48"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb8-49">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._one_hot_encode(labels)</span>
<span id="cb8-50"></span>
<span id="cb8-51">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels.clone()</span>
<span id="cb8-52">        prev_predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.zeros((<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_nodes, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n_classes), dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>torch.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>)</span>
<span id="cb8-53"></span>
<span id="cb8-54">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(max_iter):</span>
<span id="cb8-55">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Stop iterations if the system is considered at a steady state</span></span>
<span id="cb8-56">            variation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> prev_predictions).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>().item()</span>
<span id="cb8-57">            </span>
<span id="cb8-58">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> variation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> tol:</span>
<span id="cb8-59">                <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"The method stopped after </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> iterations, variation=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>variation<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">."</span>)</span>
<span id="cb8-60">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span>
<span id="cb8-61"></span>
<span id="cb8-62">            prev_predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions</span>
<span id="cb8-63">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._propagate()</span>
<span id="cb8-64"></span>
<span id="cb8-65">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb8-66">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions</span>
<span id="cb8-67"></span>
<span id="cb8-68">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict_classes(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb8-69">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>(dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).indices</span></code></pre></div></div>
</div>
<p>The Zhu and Ghahramani algorithm is now:</p>
<div id="cell-13" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LabelPropagation(BaseLabelPropagation):</span>
<span id="cb9-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, adj_matrix):</span>
<span id="cb9-3">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">super</span>().<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(adj_matrix)</span>
<span id="cb9-4"></span>
<span id="cb9-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@staticmethod</span></span>
<span id="cb9-6">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _normalize(adj_matrix):</span>
<span id="cb9-7">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Computes D^-1 * W"""</span></span>
<span id="cb9-8">        degs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> adj_matrix.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb9-9">        degs[degs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># divide by zero</span></span>
<span id="cb9-10">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> adj_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> degs[:, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>]</span>
<span id="cb9-11"></span>
<span id="cb9-12">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _propagate(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb9-13">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.matmul(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.norm_adj_matrix, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions)</span>
<span id="cb9-14"></span>
<span id="cb9-15">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Put back already known labels</span></span>
<span id="cb9-16">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions[<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labeled_mask] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels[<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labeled_mask]</span>
<span id="cb9-17"></span>
<span id="cb9-18">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, labels, max_iter<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, tol<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-3</span>):</span>
<span id="cb9-19">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">super</span>().fit(labels, max_iter, tol)</span></code></pre></div></div>
</div>
</section>
<section id="label-spreading" class="level2">
<h2 class="anchored" data-anchor-id="label-spreading">Label spreading</h2>
<p>The Label Spreading algorithm from <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3219">Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Schoelkopf. Learning with local and global consistency (2004)</a> is as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Barray%7D%7Bl%7D%7B%0A%5Cmathbf%7BW%7D%20%5Ctext%20%7B:%20adjacency%20matrix%20of%20the%20graph%7D%20%5C%5C%20%5Ctext%20%7B%20Compute%20the%20diagonal%20degree%20matrix%20%7D%20%5Cmathbf%7BD%7D%20%5Ctext%20%7B%20by%20%7D%20%5Cmathbf%7BD%7D_%7Bi%20i%7D%20%5Cleftarrow%20%5Csum_%7Bj%7D%20W_%7Bi%20j%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Compute%20the%20normalized%20graph%20Laplacian%20%7D%20%5C%5C%20%5Cmathcal%7BL%7D%20%5Cleftarrow%20%5Cmathbf%7BD%7D%5E%7B-1%20/%202%7D%20%5Cmathbf%7BW%7D%20%5Cmathbf%7BD%7D%5E%7B-1%20/%202%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Initialize%20%7D%20%5Chat%7BY%7D%5E%7B(0)%7D%20%5Cleftarrow%5Cleft(y_%7B1%7D,%20%5Cldots,%20y_%7Bl%7D,%200,0,%20%5Cldots,%200%5Cright)%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Choose%20a%20parameter%20%7D%20%5Calpha%20%5Cin%5B0,1)%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Iterate%20%7D%20%5Chat%7BY%7D(t+1)%20%5Cleftarrow%20%5Calpha%20%5Cmathcal%7BL%7D%20%5Chat%7BY%7D%5E%7B(t)%7D+(1-%5Calpha)%20%5Chat%7BY%7D%5E%7B(0)%7D%20%5Ctext%20%7B%20until%20convergence%20to%20%7D%20%5Chat%7BY%7D%5E%7B(%5Cinfty)%7D%7D%20%5C%5C%20%7B%5Ctext%20%7B%20Label%20point%20%7D%20x_%7Bi%7D%20%5Ctext%20%7B%20by%20the%20sign%20of%20%7D%20%5Chat%7By%7D_%7Bi%7D%5E%7B(%5Cinfty)%7D%7D%0A%5Cend%7Barray%7D%0A"></p>
<p>Note that the <a href="https://en.wikipedia.org/wiki/Laplacian_matrix">Laplacian</a> is used here rather than the odd <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7BD%7D%5E%7B-1%7D%20%5Cmathbf%7BW%7D%20%5Chat%7BY%7D%5E%7B(t)%7D"> matrix above. The appearance of the Laplacian mimics closer <a href="https://en.wikipedia.org/wiki/Diffusion_equation">the continuous diffusion equation</a>.</p>
<div id="cell-15" class="cell" data-execution_count="22">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LabelSpreading(BaseLabelPropagation):</span>
<span id="cb10-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, adj_matrix):</span>
<span id="cb10-3">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">super</span>().<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(adj_matrix)</span>
<span id="cb10-4">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb10-5"></span>
<span id="cb10-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@staticmethod</span></span>
<span id="cb10-7">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _normalize(adj_matrix):</span>
<span id="cb10-8">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Computes D^-1/2 * W * D^-1/2"""</span></span>
<span id="cb10-9">        degs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> adj_matrix.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb10-10">        norm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">pow</span>(degs, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb10-11">        norm[torch.isinf(norm)] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb10-12">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> adj_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> norm[:, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> norm[<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, :]</span>
<span id="cb10-13"></span>
<span id="cb10-14">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _propagate(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb10-15">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb10-16">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> torch.matmul(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.norm_adj_matrix, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.predictions)</span>
<span id="cb10-17">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.alpha) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.one_hot_labels</span>
<span id="cb10-18">        )</span>
<span id="cb10-19">    </span>
<span id="cb10-20">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, labels, max_iter<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3000</span>, tol<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-5</span>, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>):</span>
<span id="cb10-21">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb10-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Parameters</span></span>
<span id="cb10-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        ----------</span></span>
<span id="cb10-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        alpha: float</span></span>
<span id="cb10-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            Clamping factor.</span></span>
<span id="cb10-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb10-27">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha</span>
<span id="cb10-28">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">super</span>().fit(labels, max_iter, tol)</span></code></pre></div></div>
</div>
</section>
<section id="testing-models-on-synthetic-data" class="level2">
<h2 class="anchored" data-anchor-id="testing-models-on-synthetic-data">Testing models on synthetic data</h2>
<p>Let’s apply these two propagations to our karate graph:</p>
<div id="cell-18" class="cell" data-execution_count="23">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.karate_club_graph()</span>
<span id="cb11-2">adj_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.adjacency_matrix(G).toarray()</span>
<span id="cb11-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create labels</span></span>
<span id="cb11-4">labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.full(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(G.nodes), <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.</span>)</span>
<span id="cb11-5">labels[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb11-6">labels[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb11-7"></span>
<span id="cb11-8">adj_matrix_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.FloatTensor(adj_matrix)</span>
<span id="cb11-9">labels_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.LongTensor(labels)</span>
<span id="cb11-10"></span>
<span id="cb11-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learn with Label Propagation</span></span>
<span id="cb11-12">label_propagation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LabelPropagation(adj_matrix_t)</span>
<span id="cb11-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Label Propagation: "</span>, end<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb11-14">label_propagation.fit(labels_t)</span>
<span id="cb11-15">label_propagation_output_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_propagation.predict_classes()</span>
<span id="cb11-16"></span>
<span id="cb11-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learn with Label Spreading</span></span>
<span id="cb11-18">label_spreading <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LabelSpreading(adj_matrix_t)</span>
<span id="cb11-19"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Label Spreading: "</span>, end<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb11-20">label_spreading.fit(labels_t, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb11-21">label_spreading_output_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_spreading.predict_classes()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Label Propagation: The method stopped after 39 iterations, variation=0.0008.
Label Spreading: The method stopped after 31 iterations, variation=0.0000.</code></pre>
</div>
</div>
<p>The label propagation can be seen via this rendering:</p>
<div id="cell-20" class="cell" data-execution_count="24">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">prop_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_propagation_output_labels.numpy()</span>
<span id="cb13-2">g<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>net.Network(notebook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cdn_resources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in_line'</span>)</span>
<span id="cb13-3">g.from_nx(G)</span>
<span id="cb13-4">set_options(g)</span>
<span id="cb13-5">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb13-6">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb13-7">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb13-8">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb13-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i,v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(prop_labels):</span>
<span id="cb13-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prop_labels[i]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:</span>
<span id="cb13-11">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb13-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb13-13">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb13-14">    g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i)</span>
<span id="cb13-15">g.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"propagation.html"</span>) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>propagation.html</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="24">

        <iframe width="100%" height="600px" src="propagation.html" frameborder="0" allowfullscreen=""></iframe>
        
</div>
</div>
<p>The label spreading gives:</p>
<div id="cell-22" class="cell" data-execution_count="25">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">spread_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_spreading_output_labels.numpy()</span>
<span id="cb15-2">g<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>net.Network(notebook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cdn_resources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in_line'</span>)</span>
<span id="cb15-3">g.from_nx(G)</span>
<span id="cb15-4">set_options(g)</span>
<span id="cb15-5">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb15-6">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb15-7">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb15-8">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb15-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i,v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(spread_labels):</span>
<span id="cb15-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> spread_labels[i]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:</span>
<span id="cb15-11">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb15-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb15-13">        g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb15-14">    g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i)</span>
<span id="cb15-15">g.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"spreading.html"</span>) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>spreading.html</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="25">

        <iframe width="100%" height="600px" src="spreading.html" frameborder="0" allowfullscreen=""></iframe>
        
</div>
</div>
<p>If you apply these to <a href="https://mathworld.wolfram.com/CavemanGraph.html">the caveman graph</a> where the propagation reflect the community topology:</p>
<div id="cell-24" class="cell" data-execution_count="26">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb17-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb17-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb17-4"></span>
<span id="cb17-5"></span>
<span id="cb17-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create caveman graph</span></span>
<span id="cb17-7">n_cliques <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span></span>
<span id="cb17-8">size_cliques <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<span id="cb17-9">caveman_graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.connected_caveman_graph(n_cliques, size_cliques)</span>
<span id="cb17-10">adj_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.adjacency_matrix(caveman_graph).toarray()</span>
<span id="cb17-11"></span>
<span id="cb17-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create labels</span></span>
<span id="cb17-13">labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.full(n_cliques <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> size_cliques, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.</span>)</span>
<span id="cb17-14"></span>
<span id="cb17-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Only one node per clique is labeled. Each clique belongs to a different class.</span></span>
<span id="cb17-16">labels[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb17-17">labels[size_cliques] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb17-18">labels[size_cliques <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb17-19">labels[size_cliques <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb17-20"></span>
<span id="cb17-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create input tensors</span></span>
<span id="cb17-22">adj_matrix_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.FloatTensor(adj_matrix)</span>
<span id="cb17-23">labels_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.LongTensor(labels)</span>
<span id="cb17-24"></span>
<span id="cb17-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learn with Label Propagation</span></span>
<span id="cb17-26">label_propagation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LabelPropagation(adj_matrix_t)</span>
<span id="cb17-27"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Label Propagation: "</span>, end<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb17-28">label_propagation.fit(labels_t)</span>
<span id="cb17-29">label_propagation_output_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_propagation.predict_classes()</span>
<span id="cb17-30"></span>
<span id="cb17-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learn with Label Spreading</span></span>
<span id="cb17-32">label_spreading <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LabelSpreading(adj_matrix_t)</span>
<span id="cb17-33"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Label Spreading: "</span>, end<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb17-34">label_spreading.fit(labels_t, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb17-35">label_spreading_output_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_spreading.predict_classes()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Label Propagation: The method stopped after 73 iterations, variation=0.0010.
Label Spreading: The method stopped after 39 iterations, variation=0.0000.</code></pre>
</div>
</div>
<div id="cell-25" class="cell" data-execution_count="27">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">spread_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> label_spreading_output_labels.numpy()</span>
<span id="cb19-2">g<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>net.Network(notebook<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cdn_resources<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in_line'</span>)</span>
<span id="cb19-3">g.from_nx(caveman_graph)</span>
<span id="cb19-4">set_options(g)</span>
<span id="cb19-5">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span></span>
<span id="cb19-6">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb19-7">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span></span>
<span id="cb19-8">g.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"size"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb19-9">color_map<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff0000"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#00ff00"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#0000ff"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff00ff"</span>]</span>
<span id="cb19-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i,v <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(spread_labels):</span>
<span id="cb19-11">    g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"color"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>color_map[spread_labels[i]]</span>
<span id="cb19-12">    g.nodes[i][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i)</span>
<span id="cb19-13">g.show(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"caveman.html"</span>) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>caveman.html</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="27">

        <iframe width="100%" height="600px" src="caveman.html" frameborder="0" allowfullscreen=""></iframe>
        
</div>
</div>
<p>Aside from the three anomalies this gives indeed a good result.</p>


</section>

 ]]></description>
  <category>Graphs</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/LabelPropagation.html</guid>
  <pubDate>Tue, 19 Nov 2024 23:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/LabelPropagation.png" medium="image" type="image/png" height="100" width="144"/>
</item>
<item>
  <title>Graph BI</title>
  <link>https://discovery.graphsandnetworks.com/graphViz/qwiery.html</link>
  <description><![CDATA[ 





<p><a href="https://qwiery.com" target="_blank"><img src="https://discovery.graphsandnetworks.com/images/Qwiery0.png" class="img-fluid"></a></p>
<p><a href="https://graphsandnetworks.com/contact" class="btn btn-primary" data-toggle="tooltip" title="Contact us for more info or consulting options" target="_blank">Contact us</a></p>
<p><a href="https://qwiery.com" target="_blank">Qwiery</a> is a Graph Application Framework, an open source toolbox to create graph-driven apps. It’s a powerful tool to visualize and interpret data is crucial. <strong>Qwiery Dashboards</strong> offer an intuitive and graphical way to present your data, making it easier to understand and analyze.</p>
<p><strong>Overview:</strong></p>
<ul>
<li><strong>Customizable Widgets</strong>: Qwiery Dashboards allow you to arrange various types of widgets, such as charts, tables, and maps, in a grid format. This flexibility ensures that you can tailor the dashboard to meet your specific needs.</li>
<li><strong>Plugin Support</strong>: Extend the functionality of Qwiery by using plugins. Whether you need additional features or want to integrate with other tools, creating and using plugins is straightforward and enhances the overall capability of your dashboards.<br>
</li>
<li><strong>Database Compatibility</strong>: Qwiery supports your favorite databases, ensuring seamless storage and access to your graphs. This compatibility makes it a versatile choice for different data sources and use cases.</li>
<li><strong>Types of Graph Databases</strong>: There are various types, including <strong>property graphs</strong>, <strong>triple stores</strong>, and others with unique query languages like <strong>GSQL</strong> and <strong>TypeQL</strong>.</li>
<li><strong>Qwiery’s Role</strong>: Qwiery provides an <strong>API</strong> that allows uniform interaction with different graph databases, including a JSON in-memory graph storage for prototyping.</li>
<li><strong>Flexibility</strong>: Qwiery supports multiple query languages and databases, but does not natively support some exotic vendors like TigerGraph or TypeDB without additional implementation. You can switch between local and real graph stores, use plugins and adapters to customize functionality, and enforce consistency with optional schemas.</li>
<li><strong>Advantages and Disadvantages</strong>: Qwiery simplifies app creation by abstracting backend storage details but may limit access to vendor-specific features.</li>
<li><strong>Limitations</strong>: It is not a replacement for high-performance graph databases like Neo4j and does not support advanced query languages like Cypher, SPARQL, or Gremlin.</li>
<li><strong>Open Source</strong>: Qwiery is open source, allowing for modification, distribution, and community contributions.</li>
</ul>
<div class="mosaic">
<p><a href="../images/Qwiery1.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery1.png" class="mosaic-image"> </a> <a href="../images/Qwiery2.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery2.png" class="mosaic-image"> </a> <a href="../images/Qwiery3.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery3.png" class="mosaic-image"> </a> <a href="../images/Qwiery4.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery5.png" class="mosaic-image"> </a> <a href="../images/Qwiery5.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery5.png" class="mosaic-image"> </a> <a href="../images/Qwiery6.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery6.png" class="mosaic-image"> </a> <a href="../images/Qwiery7.png" target="_blank"> <img src="https://discovery.graphsandnetworks.com/images/Qwiery7.png" class="mosaic-image"> </a></p>
</div>



 ]]></description>
  <category>Graphs</category>
  <category>yFiles</category>
  <guid>https://discovery.graphsandnetworks.com/graphViz/qwiery.html</guid>
  <pubDate>Wed, 21 Aug 2024 22:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/Qwiery5.png" medium="image" type="image/png" height="71" width="144"/>
</item>
<item>
  <title>Keras LSTM setup</title>
  <link>https://discovery.graphsandnetworks.com/graphML/lstm.html</link>
  <description><![CDATA[ 





<section id="keras-lstm" class="level1">
<h1>Keras LSTM</h1>
<p><a href="https://colab.research.google.com/drive/12BJVod8SSbLdUUf0K22v0RYlkuqzReL8" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p><a href="https://keras.io">Keras</a> has been around for a long time and in the early days I used to assemble networks with it. Before PyTorch and TensorFlow what they are today (we are September 2024). The ML scene has evolved tremendously. Due to this, code which used to work does not anymore and if you want to get anywhere you need to reconsider all the basics you thought you had in your fingers.</p>
<p>Below is the basic LSTM setup and I record it here in order to use it as a stepping stone in the next project. In this way I don’t have to rewire and scan documentation all over again next time.</p>
<ul>
<li>Torch: 2.2.1+cu121</li>
<li>Keras: 3.5.0</li>
<li>Python: 3.10.10</li>
<li><a href="https://lightning.ai/" target="_blank">Runs on Lightning.AI</a>, such a wonderful platform.</li>
</ul>
<div id="cell-2" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="15648eb4-1282-479a-94b4-d82efbd2efa4" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install keras</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Requirement already satisfied: keras in /usr/local/lib/python3.10/dist-packages (3.4.1)
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras) (1.4.0)
...</code></pre>
</div>
</div>
<p>The following has to run before anything else to define the backend, in our case PyTorch. You can use TensorFlow is you prefer.</p>
<p>The MPS fallback is only needed if you run on Silicon and tells PyTorch to fallback to CPU is things are not implemented.</p>
<div id="cell-4" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb3-4"></span>
<span id="cb3-5">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"KERAS_BACKEND"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"torch"</span></span>
<span id="cb3-6">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PYTORCH_ENABLE_MPS_FALLBACK"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span></span>
<span id="cb3-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> keras</span></code></pre></div></div>
</div>
<p>Keras has direct access to IMDB:</p>
<div id="cell-6" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="924cc816-a986-40b1-9a32-7f5e5e1baa27" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> sequence</span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.models <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Sequential</span>
<span id="cb4-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.layers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Dense, Dropout, Embedding, LSTM</span>
<span id="cb4-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.callbacks <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> EarlyStopping</span>
<span id="cb4-7"></span>
<span id="cb4-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> imdb</span>
<span id="cb4-9"></span>
<span id="cb4-10">n_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb4-11">(X_train, y_train), (X_test, y_test) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> imdb.load_data(num_words<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n_words)</span>
<span id="cb4-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Train seq: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train)))</span>
<span id="cb4-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Test seq: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train)))</span>
<span id="cb4-14"></span>
<span id="cb4-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Train example: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]))</span>
<span id="cb4-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Test example: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(X_test[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<div class="ansi-escaped-output">
<pre>Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

<span class="ansi-bold">17464789/17464789</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">0s</span> 0us/step

Train seq: 25000

Test seq: 25000

Train example: [1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]

Test example: [1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 2, 394, 354, 4, 123, 9, 2, 2, 2, 10, 10, 13, 92, 124, 89, 488, 2, 100, 28, 2, 14, 31, 23, 27, 2, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 2, 451, 202, 14, 6, 717]
</pre>
</div>
</div>
</div>
<div id="cell-7" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="b7c1917c-2112-47a4-a17b-ee6d1a27d950" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Vector size; "</span>,<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Vector size;  550</code></pre>
</div>
</div>
<p>The vector size the length of the vectors but this is unrelated to the range of values within the vector. This is somewhat mistakenly called the ‘input dimension’ <code>input_num</code> below in the <code>Embedding</code> layer.</p>
<p>How the length of the tensors affect outcome/accuracy is an interesting question on its own. For now, we will pad the vectors to 200:</p>
<div id="cell-10" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">max_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span></span>
<span id="cb7-2">X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sequence.pad_sequences(X_train, maxlen<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>max_len)</span>
<span id="cb7-3">X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sequence.pad_sequences(X_test, maxlen<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>max_len)</span></code></pre></div></div>
</div>
<div id="cell-11" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="57a6d84a-9c9a-4c5a-8fe7-e1366971f8f7" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Vector size after padding: "</span>,<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Vector size after padding:  200</code></pre>
</div>
</div>
<p>Our network with LSTM looks like the following:</p>
<div id="cell-13" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:321}}" data-editable="true" data-outputid="34899ae2-5119-4921-cb7c-28ce071317cd" data-tags="[]" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Sequential([</span>
<span id="cb10-2">    Embedding(n_words, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Embedding"</span>),</span>
<span id="cb10-3">    Dropout(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dropout 1"</span>),</span>
<span id="cb10-4">    LSTM(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, dropout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, recurrent_dropout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"LSTM"</span>),</span>
<span id="cb10-5">    Dense(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">250</span>, activation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'relu'</span>, name  <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dense"</span>),</span>
<span id="cb10-6">    Dropout(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dropout 2"</span>),</span>
<span id="cb10-7">    Dense(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, activation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sigmoid'</span>, name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sigmoid"</span>),</span>
<span id="cb10-8">    ])</span>
<span id="cb10-9">model.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">compile</span>(loss<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'binary_crossentropy'</span>,  optimizer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'adam'</span>, metrics<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'accuracy'</span>])</span>
<span id="cb10-10">model.summary()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="font-weight: bold">Model: "sequential"</span>
</pre>
</div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃<span style="font-weight: bold"> Layer (type)                         </span>┃<span style="font-weight: bold"> Output Shape                </span>┃<span style="font-weight: bold">         Param # </span>┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Embedding (<span style="color: #0087ff; text-decoration-color: #0087ff">Embedding</span>)                │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout 1 (<span style="color: #0087ff; text-decoration-color: #0087ff">Dropout</span>)                  │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ LSTM (<span style="color: #0087ff; text-decoration-color: #0087ff">LSTM</span>)                          │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dense (<span style="color: #0087ff; text-decoration-color: #0087ff">Dense</span>)                        │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout 2 (<span style="color: #0087ff; text-decoration-color: #0087ff">Dropout</span>)                  │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Sigmoid (<span style="color: #0087ff; text-decoration-color: #0087ff">Dense</span>)                      │ ?                           │     <span style="color: #00af00; text-decoration-color: #00af00">0</span> (unbuilt) │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
</pre>
</div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="font-weight: bold"> Total params: </span><span style="color: #00af00; text-decoration-color: #00af00">0</span> (0.00 B)
</pre>
</div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="font-weight: bold"> Trainable params: </span><span style="color: #00af00; text-decoration-color: #00af00">0</span> (0.00 B)
</pre>
</div>
<div class="cell-output cell-output-display">
<pre style="white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace"><span style="font-weight: bold"> Non-trainable params: </span><span style="color: #00af00; text-decoration-color: #00af00">0</span> (0.00 B)
</pre>
</div>
</div>
<div id="cell-14" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="387c3fdf-7fe3-4b4b-ba5c-6b307cab04f2" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">callbacks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [EarlyStopping(monitor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'val_accuracy'</span>, patience<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"max"</span>)]</span>
<span id="cb11-2">batch_size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">128</span></span>
<span id="cb11-3">n_epochs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb11-4">model.fit(X_train, y_train, batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>batch_size, epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n_epochs, validation_split<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, callbacks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>callbacks)</span>
<span id="cb11-5"></span>
<span id="cb11-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Accuracy on test set: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(model.evaluate(X_test, y_test)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<div class="ansi-escaped-output">
<pre>Epoch 1/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">59s</span> 354ms/step - accuracy: 0.6012 - loss: 0.6486 - val_accuracy: 0.7794 - val_loss: 0.4697

Epoch 2/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">49s</span> 311ms/step - accuracy: 0.8050 - loss: 0.4342 - val_accuracy: 0.7854 - val_loss: 0.4532

Epoch 3/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">48s</span> 303ms/step - accuracy: 0.8158 - loss: 0.4184 - val_accuracy: 0.8240 - val_loss: 0.4078

Epoch 4/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">47s</span> 299ms/step - accuracy: 0.8226 - loss: 0.4063 - val_accuracy: 0.8316 - val_loss: 0.3862

Epoch 5/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">47s</span> 302ms/step - accuracy: 0.8309 - loss: 0.3955 - val_accuracy: 0.8132 - val_loss: 0.4175

Epoch 6/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">47s</span> 302ms/step - accuracy: 0.8317 - loss: 0.3896 - val_accuracy: 0.8222 - val_loss: 0.3975

Epoch 7/100

<span class="ansi-bold">157/157</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">48s</span> 308ms/step - accuracy: 0.8435 - loss: 0.3702 - val_accuracy: 0.8280 - val_loss: 0.3951

<span class="ansi-bold">782/782</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">122s</span> 156ms/step - accuracy: 0.8290 - loss: 0.3906

Accuracy on test set: 0.8313199877738953
</pre>
</div>
</div>
</div>
<section id="arithmetic" class="level2">
<h2 class="anchored" data-anchor-id="arithmetic">Arithmetic</h2>
<p>Another fun example of LSTM is simple additions, it’s an example we also assembled in <a href="../diverse/wolfram-lstmadd.html">Wolfram</a>. It’s interesting to note that the LSTM layer is sufficient, nothing dense, dropout or anything.</p>
<div id="cell-16" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-editable="true" data-outputid="32350a49-ff5d-4321-b809-4aa1fb74b3d6" data-tags="[]" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb12-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.models <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Sequential</span>
<span id="cb12-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> keras.layers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LSTM, Dense</span>
<span id="cb12-4"></span>
<span id="cb12-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate data for addition</span></span>
<span id="cb12-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate_addition_data(num_samples, max_value):</span>
<span id="cb12-7">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Generates data for addition task."""</span></span>
<span id="cb12-8">  X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.randint(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, max_value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(num_samples, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb12-9">  y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(X, axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb12-10">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> X, y</span>
<span id="cb12-11"></span>
<span id="cb12-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate training and testing data</span></span>
<span id="cb12-13">num_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span></span>
<span id="cb12-14">max_value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb12-15">X_train, y_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> generate_addition_data(num_samples, max_value)</span>
<span id="cb12-16">X_test, y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> generate_addition_data(num_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, max_value)</span>
<span id="cb12-17"></span>
<span id="cb12-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Reshape data for LSTM</span></span>
<span id="cb12-19">X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train.reshape(X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb12-20">X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_test.reshape(X_test.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], X_test.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb12-21"></span>
<span id="cb12-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create LSTM model</span></span>
<span id="cb12-23">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Sequential()</span>
<span id="cb12-24">model.add(LSTM(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span>, input_shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], X_train.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])))</span>
<span id="cb12-25">model.add(Dense(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb12-26"></span>
<span id="cb12-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compile the model</span></span>
<span id="cb12-28">model.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">compile</span>(loss<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mean_squared_error'</span>, optimizer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'adam'</span>)</span>
<span id="cb12-29"></span>
<span id="cb12-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train the model</span></span>
<span id="cb12-31">model.fit(X_train, y_train, epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span>, validation_split<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>)</span>
<span id="cb12-32"></span>
<span id="cb12-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Evaluate the model</span></span>
<span id="cb12-34">loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.evaluate(X_test, y_test)</span>
<span id="cb12-35"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Test Loss:'</span>, loss)</span>
<span id="cb12-36"></span>
<span id="cb12-37"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make predictions</span></span>
<span id="cb12-38">predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict(X_test)</span>
<span id="cb12-39"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Predictions:'</span>, predictions[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>])</span>
<span id="cb12-40"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Actual values:'</span>, y_test[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<div class="ansi-escaped-output">
<pre>Epoch 1/100

<span class="ansi-bold"> 15/250</span> <span class="ansi-green-fg">━</span><span class="ansi-white-fg">━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">1s</span> 8ms/step - loss: 11907.7441</pre>
</div>
</div>
<div class="cell-output cell-output-stderr">
<pre><code></code></pre>
</div>
<div class="cell-output cell-output-stdout">
<div class="ansi-escaped-output">
<pre><span class="ansi-bold">250/250</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">2s</span> 9ms/step - loss: 11319.8467 - val_loss: 9217.2393

Epoch 2/100

<span class="ansi-bold">250/250</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">2s</span> 9ms/step - loss: 8795.1885 - val_loss: 7656.9336

Epoch 3/100

<span class="ansi-bold">250/250</span> <span class="ansi-green-fg">━━━━━━━━━━━━━━━━━━━━</span> <span class="ansi-bold">3s</span> 11ms/step - loss: 7350.7256 - val_loss: 6516.9985

...

Predictions: [[ 95.97442 ]

 [103.975876]

 [ 47.84532 ]

 [144.85732 ]

 [118.88151 ]

 [157.88887 ]

 [ 89.905106]

 [ 67.921936]

 [ 47.886024]

 [ 38.99708 ]]

Actual values: [ 96 104  48 145 119 158  90  68  48  39]
</pre>
</div>
</div>
</div>


</section>
</section>

 ]]></description>
  <category>Graphs</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/lstm.html</guid>
  <pubDate>Wed, 05 Jun 2024 22:00:00 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/lstm.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Barbell Embedding with PyG</title>
  <link>https://discovery.graphsandnetworks.com/graphML/Barbell_Embedding_with_PyG.html</link>
  <description><![CDATA[ 





<section id="barbell-embedding-with-pyg" class="level1">
<h1>Barbell Embedding with PyG</h1>
<p>The <a href="https://networkx.org/documentation/stable/reference/generated/networkx.generators.classic.barbell_graph.html">Barbell graph</a> is an interesting graph to look at with respect to node embeddings because it has two distinctive blobs which should get emnedded in latent space in two clusters. Simply said, the graph visual and its embedding should be very similar. This is only true if you don’t have a payload on the nodes and the embedding is purely based on the topology. The notebook explores this idea and gives at the same time an example of how to approach node embeddings with Pytorch Geometric.</p>
<p><a href="https://colab.research.google.com/drive/1AhbbvRoVS-sz6evdJ4WPzLjMsAAGwWl2" target="_parent"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<div id="cell-2" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="101ba037-7cfa-416e-8866-ebeabd4bb154">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># not a must for this simple example but in general good to have GPUs when doing graph ML</span></span>
<span id="cb1-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>nvidia<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>smi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>L</span>
<span id="cb1-3"></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add this in a Google Colab cell to install the correct version of Pytorch Geometric.</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Torch Scatter can in particular be a frustrating experience</span></span>
<span id="cb1-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> format_pytorch_version(version):</span>
<span id="cb1-10">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> version.split(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span>)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb1-11"></span>
<span id="cb1-12">TORCH_version <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.__version__</span>
<span id="cb1-13">TORCH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> format_pytorch_version(TORCH_version)</span>
<span id="cb1-14"></span>
<span id="cb1-15"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> format_cuda_version(version):</span>
<span id="cb1-16">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cu'</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> version.replace(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>)</span>
<span id="cb1-17"></span>
<span id="cb1-18">CUDA_version <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.version.cuda</span>
<span id="cb1-19">CUDA <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> format_cuda_version(CUDA_version)</span>
<span id="cb1-20"></span>
<span id="cb1-21"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>scatter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>f https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>data.pyg.org<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>whl<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>{TORCH}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>{CUDA}.html</span>
<span id="cb1-22"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>sparse <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>f https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>data.pyg.org<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>whl<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>{TORCH}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>{CUDA}.html</span>
<span id="cb1-23"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>cluster <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>f https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>data.pyg.org<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>whl<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>{TORCH}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>{CUDA}.html</span>
<span id="cb1-24"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>spline<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>conv <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>f https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>data.pyg.org<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>whl<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>{TORCH}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>{CUDA}.html</span>
<span id="cb1-25"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>pip install torch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>geometric</span></code></pre></div></div>
</div>
<p>Let’s render the Barbell graph so you have an idea how it looks like</p>
<div id="cell-4" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:406}}" data-outputid="6c614d9d-c95b-4116-a97c-ced5ce15b715">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb2-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb2-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.decomposition <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PCA</span>
<span id="cb2-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> tqdm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> tqdm</span>
<span id="cb2-9"></span>
<span id="cb2-10">pd.set_option(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'display.max_columns'</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb2-11">pd.set_option(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'display.float_format'</span>, <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%.4f</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> x)</span>
<span id="cb2-12">graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.barbell_graph(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb2-13">nx.draw_networkx(graph, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphML/Barbell_Embedding_with_PyG_files/figure-html/cell-3-output-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>The embedding should resemble this layout with the intermediate nodes somewhere hanging between two clusters.</p>
<div id="cell-6" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> graph.number_of_nodes()</span>
<span id="cb3-2">embedding_dim <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span></span>
<span id="cb3-3">EPS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-15</span></span>
<span id="cb3-4">embedding <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.nn.Embedding(N, embedding_dim)</span>
<span id="cb3-5">optimizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.optim.Adam(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(embedding.parameters()), lr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>)</span></code></pre></div></div>
</div>
<p>The embedding parameters should be optimized so that the vectors (embedding the nodes) should be close if there is an edge between them. Note that there is no linear layers or weight matrix or anything else, it’s purely adjusting the embedding vectors.</p>
<div id="cell-8" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}}" data-outputid="b1a6147a-b94d-4bf0-f20b-117d5e7f9dc2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> train():</span>
<span id="cb4-2">    embedding.train()</span>
<span id="cb4-3">    loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb4-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (u, v) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> graph.edges:</span>
<span id="cb4-5">        z_u <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> embedding.weight[u]</span>
<span id="cb4-6">        z_v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> embedding.weight[v]</span>
<span id="cb4-7"></span>
<span id="cb4-8">        out <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (z_u <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> z_v).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).view(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb4-9">        pos_loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>torch.log(torch.sigmoid(out) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> EPS).mean()</span>
<span id="cb4-10">        loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> pos_loss</span>
<span id="cb4-11">    loss.backward()</span>
<span id="cb4-12">    optimizer.step()</span>
<span id="cb4-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> loss.item()</span>
<span id="cb4-14"></span>
<span id="cb4-15"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> e <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">201</span>):</span>
<span id="cb4-16">    loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train()</span>
<span id="cb4-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> e <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb4-18">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"epoch: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>e<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, loss: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>loss<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>epoch: 10, loss: 350.039306640625
epoch: 20, loss: 192.37416076660156
epoch: 30, loss: 81.0612564086914
epoch: 40, loss: 20.464614868164062
epoch: 50, loss: 2.725670576095581
epoch: 60, loss: 0.14878813922405243
epoch: 70, loss: 0.011140757240355015
epoch: 80, loss: 0.00128898024559021
epoch: 90, loss: 0.00021363147243391722
epoch: 100, loss: 6.902414315845817e-05
epoch: 110, loss: 3.3379146771039814e-05
epoch: 120, loss: 1.8358399756834842e-05
epoch: 130, loss: 1.0848104466276709e-05
epoch: 140, loss: 6.7949526965094265e-06
epoch: 150, loss: 4.410753263073275e-06
epoch: 160, loss: 3.099446303167497e-06
epoch: 170, loss: 2.2649790025752736e-06
epoch: 180, loss: 1.7881409348774469e-06
epoch: 190, loss: 1.4305124977909145e-06
epoch: 200, loss: 1.1920935776288388e-06</code></pre>
</div>
</div>
<p>It takes effectively less than a hundred epochs to adjust things because the amount of data is so small. If we now render the 30-dimensional vectors with PCA we get:</p>
<div id="cell-10" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;colab&quot;,&quot;value&quot;:{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:452}}" data-outputid="943c945b-7284-4aa4-b757-fb0326db9e22">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@torch.no_grad</span>()</span>
<span id="cb6-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> plot_embedding(embedding, title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Barbell embedding"</span>):</span>
<span id="cb6-3">    plt.figure()</span>
<span id="cb6-4">    pca <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PCA(n_components<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb6-5">    z <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pca.fit_transform(embedding.weight.numpy())</span>
<span id="cb6-6">    sns.scatterplot(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>z[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>z[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>, s<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>)</span>
<span id="cb6-7"></span>
<span id="cb6-8">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># show the label of each node on the figure</span></span>
<span id="cb6-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i, (x, y) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(z):</span>
<span id="cb6-10">        plt.text(x, y, i)</span>
<span id="cb6-11">    plt.title(title)</span>
<span id="cb6-12"></span>
<span id="cb6-13">plot_embedding(embedding)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphML/Barbell_Embedding_with_PyG_files/figure-html/cell-6-output-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>You can see that the two clusters neatly reflect the Barbell structure and the three intermediate nodes are hanging between the clusters. Note however that this is a projection of a 30 dimensional vector space that in reality this is not how the embedding looks like.</p>
<p>This is a simple embedding of the immediate neighborhood of a node. Every node attached a given one is pulling it in the embedding space. The interesting question is how to go beyond the 1-hop embedding, how to include a more of a node’s neighborhood? The challenge here is that if you go beyond the parents and children you get a set which can be very diverse: some nodes might have no 2-hop nodes, some might have a lot. Because machine learning can’t deal very well with varying (jagged) tensors, this poses a challenge. Either you have to pad tensors or you need to find a way to generate tensors of equal size. The padding leads to sparse vectors, which leads to the dimensionality curse. The generation of equal size tensor leads to a smart trick: the usage of random walks with predefind length. The random walk also solves the problem of dense graphs. If a node has a dense neighborhood or the neighborhoods are widely varying across a graph it would lead to lots of data harvesting. The random approach generates tensors of the same size and ensure that the collecting of data does not get out of hand. Of course, the random walk inevitable introduces randomness, the random walk potentially does not harvest the interesting adjacency. To remedy this the random walk is not purely random, the pseudo-random algorythm has two factors defining how close it stays to the starting point and how exotic it ventures downstream the graph.</p>
<p>The pseudo-random walk leads to n-hop embeddings and is more generic than the simple algorythm we used above. It also leads to the message-passing algorythms widely used in graph machine learning.</p>
<p>It’s interesting to note (on a more philosophical level) that <strong>randomness is a solution to a scalability problem</strong>. You trade certainty in for flexibility and universality.</p>


</section>

 ]]></description>
  <category>PyG</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/Barbell_Embedding_with_PyG.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/BarbellEmbedding.png" medium="image" type="image/png" height="115" width="144"/>
</item>
<item>
  <title>Cora Dataset</title>
  <link>https://discovery.graphsandnetworks.com/graphML/cora.html</link>
  <description><![CDATA[ 





<p>The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.</p>
<p>This dataset is the MNIST equivalent in graph learning and we explore it somewhat explicitly here in function of other articles using again and again this dataset as a testbed.</p>
<ul>
<li><a href="https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz">Direct link to download the Cora dataset</a></li>
<li><a href="https://temprl.com/cora.tgz">Alternative link to download the Cora dataset</a></li>
<li><a href="https://temprl.com/cora.graphml.zip">GraphML file with applied layout (same as image above)</a></li>
<li><a href="https://temprl.com/nodes.csv">The nodes in CSV format</a></li>
<li><a href="https://temprl.com/edges.csv">The edges in CSV format</a></li>
<li><a href="https://temprl.com/Cora52.dump">Neo4j v5.2 dump to restore (works with v5.9 and below)</a></li>
</ul>
<section id="networkx" class="level1">
<h1>NetworkX</h1>
<p><a href="https://temprl.com/cora.tgz">Download and unzip</a>, say in <code>~/data/cora/</code>:</p>
<div id="cell-2" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-4">data_dir <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.path.expanduser(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"~/data/cora"</span>)</span></code></pre></div></div>
</div>
<p>import the edges:</p>
<div id="cell-4" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">edgelist <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(os.path.join(data_dir, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cora.cites"</span>), sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, header<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"target"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"source"</span>])</span>
<span id="cb2-2">edgelist[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cites"</span></span></code></pre></div></div>
</div>
<p>The edgelist is a simple table with the source citing the target. All edges have the same label:</p>
<div id="cell-6" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">edgelist.sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).head(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="3">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">target</th>
<th data-quarto-table-cell-role="th">source</th>
<th data-quarto-table-cell-role="th">label</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">3876</th>
<td>94229</td>
<td>1111733</td>
<td>cites</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3939</th>
<td>101660</td>
<td>1107095</td>
<td>cites</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">1280</th>
<td>6213</td>
<td>6155</td>
<td>cites</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">574</th>
<td>2440</td>
<td>1117786</td>
<td>cites</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">3827</th>
<td>89308</td>
<td>103528</td>
<td>cites</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Create a NetworkX graph from thie edge-list:</p>
<div id="cell-8" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">Gnx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.from_pandas_edgelist(edgelist, edge_attr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>)</span>
<span id="cb4-2">nx.set_node_attributes(Gnx, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"paper"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span>)</span></code></pre></div></div>
</div>
<p>A typical node looks like</p>
<div id="cell-10" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">Gnx.nodes[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1103985</span>]</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="6">
<pre><code>{'label': 'paper'}</code></pre>
</div>
</div>
<p>The data attached to the nodes consists of flags indicating whether a word in a 1433-long dictionary is present or not:</p>
<div id="cell-12" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">feature_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w_</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(ii) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> ii <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1433</span>)]</span>
<span id="cb7-2">column_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>  feature_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"subject"</span>]</span>
<span id="cb7-3">node_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(os.path.join(data_dir, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cora.content"</span>), sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, header<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>column_names)</span></code></pre></div></div>
</div>
<p>Each node has a subject and 1433 other flags corresponding to word occurence:</p>
<div id="cell-14" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">node_data.head(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="8">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">w_0</th>
<th data-quarto-table-cell-role="th">w_1</th>
<th data-quarto-table-cell-role="th">w_2</th>
<th data-quarto-table-cell-role="th">w_3</th>
<th data-quarto-table-cell-role="th">w_4</th>
<th data-quarto-table-cell-role="th">w_5</th>
<th data-quarto-table-cell-role="th">w_6</th>
<th data-quarto-table-cell-role="th">w_7</th>
<th data-quarto-table-cell-role="th">w_8</th>
<th data-quarto-table-cell-role="th">w_9</th>
<th data-quarto-table-cell-role="th">...</th>
<th data-quarto-table-cell-role="th">w_1424</th>
<th data-quarto-table-cell-role="th">w_1425</th>
<th data-quarto-table-cell-role="th">w_1426</th>
<th data-quarto-table-cell-role="th">w_1427</th>
<th data-quarto-table-cell-role="th">w_1428</th>
<th data-quarto-table-cell-role="th">w_1429</th>
<th data-quarto-table-cell-role="th">w_1430</th>
<th data-quarto-table-cell-role="th">w_1431</th>
<th data-quarto-table-cell-role="th">w_1432</th>
<th data-quarto-table-cell-role="th">subject</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">31336</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Neural_Networks</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1061127</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Rule_Learning</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">1106406</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Reinforcement_Learning</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">13195</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Reinforcement_Learning</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">37879</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Probabilistic_Methods</td>
</tr>
</tbody>
</table>

<p>5 rows × 1434 columns</p>
</div>
</div>
</div>
<p>The different subjects are:</p>
<div id="cell-16" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(node_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"subject"</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>{'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}</code></pre>
</div>
</div>
<p>A typical ML challenges with this dataset in mind:</p>
<ul>
<li>label prediction: predict the subject of a paper (node) on the basis of the surrounding node data and the structure of the graph</li>
<li>edge prediction: given node data, can one predict the papers that should be cited?</li>
</ul>
<p>You will find on this site plenty of articles which are based on the Cora dataset.</p>
<p>For your information, the visualization above was created via an export of the Cora network to GML (Graph Markup Language), an import into yEd and a balloon layout. It shows some interesting characteristics which can best be analyzed via centrality.</p>
</section>
<section id="wolfram" class="level1">
<h1>Wolfram</h1>
<p>The TSV-formatted datasets linked above are easily loaded into Mathematica and it’s also a lot of fun to apply the black-box machine learning functionality on Cora. As an example, you can find in this gist an edge-prediction model based on node-content and adjacency. The model is poor (accuracy around 70%) but has potential, especially considering how easy it is to experiment within Mathematica.</p>
<p><i class="fa-solid fa-book" aria-label="book"></i> <a href="../graphML/wolfram-cora.html">The Wofram Cora experiment notebook</a></p>
</section>
<section id="pytorch" class="level1">
<h1>Pytorch</h1>
<p><a href="https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html">PyTorch Geometric has various graph datasets</a> and it’s straightforward to download Cora:</p>
<div id="cell-20" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torch_geometric.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Planetoid</span>
<span id="cb11-2">dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Planetoid(root<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'~/somewhere/Cora'</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Cora'</span>)</span>
<span id="cb11-3"></span>
<span id="cb11-4">data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dataset[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb11-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Dataset: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>dataset<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">:'</span>)</span>
<span id="cb11-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'======================'</span>)</span>
<span id="cb11-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of graphs: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(dataset)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of features: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>dataset<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_features<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of classes: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>dataset<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_classes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-10"></span>
<span id="cb11-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of nodes: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of edges: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Average node degree: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_edges <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Number of training nodes: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>train_mask<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Training node label rate: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(data.train_mask.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>()) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Contains isolated nodes: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>contains_isolated_nodes()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Contains self-loops: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>contains_self_loops()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span>
<span id="cb11-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Is undirected: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>is_undirected()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>/Users/swa/conda/envs/pyg/lib/python3.10/site-packages/torch_geometric/data/dataset.py:238: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  if osp.exists(f) and torch.load(f) != _repr(self.pre_transform):
/Users/swa/conda/envs/pyg/lib/python3.10/site-packages/torch_geometric/data/dataset.py:246: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  if osp.exists(f) and torch.load(f) != _repr(self.pre_filter):</code></pre>
</div>
</div>
<p>You can also convert the dataset to NetworkX and use the drawing but the resulting picture is not really pretty.</p>
<div id="cell-22" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb14-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb14-3"></span>
<span id="cb14-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torch_geometric.utils <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> to_networkx</span>
<span id="cb14-5">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> to_networkx(data, to_undirected<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb14-6">nx.draw_networkx(G, with_labels<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb14-7">plt.show()</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphML/cora_files/figure-html/cell-11-output-1.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>PyG has lots of interesting datasets, both from a ML point of view and from a visualization point of view. Below is generic approach to download the data and convert it to NetworkX and GML.</p>
<div id="cell-24" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torch_geometric.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Amazon</span>
<span id="cb15-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb15-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.sparse <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> coo_matrix</span>
<span id="cb15-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb15-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the name can only be 'Computers' or 'Photo' in this case.</span></span>
<span id="cb15-6">amazon <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Amazon(root<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'~/Amazon'</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Computers"</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-7">edges <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> amazon.data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"edge_index"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-8">row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> edges[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].numpy()<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-9">column <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> edges[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].numpy()<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-10">data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.repeat(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, edges.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-11">adj <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> coo_matrix((data, (row, column)))<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-12">graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.from_scipy_sparse_matrix(adj)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb15-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># nx.write_gexf(graph, "//graph.gml")</span></span></code></pre></div></div>
</div>
<p>The COO format referred to is a way to store sparse matrices, see <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html">the SciPy documentation</a>. As outline below, from here on you can use various tools to visualize the graph.</p>
</section>
<section id="neo4j" class="level1">
<h1>Neo4j</h1>
<p>Download the two CSV files (<a href="https://temprl.com/nodes.csv">nodes</a> and <a href="https://temprl.com/edges.csv">edges</a>) and put them in the import directory of the database (see screenshot). They can’t be in any other directory since Neo4j will not access files outside its scope.</p>
<p><img src="https://discovery.graphsandnetworks.com/images/Neo4jImportDir.png" class="img-fluid"></p>
<p>Run the queries:</p>
<pre class="cypher"><code>LOAD CSV WITH HEADERS FROM 'file:///nodes.csv' AS row
Create (n:Paper{id:row["nodeId"], subject:row["subject"], features:row["features"]});

LOAD CSV WITH HEADERS FROM 'file:///edges.csv' AS row
Match (u:Paper{id:row["sourceNodeId"]})
Match (v:Paper{id:row["targetNodeId"]})
Create (u)-[:Cites]-&gt;(v);
</code></pre>
<p>This creates something like 2708 nodes and 10556 edges. You can also directly <a href="https://temprl.com/Cora52.dump">download this Neo4j v5.2 database dump</a> if you prefer.</p>
</section>
<section id="visualization" class="level1">
<h1>Visualization</h1>
<p>You can easily visualize the dataset with various tools. The easiest way is via the <a href="https://www.yworks.com/yed-live/">yEd Live</a> by opening <a href="https://temprl.com/cora.graphml">this GraphML file</a>. There is also <a href="https://www.yworks.com/products/yed">the desktop version of yEd</a>.</p>
<p><img src="https://discovery.graphsandnetworks.com/images/Cora-yEd.png" class="img-fluid"></p>
<p>If you wish to reproduce the layout shown, use the balloon layout and use the settings shown below</p>
<p><img src="https://discovery.graphsandnetworks.com/images/LayoutSettings.png" class="img-fluid"></p>
<p>The <a href="https://gephi.org/">Gephi app</a> is a popular (free) app to explore graphs but offers fewer graph layout algorithms. Use <a href="https://temprl.com/cora.gephi">this Gephi file</a> or import the aforementioned GraphML file.</p>
<p>Finally, there is <a href="https://cytoscape.org/">Cytoscape</a> and if you download <a href="https://www.yworks.com/products/yfiles-layout-algorithms-for-cytoscape">the yFiles layout algorithms for Cytoscape</a> you can create beautiful visualizations with little effort.</p>


</section>

 ]]></description>
  <category>Graphs</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/cora.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/art/coraballoons_small.png" medium="image" type="image/png" height="113" width="144"/>
</item>
<item>
  <title>PyTorch basic pattern</title>
  <link>https://discovery.graphsandnetworks.com/graphML/torchBasic.html</link>
  <description><![CDATA[ 





<p>It’s through simple examples and the basics that one grasps a framework. This is true for scientific (mathematical) frameworks and software stacks. The snippet below is the essence of a Torch model and I always start from this simple setup to assemble more complex things.</p>
<div id="cell-2" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb1-3"></span>
<span id="cb1-4">x_input <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.FloatTensor([[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]])</span>
<span id="cb1-5">y_input <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.FloatTensor([[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">21</span>],[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>]])</span>
<span id="cb1-6"></span>
<span id="cb1-7">x, y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.autograd.Variable(x_input), torch.autograd.Variable(y_input)</span>
<span id="cb1-8"></span>
<span id="cb1-9">net <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.nn.Sequential(</span>
<span id="cb1-10">          torch.nn.Linear(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>),</span>
<span id="cb1-11">          torch.nn.Tanh(),</span>
<span id="cb1-12">          torch.nn.Linear(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>),</span>
<span id="cb1-13">          torch.nn.Linear(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-14">        )</span>
<span id="cb1-15"></span>
<span id="cb1-16">optimizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.optim.Adam(net.parameters(), lr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>)</span>
<span id="cb1-17">loss_func <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.nn.MSELoss() </span>
<span id="cb1-18">loss_sequence <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>()</span>
<span id="cb1-19"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>):</span>
<span id="cb1-20">    prediction <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> net(x)     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># input x and predict based on x</span></span>
<span id="cb1-21"></span>
<span id="cb1-22">    loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> loss_func(prediction, y)     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># must be (1. nn output, 2. target)</span></span>
<span id="cb1-23">    loss_sequence.append(loss.data.numpy())</span>
<span id="cb1-24">    optimizer.zero_grad()   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># clear gradients for next train</span></span>
<span id="cb1-25">    loss.backward()         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># backpropagation, compute gradients</span></span>
<span id="cb1-26">    optimizer.step()        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># apply gradients</span></span>
<span id="cb1-27"></span>
<span id="cb1-28">plt.plot(loss_sequence)   </span>
<span id="cb1-29">net(x)</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="1">
<pre><code>tensor([[ 8.0000],
        [10.0000],
        [ 9.0000],
        [21.0001],
        [12.0001]], grad_fn=&lt;AddmmBackward0&gt;)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphML/torchBasic_files/figure-html/cell-2-output-2.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>



 ]]></description>
  <category>GraphML</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/torchBasic.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
</item>
<item>
  <title>Analytics or ML?</title>
  <link>https://discovery.graphsandnetworks.com/graphML/analyticsVsML.html</link>
  <description><![CDATA[ 





<p>For many, graph analytics and graph machine learning are synonymous terms. This is not the case, however. In what follows we highlight the main differences and ingredients.</p>
<p>We offer consulting services across the whole graph spectrum, but it might be useful to differentiate the two domains since they lead to different project types and efforts.</p>
<section id="definition-and-scope" class="level3">
<h3 class="anchored" data-anchor-id="definition-and-scope">1. <strong>Definition and Scope</strong></h3>
<ul>
<li><strong>Graph Analytics:</strong>
<ul>
<li>Graph analytics refers to a set of techniques and algorithms used to analyze graph structures. It focuses on understanding the (topological) properties and behavior of graphs, such as the relationships between nodes, the overall structure, and specific patterns within the graph. Common graph analytics tasks include calculating <strong>centrality measures</strong>, detecting <strong>communities</strong>, finding <strong>shortest paths</strong>, and identifying <strong>cliques</strong> or <strong>connected components</strong>.</li>
<li>Graph analytics often involves predefined algorithms that are applied to a graph to extract insights or solve specific problems. The techniques are usually more <strong>deterministic</strong> and involve direct computation on the graph’s structure.</li>
</ul></li>
<li><strong>Graph Machine Learning (GML):</strong>
<ul>
<li>Graph machine learning involves the application of machine learning models to graph-structured data. GML aims to <strong>learn patterns</strong> and <strong>make predictions</strong> based on the data represented in a graph. This might include predicting properties of nodes, edges, or entire subgraphs, generating embeddings for nodes or graphs, and identifying new connections within the graph.</li>
<li>Graph ML is more dynamic and involves training machine learning models on (a lot of) graph data to generalize from patterns in the data, enabling predictions or classifications that go beyond the capabilities of traditional graph analytics.</li>
</ul></li>
</ul>
<blockquote class="blockquote">
<p>[!Summary] In essence, analytics is descriptive and deterministic. You can apply it to a graph or a set of graphs of any size. Graph ML is predictive and probabilistic. You need big data to create ML models.</p>
</blockquote>
</section>
<section id="techniques-and-algorithms" class="level3">
<h3 class="anchored" data-anchor-id="techniques-and-algorithms">2. <strong>Techniques and Algorithms</strong></h3>
<ul>
<li><strong>Graph Analytics:</strong>
<ul>
<li><strong>Centrality Measures:</strong> Algorithms like PageRank, degree centrality, betweenness centrality, and eigenvector centrality are used to identify the most important nodes within a graph. Centrality is all about describing which are the important nodes and edges in a graph. The definition of ‘important’ leads to different notions of centrality.</li>
<li><strong>Community Detection:</strong> Algorithms such as modularity optimization, Louvain method, and spectral clustering are used to detect clusters or communities within a graph.</li>
<li><strong>Pathfinding:</strong> Algorithms like Dijkstra’s and A* are used to find the shortest paths between nodes.</li>
<li><strong>Subgraph Matching:</strong> Techniques for identifying specific patterns or motifs within a graph, such as triangles, stars, or more complex structures.</li>
</ul></li>
<li><strong>Graph Machine Learning:</strong>
<ul>
<li><strong>Graph Neural Networks (GNNs):</strong> A class of neural networks designed to operate on graph-structured data, allowing for tasks like node classification, link prediction, and graph classification.</li>
<li><strong>Node/Edge/Graph Embeddings:</strong> Techniques like DeepWalk, Node2Vec, and GraphSAGE that learn low-dimensional vector representations of nodes, edges, or entire graphs, preserving the structure and properties of the graph.</li>
<li><strong>Supervised and Unsupervised Learning:</strong> GML applies both supervised (with labeled data) and unsupervised learning (without labels) techniques to graphs, learning patterns and making predictions based on the graph’s structure.</li>
</ul></li>
</ul>
<p>Note that graph machine learning goes by various names:</p>
<ul>
<li>graph machine learning</li>
<li>geometric deep learning</li>
<li>graph neural networks (GNN)</li>
<li>network embedding</li>
<li>graph representation learning</li>
<li>machine learning on graphs</li>
<li>graph embeddings.</li>
</ul>
<p>Graph analytics can be done in-memory with frameworks like NetworkX or iGraph. If you have a large amount you can use a graph database and vendors have custom implementations. Neo4j’s GDS has lots of graph analytics and Memgraph has strong support for NetworkX.</p>
<p>Graph machine learning requires special GPU/CUDA frameworks like PyTorch or DGL. Like all data science efforts, it’s hard work and comes with a lot of experimenting and it requires skills and experience to design a ML model. Graph analytics is much more straightforward, in general.</p>
</section>
<section id="goals-and-applications" class="level3">
<h3 class="anchored" data-anchor-id="goals-and-applications">3. <strong>Goals and Applications</strong></h3>
<ul>
<li><strong>Graph Analytics:</strong>
<ul>
<li>The primary goal is <strong>to explore and understand the structure and properties</strong> of the graph. It helps in answering questions like “<em>Who are the key influencers in a social network?</em>” or “<em>What is the shortest path between two points in a transportation network?</em>”</li>
<li>Applications include network optimization, fraud detection, social network analysis, and infrastructure management, where the main focus is on interpreting the existing structure of the graph.</li>
</ul></li>
<li><strong>Graph Machine Learning:</strong>
<ul>
<li>The goal of GML is to learn from graph data to <strong>make predictions, classifications, or generate embeddings</strong> that can be used in downstream tasks. This could involve predicting future interactions in a network, classifying nodes based on their features and connections, or even generating new graph structures.</li>
<li>Applications include recommendation systems, drug discovery (predicting molecular interactions), predictive maintenance (predicting failures based on equipment graphs), and any scenario where the goal is to predict unknown information based on the graph’s structure and existing data.</li>
</ul></li>
</ul>
</section>
<section id="complexity-and-flexibility" class="level3">
<h3 class="anchored" data-anchor-id="complexity-and-flexibility">4. <strong>Complexity and Flexibility</strong></h3>
<ul>
<li><strong>Graph Analytics:</strong>
<ul>
<li>Generally involves <strong>straightforward</strong>, <strong>deterministic algorithms</strong>. The complexity depends on the size of the graph and the specific algorithms used but is often bounded by the need to directly compute on the graph’s structure.</li>
<li>Less flexible in adapting to new patterns or unseen data since the analysis is usually based on predefined metrics and algorithms.</li>
</ul></li>
<li><strong>Graph Machine Learning:</strong>
<ul>
<li>More <strong>complex</strong>, as it involves training models that learn from data. The complexity also comes from the need to process graph data in a way that machine learning models can understand (e.g., through embeddings or GNNs).</li>
<li>Highly flexible and adaptable to new data, allowing models to generalize and make predictions even on unseen parts of the graph.</li>
</ul></li>
</ul>
</section>
<section id="output" class="level3">
<h3 class="anchored" data-anchor-id="output">5. <strong>Output</strong></h3>
<ul>
<li><strong>Graph Analytics:</strong>
<ul>
<li>Outputs are usually descriptive metrics or patterns. For example, you might get a list of nodes ranked by their centrality, clusters of nodes that form communities, or a visualization of the shortest path between nodes.</li>
</ul></li>
<li><strong>Graph Machine Learning:</strong>
<ul>
<li>Outputs are predictive in nature. A model embodies learned patterns. For example, predictions about the category of a node, probabilities of new edges forming between nodes, or learned embeddings that can be used in further machine learning tasks.</li>
</ul></li>
</ul>


</section>

 ]]></description>
  <category>ML</category>
  <category>Analytics</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/analyticsVsML.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/art/radial_small.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Cora in Wolfram</title>
  <link>https://discovery.graphsandnetworks.com/graphML/wolfram-cora.html</link>
  <description><![CDATA[ 





<iframe src="../wolfram/cora/index.htm" width="100%" height="100%" frameborder="0" marginheight="0" marginwidth="0" scrolling="auto" title="Cora in Wolfram">
</iframe>



 ]]></description>
  <category>Code</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/wolfram-cora.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/wolfram/cora/HTMLFiles/index_10.gif" medium="image" type="image/gif"/>
</item>
<item>
  <title>Primes Graph</title>
  <link>https://discovery.graphsandnetworks.com/graphML/wolfram-primes.html</link>
  <description><![CDATA[ 





<p>Prime numbers are the atoms of the numeric world and they can be used to uncover relationships between any pair of numbers. To be precise, the relationships we discuss below is defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax%5Csim%20y%20%5CLeftrightarrow%20%7Cx-y%7C=0%5Cmod%202%5Ep,%20%5C;p%5Cin%20%5CBbb%7BP%7D%0A"> with <img src="https://latex.codecogs.com/png.latex?%5CBbb%7BP%7D"> the prime numbers. For each prime you get another network and as can be seen below the topology of the network is very different for each value.</p>
<p><i class="fa-brands fa-github" aria-label="github"></i> <a href="https://github.com/Orbifold/primes-graph" target="_blank">The Wolfram and Python notebooks can be found on Github.</a></p>
<iframe src="../wolfram/PrimesGraph/index.htm" width="100%" height="1500px" frameborder="0" marginheight="0" marginwidth="0" scrolling="auto" title="Primes Graph in Wolfram">
</iframe>



 ]]></description>
  <category>Code</category>
  <category>Mathematics</category>
  <category>Wolfram</category>
  <guid>https://discovery.graphsandnetworks.com/graphML/wolfram-primes.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/wolfram/cora/HTMLFiles/index_10.gif" medium="image" type="image/gif"/>
</item>
<item>
  <title>NetworkX to Wolfram</title>
  <link>https://discovery.graphsandnetworks.com/graphAnalytics/wolfram-nx2w.html</link>
  <description><![CDATA[ 





<iframe src="../wolfram/nx2w/index.htm" width="100%" height="100%" frameborder="0" marginheight="0" marginwidth="0" scrolling="auto" title="Wolfram NetworkX import">
</iframe>



 ]]></description>
  <category>Code</category>
  <category>Wolfram</category>
  <guid>https://discovery.graphsandnetworks.com/graphAnalytics/wolfram-nx2w.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/wolfram/nx2w/HTMLFiles/nx22_10.gif" medium="image" type="image/gif"/>
</item>
<item>
  <title>Diffusion on graphs</title>
  <link>https://discovery.graphsandnetworks.com/graphAnalytics/diffusion.html</link>
  <description><![CDATA[ 





<div id="cell-1" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> random</span>
<span id="cb1-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span></code></pre></div></div>
</div>
<p>We will use Erdos-Renyi graphs and the Barabasi-Albert graphs as a basis for our diffusions:</p>
<div id="cell-3" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1500</span></span>
<span id="cb2-2">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb2-3">G1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N, k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb2-4">G2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.barabasi_albert_graph(N, k)</span>
<span id="cb2-5">G3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.gaussian_random_partition_graph(N,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span></code></pre></div></div>
</div>
<section id="si-model" class="level1">
<h1>SI model</h1>
<p>The <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology">SI model</a> is the simplest form of all disease models. Individuals are born into the simulation with no immunity (susceptible). Once infected and with no treatment, individuals stay infected and infectious throughout their life, and remain in contact with the susceptible population. This model matches the behavior of diseases like cytomegalovirus (CMV) or herpes.</p>
<p>In a continuous settings the model is defined via this set of coupled differential equations:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A&amp;%20%5Cfrac%7Bd%20S%7D%7Bdt%7D%20=%20-%5Cfrac%7B%5Cbeta%20S%20I%7D%7BN%7D%20%5C%5C%0A&amp;%20%5Cfrac%7Bd%20I%7D%7Bdt%7D%20=%20%5Cbeta%20I%20(1%20-%20%5Cfrac%7BI%7D%7BN%7D)%0A%5Cend%7Balign*%7D%0A"> with <img src="https://latex.codecogs.com/png.latex?N%20=%20S%20+%20I"> the total population consisting of <img src="https://latex.codecogs.com/png.latex?S"> susceptible individuals and <img src="https://latex.codecogs.com/png.latex?I"> infected individuals. The factor <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> is the diffusion rate leading to a logistic growth of the infected population.</p>
<p>This simple diffusion model is really straightforward on a graph. Given a connection from an infected node to a susceptible one, a random value above a threshold (the diffusion parameter) defines whether or not the susceptible node gets affected. This loop runs untile the whole network is reached and infected. It’s also obvious that in this model the whole population gets infected eventually and this corresponds to the logistic curve as seen below.</p>
<div id="cell-5" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> SI(G, initial_infected:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, beta:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>, N:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>, T:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>):</span>
<span id="cb3-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Simulates the Susceptible-Infected (SI) model on a given graph.</span></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Parameters:</span></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        G (networkx.Graph): The input graph where nodes represent individuals.</span></span>
<span id="cb3-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        initial_infected (int): The initial number of infected individuals.</span></span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        HM (float): The probability of infection transmission between connected nodes.</span></span>
<span id="cb3-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        N (int): The total number of individuals in the graph.</span></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        T (int): The number of time steps to simulate.</span></span>
<span id="cb3-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Returns:</span></span>
<span id="cb3-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        tuple: A tuple containing three numpy arrays:</span></span>
<span id="cb3-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            - susceptible (numpy.ndarray): The number of susceptible individuals at each time step.</span></span>
<span id="cb3-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            - infected (numpy.ndarray): The number of infected individuals at each time step.</span></span>
<span id="cb3-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            - infection_delta (numpy.ndarray): The number of new infections at each time step.</span></span>
<span id="cb3-15"></span>
<span id="cb3-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb3-17">    </span>
<span id="cb3-18">    susceptible <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># how many susceptible at time t</span></span>
<span id="cb3-19">    infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># how many infected at time t</span></span>
<span id="cb3-20">    infection_delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># increase in infection compared to previous time</span></span>
<span id="cb3-21">    infected[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> initial_infected</span>
<span id="cb3-22">    susceptible[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> initial_infected</span>
<span id="cb3-23">    infection_delta[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> infected[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb3-24"></span>
<span id="cb3-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes():</span>
<span id="cb3-26">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb3-27">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbors"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [n <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> n <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.neighbors(u)]</span>
<span id="cb3-28">    init <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.sample(G.nodes(), initial_infected)</span>
<span id="cb3-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> init:</span>
<span id="cb3-30">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-31"></span>
<span id="cb3-32">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,T):</span>
<span id="cb3-33">        susceptible[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> susceptible[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb3-34">        infected[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> infected[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb3-35">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes:            </span>
<span id="cb3-36">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb3-37">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># nb_friend_infected = [G.nodes[n]["infected"] == 1 for n in G.nodes[u]["neighbors"]].count(True)</span></span>
<span id="cb3-38">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> n <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbors"</span>]:</span>
<span id="cb3-39">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[n][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>: </span>
<span id="cb3-40">                        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># with HM infect</span></span>
<span id="cb3-41">                        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> np.random.rand() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> beta:</span>
<span id="cb3-42">                            G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-43">                            infected[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-44">                            susceptible[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-45">                            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span>
<span id="cb3-46">        infection_delta[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> infected[t]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>infected[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb3-47">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> susceptible, infected, infection_delta</span></code></pre></div></div>
</div>
<p>This is a simple implementation without using adjacency matrices or vectorizations. It ain’t the most performant one but it shows best what is going on.</p>
<section id="erdos-renyi" class="level2">
<h2 class="anchored" data-anchor-id="erdos-renyi">Erdos Renyi</h2>
<p>Let’s show explicitly what happens when using an Erdos-Renyi graph.</p>
<div id="cell-9" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">T <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb4-2">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1500</span> </span>
<span id="cb4-3">beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03</span></span>
<span id="cb4-4">initial_infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> </span>
<span id="cb4-5"></span>
<span id="cb4-6">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb4-7">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb4-8">s_erdos, inf_erdos,infection_delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G,initial_infected,beta, N, T)</span>
<span id="cb4-9">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>s_erdos, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'b'</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Susceptible p=0.04"</span>)</span>
<span id="cb4-10">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>inf_erdos, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'r'</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Infected  p=0.04"</span>)</span>
<span id="cb4-11">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time"</span>)</span>
<span id="cb4-12">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Percentage of population infected"</span>)</span>
<span id="cb4-13">plt.legend()</span>
<span id="cb4-14">plt.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_47535/1712297490.py:28: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  init = random.sample(G.nodes(), initial_infected)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphAnalytics/diffusion_files/figure-html/cell-5-output-2.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>You can see that saturation happens extremely fast and this is due to the density of the network. If you reduce this you can see that some individuals don’t get infected because of the sparsity:</p>
<div id="cell-11" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"></span>
<span id="cb6-2">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb6-3">T <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb6-4">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb6-5">s_erdos, inf_erdos,infection_delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G,initial_infected,beta, N, T)</span>
<span id="cb6-6">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>s_erdos,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b"</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Susceptible p=0.01"</span>)</span>
<span id="cb6-7">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>inf_erdos,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Infected  p=0.01"</span>)</span>
<span id="cb6-8"></span>
<span id="cb6-9">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time"</span>)</span>
<span id="cb6-10">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">% i</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">nfected"</span>)</span>
<span id="cb6-11">plt.legend()</span>
<span id="cb6-12">plt.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_47535/1712297490.py:28: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  init = random.sample(G.nodes(), initial_infected)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphAnalytics/diffusion_files/figure-html/cell-6-output-2.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>If you reduce <img src="https://latex.codecogs.com/png.latex?k"> sufficiently there will be no infections at all. The problem will the model is that there is some randomness but given enoough time infection will always happen.</p>
<p>If we now look at infection delta’s you can see that there is an initial explosion of infections with a Poisson-like distribution:</p>
<div id="cell-14" class="cell" data-execution_count="16">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"></span>
<span id="cb8-2">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the mean degree of the networks</span></span>
<span id="cb8-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># defining an erdos renyi network</span></span>
<span id="cb8-4">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb8-5">s_erdos, inf_erdos,infection_delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G,initial_infected,beta, N, T)</span>
<span id="cb8-6">plt.plot(infection_delta,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Infected  p=0.01"</span>)</span>
<span id="cb8-7">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time"</span>)</span>
<span id="cb8-8">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Number of new cases"</span>)</span>
<span id="cb8-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">##</span></span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_47535/1712297490.py:28: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  init = random.sample(G.nodes(), initial_infected)</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="16">
<pre><code>Text(0, 0.5, 'Number of new cases')</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphAnalytics/diffusion_files/figure-html/cell-7-output-3.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="barabasi-albert" class="level2">
<h2 class="anchored" data-anchor-id="barabasi-albert">Barabasi-Albert</h2>
<p>Let’s look at the Barabasi-Albert graph propagation:</p>
<div id="cell-16" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"></span>
<span id="cb11-2">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb11-3">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<span id="cb11-4">initial_infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb11-5"></span>
<span id="cb11-6">G1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb11-7">G1_demi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb11-8">G2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.barabasi_albert_graph(N, k)</span>
<span id="cb11-9">G2_demi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.barabasi_albert_graph(N, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb11-10"></span>
<span id="cb11-11">s_ER, inf_ER,nb_inf_t_ER <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G1,initial_infected,beta, N, T)</span>
<span id="cb11-12">s_ER_demi, inf_ER_demi,nb_inf_t_ER_demi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G1_demi,initial_infected,beta, N, T)</span>
<span id="cb11-13"></span>
<span id="cb11-14">s_BA, inf_BA,nb_inf_t_BA <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G2,initial_infected,beta, N, T)</span>
<span id="cb11-15">s_BA_demi, inf_BA_demi,nb_inf_t_BA_demi <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SI(G2_demi,initial_infected,beta, N, T)</span>
<span id="cb11-16">plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>))</span>
<span id="cb11-17">plt.plot(nb_inf_t_ER,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Network:EA Infected  p=0.04"</span>)</span>
<span id="cb11-18">plt.plot(nb_inf_t_ER_demi,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r--"</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Network:EA Infected  p=0.02"</span>)</span>
<span id="cb11-19"></span>
<span id="cb11-20"></span>
<span id="cb11-21">plt.plot(nb_inf_t_BA,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b"</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Network:BA Infected  p=0.04"</span>)</span>
<span id="cb11-22">plt.plot(nb_inf_t_BA_demi,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"b--"</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Network:BA Infected  p=0.02"</span>)</span>
<span id="cb11-23">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time"</span>)</span>
<span id="cb11-24">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Number of new cases"</span>)</span>
<span id="cb11-25">plt.xlim(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span>)</span>
<span id="cb11-26"></span>
<span id="cb11-27">plt.legend()</span>
<span id="cb11-28">plt.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_47535/1712297490.py:28: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  init = random.sample(G.nodes(), initial_infected)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphAnalytics/diffusion_files/figure-html/cell-8-output-2.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
</section>
</section>
<section id="sir-model" class="level1">
<h1>SIR model</h1>
<p>The SI model is not satisfactory because everyone gets infected eventually and it does not take into account recovery, survival, removal (death) and so on. The SIR model improves the SI by adding a number of removed (and immune) or deceased individuals, denoted by <img src="https://latex.codecogs.com/png.latex?R">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Balign*%7D%0A&amp;%20%5Cfrac%7Bd%20S%7D%7Bdt%7D%20=%20-%5Cfrac%7B%5Cbeta%20S%20I%7D%7BN%7D,%20%5C%5C%0A&amp;%20%5Cfrac%7Bd%20I%7D%7Bdt%7D%20=%20%5Cbeta%20I%20S%20-%20%5Cgamma%20I,%20%5C%5C%0A&amp;%20%5Cfrac%7Bd%20R%7D%7Bdt%7D%20=%20%5Cgamma%20I,%0A%5Cend%7Balign*%7D%0A"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> coefficient corresponds to how fast an infection leads to deceased individuals.</p>
<div id="cell-18" class="cell" data-execution_count="19">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> SIR(G, initial_infected, gamma, beta, N, T):</span>
<span id="cb13-2">    </span>
<span id="cb13-3">    susceptible <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T)</span>
<span id="cb13-4">    removed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T)</span>
<span id="cb13-5">    infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros(T)</span>
<span id="cb13-6">    infected[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> initial_infected</span>
<span id="cb13-7">    susceptible[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> initial_infected</span>
<span id="cb13-8">    </span>
<span id="cb13-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes():</span>
<span id="cb13-10">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb13-11">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infectedDuration"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb13-12">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbors"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [n <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> n <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.neighbors(u)]</span>
<span id="cb13-13"></span>
<span id="cb13-14">    init <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.sample(G.nodes(), initial_infected)</span>
<span id="cb13-15">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> init:</span>
<span id="cb13-16">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-17">        G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infectedDuration"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-18">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># running simulation</span></span>
<span id="cb13-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,T):</span>
<span id="cb13-20">        susceptible[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> susceptible[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb13-21">        infected[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> infected[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb13-22">        removed[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> removed[t<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb13-23">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check which persons have recovered</span></span>
<span id="cb13-24">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes:</span>
<span id="cb13-25">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if infected</span></span>
<span id="cb13-26">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:</span>
<span id="cb13-27">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infectedDuration"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> gamma:</span>
<span id="cb13-28">                    G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infectedDuration"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-29">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb13-30">                    G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#"recovered"</span></span>
<span id="cb13-31">                    removed[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-32">                    infected[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-33">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># check contagion    </span></span>
<span id="cb13-34">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> u <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes:</span>
<span id="cb13-35">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#if susceptible</span></span>
<span id="cb13-36">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb13-37">                nb_friend_infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [G.nodes[n][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> n <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbors"</span>]].count(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb13-38">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#print(nb_friend_infected)</span></span>
<span id="cb13-39">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> n <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbors"</span>]:</span>
<span id="cb13-40">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> G.nodes[n][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>: <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if friend is infected</span></span>
<span id="cb13-41">                        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># with HM infect</span></span>
<span id="cb13-42">                        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> np.random.rand() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> beta:</span>
<span id="cb13-43">                            G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"infected"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-44">                            infected[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-45">                            susceptible[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb13-46">                            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span>
<span id="cb13-47">    </span>
<span id="cb13-48">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> susceptible, infected,removed</span></code></pre></div></div>
</div>
<p>Once again, this isn’t the most performant implementation but demonstrative.</p>
<p>If we apply the algorithm an Erdos-Renyin graph:</p>
<div id="cell-20" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb14-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># time of simulation</span></span>
<span id="cb14-3">T <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb14-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of agents</span></span>
<span id="cb14-5">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span></span>
<span id="cb14-6">beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03</span></span>
<span id="cb14-7">gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb14-8">initial_infected <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb14-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean degree of the networks</span></span>
<span id="cb14-10">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span>
<span id="cb14-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># defining an erdos renyi network</span></span>
<span id="cb14-12">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb14-13"></span>
<span id="cb14-14">s_erdos, inf_erdos,r_erdos <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SIR(G,initial_infected,gamma,beta, N, T)</span>
<span id="cb14-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#plt.plot((100/N)*s_erdos, color='b',marker='+', label="Susceptible k=20")</span></span>
<span id="cb14-16">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>inf_erdos, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'r'</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Infected p=0.04"</span>)</span>
<span id="cb14-17">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>r_erdos, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'g'</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Recovered p=0.04"</span>)</span>
<span id="cb14-18"></span>
<span id="cb14-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#</span></span>
<span id="cb14-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean degree of the networks</span></span>
<span id="cb14-21">k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<span id="cb14-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># defining an erdos renyi network</span></span>
<span id="cb14-23">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.erdos_renyi_graph(N,k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)</span>
<span id="cb14-24"></span>
<span id="cb14-25">s_erdos, inf_erdos,r_erdos <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SIR(G,initial_infected,gamma,beta, N, T)</span>
<span id="cb14-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#plt.plot((100/N)*s_erdos,color="b",marker='o',label="Susceptible k=10")</span></span>
<span id="cb14-27">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>inf_erdos,color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>,label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Infected p=0.02"</span>)</span>
<span id="cb14-28">plt.plot((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>N)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>r_erdos,color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'g'</span>,marker<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>,label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Recovered p=0.02"</span>)</span>
<span id="cb14-29"></span>
<span id="cb14-30">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"time"</span>)</span>
<span id="cb14-31">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Percentage of population infected"</span>)</span>
<span id="cb14-32">plt.legend()</span>
<span id="cb14-33">plt.show()</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_47535/935179267.py:16: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  init = random.sample(G.nodes(), initial_infected)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><img src="https://discovery.graphsandnetworks.com/graphAnalytics/diffusion_files/figure-html/cell-10-output-2.png" class="img-fluid figure-img"></p>
</figure>
</div>
</div>
</div>
<p>These epidemiology models are not limited to flat spaces and it’s interesting to apply it to other topologies, like a torus. In the adjacent image we used Python inside <a href="https://www.maxon.net/en/cinema-4d">Cinema 4D</a> to simulate infection:</p>
<p><img src="https://discovery.graphsandnetworks.com/images/InfectionFlat.gif" class="img-fluid" width="300"> <img src="https://discovery.graphsandnetworks.com/images/InfectionTorus.png" class="img-fluid" width="300"></p>


</section>

 ]]></description>
  <category>Graphs</category>
  <category>Epidemiology</category>
  <guid>https://discovery.graphsandnetworks.com/graphAnalytics/diffusion.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
  <media:content url="https://discovery.graphsandnetworks.com/images/InfectionTorus.png" medium="image" type="image/png" height="107" width="144"/>
</item>
<item>
  <title>Entity Resolution</title>
  <link>https://discovery.graphsandnetworks.com/graphAnalytics/entityResolution1.html</link>
  <description><![CDATA[ 





<p>See also this <a href="../graphDB/jaccard_gds.html">related note on how to use Jaccard in Neo4j via GDS</a>.</p>
<p>Many companies use common sense when it comes to entity resolution:</p>
<ul>
<li>use regex and heuristics to sniff out similarities</li>
<li>go a step further with NLP techniques (e.g.&nbsp;<a href="https://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank">Levenshtein distance</a>)</li>
<li>Google stuff and end up with complex SQL queries or <a href="https://aws.amazon.com/entity-resolution/" target="_blank">AWS Entity Resolution Service</a>.</li>
</ul>
<p>These approaches work well in some cases but none of these techniques can access the topological information that every enterprise entity has. Some implicitly, sometimes explicit. That is, entities are never a data island, they always come with a cloud of information. This cloud of relations is a graph (network). Accurate entity resolution is, hence, a problem of not only matching the properties of an entity but also its context. The context augments the similarity and you should not omit it.</p>
<p>The reason companies don’t approach entity resolution from a graph point of view can have many origins:</p>
<ul>
<li>expertise with graph and graph database</li>
<li>know-how in teams</li>
<li>additional complexity</li>
<li>time to market</li>
<li>licensing cost, additional infrastructure</li>
</ul>
<p>and more. The transition from SQL to graphs can indeed be a project on its own and I have had customers deciding to avoid graphs knowing full well that it would increase accuracy, but increase cost in equal measures.</p>
<p>In this and subsequent notebooks we will try to explain various techniques to approach entity resolution. You can find vendor-specific articles about this but they (no surprise there) focus on their product and are usually focused on selling you a product rather than selling you a solution.</p>
<section id="sample-data" class="level2">
<h2 class="anchored" data-anchor-id="sample-data">Sample Data</h2>
<p>We’ll use a network originally from Neo4j (see <a href="https://github.com/neo4j-graph-examples/entity-resolution" target="_blank">this Github repo</a>) and converted in NetworkX format <a href="../graphAnalytics/neo2nx.html">here</a>.</p>
<p>The schema is straightforward and the main objective is to consolidate users (label: User):</p>
<p><img src="https://discovery.graphsandnetworks.com/images/EntityResolutionSchema.png" class="img-fluid"></p>
<p>The data can be fetched from the pickle back into NetworkX like this:</p>
<div id="cell-2" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> networkx <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> nx</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pickle</span>
<span id="cb1-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'../data/EntityResolution.pkl'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'rb'</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb1-4">    G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pickle.load(f)</span>
<span id="cb1-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(G)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>MultiDiGraph with 1237 nodes and 1819 edges</code></pre>
</div>
</div>
<p>The data in the nodes consists of the usual suspects:</p>
<div id="cell-4" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes():</span>
<span id="cb3-2">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(G.nodes[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span>])</span>
<span id="cb3-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{'labels': ['User'], 'properties': {'lastName': 'Burbidge', 'country': 'US', 'firstName': 'Dorette', 'gender': 'Male', 'phone': '834-424-8856', 'state': 'Ohio', 'userId': 1, 'email': 'dburbidge0@japanpost.jp'}}</code></pre>
</div>
</div>
</section>
<section id="properties" class="level2">
<h2 class="anchored" data-anchor-id="properties">Properties</h2>
<p>At mentioned above, common sense dictates to look at the properties to identify similarities. On a string level you have various options and Python has the built-in <code>SequenceMatcher</code>. Taking random pairs of User nodes and finding the similarity of the full names goes like this:</p>
<div id="cell-6" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> difflib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SequenceMatcher</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> random</span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take the ids of the User labels</span></span>
<span id="cb5-4">user_ids <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes() <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"User"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> G.nodes(data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span>][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"labels"</span>]]</span>
<span id="cb5-5">pairs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [random.sample(user_ids , k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)]</span>
<span id="cb5-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> pair <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> pairs:</span>
<span id="cb5-7">    u <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes(data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)[pair[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb5-8">    v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes(data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)[pair[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb5-9">    u_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>]</span>
<span id="cb5-10">    v_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>]</span>
<span id="cb5-11">    w <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SequenceMatcher(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,u_name, v_name)</span>
<span id="cb5-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(u_name, v_name, w.ratio())</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Nert Picopp Deni Jest 0.2
Konstanze Absolem Ronda Skilling 0.25806451612903225
Maxine Clousley Tarrah Caw 0.24
Port Prozescky Cacilie Nicklin 0.20689655172413793
Brana Cordeau Aurora Broadis 0.2962962962962963
Meryl Boncore Thedrick Dwine 0.37037037037037035
Riki Prujean Cece Brownstein 0.14814814814814814
Lynnelle Minkin Bar O'Hear 0.08
Carr Keemar Thedrick Dwine 0.24
Bobbe Wittering Maxine Francescone 0.24242424242424243</code></pre>
</div>
</div>
<p>This is not a practical solution because all possible pairs quickly grows with the amount of nodes. There is also the issue of missing data and how to impute, a classic data science problem.</p>
</section>
<section id="topology" class="level2">
<h2 class="anchored" data-anchor-id="topology">Topology</h2>
<p>The most basic topological similarity is based on the idea that the more friends you share, the more likely you know each other. The more two nodes share the same nodes in their 1-hop neighborhood, the more they are similar. This is expressed in the <strong>Jaccard similarity</strong> and is <a href="https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_prediction.jaccard_coefficient.html#networkx.algorithms.link_prediction.jaccard_coefficient">easy to compute</a>. We are only interested in the similarity between User nodes, so we pass the index as the second paramter. Note that the given tuples have nothing to do with edges, we are only specifying which couples are interesting:</p>
<div id="cell-9" class="cell" data-execution_count="34">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.Graph(G) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Jaccard will not work with multi-graphs</span></span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> itertools</span>
<span id="cb7-3">user_combinations <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(itertools.combinations(user_ids, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># around 125K combinations</span></span>
<span id="cb7-4">jaccard_all_sim <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nx.jaccard_coefficient(G, user_combinations)</span>
<span id="cb7-5">jaccard_user_sim <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [(u,v,w) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (u,v,w) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> jaccard_all_sim <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> w<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>]</span>
<span id="cb7-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (u,v,w) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> jaccard_user_sim:</span>
<span id="cb7-7">    user_u <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb7-8">    user_v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes[v][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb7-9">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(user_u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> user_u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>], user_v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> user_v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>], w)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Shirlene Borres Forrester Borres 1.0
Ondrea Garnsey Nadine Garnsey 1.0
Wanda W Wanda Wroath 1.0</code></pre>
</div>
</div>
<p>It’s important to stress here that the similarity is purely topological and does not depend on the properties of the nodes. That is, the method suggests that the nodes are similar because their neighborhood has some similarities.</p>
<p>In order to include the topology with the properties you need to step into graph machine learning and use node embeddings (node2vec).</p>
<p>Alternatively, you can use your heuristics together with the Jaccard coefficient to increase accuracy. Yet another road is to use classic machine learning and use the properties together with the topological similarity in your feature engineering.</p>
<ul>
<li>The Jaccard coefficient is <a href="https://neo4j.com/docs/graph-data-science/current/algorithms/similarity-functions/">available in Neo4j via GDS</a></li>
<li><a href="https://docs.aws.amazon.com/neptune-analytics/latest/userguide/jaccard-similarity.html">Neptune Analytics has Jaccard</a></li>
<li>Memgraph calls it simply <a href="https://memgraph.com/docs/advanced-algorithms/available-algorithms/node_similarity">node similariy</a>.</li>
</ul>
</section>
<section id="feature-engineering" class="level2">
<h2 class="anchored" data-anchor-id="feature-engineering">Feature Engineering</h2>
<p>For data scientist this topic is obvious but experience with customers shows that it’s much less clear to most people. So, here’s a note on features and how extra info can be used.</p>
<p>No matter the ML technique you use (SVM, XGBoost, neural networks…), you have a table with columns containing the characteristics of stuff and a special column (the target) that you wish to predict. If you wish to train an ML model which learns the similarity of nodes you collect the relevant properties that each node carries and you add it as a column. Since you have two nodes, you have these columns twice. The target column is a boolean saying whether they are or they are not similar. There are plenty of variations on this theme but let’s keep it simple.</p>
<p>The more rows you give to the algorithm, the more it will learn the characteristics of ‘the same’ two things (nodes).</p>
<p>Feature engineering is in this respect all about using meaningful features, combining them and converting them in a format suitable for ML. The important bit is also that you can add whatever you think can contribute to the process of detecting patterns. With (property) graphs you can add the payload of the nodes and edges AND the topological information. Provided you can encode the topology in a numerical format.</p>
<p>The way a node sits in a graph, the immediate neighborhood (1-hop) or a bit more (N-hops) can be encoded in a vector via a technique called node embeddings. In essence, a vector captures the neighborhood (topology) and this augments the features with information that a ML algorithm can use.</p>
</section>
<section id="embeddings" class="level1">
<h1>Embeddings</h1>
<p>The creation of vectors from nodes (typically abbreviated as node2vec) is a topic we’ll explain elsewhere but there are various easy to use libraries which hide the complexity of assembling these vectors. The best known one is <code>Node2Vec</code> (<code>pip install node2vec</code>) or a faster version <code>fastnode2vec</code> (<code>pip install fastnode2vec</code>) with linear memory usage.</p>
<p>To create an embedding you need to convert the graph to another type of graph (since fastnode2vec inherits from the <a href="https://radimrehurek.com/gensim/">Gensim</a> framework), define the parameters and train things:</p>
<div id="cell-14" class="cell" data-execution_count="57">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastnode2vec <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Graph, Node2Vec</span>
<span id="cb9-2">graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Graph(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(G.edges()), directed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, weighted<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb9-3">n2v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Node2Vec(graph, dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, walk_length<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, window<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, p<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.0</span>, q<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, workers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb9-4">n2v.train(epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<script type="application/vnd.jupyter.widget-view+json">
{"model_id":"ef41b8abb63a472eabd9706b8ad90903","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"}
</script>
</div>
<div class="cell-output cell-output-display">
<script type="application/vnd.jupyter.widget-view+json">
{"model_id":"a7fd81876887421a8a4eb917a83b709d","version_major":2,"version_minor":0,"quarto_mimetype":"application/vnd.jupyter.widget-view+json"}
</script>
</div>
</div>
<p>This happens in less than a second and you can fetch node vectors like so:</p>
<div id="cell-16" class="cell" data-execution_count="45">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(n2v.wv[user_ids[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[ 0.00309572 -0.07105517  0.03536741 -0.00647308  0.0246973   0.01980644
 -0.10460257  0.01136639  0.05422755 -0.06904547]</code></pre>
</div>
</div>
<p>With this in place we can start to use the vectors and the easiest thing is to use cosine similarity. Since the created vectors are Numpy arrays this is straightforward:</p>
<div id="cell-18" class="cell" data-execution_count="58">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb12-2"></span>
<span id="cb12-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> cosim(id_1, id_2):</span>
<span id="cb12-4">    v1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> n2v.wv[id_1]</span>
<span id="cb12-5">    v2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> n2v.wv[id_2]</span>
<span id="cb12-6">    dot_product <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.dot(v1, v2)</span>
<span id="cb12-7">    norm_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.linalg.norm(v1)</span>
<span id="cb12-8">    norm_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.linalg.norm(v2)</span>
<span id="cb12-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> dot_product <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (norm_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> norm_b)</span>
<span id="cb12-10"></span>
<span id="cb12-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Cosine Similarity: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cosim(user_ids[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], user_ids[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Cosine Similarity: 0.4215</code></pre>
</div>
</div>
<p>Cosine similarity is really a mathematical translation for how close two vectors are to one another. The closer the value is to 1.0 the more they are close. The vector embedding ensures that vectors are close if the their topologies are similar and so, if we find high cosine values it means the two nodes have similar neighborhoods.</p>
<p>Once again, we look at the user nodes:</p>
<div id="cell-20" class="cell" data-execution_count="59">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">user_combinations <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(itertools.combinations(user_ids, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)) </span>
<span id="cb14-2">cosim_users <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [(u,v, cosim(u,v)) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (u,v) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> user_combinations <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> cosim(u,v) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span> ]</span>
<span id="cb14-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(cosim_users))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>21</code></pre>
</div>
</div>
<p>In decreasing similarity:</p>
<div id="cell-22" class="cell" data-execution_count="61">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1">sorted_cosim_users <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sorted</span>(cosim_users, key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>], reverse<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb16-2"></span>
<span id="cb16-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Iterate through the sorted list</span></span>
<span id="cb16-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (u, v, w) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> sorted_cosim_users[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>]:</span>
<span id="cb16-5">    user_u <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes[u][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb16-6">    user_v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> G.nodes[v][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"properties"</span>]</span>
<span id="cb16-7">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(user_u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> user_u[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>], user_v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"firstName"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> user_v[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lastName"</span>], w)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Wandis Blewett Kristian Twaits 0.96341246
Riki Prujean Huey Ferneyhough 0.95864064
Eal Grombridge Konstanze Absolem 0.9390747
Perceval Pilbury Berk Mauser 0.93898416
Morgan Mongenot Maurine Mesant 0.93397725
Tova Canton Christopher Cowmeadow 0.9336317
Robyn Haskett Christiana Goodey 0.9315399
Danyette Pinnigar Dalt Le Quesne 0.9289893
Gardiner Dudman Lexy Canton 0.92787033
Gerard Toy Loydie Regis 0.9258602</code></pre>
</div>
</div>
<p>You might wonder what is the advantage over the Jaccard method above. Especially since both methods only take the topology into account and none of the properties.</p>
<p>The embedding is more general:</p>
<ul>
<li>it’s a ML trained model specifically on the graph</li>
<li>the <code>walk_length</code> parameter reveals that underneath the hood there is a random walk collecting information about a node’s neighborhood. By increading the walk’s size you increase the size of the topology taken into account. The Jaccard method is 1-hop only. The additional paramters (<code>p,q</code> above) also are related to the random walk (how adventurous or exotic the walk is).</li>
<li>the vector dimension allows one to be more sparse or dense in function of the graph and downstream tasks.</li>
</ul>
<p>One could say that Jaccard is the simplistic measure and Node2Vec fits in the AI (neural network) space. Whether one is better than the other depends on your tech stack, team, amount of data and so on.</p>


</section>

 ]]></description>
  <category>Business</category>
  <guid>https://discovery.graphsandnetworks.com/graphAnalytics/entityResolution1.html</guid>
  <pubDate>Mon, 30 Mar 2026 09:22:32 GMT</pubDate>
</item>
</channel>
</rss>
