Nuextract

GraphAI

A model dedicated to structured data extraction.

Published

December 6, 2024

Nuextract by Numind is dedicated to extraction information in a predefined format. It’s not like named entity recoginition but more like JSON extraction.

This model can effectively shortcut the need for structured output via Pydantic specs via other LLMs. In addition, it allows complex JSON structures.

Let’s setup the preliminaries:

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
    template = json.dumps(json.loads(template), indent=4)
    prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]
    
    outputs = []
    with torch.no_grad():
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i+batch_size]
            batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)

            pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
            outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    return [output.split("<|output|>")[1] for output in outputs]

model_name = "numind/NuExtract-v1.5"
device = "mps"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

The above will download (~10GB) the model automatically.

Let’s start with a simple text and a person/children extraction:

text = """My name is John and I am 56 years old. I have two daughters, Anna and Lisa. Anna is 28 years old and List 25."""

template = """{
    "Person": {
        "Name": "",
        "Age": "",
        "Children": [
            {
                "Name": "",
                "Age": 0
            }
        ]
    }
     
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

{"Person": {"Name": "John", "Age": "56", "Children": [{"Name": "Anna", "Age": 28}, {"Name": "Lisa", "Age": 25}]}}

Note that the default in the schema also defines the data type. The ‘age’ is an integer, the ‘name’ is a string.

Does it work with a graph structure? Yes, it does. Let’s try a simple example:

text = """
- John knows Mary
- Mary knows Peter
- Peter knows John
"""

template = """{
    "nodes": [{"name":""}],
    "edges":[{ "source": "", "target": "" }]
     
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

{"nodes": [
    {
        "name": "John"
    },
    {
        "name": "Mary"
    },
    {
        "name": "Peter"
    }
],
    "edges": [
    {
        "source": "John",
        "target": "Mary"
    },
    {
        "source": "Mary",
        "target": "Peter"
    },
    {
        "source": "Peter",
        "target": "John"
    }
]}

Let’s try something a bit more complex, does it understand SPO triples?

text = """
- John knows Mary
- Mary knows Peter
- Peter knows John
- John is 23 years old
- John has a red car
- Peter is 25 years old
- Mary has a blue car
"""

template = """{
    "nodes": [{"name":"", "age": 0}],
    "edges":[{ "subject": "", "predicate": "", "object": "" }]
     
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

{"nodes": [
    {
        "name": "John",
        "age": 23
    },
    {
        "name": "Mary",
        "age": 0
    },
    {
        "name": "Peter",
        "age": 25
    }
],
"edges": [
    {
        "subject": "John",
        "predicate": "knows",
        "object": "Mary"
    },
    {
        "subject": "Mary",
        "predicate": "knows",
        "object": "Peter"
    },
    {
        "subject": "Peter",
        "predicate": "knows",
        "object": "John"
    }
]}

This is a simple example, but it shows that the model can understand a graph structure and, specifically, SPO triples.

Does it work towards graph RAG (LightRAG to be precise)?

text = """
This site serves as a comprehensive resource hub (blog or journal) for all things graphs and our consulting services. It offers a rich collection of notebooks, articles, tutorials, and other resources, covering a wide range of topcs from basic graph theory to advanced visualization techniques and AI applications. Whether you’re a beginner looking to learn the fundamentals or an expert seeking to deepen your understanding, you’ll find valuable content here to support your journey.

In addition to the educational materials, the site also provides a wealth of code snippets, practical tricks, and tools designed to enhance your workflow. These resources are ideal for developers, data scientists, and researchers who need quick, effective solutions to common challenges in graph-related projects. The code and tips shared here are the result of years of hands-on experience and experimentation, making them highly practical and immediately applicable to real-world problems.

The site also features reflections on the evolution of graph consulting over the past 20 years. These insights are drawn from decades of experience in the field, offering a unique perspective on how graph technologies have evolved and the impact they have had on various industries. Alongside these reflections, you will find research papers and experimental results that push the boundaries of what is possible with graphs, providing inspiration and guidance for those looking to innovate in this dynamic area.
"""

template = """{
    "nodes": [{"name":"", "description": ""}],
    "edges":[{ "source": "", "description": "", "target": "", "keywords": [] }]
     
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

{"nodes": [{"name": "site", "description": "comprehensive resource hub for all things graphs and our consulting services"}, {"name": "educational materials", "description": "rich collection of notebooks, articles, tutorials, and other resources"}, {"name": "code snippets", "description": "practical tricks, and tools designed to enhance your workflow"}, {"name": "reflections", "description": "on the evolution of graph consulting over the past 20 years"}, {"name": "research papers", "description": "experimental results that push the boundaries of what is possible with graphs"}], "edges": [{"source": "site", "description": "serves as a comprehensive resource hub", "target": "educational materials", "keywords": ["graphs", "consulting services", "resources"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "code snippets", "keywords": ["developers", "data scientists", "researchers"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "reflections", "keywords": ["evolution", "graph consulting", "industries"]}, {"source": "site", "description": "serves as a comprehensive resource hub", "target": "research papers", "keywords": ["graphs", "innovate", "dynamic area"]}]}

Nuextract is not specifically designed to generate knowledge graphs, but a shallow test reveals that it can do it. This is quite remarkable.

Outside KG creation, it works well to collect form-like information from arbitrary input. One could scan the generated JSON to ensure that all required info has been supplied (and re-iterate if necessary).