Knwl Benchmarks

KnowledgeGraphs

GraphAI

What model works best for a particular dataset? Is bigger always better? How much time does it take to ingest some facts?

The benchmark suite in the benchmarks directory is designed to help answer these questions:

it’s a standalone script using Knwl to ingest a set of sentences/facts
it measures ingestion time and the amount of returned nodes and edges
LLM output are captured in knwl_data and can be re-used or analyzed
it loops over local (Ollama) and cloud (OpenAI, Anthropic…) LLM providers and different models within each provider
it’s easy to customize and to run.

This benchmark is a stepping stone since:

it does not consider parsing/OCR time for documents (only ingestion time)
it does not consider complex content (math, tables, code…)
it does not consider knowledge graph quality (only quantity of nodes/edges)
it does not consider multiple languages (only English).

Some Insights

The following is not in stone but might help to rethink some assumptions about LLMs and their performance on knowledge ingestion tasks:

bigger models take more time
smaller models are qualitatively as able as bigger models for knowledge ingestion
local models (Ollama) are great for privacy but performances are worse than cloud models (unless you have a very powerful GPU setup)
extracting knowledge is expensive: 3 nodes and 2 edges can take up to 20 seconds with some models
reasoning models perform worse than non-reasoning models for knowledge ingestion tasks
bigger models do sometimes extract more nodes/edges but not always: sometimes smaller models do better
for local development and testing you can use 7b models (gemma3, qwen2.5) which are fast and qualitatively good enough
worst models (latency and errors) are: gpt-oss, llama3.1
best local model is qwen2.5 across all sizes (7b, 14b, 32b)

Metric

In order to pick out the model that works best for your data you need to define what “best” means for you. That is, you need to define a metric that combines ingestion time, number of nodes and number of edges in a way that reflects your priorities. A simple metric could be:

score = (nodes + edges) / time

This metric gives equal weight to nodes and edges and divides by time, so higher scores are better. You could adjust the weights or the formula according to your needs:

score = (w1 * nodes + w2 * edges) / (w3 * time)

Where w1, w2, and w3 are weights (sum equal to one) that you can set based on what matters most to you (e.g., if time is more critical, increase w3).

Usage

customize the models and strategies variables in benchmarks/run.py to select which models and strategies to benchmark
run uv run run.py inside the benchmarks directory
results are printed in the console and saved in a CSV file in the benchmarks/results directory.

Note that the results are time stamped and every run will create a new CSV file. The fact that the LLM calls are cached means that subsequent runs with the same configuration will be much faster.

Sample output from November 2025 can be found below:

key	provider	model	failed	node_count	edge_count	latency (s)	error
married	ollama	qwen2.5:7b	False	2	1	6.74
family	ollama	qwen2.5:7b	False	3	3	0.42
work	ollama	qwen2.5:7b	False	2	1	0.24
geo	ollama	qwen2.5:7b	False	3	2	11.99
married	ollama	qwen2.5:14b	False	2	1	0.26
family	ollama	qwen2.5:14b	False	3	2	0.38
work	ollama	qwen2.5:14b	False	2	0	0.19
geo	ollama	qwen2.5:14b	False	3	1	33.03
married	ollama	qwen2.5:32b	False	2	1	0.26
family	ollama	qwen2.5:32b	False	3	3	0.46
work	ollama	qwen2.5:32b	False	3	2	0.43
geo	ollama	qwen2.5:32b	False	2	1	67.29
married	ollama	gemma3:4b	False	2	0	5.12
family	ollama	gemma3:4b	False	3	2	5.55
work	ollama	gemma3:4b	False	3	1	5.21
geo	ollama	gemma3:4b	False	4	2	5.94
married	ollama	gemma3:12b	False	2	1	20.62
family	ollama	gemma3:12b	False	3	3	18.29
work	ollama	gemma3:12b	False	3	2	17.05
geo	ollama	gemma3:12b	False	4	2	16.77
married	ollama	gemma3:27b	False	3	1	42.74
family	ollama	gemma3:27b	False	4	3	49.41
work	ollama	gemma3:27b	False	3	2	44.17
geo	ollama	gemma3:27b	False	4	3	48.58
married	ollama	llama3.1	True	0	0	0.0	Given text is likely not an LLM record: Note: The output format follows the specified guidelines, with each entity and relationship formatted as described. The content keywords are also extracted and provided in the required format.
family	ollama	llama3.1	True	0	0	0.0	Given text is likely not an LLM record: **
work	ollama	llama3.1	False	2	0	4.77
geo	ollama	llama3.1	False	3	1	3.91
married	ollama	qwen3:8b	False	2	1	19.14
family	ollama	qwen3:8b	False	3	3	46.35
work	ollama	qwen3:8b	False	3	2	18.98
geo	ollama	qwen3:8b	False	3	0	50.16
married	ollama	qwen3:14b	False	2	0	36.13
family	ollama	qwen3:14b	False	3	3	24.14
work	ollama	qwen3:14b	True	0	0	0.0	Given text is likely not an LLM record: , and ending with
geo	ollama	qwen3:14b	False	1	0	42.74
married	ollama	gpt-oss:20b	False	2	1	36.13
family	ollama	gpt-oss:20b	False	3	3	62.75
work	ollama	gpt-oss:20b	True	0	0	0.0	list index out of range
geo	ollama	gpt-oss:20b	False	4	4	168.09
married	ollama	mistral	False	2	1	12.27
family	ollama	mistral	False	3	1	4.86
work	ollama	mistral	False	2	1	5.06
geo	ollama	mistral	False	3	0	4.93
married	openai	gpt-5-mini	False	2	1	0.27
family	openai	gpt-5-mini	False	4	6	0.77
work	openai	gpt-5-mini	False	3	3	0.52
geo	openai	gpt-5-mini	False	4	6	38.61
married	openai	gpt-5-nano-2025-08-07	False	2	1	12.72
family	openai	gpt-5-nano-2025-08-07	False	4	4	38.27
work	openai	gpt-5-nano-2025-08-07	False	3	3	26.57
geo	openai	gpt-5-nano-2025-08-07	False	3	3	41.24
married	openai	gpt-4.1-2025-04-14	False	2	1	3.52
family	openai	gpt-4.1-2025-04-14	False	4	6	14.46
work	openai	gpt-4.1-2025-04-14	False	3	3	6.11
geo	openai	gpt-4.1-2025-04-14	False	4	4	7.84
married	anthropic	claude-sonnet-4-5-20250929	False	2	1	4.12
family	anthropic	claude-sonnet-4-5-20250929	False	4	6	8.05
work	anthropic	claude-sonnet-4-5-20250929	False	3	3	6.29
geo	anthropic	claude-sonnet-4-5-20250929	False	4	3	7.11
married	anthropic	claude-haiku-4-5-20251001	False	2	1	1.8
family	anthropic	claude-haiku-4-5-20251001	False	4	6	4.62
work	anthropic	claude-haiku-4-5-20251001	False	3	3	4.31
geo	anthropic	claude-haiku-4-5-20251001	False	4	3	3.8