Knwl Benchmarks

KnowledgeGraphs
GraphAI

What model works best for a particular dataset? Is bigger always better? How much time does it take to ingest some facts?

The benchmark suite in the benchmarks directory is designed to help answer these questions:

This benchmark is a stepping stone since:

Some Insights

The following is not in stone but might help to rethink some assumptions about LLMs and their performance on knowledge ingestion tasks:

  • bigger models take more time
  • smaller models are qualitatively as able as bigger models for knowledge ingestion
  • local models (Ollama) are great for privacy but performances are worse than cloud models (unless you have a very powerful GPU setup)
  • extracting knowledge is expensive: 3 nodes and 2 edges can take up to 20 seconds with some models
  • reasoning models perform worse than non-reasoning models for knowledge ingestion tasks
  • bigger models do sometimes extract more nodes/edges but not always: sometimes smaller models do better
  • for local development and testing you can use 7b models (gemma3, qwen2.5) which are fast and qualitatively good enough
  • worst models (latency and errors) are: gpt-oss, llama3.1
  • best local model is qwen2.5 across all sizes (7b, 14b, 32b)

Metric

In order to pick out the model that works best for your data you need to define what “best” means for you. That is, you need to define a metric that combines ingestion time, number of nodes and number of edges in a way that reflects your priorities. A simple metric could be:

score = (nodes + edges) / time

This metric gives equal weight to nodes and edges and divides by time, so higher scores are better. You could adjust the weights or the formula according to your needs:

score = (w1 * nodes + w2 * edges) / (w3 * time)

Where w1, w2, and w3 are weights (sum equal to one) that you can set based on what matters most to you (e.g., if time is more critical, increase w3).

Usage

  • customize the models and strategies variables in benchmarks/run.py to select which models and strategies to benchmark
  • run uv run run.py inside the benchmarks directory
  • results are printed in the console and saved in a CSV file in the benchmarks/results directory.

Note that the results are time stamped and every run will create a new CSV file. The fact that the LLM calls are cached means that subsequent runs with the same configuration will be much faster.

Sample output from November 2025 can be found below:

key provider model failed node_count edge_count latency (s) error
married ollama qwen2.5:7b False 2 1 6.74
family ollama qwen2.5:7b False 3 3 0.42
work ollama qwen2.5:7b False 2 1 0.24
geo ollama qwen2.5:7b False 3 2 11.99
married ollama qwen2.5:14b False 2 1 0.26
family ollama qwen2.5:14b False 3 2 0.38
work ollama qwen2.5:14b False 2 0 0.19
geo ollama qwen2.5:14b False 3 1 33.03
married ollama qwen2.5:32b False 2 1 0.26
family ollama qwen2.5:32b False 3 3 0.46
work ollama qwen2.5:32b False 3 2 0.43
geo ollama qwen2.5:32b False 2 1 67.29
married ollama gemma3:4b False 2 0 5.12
family ollama gemma3:4b False 3 2 5.55
work ollama gemma3:4b False 3 1 5.21
geo ollama gemma3:4b False 4 2 5.94
married ollama gemma3:12b False 2 1 20.62
family ollama gemma3:12b False 3 3 18.29
work ollama gemma3:12b False 3 2 17.05
geo ollama gemma3:12b False 4 2 16.77
married ollama gemma3:27b False 3 1 42.74
family ollama gemma3:27b False 4 3 49.41
work ollama gemma3:27b False 3 2 44.17
geo ollama gemma3:27b False 4 3 48.58
married ollama llama3.1 True 0 0 0.0 Given text is likely not an LLM record: Note: The output format follows the specified guidelines, with each entity and relationship formatted as described. The content keywords are also extracted and provided in the required format.
family ollama llama3.1 True 0 0 0.0 Given text is likely not an LLM record: **
work ollama llama3.1 False 2 0 4.77
geo ollama llama3.1 False 3 1 3.91
married ollama qwen3:8b False 2 1 19.14
family ollama qwen3:8b False 3 3 46.35
work ollama qwen3:8b False 3 2 18.98
geo ollama qwen3:8b False 3 0 50.16
married ollama qwen3:14b False 2 0 36.13
family ollama qwen3:14b False 3 3 24.14
work ollama qwen3:14b True 0 0 0.0 Given text is likely not an LLM record: , and ending with
geo ollama qwen3:14b False 1 0 42.74
married ollama gpt-oss:20b False 2 1 36.13
family ollama gpt-oss:20b False 3 3 62.75
work ollama gpt-oss:20b True 0 0 0.0 list index out of range
geo ollama gpt-oss:20b False 4 4 168.09
married ollama mistral False 2 1 12.27
family ollama mistral False 3 1 4.86
work ollama mistral False 2 1 5.06
geo ollama mistral False 3 0 4.93
married openai gpt-5-mini False 2 1 0.27
family openai gpt-5-mini False 4 6 0.77
work openai gpt-5-mini False 3 3 0.52
geo openai gpt-5-mini False 4 6 38.61
married openai gpt-5-nano-2025-08-07 False 2 1 12.72
family openai gpt-5-nano-2025-08-07 False 4 4 38.27
work openai gpt-5-nano-2025-08-07 False 3 3 26.57
geo openai gpt-5-nano-2025-08-07 False 3 3 41.24
married openai gpt-4.1-2025-04-14 False 2 1 3.52
family openai gpt-4.1-2025-04-14 False 4 6 14.46
work openai gpt-4.1-2025-04-14 False 3 3 6.11
geo openai gpt-4.1-2025-04-14 False 4 4 7.84
married anthropic claude-sonnet-4-5-20250929 False 2 1 4.12
family anthropic claude-sonnet-4-5-20250929 False 4 6 8.05
work anthropic claude-sonnet-4-5-20250929 False 3 3 6.29
geo anthropic claude-sonnet-4-5-20250929 False 4 3 7.11
married anthropic claude-haiku-4-5-20251001 False 2 1 1.8
family anthropic claude-haiku-4-5-20251001 False 4 6 4.62
work anthropic claude-haiku-4-5-20251001 False 3 3 4.31
geo anthropic claude-haiku-4-5-20251001 False 4 3 3.8