Knwl Benchmarks
What model works best for a particular dataset? Is bigger always better? How much time does it take to ingest some facts?
The benchmark suite in the benchmarks directory is designed to help answer these questions:
- it’s a standalone script using Knwl to ingest a set of sentences/facts
- it measures ingestion time and the amount of returned nodes and edges
- LLM output are captured in
knwl_dataand can be re-used or analyzed - it loops over local (Ollama) and cloud (OpenAI, Anthropic…) LLM providers and different models within each provider
- it’s easy to customize and to run.
This benchmark is a stepping stone since:
- it does not consider parsing/OCR time for documents (only ingestion time)
- it does not consider complex content (math, tables, code…)
- it does not consider knowledge graph quality (only quantity of nodes/edges)
- it does not consider multiple languages (only English).
Some Insights
The following is not in stone but might help to rethink some assumptions about LLMs and their performance on knowledge ingestion tasks:
- bigger models take more time
- smaller models are qualitatively as able as bigger models for knowledge ingestion
- local models (Ollama) are great for privacy but performances are worse than cloud models (unless you have a very powerful GPU setup)
- extracting knowledge is expensive: 3 nodes and 2 edges can take up to 20 seconds with some models
- reasoning models perform worse than non-reasoning models for knowledge ingestion tasks
- bigger models do sometimes extract more nodes/edges but not always: sometimes smaller models do better
- for local development and testing you can use 7b models (gemma3, qwen2.5) which are fast and qualitatively good enough
- worst models (latency and errors) are: gpt-oss, llama3.1
- best local model is qwen2.5 across all sizes (7b, 14b, 32b)
Metric
In order to pick out the model that works best for your data you need to define what “best” means for you. That is, you need to define a metric that combines ingestion time, number of nodes and number of edges in a way that reflects your priorities. A simple metric could be:
score = (nodes + edges) / time
This metric gives equal weight to nodes and edges and divides by time, so higher scores are better. You could adjust the weights or the formula according to your needs:
score = (w1 * nodes + w2 * edges) / (w3 * time)
Where w1, w2, and w3 are weights (sum equal to one) that you can set based on what matters most to you (e.g., if time is more critical, increase w3).
Usage
- customize the
modelsandstrategiesvariables inbenchmarks/run.pyto select which models and strategies to benchmark - run
uv run run.pyinside thebenchmarksdirectory - results are printed in the console and saved in a CSV file in the
benchmarks/resultsdirectory.
Note that the results are time stamped and every run will create a new CSV file. The fact that the LLM calls are cached means that subsequent runs with the same configuration will be much faster.
Sample output from November 2025 can be found below:
| key | provider | model | failed | node_count | edge_count | latency (s) | error |
|---|---|---|---|---|---|---|---|
| married | ollama | qwen2.5:7b | False | 2 | 1 | 6.74 | |
| family | ollama | qwen2.5:7b | False | 3 | 3 | 0.42 | |
| work | ollama | qwen2.5:7b | False | 2 | 1 | 0.24 | |
| geo | ollama | qwen2.5:7b | False | 3 | 2 | 11.99 | |
| married | ollama | qwen2.5:14b | False | 2 | 1 | 0.26 | |
| family | ollama | qwen2.5:14b | False | 3 | 2 | 0.38 | |
| work | ollama | qwen2.5:14b | False | 2 | 0 | 0.19 | |
| geo | ollama | qwen2.5:14b | False | 3 | 1 | 33.03 | |
| married | ollama | qwen2.5:32b | False | 2 | 1 | 0.26 | |
| family | ollama | qwen2.5:32b | False | 3 | 3 | 0.46 | |
| work | ollama | qwen2.5:32b | False | 3 | 2 | 0.43 | |
| geo | ollama | qwen2.5:32b | False | 2 | 1 | 67.29 | |
| married | ollama | gemma3:4b | False | 2 | 0 | 5.12 | |
| family | ollama | gemma3:4b | False | 3 | 2 | 5.55 | |
| work | ollama | gemma3:4b | False | 3 | 1 | 5.21 | |
| geo | ollama | gemma3:4b | False | 4 | 2 | 5.94 | |
| married | ollama | gemma3:12b | False | 2 | 1 | 20.62 | |
| family | ollama | gemma3:12b | False | 3 | 3 | 18.29 | |
| work | ollama | gemma3:12b | False | 3 | 2 | 17.05 | |
| geo | ollama | gemma3:12b | False | 4 | 2 | 16.77 | |
| married | ollama | gemma3:27b | False | 3 | 1 | 42.74 | |
| family | ollama | gemma3:27b | False | 4 | 3 | 49.41 | |
| work | ollama | gemma3:27b | False | 3 | 2 | 44.17 | |
| geo | ollama | gemma3:27b | False | 4 | 3 | 48.58 | |
| married | ollama | llama3.1 | True | 0 | 0 | 0.0 | Given text is likely not an LLM record: Note: The output format follows the specified guidelines, with each entity and relationship formatted as described. The content keywords are also extracted and provided in the required format. |
| family | ollama | llama3.1 | True | 0 | 0 | 0.0 | Given text is likely not an LLM record: ** |
| work | ollama | llama3.1 | False | 2 | 0 | 4.77 | |
| geo | ollama | llama3.1 | False | 3 | 1 | 3.91 | |
| married | ollama | qwen3:8b | False | 2 | 1 | 19.14 | |
| family | ollama | qwen3:8b | False | 3 | 3 | 46.35 | |
| work | ollama | qwen3:8b | False | 3 | 2 | 18.98 | |
| geo | ollama | qwen3:8b | False | 3 | 0 | 50.16 | |
| married | ollama | qwen3:14b | False | 2 | 0 | 36.13 | |
| family | ollama | qwen3:14b | False | 3 | 3 | 24.14 | |
| work | ollama | qwen3:14b | True | 0 | 0 | 0.0 | Given text is likely not an LLM record: , and ending with |
| geo | ollama | qwen3:14b | False | 1 | 0 | 42.74 | |
| married | ollama | gpt-oss:20b | False | 2 | 1 | 36.13 | |
| family | ollama | gpt-oss:20b | False | 3 | 3 | 62.75 | |
| work | ollama | gpt-oss:20b | True | 0 | 0 | 0.0 | list index out of range |
| geo | ollama | gpt-oss:20b | False | 4 | 4 | 168.09 | |
| married | ollama | mistral | False | 2 | 1 | 12.27 | |
| family | ollama | mistral | False | 3 | 1 | 4.86 | |
| work | ollama | mistral | False | 2 | 1 | 5.06 | |
| geo | ollama | mistral | False | 3 | 0 | 4.93 | |
| married | openai | gpt-5-mini | False | 2 | 1 | 0.27 | |
| family | openai | gpt-5-mini | False | 4 | 6 | 0.77 | |
| work | openai | gpt-5-mini | False | 3 | 3 | 0.52 | |
| geo | openai | gpt-5-mini | False | 4 | 6 | 38.61 | |
| married | openai | gpt-5-nano-2025-08-07 | False | 2 | 1 | 12.72 | |
| family | openai | gpt-5-nano-2025-08-07 | False | 4 | 4 | 38.27 | |
| work | openai | gpt-5-nano-2025-08-07 | False | 3 | 3 | 26.57 | |
| geo | openai | gpt-5-nano-2025-08-07 | False | 3 | 3 | 41.24 | |
| married | openai | gpt-4.1-2025-04-14 | False | 2 | 1 | 3.52 | |
| family | openai | gpt-4.1-2025-04-14 | False | 4 | 6 | 14.46 | |
| work | openai | gpt-4.1-2025-04-14 | False | 3 | 3 | 6.11 | |
| geo | openai | gpt-4.1-2025-04-14 | False | 4 | 4 | 7.84 | |
| married | anthropic | claude-sonnet-4-5-20250929 | False | 2 | 1 | 4.12 | |
| family | anthropic | claude-sonnet-4-5-20250929 | False | 4 | 6 | 8.05 | |
| work | anthropic | claude-sonnet-4-5-20250929 | False | 3 | 3 | 6.29 | |
| geo | anthropic | claude-sonnet-4-5-20250929 | False | 4 | 3 | 7.11 | |
| married | anthropic | claude-haiku-4-5-20251001 | False | 2 | 1 | 1.8 | |
| family | anthropic | claude-haiku-4-5-20251001 | False | 4 | 6 | 4.62 | |
| work | anthropic | claude-haiku-4-5-20251001 | False | 3 | 3 | 4.31 | |
| geo | anthropic | claude-haiku-4-5-20251001 | False | 4 | 3 | 3.8 |