Knwl Benchmarks
Finding the Right Model for Your Data
What model works best for a particular dataset? Is bigger always better? How much does latency matter for knowledge extraction? These are critical questions when building production-ready graph RAG systems.
The Knwl benchmark suite provides empirical answers by systematically testing diverse LLM models across multiple providers. Whether you’re choosing between local privacy-focused models or cloud-based solutions, these benchmarks help you make informed decisions based on real performance data rather than assumptions.
Overview
The benchmark suite in the benchmarks directory is a comprehensive testing framework that:
- Standalone execution: Self-contained script using Knwl to ingest sets of sentences and facts from various domains
- Multi-dimensional metrics: Measures ingestion time, node extraction count, edge detection, and failure rates
- Persistent caching: LLM outputs are captured in
knwl_datafor analysis, comparison, and rapid re-testing - Cross-provider testing: Automatically tests local models (Ollama) and cloud providers (OpenAI, Anthropic, Groq) with multiple models per provider
- Configurable: Easy to customize for your specific content types and performance requirements
- Reproducible: Timestamped results with consistent testing methodology
Current Limitations
While comprehensive, this benchmark focuses on specific aspects of knowledge extraction. Keep in mind:
- Scope: Measures only knowledge ingestion time, not document parsing, OCR, or preprocessing overhead
- Content complexity: Tests standard text; doesn’t evaluate performance on mathematical notation, complex tables, code snippets, or highly technical diagrams
- Quality metrics: Focuses on quantitative measures (node/edge counts, latency) rather than semantic accuracy or knowledge graph coherence
- Language coverage: Currently English-only; multilingual performance may vary significantly
- Scale: Tests individual fact extraction; doesn’t benchmark batch processing or large document ingestion
Key Insights
These findings challenge common assumptions about LLM performance on knowledge extraction tasks. The results suggest that model selection for graph RAG requires different criteria than general-purpose LLM applications.
Size vs. Performance
Bigger isn’t better for knowledge extraction. While larger models excel at reasoning and creative tasks, they show diminishing returns for structured knowledge extraction. In fact:
- Larger models consistently exhibit higher latency (sometimes 10-20x slower) without proportional improvements in node/edge detection
- Mid-size models (7b-14b parameters) often match or exceed larger models in extraction quality
- The relationship between model size and extraction accuracy is non-linear and highly dependent on model architecture
The Speed-Quality Trade-off
Knowledge extraction is computationally expensive. Even simple facts require substantial processing:
- Extracting 3 nodes and 2 edges can take 20+ seconds with some models
- Latency varies dramatically: from under 1 second to over 200 seconds for equivalent tasks
- This makes model choice critical for production systems processing thousands of documents
Reasoning Models
Reasoning models underperform for knowledge extraction. Models optimized for chain-of-thought reasoning (like o1-style models) consistently show worse performance:
- Higher latency due to extended reasoning chains
- No measurable improvement in entity or relationship detection
- The structured, deterministic nature of knowledge extraction doesn’t benefit from exploratory reasoning
Local vs. Cloud Models
Local models offer privacy at a performance cost. Ollama-based local models provide data privacy and cost predictability but:
- Generally slower inference than cloud alternatives
- Quality varies widely by model architecture
- Exception: With high-end GPU setups, some local models approach cloud performance
- Best use case: Development, testing, and privacy-sensitive applications
Model-Specific Findings
Performance leaders:
- Cloud: OpenAI GPT-5.2 and Anthropic Claude Sonnet 4.5 balance speed with extraction quality
- Local: Qwen2.5 (across 7b, 14b, 32b variants) consistently delivers strong results
- Emerging: GLM 4.7 Flash shows promising performance for its size
Underperformers:
- GPT-OSS models: High latency despite reasonable extraction counts
- Llama 3.1: Surprisingly weak on edge detection, often returning zero relationships
- GPT-5-mini: Extremely high latency (150-230s) makes it impractical for production
Development Recommendations
For local development and testing, 7b parameter models provide the best balance:
- Gemma3 7b: Fast, consistent, low resource requirements
- Qwen2.5 7b: Better extraction quality, slightly higher resource needs
- Both are “good enough” for development workflows and integration testing
For production deployments, consider:
- Cloud APIs for high-throughput, low-latency requirements
- Mid-size local models (14b-32b) for privacy-sensitive or cost-optimized scenarios
- Qwen2.5 family for best overall local performance
Performance Metric
Selecting the “best” model requires defining what “best” means for your specific use case. Knwl uses a composite metric that balances extraction completeness against processing speed:
score = (w1 * nodes + w2 * edges) / (w3 * time)
Where: - w1: Weight for node extraction (entities, concepts) - w2: Weight for edge extraction (relationships, connections) - w3: Weight for processing time (latency penalty) - Weights sum to 1.0
Interpreting the Metric
Higher scores indicate better performance. The metric rewards models that extract more knowledge elements (nodes and edges) while penalizing longer processing times.
Customizing for Your Needs
Prioritize completeness (detailed knowledge graphs): - Increase w1 and w2 (e.g., w1=0.45, w2=0.45, w3=0.10) - Use case: Archival systems, research databases, comprehensive knowledge bases
Prioritize speed (real-time applications): - Increase w3 (e.g., w1=0.25, w2=0.25, w3=0.50) - Use case: Live chat systems, real-time document processing, interactive applications
Prioritize relationships (highly connected graphs): - Increase w2 (e.g., w1=0.30, w2=0.50, w3=0.20) - Use case: Social network analysis, dependency tracking, causal reasoning
Balanced approach (general purpose): - Equal weights (w1=0.33, w2=0.33, w3=0.34) - Use case: Most production graph RAG applications
The default weights in the benchmark suite are balanced, but you can adjust them in the configuration to match your application’s requirements.
Usage
Running the Benchmark
You can run the benchmark via the CLI utility:
uv run knwl-benchmarkOr if you installed Knwl via pipx:
knwl-benchmarkInteractive Configuration
The benchmark suite provides an interactive setup process:
- Select providers: Choose which LLM providers to test
- Local: Ollama (with available models auto-detected)
- Cloud: OpenAI, Anthropic, Groq, and others
- You can select multiple providers in a single run
- Choose test data: Select the facts/content to ingest
- Pre-configured datasets (biography, biomedical, technical)
- Custom text files
- The content complexity affects extraction results
- Select extraction strategy: Pick your knowledge extraction approach
- Different strategies balance precision vs. recall
- Strategies affect both quality and processing time
- Configure weights (optional): Adjust the performance metric weights to match your priorities
Understanding Results
Timestamped outputs: Each run generates a CSV file with timestamp, making it easy to track performance evolution across different configurations or model versions.
Caching behavior: LLM API calls are automatically cached in knwl_data/. This means: - First run with a model: Full API calls, normal latency - Subsequent runs: Cached responses, near-instant processing - Enables rapid experimentation with different metrics without re-running expensive API calls - Cache can be cleared to force fresh API calls for updated model versions
Output files: Results include: - Model name and provider - Computed metric score - Node and edge counts - Latency measurements - Failure status - Sortable by any column for easy comparison
Benchmark Results (January 2026)
These results come from systematic testing across two distinct content types: biographical narrative text and biomedical technical content. This dual-domain approach reveals how model performance varies with content characteristics.
Domain-Specific Performance
Different content types reveal distinct model strengths:
Biographical content (narrative, temporal, person-centric): - Tests entity recognition for people, places, events - Relationships are often implicit and contextual - Requires understanding of temporal sequences and causality
Biomedical content (technical, entity-dense, terminology-heavy): - Tests scientific entity extraction (proteins, diseases, compounds) - Relationships are more explicit but domain-specific - Requires handling of specialized terminology and abbreviations
Top Performers
| Content Type | Best Model | Metric Score |
|---|---|---|
| Biography | ollama/gpt-oss:20b | 34.52 |
| Biomedical | ollama/gpt-oss:20b | 42.88 |
Notable finding: GPT-OSS achieves the highest extraction scores despite moderate latency. This model excels at detecting both entities and relationships, particularly in technical domains where it extracts significantly more edges than competitors.
Model Observations
Cloud vs. Local Performance Gap: - OpenAI GPT-5.2 delivers strong performance (32.4 biography, 30.47 biomed) with reasonable latency - Anthropic models (Claude Sonnet/Opus 4.5) show consistent but conservative extraction with excellent speed - Local models sacrifice extraction completeness for lower absolute latency
GLM 4.7 Flash: This newer model demonstrates competitive performance for its size, making it an attractive option for resource-constrained environments.
Size-Performance Disconnect: The data confirms that parameter count doesn’t correlate with extraction quality. Models like Qwen3 14b (14.98 score) and Qwen2.5 32b (13.54 score) show that larger variants don’t necessarily extract more knowledge.
Edge Detection Challenges: Several models struggle specifically with relationship extraction: - Llama 3.1, Qwen3 14b, and others often return zero edges despite finding nodes - This suggests edge detection requires different model capabilities than entity recognition
Practical Recommendations
Based on these results:
For production systems: - Start with OpenAI GPT-5.2 or Anthropic Claude for balanced performance - Use GPT-OSS if extraction completeness matters more than latency - Implement caching to amortize latency across repeated content patterns
For development workflows: - Use Qwen2.5 7b or Gemma3 7b for fast iteration - GLM 4.7 Flash offers a middle ground between speed and quality - Local models enable offline development and privacy-sensitive testing
For cost optimization: - Smaller cloud models (GPT-5-nano) may be too slow for practical use - Mid-size local models (14b-20b) offer best price-performance for high-volume processing - Consider hybrid approaches: local for bulk processing, cloud for quality assurance
Detailed Results
Complete performance data for all tested models, sorted by metric score. These tables reveal the nuanced trade-offs between extraction completeness, relationship detection, and processing speed.
Biography Domain
Narrative text about a person’s life, testing temporal reasoning, entity co-reference, and implicit relationship extraction.
| Model | Metric | Nodes | Edges | Latency (s) | Failed |
|---|---|---|---|---|---|
| ollama/gpt-oss:20b | 34.52 | 38 | 43 | 8.42 | No |
| openai/gpt-5.2-2025-12-11 | 32.4 | 35 | 42 | 11.47 | No |
| openai/gpt-5-mini | 20.09 | 26 | 24 | 233.95 | No |
| anthropic/claude-sonnet-4-5-20250929 | 18.05 | 16 | 18 | 3.49 | No |
| anthropic/claude-opus-4-5-20251101 | 18.02 | 17 | 17 | 3.52 | No |
| anthropic/claude-haiku-4-5-20251001 | 16.32 | 15 | 13 | 2.91 | No |
| ollama/glm-4.7-flash:latest | 15.93 | 17 | 10 | 2.9 | No |
| ollama/gemma3:27b | 15.42 | 16 | 9 | 2.69 | No |
| ollama/devstral-small-2:latest | 15.41 | 14 | 11 | 2.7 | No |
| ollama/mistral | 15.03 | 12 | 11 | 2.43 | No |
| ollama/qwen3:14b | 14.98 | 21 | 2 | 2.46 | No |
| ollama/qwen3:8b | 14.96 | 12 | 11 | 2.47 | No |
| ollama/gemma3:4b | 14.82 | 15 | 8 | 2.56 | No |
| ollama/gemma3:12b | 14.31 | 12 | 8 | 2.17 | No |
| ollama/llama3.1 | 14.16 | 6 | 0 | 0.7 | No |
| openai/gpt-5-nano-2025-08-07 | 14.13 | 17 | 0 | 1.73 | No |
| ollama/qwen2.5:32b | 13.54 | 9 | 5 | 1.52 | No |
| ollama/cogito:14b | 13.52 | 6 | 4 | 1.1 | No |
| ollama/qwen2.5:14b | 13.51 | 12 | 2 | 1.53 | No |
| ollama/qwen2.5:7b | 5.56 | 8 | 4 | 25.21 | No |
Biomedical Domain
Technical biomedical text with specialized terminology, testing scientific entity recognition and complex relationship extraction in a domain-specific context.
| Model | Metric | Nodes | Edges | Latency (s) | Failed |
|---|---|---|---|---|---|
| ollama/gpt-oss:20b | 42.88 | 38 | 66 | 14.6 | No |
| openai/gpt-5.2-2025-12-11 | 30.47 | 32 | 38 | 7.09 | No |
| openai/gpt-5-mini | 19.73 | 24 | 25 | 150.95 | No |
| ollama/mistral | 17.63 | 19 | 14 | 3.51 | No |
| ollama/gemma3:27b | 16.26 | 15 | 13 | 2.95 | No |
| anthropic/claude-opus-4-5-20251101 | 16.05 | 13 | 14 | 2.81 | No |
| anthropic/claude-sonnet-4-5-20250929 | 16.04 | 13 | 14 | 2.82 | No |
| ollama/devstral-small-2:latest | 15.48 | 13 | 12 | 2.65 | No |
| ollama/glm-4.7-flash:latest | 14.87 | 12 | 11 | 2.53 | No |
| ollama/gemma3:4b | 14.59 | 21 | 0 | 2.23 | No |
| ollama/qwen2.5:7b | 14.54 | 18 | 17 | 36.29 | No |
| ollama/gemma3:12b | 14.33 | 12 | 8 | 2.16 | No |
| ollama/qwen2.5:14b | 13.78 | 9 | 7 | 1.71 | No |
| ollama/llama3.1 | 13.73 | 5 | 3 | 0.9 | No |
| anthropic/claude-haiku-4-5-20251001 | 13.69 | 8 | 7 | 1.6 | No |
| ollama/qwen2.5:32b | 13.63 | 8 | 7 | 1.62 | No |
| ollama/qwen3:8b | 13.63 | 8 | 7 | 1.62 | No |
| ollama/cogito:14b | 13.6 | 8 | 7 | 1.63 | No |
| ollama/qwen3:14b | 13.43 | 13 | 0 | 1.43 | No |
| openai/gpt-5-nano-2025-08-07 | 0.0 | 0 | 0 | 0.0 | Yes |
Key Takeaways from Results
- Domain variation matters: Models show different relative performance across biography vs. biomedical content
- Latency spreads are extreme: From <1s to 230s for similar extraction tasks
- Edge detection is harder: Many models excel at node extraction but fail on relationships
- Model failures are rare: Only 1 failure (GPT-5-nano on biomed) across all tests indicates robust API reliability
- Score gaps are significant: Top performers (30-40) dramatically outperform median models (13-17)
Generated by the Knwl.ai Benchmarking Utility on 2026-01-26 12:00:40
Running Your Own Benchmarks
These results reflect specific test content and default metric weights. Your results may vary based on:
- Content domain and complexity
- Custom metric weight preferences
- Model version updates and fine-tuning
- API endpoint variations and geographic latency
Run knwl-benchmark with your own content to find the optimal model for your specific use case.