Knwl Benchmarks

KnowledgeGraphs
GraphAI
Benchmarks

Finding the Right Model for Your Data

What model works best for a particular dataset? Is bigger always better? How much does latency matter for knowledge extraction? These are critical questions when building production-ready graph RAG systems.

The Knwl benchmark suite provides empirical answers by systematically testing diverse LLM models across multiple providers. Whether you’re choosing between local privacy-focused models or cloud-based solutions, these benchmarks help you make informed decisions based on real performance data rather than assumptions.

Overview

The benchmark suite in the benchmarks directory is a comprehensive testing framework that:

  • Standalone execution: Self-contained script using Knwl to ingest sets of sentences and facts from various domains
  • Multi-dimensional metrics: Measures ingestion time, node extraction count, edge detection, and failure rates
  • Persistent caching: LLM outputs are captured in knwl_data for analysis, comparison, and rapid re-testing
  • Cross-provider testing: Automatically tests local models (Ollama) and cloud providers (OpenAI, Anthropic, Groq) with multiple models per provider
  • Configurable: Easy to customize for your specific content types and performance requirements
  • Reproducible: Timestamped results with consistent testing methodology

Current Limitations

While comprehensive, this benchmark focuses on specific aspects of knowledge extraction. Keep in mind:

  • Scope: Measures only knowledge ingestion time, not document parsing, OCR, or preprocessing overhead
  • Content complexity: Tests standard text; doesn’t evaluate performance on mathematical notation, complex tables, code snippets, or highly technical diagrams
  • Quality metrics: Focuses on quantitative measures (node/edge counts, latency) rather than semantic accuracy or knowledge graph coherence
  • Language coverage: Currently English-only; multilingual performance may vary significantly
  • Scale: Tests individual fact extraction; doesn’t benchmark batch processing or large document ingestion

Key Insights

These findings challenge common assumptions about LLM performance on knowledge extraction tasks. The results suggest that model selection for graph RAG requires different criteria than general-purpose LLM applications.

Size vs. Performance

Bigger isn’t better for knowledge extraction. While larger models excel at reasoning and creative tasks, they show diminishing returns for structured knowledge extraction. In fact:

  • Larger models consistently exhibit higher latency (sometimes 10-20x slower) without proportional improvements in node/edge detection
  • Mid-size models (7b-14b parameters) often match or exceed larger models in extraction quality
  • The relationship between model size and extraction accuracy is non-linear and highly dependent on model architecture

The Speed-Quality Trade-off

Knowledge extraction is computationally expensive. Even simple facts require substantial processing:

  • Extracting 3 nodes and 2 edges can take 20+ seconds with some models
  • Latency varies dramatically: from under 1 second to over 200 seconds for equivalent tasks
  • This makes model choice critical for production systems processing thousands of documents

Reasoning Models

Reasoning models underperform for knowledge extraction. Models optimized for chain-of-thought reasoning (like o1-style models) consistently show worse performance:

  • Higher latency due to extended reasoning chains
  • No measurable improvement in entity or relationship detection
  • The structured, deterministic nature of knowledge extraction doesn’t benefit from exploratory reasoning

Local vs. Cloud Models

Local models offer privacy at a performance cost. Ollama-based local models provide data privacy and cost predictability but:

  • Generally slower inference than cloud alternatives
  • Quality varies widely by model architecture
  • Exception: With high-end GPU setups, some local models approach cloud performance
  • Best use case: Development, testing, and privacy-sensitive applications

Model-Specific Findings

Performance leaders:

  • Cloud: OpenAI GPT-5.2 and Anthropic Claude Sonnet 4.5 balance speed with extraction quality
  • Local: Qwen2.5 (across 7b, 14b, 32b variants) consistently delivers strong results
  • Emerging: GLM 4.7 Flash shows promising performance for its size

Underperformers:

  • GPT-OSS models: High latency despite reasonable extraction counts
  • Llama 3.1: Surprisingly weak on edge detection, often returning zero relationships
  • GPT-5-mini: Extremely high latency (150-230s) makes it impractical for production

Development Recommendations

For local development and testing, 7b parameter models provide the best balance:

  • Gemma3 7b: Fast, consistent, low resource requirements
  • Qwen2.5 7b: Better extraction quality, slightly higher resource needs
  • Both are “good enough” for development workflows and integration testing

For production deployments, consider:

  • Cloud APIs for high-throughput, low-latency requirements
  • Mid-size local models (14b-32b) for privacy-sensitive or cost-optimized scenarios
  • Qwen2.5 family for best overall local performance

Performance Metric

Selecting the “best” model requires defining what “best” means for your specific use case. Knwl uses a composite metric that balances extraction completeness against processing speed:

score = (w1 * nodes + w2 * edges) / (w3 * time)

Where: - w1: Weight for node extraction (entities, concepts) - w2: Weight for edge extraction (relationships, connections) - w3: Weight for processing time (latency penalty) - Weights sum to 1.0

Interpreting the Metric

Higher scores indicate better performance. The metric rewards models that extract more knowledge elements (nodes and edges) while penalizing longer processing times.

Customizing for Your Needs

Prioritize completeness (detailed knowledge graphs): - Increase w1 and w2 (e.g., w1=0.45, w2=0.45, w3=0.10) - Use case: Archival systems, research databases, comprehensive knowledge bases

Prioritize speed (real-time applications): - Increase w3 (e.g., w1=0.25, w2=0.25, w3=0.50) - Use case: Live chat systems, real-time document processing, interactive applications

Prioritize relationships (highly connected graphs): - Increase w2 (e.g., w1=0.30, w2=0.50, w3=0.20) - Use case: Social network analysis, dependency tracking, causal reasoning

Balanced approach (general purpose): - Equal weights (w1=0.33, w2=0.33, w3=0.34) - Use case: Most production graph RAG applications

The default weights in the benchmark suite are balanced, but you can adjust them in the configuration to match your application’s requirements.

Usage

Running the Benchmark

You can run the benchmark via the CLI utility:

uv run knwl-benchmark

Or if you installed Knwl via pipx:

knwl-benchmark

Interactive Configuration

The benchmark suite provides an interactive setup process:

  1. Select providers: Choose which LLM providers to test
    • Local: Ollama (with available models auto-detected)
    • Cloud: OpenAI, Anthropic, Groq, and others
    • You can select multiple providers in a single run
  2. Choose test data: Select the facts/content to ingest
    • Pre-configured datasets (biography, biomedical, technical)
    • Custom text files
    • The content complexity affects extraction results
  3. Select extraction strategy: Pick your knowledge extraction approach
    • Different strategies balance precision vs. recall
    • Strategies affect both quality and processing time
  4. Configure weights (optional): Adjust the performance metric weights to match your priorities

Understanding Results

Timestamped outputs: Each run generates a CSV file with timestamp, making it easy to track performance evolution across different configurations or model versions.

Caching behavior: LLM API calls are automatically cached in knwl_data/. This means: - First run with a model: Full API calls, normal latency - Subsequent runs: Cached responses, near-instant processing - Enables rapid experimentation with different metrics without re-running expensive API calls - Cache can be cleared to force fresh API calls for updated model versions

Output files: Results include: - Model name and provider - Computed metric score - Node and edge counts - Latency measurements - Failure status - Sortable by any column for easy comparison

Benchmark Results (January 2026)

These results come from systematic testing across two distinct content types: biographical narrative text and biomedical technical content. This dual-domain approach reveals how model performance varies with content characteristics.

Domain-Specific Performance

Different content types reveal distinct model strengths:

Biographical content (narrative, temporal, person-centric): - Tests entity recognition for people, places, events - Relationships are often implicit and contextual - Requires understanding of temporal sequences and causality

Biomedical content (technical, entity-dense, terminology-heavy): - Tests scientific entity extraction (proteins, diseases, compounds) - Relationships are more explicit but domain-specific - Requires handling of specialized terminology and abbreviations

Top Performers

Content Type Best Model Metric Score
Biography ollama/gpt-oss:20b 34.52
Biomedical ollama/gpt-oss:20b 42.88

Notable finding: GPT-OSS achieves the highest extraction scores despite moderate latency. This model excels at detecting both entities and relationships, particularly in technical domains where it extracts significantly more edges than competitors.

Model Observations

Cloud vs. Local Performance Gap: - OpenAI GPT-5.2 delivers strong performance (32.4 biography, 30.47 biomed) with reasonable latency - Anthropic models (Claude Sonnet/Opus 4.5) show consistent but conservative extraction with excellent speed - Local models sacrifice extraction completeness for lower absolute latency

GLM 4.7 Flash: This newer model demonstrates competitive performance for its size, making it an attractive option for resource-constrained environments.

Size-Performance Disconnect: The data confirms that parameter count doesn’t correlate with extraction quality. Models like Qwen3 14b (14.98 score) and Qwen2.5 32b (13.54 score) show that larger variants don’t necessarily extract more knowledge.

Edge Detection Challenges: Several models struggle specifically with relationship extraction: - Llama 3.1, Qwen3 14b, and others often return zero edges despite finding nodes - This suggests edge detection requires different model capabilities than entity recognition

Practical Recommendations

Based on these results:

For production systems: - Start with OpenAI GPT-5.2 or Anthropic Claude for balanced performance - Use GPT-OSS if extraction completeness matters more than latency - Implement caching to amortize latency across repeated content patterns

For development workflows: - Use Qwen2.5 7b or Gemma3 7b for fast iteration - GLM 4.7 Flash offers a middle ground between speed and quality - Local models enable offline development and privacy-sensitive testing

For cost optimization: - Smaller cloud models (GPT-5-nano) may be too slow for practical use - Mid-size local models (14b-20b) offer best price-performance for high-volume processing - Consider hybrid approaches: local for bulk processing, cloud for quality assurance

Detailed Results

Complete performance data for all tested models, sorted by metric score. These tables reveal the nuanced trade-offs between extraction completeness, relationship detection, and processing speed.

Biography Domain

Narrative text about a person’s life, testing temporal reasoning, entity co-reference, and implicit relationship extraction.

Model Metric Nodes Edges Latency (s) Failed
ollama/gpt-oss:20b 34.52 38 43 8.42 No
openai/gpt-5.2-2025-12-11 32.4 35 42 11.47 No
openai/gpt-5-mini 20.09 26 24 233.95 No
anthropic/claude-sonnet-4-5-20250929 18.05 16 18 3.49 No
anthropic/claude-opus-4-5-20251101 18.02 17 17 3.52 No
anthropic/claude-haiku-4-5-20251001 16.32 15 13 2.91 No
ollama/glm-4.7-flash:latest 15.93 17 10 2.9 No
ollama/gemma3:27b 15.42 16 9 2.69 No
ollama/devstral-small-2:latest 15.41 14 11 2.7 No
ollama/mistral 15.03 12 11 2.43 No
ollama/qwen3:14b 14.98 21 2 2.46 No
ollama/qwen3:8b 14.96 12 11 2.47 No
ollama/gemma3:4b 14.82 15 8 2.56 No
ollama/gemma3:12b 14.31 12 8 2.17 No
ollama/llama3.1 14.16 6 0 0.7 No
openai/gpt-5-nano-2025-08-07 14.13 17 0 1.73 No
ollama/qwen2.5:32b 13.54 9 5 1.52 No
ollama/cogito:14b 13.52 6 4 1.1 No
ollama/qwen2.5:14b 13.51 12 2 1.53 No
ollama/qwen2.5:7b 5.56 8 4 25.21 No

Biomedical Domain

Technical biomedical text with specialized terminology, testing scientific entity recognition and complex relationship extraction in a domain-specific context.

Model Metric Nodes Edges Latency (s) Failed
ollama/gpt-oss:20b 42.88 38 66 14.6 No
openai/gpt-5.2-2025-12-11 30.47 32 38 7.09 No
openai/gpt-5-mini 19.73 24 25 150.95 No
ollama/mistral 17.63 19 14 3.51 No
ollama/gemma3:27b 16.26 15 13 2.95 No
anthropic/claude-opus-4-5-20251101 16.05 13 14 2.81 No
anthropic/claude-sonnet-4-5-20250929 16.04 13 14 2.82 No
ollama/devstral-small-2:latest 15.48 13 12 2.65 No
ollama/glm-4.7-flash:latest 14.87 12 11 2.53 No
ollama/gemma3:4b 14.59 21 0 2.23 No
ollama/qwen2.5:7b 14.54 18 17 36.29 No
ollama/gemma3:12b 14.33 12 8 2.16 No
ollama/qwen2.5:14b 13.78 9 7 1.71 No
ollama/llama3.1 13.73 5 3 0.9 No
anthropic/claude-haiku-4-5-20251001 13.69 8 7 1.6 No
ollama/qwen2.5:32b 13.63 8 7 1.62 No
ollama/qwen3:8b 13.63 8 7 1.62 No
ollama/cogito:14b 13.6 8 7 1.63 No
ollama/qwen3:14b 13.43 13 0 1.43 No
openai/gpt-5-nano-2025-08-07 0.0 0 0 0.0 Yes

Key Takeaways from Results

  1. Domain variation matters: Models show different relative performance across biography vs. biomedical content
  2. Latency spreads are extreme: From <1s to 230s for similar extraction tasks
  3. Edge detection is harder: Many models excel at node extraction but fail on relationships
  4. Model failures are rare: Only 1 failure (GPT-5-nano on biomed) across all tests indicates robust API reliability
  5. Score gaps are significant: Top performers (30-40) dramatically outperform median models (13-17)

Generated by the Knwl.ai Benchmarking Utility on 2026-01-26 12:00:40

Running Your Own Benchmarks

These results reflect specific test content and default metric weights. Your results may vary based on:

  • Content domain and complexity
  • Custom metric weight preferences
  • Model version updates and fine-tuning
  • API endpoint variations and geographic latency

Run knwl-benchmark with your own content to find the optimal model for your specific use case.