QA generation for DeepEval

LLM
RAG evaluation using DeepEval

This shows how to generate QA pairs for a given repo towards RAG evaluation. Frameworks like deepeval make it easy to evaluate the quality of a RAG solution but you do need to feed it with QA pairs and this can on its own be done via LLMs. We also show how to use DeepEval once the QA pairs have been assembled.

LangChain is used but it would work with other frameworks as well, including no framework at all (i.e. just using the LLMs directly). The LLM is local (Qwen 2.5) as well as the embedding model (nomic).

Some useful refs:

Setup and imports

You can install the necessary package via poetry install if you save the following as pyproject.toml:


[tool.poetry]
name = "rageval"
version = "0.1.0"
description = ""
authors = ["Swa <swa@orbifold.net>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.9,<3.13"
deepeval = "^2.1.6"
langchain-ollama = "^0.2.2"
jupyter = "^1.1.1"
langchain-community = "^0.3.14"
pypdf = "^5.1.0"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

The following import are used below:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.chat_models import ChatOllama
from langchain_core.vectorstores import InMemoryVectorStore
from langchain import hub
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
 

from typing import List, Dict, Any, Tuple
import numpy as np
import random
import json
import re

Language and embedding models

In order to use a vector DB you need a thing creating a vector aka embedding model:

embedder = OllamaEmbeddings(model="nomic-embed-text")
/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_72169/767370785.py:1: LangChainDeprecationWarning: The class `OllamaEmbeddings` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import OllamaEmbeddings``.
  embedder = OllamaEmbeddings(model="nomic-embed-text")

You can use such an embedder directly to see the vector like so:

dummy_chunk = "The climate of the Earth has been changing since the planet was formed. The Earth has experienced periods of cooling and warming, and these changes have been driven by a variety of factors. The most recent period of warming, which began in the late 19th century, is known as global warming. Global warming is caused by an increase in the concentration of greenhouse gases in the Earth's atmosphere. These gases trap heat from the Sun, preventing it from escaping back into space. As a result, the Earth's surface temperature has been rising, leading to changes in the climate. The effects of global warming are already being felt around the world, with rising sea levels, more frequent heatwaves, and changes in precipitation patterns."
vector = embedder.embed_query(dummy_chunk)
print("vector dimension:", np.array(vector).shape[0])
vector dimension: 768

The language model use below can be defined via the ChatOllama class from LangChain. It’s a simple wrapper around the ollama library.

llm = ChatOllama(model="qwen2.5:14b", temperature=0, num_predict=2000)
/var/folders/0p/dt9_dywj6rnc0wzps07dg31m0000gn/T/com.apple.shortcuts.mac-helper/ipykernel_72169/624464789.py:1: LangChainDeprecationWarning: The class `ChatOllama` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import ChatOllama``.
  llm = ChatOllama(model="qwen2.5:14b", temperature=0, num_predict=2000)

You can check it works via a simple invoke:

print(llm.invoke("What is loop quantum gravity?").content)
Loop Quantum Gravity (LQG) is a theoretical framework in physics that attempts to merge the principles of general relativity with quantum mechanics. It aims to provide a theory of quantum spacetime, which would describe how space and time are quantized at the smallest scales.

Key aspects of Loop Quantum Gravity include:

1. **Quantization of Space**: In LQG, space is not continuous but made up of discrete units or "atoms" of volume and area. These fundamental building blocks are represented by loops (closed paths) in spacetime, which give the theory its name.

2. **Spin Networks**: The basic structure of space in LQG is described using spin networks, which are graphs with edges labeled by spins (quantum numbers). These networks evolve over time to form a dynamic picture of quantum geometry.

3. **Diffeomorphism Invariance**: Unlike some other approaches to quantum gravity, LQG maintains the principle that physics should be independent of how spacetime is parameterized or "labeled". This property, known as diffeomorphism invariance, is crucial for maintaining general covariance (a key feature of Einstein's theory of relativity).

4. **Quantum Geometry**: In this framework, geometry itself becomes a quantum mechanical object. The area and volume of regions in space are quantized, meaning they can only take on certain discrete values.

5. **Black Hole Entropy**: LQG provides a way to calculate the entropy of black holes based on the number of possible spin network states that can exist at the event horizon, offering an explanation for the Bekenstein-Hawking formula which relates black hole entropy to its surface area.

6. **Cosmological Implications**: The theory also has implications for cosmology, potentially providing insights into the early universe and the Big Bang singularity problem by suggesting a natural cutoff for extreme curvature conditions.

While Loop Quantum Gravity is an active area of research with promising theoretical developments, it remains one of several approaches to quantum gravity. Its predictions are not yet directly testable due to the extremely small scales at which quantum gravitational effects would become significant.

Pdf loading

To generate QA pairs you need chunks of text the LLM can work with. Prior to this you need to load the pdf and convert it to plain text:

path = "./Sapiens.pdf"
loader = PyPDFLoader(path)
documents = loader.load()

You can use any pdf or stack of pdf’s you like.

print("chunk count:",len(documents))
print("\n=== chunk 100 ===:\n",documents[100].page_content)
chunk count: 481

=== chunk 100 ===:
 procreation. In good times females reach puberty earlier, and their
chances of getting pregnant are a bit higher. In bad times puberty is
late and fertility decreases.
To these natural population controls were added cultural
mechanisms. Babies and small children, who move slowly and
demand much attention, were a burden on nomadic foragers. People
tried to space their children three to four years apart. Women did so
by nursing their children around the clock and until a late age
(around-the-clock suckling signicantly decreases the chances of
getting pregnant). Other methods included full or partial sexual
abstinence (backed perhaps by cultural taboos), abortions and
occasionally infanticide.4
During these long millennia people occasionally ate wheat grain,
but this was a marginal part of their diet. About 18,000 years ago,
the last ice age gave way to a period of global warming. As
temperatures rose, so did rainfall. The new climate was ideal for
Middle Eastern wheat and other cereals, which multiplied and
spread. People began eating more wheat, and in exchange they
inadvertently spread its growth. Since it was impossible to eat wild
grains without rst winnowing, grinding and cooking them, people
who gathered these grains carried them back to their temporary
campsites for processing. Wheat grains are small and numerous, so
some of them inevitably fell on the way to the campsite and were
lost. Over time, more and more wheat grew along favourite human
trails and near campsites.
When humans burned down forests and thickets, this also helped
wheat. Fire cleared away trees and shrubs, allowing wheat and
other grasses to monopolise the sunlight, water and nutrients.
Where wheat became particularly abundant, and game and other
food sources were also plentiful, human bands could gradually give
up their nomadic lifestyle and settle down in seasonal and even
permanent camps.
At rst they might have camped for four weeks during the
harvest. A generation later, as wheat plants multiplied and spread,
the harvest camp might have lasted for ve weeks, then six, and
nally it became a permanent village. Evidence of such settlements

Note that the formatting and content is miserable. This is a separate topic.

At this point you have enough to generate QA pairs.

Generating QA pairs

Generating QA via an LLM is in essence just good prompt engineering:

QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

Let’s take a chunk of text and generate a question for it:

context = documents[100].page_content
prompt=QA_generation_prompt.format(context= context)
print(llm.invoke(prompt).content)
Output:::
Factoid question: What cultural method did nomadic foragers use to space their children three to four years apart?
Answer: Women nursed their children around the clock until a late age, which significantly decreased the chances of getting pregnant. Other methods included full or partial sexual abstinence, abortions, and occasionally infanticide.

The output is not really great for downstream tasks. Structured or JSON output might or might not work depending on all sorts of things. The safe way forward is to simply parse this output and assemble the JSON yourself:

def extract_qa_pair(output):
    qa_pair = {}
    qa_pair["question"] = re.search(r"Factoid question: (.*)\n", output).group(1)
    answer_match = re.search(r"Answer: (.*)", output)
    qa_pair["answer"] = answer_match.group(1) if answer_match else None
    return qa_pair
extract_qa_pair(llm.invoke(prompt).content)
{'question': 'What cultural method did nomadic foragers use to space their children three to four years apart?',
 'answer': 'Women nursed their children around the clock until a late age, which significantly decreased the chances of getting pregnant. Other methods included full or partial sexual abstinence, abortions, and occasionally infanticide.'}

Let’s wrap all this into a function:

def generate_qa(amount = 10):
    qa_pairs = []
    for i in range(amount):
        # skipping front and post matter
        page = random.randint(20, len(documents)-20)
        prompt=QA_generation_prompt.format(context= documents[page].page_content)
        output = llm.invoke(prompt).content
        item = extract_qa_pair(output)
        item["documentId"] = page
        item["content"] = documents[page].page_content
        qa_pairs.append(item)
    return qa_pairs
generate_qa(2)
[{'question': 'What is a more accurate term for the Stone Age according to the context?',
  'answer': 'Wood Age',
  'documentId': 56,
  'content': 'stone tools. Artefacts made of more perishable materials – such as\nwood, bamboo or leather – survive only under unique conditions.\nThe common impression that pre-agricultural humans lived in an\nage of stone is a misconception based on this archaeological bias.\nThe Stone Age should more accurately be called the Wood Age,\nbecause most of the tools used by ancient hunter-gatherers were\nmade of wood.\nAny reconstruction of the lives of ancient hunter-gatherers from\nthe surviving artefacts is extremely problematic. One of the most\nglaring di\x00erences between the ancient foragers and their\nagricultural and industrial descendants is that foragers had very few\nartefacts to begin with, and these played a comparatively modest\nrole in their lives. Over the course of his or her life, a typical\nmember of a modern a\x00uent society will own several million\nartefacts – from cars and houses to disposable nappies and milk\ncartons. There’s hardly an activity, a belief, or even an emotion that\nis not mediated by objects of our own devising. Our eating habits\nare mediated by a mind-boggling collection of such items, from\nspoons and glasses to genetic engineering labs and gigantic ocean-\ngoing ships. In play, we use a plethora of toys, from plastic cards to\n100,000-seater stadiums. Our romantic and sexual relations are\naccoutred by rings, beds, nice clothes, sexy underwear, condoms,\nfashionable restaurants, cheap motels, airport lounges, wedding\nhalls and catering companies. Religions bring the sacred into our\nlives with Gothic churches, Muslim mosques, Hindu ashrams, Torah\nscrolls, Tibetan prayer wheels, priestly cassocks, candles, incense,\nChristmas trees, matzah balls, tombstones and icons.\nWe hardly notice how ubiquitous our stu\x00 is until we have to\nmove it to a new house. Foragers moved house every month, every\nweek, and sometimes even every day, toting whatever they had on\ntheir backs. There were no moving companies, wagons, or even\npack animals to share the burden. They consequently had to make\ndo with only the most essential possessions. It’s reasonable to\npresume, then, that the greater part of their mental, religious and\nemotional lives was conducted without the help of artefacts. An\narchaeologist working 100,000 years from now could piece together'},
 {'question': 'What religion did Emperor Constantine choose for his empire in the fourth century AD?',
  'answer': 'Christianity',
  'documentId': 267,
  'content': 'Every point in history is a crossroads. A single travelled road leads\nfrom the past to the present, but myriad paths fork o\x00 into the\nfuture. Some of those paths are wider, smoother and better marked,\nand are thus more likely to be taken, but sometimes history – or the\npeople who make history – takes unexpected turns.\nAt the beginning of the fourth century AD, the Roman Empire\nfaced a wide horizon of religious possibilities. It could have stuck to\nits traditional and variegated polytheism. But its emperor,\nConstantine, looking back on a fractious century of civil war, seems\nto have thought that a single religion with a clear doctrine could\nhelp unify his ethnically diverse realm. He could have chosen any of\na number of contemporary cults to be his national faith –\nManichaeism, Mithraism, the cults of Isis or Cybele, Zoroastrianism,\nJudaism and even Buddhism were all available options. Why did he\nopt for Jesus? Was there something in Christian theology that\nattracted him personally, or perhaps an aspect of the faith that made\nhim think it would be easier to use for his purposes? Did he have a\nreligious experience, or did some of his advisers suggest that the\nChristians were quickly gaining adherents and that it would be best\nto jump on that wagon? Historians can speculate, but not provide\nany de\x00nitive answer. They can describe how Christianity took over\nthe Roman Empire, but they cannot explain why this particular\npossibility was realised.\nWhat is the di\x00erence between describing ‘how’ and explaining\n‘why’? To describe ‘how’ means to reconstruct the series of speci\x00c\nevents that led from one point to another. To explain ‘why means to\n\x00nd causal connections that account for the occurrence of this\nparticular series of events to the exclusion of all others.\nSome scholars do indeed provide deterministic explanations of\nevents such as the rise of Christianity. They attempt to reduce\nhuman history to the workings of biological, ecological or economic\nforces. They argue that there was something about the geography,\ngenetics or economy of the Roman Mediterranean that made the rise\nof a monotheist religion inevitable. Yet most historians tend to be\nsceptical of such deterministic theories. This is one of the'}]

Vector DB

In order to implement a basic RAG (which we evaluate below) we need a vector store. We use here an in-memory store but any other would do:

vectors = InMemoryVectorStore(embedder)
document_ids = vectors.add_documents(documents=documents)
print("vector count:",len(document_ids))
vector count: 481

Thanks to the LangChain implementation you can turn this store directly into a retriever:

retriever = vectors.as_retriever()

A retriever pulls documents out of the store, it does not answer questions. The following method is the essence of finding context for a question:

retriever.invoke("How many men were in Pizarro's expedition when he arrived in the Inca Empire?")
[Document(id='e64b9f49-652c-4bc4-9e2b-fd6d38646817', metadata={'source': './Sapiens.pdf', 'page': 330}, page_content='population of the Americas had shrunk by about 90 per cent, due\nmainly to unfamiliar diseases that reached America with the\ninvaders. The survivors found themselves under the thumb of a\ngreedy and racist regime that was far worse than that of the Aztecs.\nTen years after Cortés landed in Mexico, Pizarro arrived on the\nshore of the Inca Empire. He had far fewer soldiers than Cortés – his\nexpedition numbered just 168 men! Yet Pizarro bene\x00ted from all\nthe knowledge and experience gained in previous invasions. The\nInca, in contrast, knew nothing about the fate of the Aztecs. Pizarro\nplagiarised Cortés. He declared himself a peaceful emissary from the\nking of Spain, invited the Inca ruler, Atahualpa, to a diplomatic\ninterview, and then kidnapped him. Pizarro proceeded to conquer\nthe paralysed empire with the help of local allies. If the subject\npeoples of the Inca Empire had known the fate of the inhabitants of\nMexico, they would not have thrown in their lot with the invaders.\nBut they did not know.\nThe native peoples of America were not the only ones to pay a\nheavy price for their parochial outlook. The great empires of Asia –\nthe Ottoman, the Safavid, the Mughal and the Chinese – very\nquickly heard that the Europeans had discovered something big. Yet\nthey displayed little interest in these discoveries. They continued to\nbelieve that the world revolved around Asia, and made no attempt\nto compete with the Europeans for control of America or of the new\nocean lanes in the Atlantic and the Paci\x00c. Even puny European\nkingdoms such as Scotland and Denmark sent a few explore-and-\nconquer expeditions to America, but not one expedition of either\nexploration or conquest was ever sent to America from the Islamic\nworld, India or China. The \x00rst non-European power that tried to\nsend a military expedition to America was Japan. That happened in\nJune 1942, when a Japanese expedition conquered Kiska and Attu,\ntwo small islands o\x00 the Alaskan coast, capturing in the process ten\nUS soldiers and a dog. The Japanese never got any closer to the\nmainland.'),
 Document(id='f9a85335-955c-4d8a-8085-c1254d3f9933', metadata={'source': './Sapiens.pdf', 'page': 326}, page_content='capital was a smouldering ruin, the Aztec Empire was a thing of the\npast, and Hernán Cortés lorded over a vast new Spanish Empire in\nMexico.\nThe Spaniards did not stop to congratulate themselves or even to\ncatch their breath. They immediately commenced explore-and-\nconquer operations in all directions. The previous rulers of Central\nAmerica – the Aztecs, the Toltecs, the Maya – barely knew South\nAmerica existed, and never made any attempt to subjugate it, over\nthe course of 2,000 years. Yet within little more than ten years of\nthe Spanish conquest of Mexico, Francisco Pizarro had discovered\nthe Inca Empire in South America, vanquishing it in 1532.\nHad the Aztecs and Incas shown a bit more interest in the world\nsurrounding them – and had they known what the Spaniards had\ndone to their neighbours – they might have resisted the Spanish\nconquest more keenly and successfully. In the years separating\nColumbus’ \x00rst journey to America (1492) from the landing of\nCortés in Mexico (1519), the Spaniards conquered most of the\nCaribbean islands, setting up a chain of new colonies. For the\nsubjugated natives, these colonies were hell on earth. They were\nruled with an iron \x00st by greedy and unscrupulous colonists who\nenslaved them and set them to work in mines and plantations,\nkilling anyone who o\x00ered the slightest resistance. Most of the\nnative population soon died, either because of the harsh working\nconditions or the virulence of the diseases that hitch-hiked to\nAmerica on the conquerors’ sailing ships. Within twenty years,\nalmost the entire native Caribbean population was wiped out. The\nSpanish colonists began importing African slaves to \x00ll the vacuum.\nThis genocide took place on the very doorstep of the Aztec\nEmpire, yet when Cortés landed on the empire’s eastern coast, the\nAztecs knew nothing about it. The coming of the Spaniards was the\nequivalent of an alien invasion from outer space. The Aztecs were\nconvinced that they knew the entire world and that they ruled most\nof it. To them it was unimaginable that outside their domain could\nexist anything like these Spaniards. When Cortés and his men\nlanded on the sunny beaches of today’s Vera Cruz, it was the \x00rst\ntime the Aztecs encountered a completely unknown people.'),
 Document(id='3698b77a-21f7-46c9-81ca-64f7f33ee4e1', metadata={'source': './Sapiens.pdf', 'page': 147}, page_content='knots on di\x00erent cords with di\x00erent colours, it was possible to\nrecord large amounts of mathematical data relating to, for example,\ntax collection and property ownership.2\nFor hundreds, perhaps thousands of years, quipus were essential\nto the business of cities, kingdoms and empires.3 They reached their\nfull potential under the Inca Empire, which ruled 10–12 million\npeople and covered today’s Peru, Ecuador and Bolivia, as well as\nchunks of Chile, Argentina and Colombia. Thanks to quipus, the\nIncas could save and process large amounts of data, without which\nthey would not have been able to maintain the complex\nadministrative machinery that an empire of that size requires.\nIn fact, quipus were so e\x00ective and accurate that in the early\nyears following the Spanish conquest of South America, the\nSpaniards themselves employed quipus in the work of administering\ntheir new empire. The problem was that the Spaniards did not\nthemselves know how to record and read quipus, making them\ndependent on local professionals. The continent’s new rulers realised\nthat this placed them in a tenuous position – the native quipu\nexperts could easily mislead and cheat their overlords. So once\nSpain’s dominion was more \x00rmly established, quipus were phased\nout and the new empire’s records were kept entirely in Latin script\nand numerals. Very few quipus survived the Spanish occupation,\nand most of those remaining are undecipherable, since,\nunfortunately, the art of reading quipus has been lost.\nThe Wonders of Bureaucracy\nThe Mesopotamians eventually started to want to write down things\nother than monotonous mathematical data. Between 3000 BC and\n2500 BC more and more signs were added to the Sumerian system,\ngradually transforming it into a full script that we today call\ncuneiform. By 2500 BC, kings were using cuneiform to issue decrees,\npriests were using it to record oracles, and less exalted citizens were'),
 Document(id='6918767b-0940-4d99-b48b-5f5115a469c6', metadata={'source': './Sapiens.pdf', 'page': 329}, page_content='Cortés was now in a very delicate situation. He had captured the\nemperor, but was surrounded by tens of thousands of furious enemy\nwarriors, millions of hostile civilians, and an entire continent about\nwhich he knew practically nothing. He had at his disposal only a\nfew hundred Spaniards, and the closest Spanish reinforcements were\nin Cuba, more than 1,500 kilometres away.\nCortés kept Montezuma captive in the palace, making it look as if\nthe king remained free and in charge and as if the ‘Spanish\nambassador’ were no more than a guest. The Aztec Empire was an\nextremely centralised polity, and this unprecedented situation\nparalysed it. Montezuma continued to behave as if he ruled the\nempire, and the Aztec elite continued to obey him, which meant\nthey obeyed Cortés. This situation lasted for several months, during\nwhich time Cortés interrogated Montezuma and his attendants,\ntrained translators in a variety of local languages, and sent small\nSpanish expeditions in all directions to become familiar with the\nAztec Empire and the various tribes, peoples and cities that it ruled.\nThe Aztec elite eventually revolted against Cortés and\nMontezuma, elected a new emperor, and drove the Spaniards from\nTenochtitlan. However, by now numerous cracks had appeared in\nthe imperial edi\x00ce. Cortés used the knowledge he had gained to\nprise the cracks open wider and split the empire from within. He\nconvinced many of the empire’s subject peoples to join him against\nthe ruling Aztec elite. The subject peoples miscalculated badly. They\nhated the Aztecs, but knew nothing of Spain or the Caribbean\ngenocide. They assumed that with Spanish help they could shake o\x00\nthe Aztec yoke. The idea that the Spanish would take over never\noccurred to them. They were sure that if Cortés and his few hundred\nhenchmen caused any trouble, they could easily be overwhelmed.\nThe rebellious peoples provided Cortés with an army of tens of\nthousands of local troops, and with its help Cortés besieged\nTenochtitlan and conquered the city.\nAt this stage more and more Spanish soldiers and settlers began\narriving in Mexico, some from Cuba, others all the way from Spain.\nWhen the local peoples realised what was happening, it was too\nlate. Within a century of the landing at Vera Cruz, the native')]

You can specify various kwargs here, for example the number of documents to return. The default is 5:

retriever = vectors.as_retriever(search_kwargs={"k": 3})
retriever.invoke("How many men were in Pizarro's expedition when he arrived in the Inca Empire?")
[Document(id='e64b9f49-652c-4bc4-9e2b-fd6d38646817', metadata={'source': './Sapiens.pdf', 'page': 330}, page_content='population of the Americas had shrunk by about 90 per cent, due\nmainly to unfamiliar diseases that reached America with the\ninvaders. The survivors found themselves under the thumb of a\ngreedy and racist regime that was far worse than that of the Aztecs.\nTen years after Cortés landed in Mexico, Pizarro arrived on the\nshore of the Inca Empire. He had far fewer soldiers than Cortés – his\nexpedition numbered just 168 men! Yet Pizarro bene\x00ted from all\nthe knowledge and experience gained in previous invasions. The\nInca, in contrast, knew nothing about the fate of the Aztecs. Pizarro\nplagiarised Cortés. He declared himself a peaceful emissary from the\nking of Spain, invited the Inca ruler, Atahualpa, to a diplomatic\ninterview, and then kidnapped him. Pizarro proceeded to conquer\nthe paralysed empire with the help of local allies. If the subject\npeoples of the Inca Empire had known the fate of the inhabitants of\nMexico, they would not have thrown in their lot with the invaders.\nBut they did not know.\nThe native peoples of America were not the only ones to pay a\nheavy price for their parochial outlook. The great empires of Asia –\nthe Ottoman, the Safavid, the Mughal and the Chinese – very\nquickly heard that the Europeans had discovered something big. Yet\nthey displayed little interest in these discoveries. They continued to\nbelieve that the world revolved around Asia, and made no attempt\nto compete with the Europeans for control of America or of the new\nocean lanes in the Atlantic and the Paci\x00c. Even puny European\nkingdoms such as Scotland and Denmark sent a few explore-and-\nconquer expeditions to America, but not one expedition of either\nexploration or conquest was ever sent to America from the Islamic\nworld, India or China. The \x00rst non-European power that tried to\nsend a military expedition to America was Japan. That happened in\nJune 1942, when a Japanese expedition conquered Kiska and Attu,\ntwo small islands o\x00 the Alaskan coast, capturing in the process ten\nUS soldiers and a dog. The Japanese never got any closer to the\nmainland.'),
 Document(id='f9a85335-955c-4d8a-8085-c1254d3f9933', metadata={'source': './Sapiens.pdf', 'page': 326}, page_content='capital was a smouldering ruin, the Aztec Empire was a thing of the\npast, and Hernán Cortés lorded over a vast new Spanish Empire in\nMexico.\nThe Spaniards did not stop to congratulate themselves or even to\ncatch their breath. They immediately commenced explore-and-\nconquer operations in all directions. The previous rulers of Central\nAmerica – the Aztecs, the Toltecs, the Maya – barely knew South\nAmerica existed, and never made any attempt to subjugate it, over\nthe course of 2,000 years. Yet within little more than ten years of\nthe Spanish conquest of Mexico, Francisco Pizarro had discovered\nthe Inca Empire in South America, vanquishing it in 1532.\nHad the Aztecs and Incas shown a bit more interest in the world\nsurrounding them – and had they known what the Spaniards had\ndone to their neighbours – they might have resisted the Spanish\nconquest more keenly and successfully. In the years separating\nColumbus’ \x00rst journey to America (1492) from the landing of\nCortés in Mexico (1519), the Spaniards conquered most of the\nCaribbean islands, setting up a chain of new colonies. For the\nsubjugated natives, these colonies were hell on earth. They were\nruled with an iron \x00st by greedy and unscrupulous colonists who\nenslaved them and set them to work in mines and plantations,\nkilling anyone who o\x00ered the slightest resistance. Most of the\nnative population soon died, either because of the harsh working\nconditions or the virulence of the diseases that hitch-hiked to\nAmerica on the conquerors’ sailing ships. Within twenty years,\nalmost the entire native Caribbean population was wiped out. The\nSpanish colonists began importing African slaves to \x00ll the vacuum.\nThis genocide took place on the very doorstep of the Aztec\nEmpire, yet when Cortés landed on the empire’s eastern coast, the\nAztecs knew nothing about it. The coming of the Spaniards was the\nequivalent of an alien invasion from outer space. The Aztecs were\nconvinced that they knew the entire world and that they ruled most\nof it. To them it was unimaginable that outside their domain could\nexist anything like these Spaniards. When Cortés and his men\nlanded on the sunny beaches of today’s Vera Cruz, it was the \x00rst\ntime the Aztecs encountered a completely unknown people.'),
 Document(id='3698b77a-21f7-46c9-81ca-64f7f33ee4e1', metadata={'source': './Sapiens.pdf', 'page': 147}, page_content='knots on di\x00erent cords with di\x00erent colours, it was possible to\nrecord large amounts of mathematical data relating to, for example,\ntax collection and property ownership.2\nFor hundreds, perhaps thousands of years, quipus were essential\nto the business of cities, kingdoms and empires.3 They reached their\nfull potential under the Inca Empire, which ruled 10–12 million\npeople and covered today’s Peru, Ecuador and Bolivia, as well as\nchunks of Chile, Argentina and Colombia. Thanks to quipus, the\nIncas could save and process large amounts of data, without which\nthey would not have been able to maintain the complex\nadministrative machinery that an empire of that size requires.\nIn fact, quipus were so e\x00ective and accurate that in the early\nyears following the Spanish conquest of South America, the\nSpaniards themselves employed quipus in the work of administering\ntheir new empire. The problem was that the Spaniards did not\nthemselves know how to record and read quipus, making them\ndependent on local professionals. The continent’s new rulers realised\nthat this placed them in a tenuous position – the native quipu\nexperts could easily mislead and cheat their overlords. So once\nSpain’s dominion was more \x00rmly established, quipus were phased\nout and the new empire’s records were kept entirely in Latin script\nand numerals. Very few quipus survived the Spanish occupation,\nand most of those remaining are undecipherable, since,\nunfortunately, the art of reading quipus has been lost.\nThe Wonders of Bureaucracy\nThe Mesopotamians eventually started to want to write down things\nother than monotonous mathematical data. Between 3000 BC and\n2500 BC more and more signs were added to the Sumerian system,\ngradually transforming it into a full script that we today call\ncuneiform. By 2500 BC, kings were using cuneiform to issue decrees,\npriests were using it to record oracles, and less exalted citizens were')]

Naive RAG

The basic RAG recipe is in essence a prompt containing a question and some context. You don’t need LangChain or any framework to do this, but it helps. With LangChain you can proceed by fetching a predefined RAG prompt:

rag_prompt = hub.pull("rlm/rag-prompt")
/Users/swa/temp/rageval/.venv/lib/python3.10/site-packages/langsmith/client.py:256: LangSmithMissingAPIKeyWarning: API key must be provided when using hosted LangSmith API
  warnings.warn(

This prompt contains some Hub metadata but the core is this:

You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:

This can be improved in various ways but for the purpose of this notebook it’s enough:

def generate_answer(question):
    context = retriever.invoke(question)
    prompt = rag_prompt.format(question=question, context=context)
    output = llm.invoke(prompt).content
    # answer = re.search(r"Answer: (.*)", output).group(1)
    return {
        "question": question,
        "answer": output,
        "context": context
    }
generate_answer("How many men were in Pizarro's expedition when he arrived in the Inca Empire?")
{'question': "How many men were in Pizarro's expedition when he arrived in the Inca Empire?",
 'answer': "Pizarro's expedition numbered just 168 men when he arrived on the shore of the Inca Empire.",
 'context': [Document(id='e64b9f49-652c-4bc4-9e2b-fd6d38646817', metadata={'source': './Sapiens.pdf', 'page': 330}, page_content='population of the Americas had shrunk by about 90 per cent, due\nmainly to unfamiliar diseases that reached America with the\ninvaders. The survivors found themselves under the thumb of a\ngreedy and racist regime that was far worse than that of the Aztecs.\nTen years after Cortés landed in Mexico, Pizarro arrived on the\nshore of the Inca Empire. He had far fewer soldiers than Cortés – his\nexpedition numbered just 168 men! Yet Pizarro bene\x00ted from all\nthe knowledge and experience gained in previous invasions. The\nInca, in contrast, knew nothing about the fate of the Aztecs. Pizarro\nplagiarised Cortés. He declared himself a peaceful emissary from the\nking of Spain, invited the Inca ruler, Atahualpa, to a diplomatic\ninterview, and then kidnapped him. Pizarro proceeded to conquer\nthe paralysed empire with the help of local allies. If the subject\npeoples of the Inca Empire had known the fate of the inhabitants of\nMexico, they would not have thrown in their lot with the invaders.\nBut they did not know.\nThe native peoples of America were not the only ones to pay a\nheavy price for their parochial outlook. The great empires of Asia –\nthe Ottoman, the Safavid, the Mughal and the Chinese – very\nquickly heard that the Europeans had discovered something big. Yet\nthey displayed little interest in these discoveries. They continued to\nbelieve that the world revolved around Asia, and made no attempt\nto compete with the Europeans for control of America or of the new\nocean lanes in the Atlantic and the Paci\x00c. Even puny European\nkingdoms such as Scotland and Denmark sent a few explore-and-\nconquer expeditions to America, but not one expedition of either\nexploration or conquest was ever sent to America from the Islamic\nworld, India or China. The \x00rst non-European power that tried to\nsend a military expedition to America was Japan. That happened in\nJune 1942, when a Japanese expedition conquered Kiska and Attu,\ntwo small islands o\x00 the Alaskan coast, capturing in the process ten\nUS soldiers and a dog. The Japanese never got any closer to the\nmainland.'),
  Document(id='f9a85335-955c-4d8a-8085-c1254d3f9933', metadata={'source': './Sapiens.pdf', 'page': 326}, page_content='capital was a smouldering ruin, the Aztec Empire was a thing of the\npast, and Hernán Cortés lorded over a vast new Spanish Empire in\nMexico.\nThe Spaniards did not stop to congratulate themselves or even to\ncatch their breath. They immediately commenced explore-and-\nconquer operations in all directions. The previous rulers of Central\nAmerica – the Aztecs, the Toltecs, the Maya – barely knew South\nAmerica existed, and never made any attempt to subjugate it, over\nthe course of 2,000 years. Yet within little more than ten years of\nthe Spanish conquest of Mexico, Francisco Pizarro had discovered\nthe Inca Empire in South America, vanquishing it in 1532.\nHad the Aztecs and Incas shown a bit more interest in the world\nsurrounding them – and had they known what the Spaniards had\ndone to their neighbours – they might have resisted the Spanish\nconquest more keenly and successfully. In the years separating\nColumbus’ \x00rst journey to America (1492) from the landing of\nCortés in Mexico (1519), the Spaniards conquered most of the\nCaribbean islands, setting up a chain of new colonies. For the\nsubjugated natives, these colonies were hell on earth. They were\nruled with an iron \x00st by greedy and unscrupulous colonists who\nenslaved them and set them to work in mines and plantations,\nkilling anyone who o\x00ered the slightest resistance. Most of the\nnative population soon died, either because of the harsh working\nconditions or the virulence of the diseases that hitch-hiked to\nAmerica on the conquerors’ sailing ships. Within twenty years,\nalmost the entire native Caribbean population was wiped out. The\nSpanish colonists began importing African slaves to \x00ll the vacuum.\nThis genocide took place on the very doorstep of the Aztec\nEmpire, yet when Cortés landed on the empire’s eastern coast, the\nAztecs knew nothing about it. The coming of the Spaniards was the\nequivalent of an alien invasion from outer space. The Aztecs were\nconvinced that they knew the entire world and that they ruled most\nof it. To them it was unimaginable that outside their domain could\nexist anything like these Spaniards. When Cortés and his men\nlanded on the sunny beaches of today’s Vera Cruz, it was the \x00rst\ntime the Aztecs encountered a completely unknown people.'),
  Document(id='3698b77a-21f7-46c9-81ca-64f7f33ee4e1', metadata={'source': './Sapiens.pdf', 'page': 147}, page_content='knots on di\x00erent cords with di\x00erent colours, it was possible to\nrecord large amounts of mathematical data relating to, for example,\ntax collection and property ownership.2\nFor hundreds, perhaps thousands of years, quipus were essential\nto the business of cities, kingdoms and empires.3 They reached their\nfull potential under the Inca Empire, which ruled 10–12 million\npeople and covered today’s Peru, Ecuador and Bolivia, as well as\nchunks of Chile, Argentina and Colombia. Thanks to quipus, the\nIncas could save and process large amounts of data, without which\nthey would not have been able to maintain the complex\nadministrative machinery that an empire of that size requires.\nIn fact, quipus were so e\x00ective and accurate that in the early\nyears following the Spanish conquest of South America, the\nSpaniards themselves employed quipus in the work of administering\ntheir new empire. The problem was that the Spaniards did not\nthemselves know how to record and read quipus, making them\ndependent on local professionals. The continent’s new rulers realised\nthat this placed them in a tenuous position – the native quipu\nexperts could easily mislead and cheat their overlords. So once\nSpain’s dominion was more \x00rmly established, quipus were phased\nout and the new empire’s records were kept entirely in Latin script\nand numerals. Very few quipus survived the Spanish occupation,\nand most of those remaining are undecipherable, since,\nunfortunately, the art of reading quipus has been lost.\nThe Wonders of Bureaucracy\nThe Mesopotamians eventually started to want to write down things\nother than monotonous mathematical data. Between 3000 BC and\n2500 BC more and more signs were added to the Sumerian system,\ngradually transforming it into a full script that we today call\ncuneiform. By 2500 BC, kings were using cuneiform to issue decrees,\npriests were using it to record oracles, and less exalted citizens were')]}

From the sample QA above you can see that this answer is correct.

DeepEval

We are now ready to evaluate our RAG solutions based on the generated QA and our naive RAG implementation. The DeepEval framework needs a set of test cases (very much like unit tests):

qs = generate_qa(2)
for q in qs:
    print("\n")
    print("Question:", q["question"])
    print("Real answer:", q["answer"])
    g = generate_answer(q["question"])
    print("Generated answer:", g["answer"])
    print("Context:", g["context"])


Question: What is an example of an imagined community in ancient China?
Real answer: In ancient China, tens of millions of people saw themselves as members of a single family with the emperor as its father.
Generated answer: An example of an imagined community in ancient China is when tens of millions of people saw themselves as members of a single family with the emperor as its father. This illustrates how the concept of a large, unified community was fostered through imperial rule and cultural unity.
Context: [Document(id='6160b028-adca-4163-83a7-264eeb358a75', metadata={'source': './Sapiens.pdf', 'page': 401}, page_content='Imagined Communities\nLike the nuclear family, the community could not completely\ndisappear from our world without any emotional replacement.\nMarkets and states today provide most of the material needs once\nprovided by communities, but they must also supply tribal bonds.\nMarkets and states do so by fostering ‘imagined communities’ that\ncontain millions of strangers, and which are tailored to national and\ncommercial needs. An imagined community is a community of\npeople who don’t really know each other, but imagine that they do.\nSuch communities are not a novel invention. Kingdoms, empires and\nchurches functioned for millennia as imagined communities. In\nancient China, tens of millions of people saw themselves as\nmembers of a single family, with the emperor as its father. In the\nMiddle Ages, millions of devout Muslims imagined that they were\nall brothers and sisters in the great community of Islam. Yet\nthroughout history, such imagined communities played second\n\x00ddle to intimate communities of several dozen people who knew\neach other well. The intimate communities ful\x00lled the emotional\nneeds of their members and were essential for everyone’s survival\nand welfare. In the last two centuries, the intimate communities\nhave withered, leaving imagined communities to \x00ll in the\nemotional vacuum.\nThe two most important examples for the rise of such imagined\ncommunities are the nation and the consumer tribe. The nation is\nthe imagined community of the state. The consumer tribe is the\nimagined community of the market. Both are imagined communities\nbecause it is impossible for all customers in a market or for all\nmembers of a nation really to know one another the way villagers\nknew one another in the past. No German can intimately know the\nother 80 million members of the German nation, or the other 500\nmillion customers inhabiting the European Common Market (which\nevolved \x00rst into the European Community and \x00nally became the\nEuropean Union).'), Document(id='8161e761-9df3-4e4e-96fe-aa56133036fd', metadata={'source': './Sapiens.pdf', 'page': 223}, page_content='world is composed of separate nation states, in China periods of\npolitical fragmentation were seen as dark ages of chaos and\ninjustice. This perception has had far-reaching implications for\nChinese history. Every time an empire collapsed, the dominant\npolitical theory goaded the powers that be not to settle for paltry\nindependent principalities, but to attempt reuni\x00cation. Sooner or\nlater these attempts always succeeded.\nWhen They Become Us\nEmpires have played a decisive part in amalgamating many small\ncultures into fewer big cultures. Ideas, people, goods and technology\nspread more easily within the borders of an empire than in a\npolitically fragmented region. Often enough, it was the empires\nthemselves which deliberately spread ideas, institutions, customs\nand norms. One reason was to make life easier for themselves. It is\ndi\x00cult to rule an empire in which every little district has its own\nset of laws, its own form of writing, its own language and its own\nmoney. Standardisation was a boon to emperors.\nA second and equally important reason why empires actively\nspread a common culture was to gain legitimacy. At least since the\ndays of Cyrus and Qín Shǐ Huángdì, empires have justi\x00ed their\nactions – whether road-building or bloodshed – as necessary to\nspread a superior culture from which the conquered bene\x00t even\nmore than the conquerors.\nThe bene\x00ts were sometimes salient – law enforcement, urban\nplanning, standardisation of weights and measures – and sometimes\nquestionable – taxes, conscription, emperor worship. But most\nimperial elites earnestly believed that they were working for the\ngeneral welfare of all the empires inhabitants. China’s ruling class\ntreated their country’s neighbours and its foreign subjects as\nmiserable barbarians to whom the empire must bring the bene\x00ts of\nculture. The Mandate of Heaven was bestowed upon the emperor\nnot in order to exploit the world, but in order to educate humanity.'), Document(id='7c6c6c25-0fe0-4f4a-98d7-6da03ee2b69d', metadata={'source': './Sapiens.pdf', 'page': 52}, page_content='were their societies like? Did they have monogamous relationships\nand nuclear families? Did they have ceremonies, moral codes, sports\ncontests and religious rituals? Did they \x00ght wars? The next chapter\ntakes a peek behind the curtain of the ages, examining what life was\nlike in the millennia separating the Cognitive Revolution from the\nAgricultural Revolution.\n* Here and in the following pages, when speaking about Sapiens language, I refer to the\nbasic linguistic abilities of our species and not to a particular dialect. English, Hindi and\nChinese are all variants of Sapiens language. Apparently, even at the time of the Cognitive\nRevolution, di\x00erent Sapiens groups had di\x00erent dialects.')]


Question: What did foragers do according to the passage?
Real answer: Foragers reshaped the ecology of our planet long before the first agricultural village was built.
Generated answer: Foragers built monumental structures like Göbekli Tepe for mysterious cultural purposes. They spent less time obtaining food compared to agricultural societies and enjoyed more leisure activities. Foragers engaged in some long-term planning but generally focused on immediate needs due to their subsistence lifestyle.
Context: [Document(id='a57d4ee0-5717-4f12-80f9-34f783d42f92', metadata={'source': './Sapiens.pdf', 'page': 106}, page_content='foragers, and the complexity of their cultures, seem to be far more\nimpressive than was previously suspected.\n13. Opposite: The remains of a monumental structure from Göbekli Tepe. Right: One\nof the decorated stone pillars (about \x00ve metres high).\nWhy would a foraging society build such structures? They had no\nobvious utilitarian purpose. They were neither mammoth\nslaughterhouses nor places to shelter from rain or hide from lions.\nThat leaves us with the theory that they were built for some\nmysterious cultural purpose that archaeologists have a hard time\ndeciphering. Whatever it was, the foragers thought it worth a huge\namount of e\x00ort and time. The only way to build Göbekli Tepe was\nfor thousands of foragers belonging to di\x00erent bands and tribes to\ncooperate over an extended period of time. Only a sophisticated\nreligious or ideological system could sustain such e\x00orts.'), Document(id='17d081a6-1b88-4ae1-be4d-b634009ae632', metadata={'source': './Sapiens.pdf', 'page': 64}, page_content='The hunter-gatherer way of life di\x00ered signi\x00cantly from region to\nregion and from season to season, but on the whole foragers seem to\nhave enjoyed a more comfortable and rewarding lifestyle than most\nof the peasants, shepherds, labourers and o\x00ce clerks who followed\nin their footsteps.\nWhile people in today’s a\x00uent societies work an average of forty\nto forty-\x00ve hours a week, and people in the developing world work\nsixty and even eighty hours a week, hunter-gatherers living today in\nthe most inhospitable of habitats – such as the Kalahari Desert work\non average for just thirty-\x00ve to forty-\x00ve hours a week. They hunt\nonly one day out of three, and gathering takes up just three to six\nhours daily. In normal times, this is enough to feed the band. It may\nwell be that ancient hunter-gatherers living in zones more fertile\nthan the Kalahari spent even less time obtaining food and raw\nmaterials. On top of that, foragers enjoyed a lighter load of\nhousehold chores. They had no dishes to wash, no carpets to\nvacuum, no \x00oors to polish, no nappies to change and no bills to\npay.\nThe forager economy provided most people with more interesting\nlives than agriculture or industry do. Today, a Chinese factory hand\nleaves home around seven in the morning, makes her way through\npolluted streets to a sweatshop, and there operates the same\nmachine, in the same way, day in, day out, for ten long and mind-\nnumbing hours, returning home around seven in the evening in\norder to wash dishes and do the laundry. Thirty thousand years ago,\na Chinese forager might leave camp with her companions at, say,\neight in the morning. They’d roam the nearby forests and meadows,\ngathering mushrooms, digging up edible roots, catching frogs and\noccasionally running away from tigers. By early afternoon, they\nwere back at the camp to make lunch. That left them plenty of time\nto gossip, tell stories, play with the children and just hang out. Of\ncourse the tigers sometimes caught them, or a snake bit them, but\non the other hand they didn’t have to deal with automobile\naccidents and industrial pollution.\nIn most places and at most times, foraging provided ideal\nnutrition. That is hardly surprising – this had been the human diet'), Document(id='db1ceb71-ff5e-4099-a8e8-0ce5f20f0f8b', metadata={'source': './Sapiens.pdf', 'page': 117}, page_content='and more things – objects, not easily transportable, that tied them\ndown. Ancient farmers might seem to us dirt poor, but a typical\nfamily possessed more artefacts than an entire forager tribe.\nThe Coming of the Future\nWhile agricultural space shrank, agricultural time expanded.\nForagers usually didn’t waste much time thinking about next week\nor next month. Farmers sailed in their imagination years and\ndecades into the future.\nForagers discounted the future because they lived from hand to\nmouth and could only preserve food or accumulate possessions with\ndi\x00culty. Of course, they clearly engaged in some advanced\nplanning. The creators of the cave paintings of Chauvet, Lascaux\nand Altamira almost certainly intended them to last for generations.\nSocial alliances and political rivalries were long-term a\x00airs. It often\ntook years to repay a favour or to avenge a wrong. Nevertheless, in\nthe subsistence economy of hunting and gathering, there was an\nobvious limit to such long-term planning. Paradoxically, it saved\nforagers a lot of anxieties. There was no sense in worrying about\nthings that they could not in\x00uence.\nThe Agricultural Revolution made the future far more important\nthan it had ever been before. Farmers must always keep the future\nin mind and must work in its service. The agricultural economy was\nbased on a seasonal cycle of production, comprising long months of\ncultivation followed by short peak periods of harvest. On the night\nfollowing the end of a plentiful harvest the peasants might celebrate\nfor all they were worth, but within a week or so they were again up\nat dawn for a long day in the \x00eld. Although there was enough food\nfor today, next week, and even next month, they had to worry about\nnext year and the year after that.\nConcern about the future was rooted not only in seasonal cycles of\nproduction, but also in the fundamental uncertainty of agriculture.\nSince most villages lived by cultivating a very limited variety of')]

Let’s wrap this into a DeepEval set of test cases and note that the context needs to be a list of strings (not a LangChain Document object):

def create_test_cases(amount = 10) -> List[LLMTestCase]:
    qas = generate_qa(amount)
    test_cases = []
    for qa in qas:
        gen = generate_answer(qa["question"])
        case = LLMTestCase(
            input = qa["question"],
            expected_output = qa["answer"],
            actual_output= gen["answer"],
            retrieval_context=[u.page_content for u in gen["context"]] if gen["context"] else None,
        )
        test_cases.append(case)
     
    return test_cases
test_cases = create_test_cases(10)

Make sure to use the following in order to make DeepEval work with Ollama:

deepeval set-local-model --model-name="qwen2.5:14b" --base-url="http://localhost:11434/v1/"  --api-key="nada"

You can now simply feed DeepEval with the test cases and use the metrics you like:


 
correctness_metric = GEval(
    name="Correctness",
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],
)
evaluate(
        test_cases=test_cases,
        metrics=[correctness_metric]
    )
✨ You're running DeepEval's latest Correctness (GEval) Metric! (using local model, strict=False, 
async_mode=True)...
Event loop is already running. Applying nest_asyncio patch to allow async execution...
Evaluating 10 test case(s) in parallel: |██████████|100% (10/10) [Time Taken: 00:30,  3.06s/test case]

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output does not address the problem of specialization as outlined in the expected output and lacks factual correctness., error: None)

For test case:

  - input: What problem does specialization create in an economy?
  - actual output: The context does not directly address the problems created by specialization in an economy. I don't have enough information from these documents to answer that specific question accurately.
  - expected output: Specialization creates the problem of how to manage the exchange of goods between specialists when an economy of favours and obligations doesn't work with large numbers of strangers.
  - context: None
  - retrieval context: ['have repeatedly managed to trounce calculating merchants, and\neven to reshape the economy. It is therefore impossible to\nunderstand the uni\x00cation of humankind as a purely economic\nprocess. In order to understand how thousands of isolated cultures\ncoalesced over time to form the global village of today, we must\ntake into account the role of gold and silver, but we cannot\ndisregard the equally crucial role of steel.', 'The Capitalist Hell\nThere is an even more fundamental reason why it’s dangerous to\ngive markets a completely free rein. Adam Smith taught that the\nshoemaker would use his surplus to employ more assistants. This\nimplies that egoistic greed is bene\x00cial for all, since pro\x00ts are\nutilised to expand production and hire more employees.\nYet what happens if the greedy shoemaker increases his pro\x00ts by\npaying employees less and increasing their work hours? The\nstandard answer is that the free market would protect the\nemployees. If our shoemaker pays too little and demands too much,\nthe best employees would naturally abandon him and go to work for\nhis competitors. The tyrant shoemaker would \x00nd himself left with\nthe worst labourers, or with no labourers at all. He would have to\nmend his ways or go out of business. His own greed would compel\nhim to treat his employees well.\nThis sounds bulletproof in theory, but in practice the bullets get\nthrough all too easily. In a completely free market, unsupervised by\nkings and priests, avaricious capitalists can establish monopolies or\ncollude against their workforces. If there is a single corporation\ncontrolling all shoe factories in a country, or if all factory owners\nconspire to reduce wages simultaneously, then the labourers are no\nlonger able to protect themselves by switching jobs.\nEven worse, greedy bosses might curtail the workers’ freedom of\nmovement through debt peonage or slavery. At the end of the\nMiddle Ages, slavery was almost unknown in Christian Europe.\nDuring the early modern period, the rise of European capitalism\nwent hand in hand with the rise of the Atlantic slave trade.\nUnrestrained market forces, rather than tyrannical kings or racist\nideologues, were responsible for this calamity.\nWhen the Europeans conquered America, they opened gold and\nsilver mines and established sugar, tobacco and cotton plantations.\nThese mines and plantations became the mainstay of American\nproduction and export. The sugar plantations were particularly\nimportant. In the Middle Ages, sugar was a rare luxury in Europe. It', 'maxim that ‘When pro\x00ts increase, the landlord or weaver will\nemploy more assistants’ and not ‘When pro\x00ts increase, Scrooge will\nhoard his money in a chest and take it out only to count his coins.’\nA crucial part of the modern capitalist economy was the emergence\nof a new ethic, according to which pro\x00ts ought to be reinvested in\nproduction. This brings about more pro\x00ts, which are again\nreinvested in production, which brings more pro\x00ts, et cetera ad\nin\x00nitum. Investments can be made in many ways: enlarging the\nfactory, conducting scienti\x00c research, developing new products.\nYet all these investments must somehow increase production and\ntranslate into larger pro\x00ts. In the new capitalist creed, the \x00rst and\nmost sacred commandment is: ‘The pro\x00ts of production must be\nreinvested in increasing production.’\nThat’s why capitalism is called ‘capitalism’. Capitalism\ndistinguishes ‘capital’ from mere ‘wealth’. Capital consists of money,\ngoods and resources that are invested in production. Wealth, on the\nother hand, is buried in the ground or wasted on unproductive\nactivities. A pharaoh who pours resources into a non-productive\npyramid is not a capitalist. A pirate who loots a Spanish treasure\n\x00eet and buries a chest full of glittering coins on the beach of some\nCaribbean island is not a capitalist. But a hard-working factory hand\nwho reinvests part of his income in the stock market is.\nThe idea that ‘The pro\x00ts of production must be reinvested in\nincreasing production’ sounds trivial. Yet it was alien to most people\nthroughout history. In premodern times, people believed that\nproduction was more or less constant. So why reinvest your pro\x00ts\nif production won’t increase by much, no matter what you do? Thus\nmedieval noblemen espoused an ethic of generosity and conspicuous\nconsumption. They spent their revenues on tournaments, banquets,\npalaces and wars, and on charity and monumental cathedrals. Few\ntried to reinvest pro\x00ts in increasing their manors’ output,\ndeveloping better kinds of wheat, or looking for new markets.']

======================================================================

Metrics Summary

  - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output provides correct numerical values but does not explicitly state the difference in nucleobases between Sapiens and mouse genomes as specified in the expected output., error: None)

For test case:

  - input: How many more nucleobases does the Sapiens genome have compared to the mouse genome?
  - actual output: The Sapiens genome has about 2.9 billion nucleobases, while the mouse genome contains approximately 2.5 billion nucleobases. This means the Sapiens genome is only 14% larger than that of the mouse.
  - expected output: The Sapiens genome has about 400 million more nucleobases compared to the mouse genome.
  - context: None
  - retrieval context: ['But why stop even at Neanderthals? Why not go back to God’s\ndrawing board and design a better Sapiens? The abilities, needs and\ndesires of Homo sapiens have a genetic basis, and the Sapiens\ngenome is no more complex than that of voles and mice. (The\nmouse genome contains about 2.5 billion nucleobases, the Sapiens\ngenome about 2.9 billion bases – meaning the latter is only 14 per\ncent larger.)11 In the medium range – perhaps in a few decades –\ngenetic engineering and other forms of biological engineering might\nenable us to make far-reaching alterations not only to our\nphysiology, immune system and life expectancy, but also to our\nintellectual and emotional capacities. If genetic engineering can\ncreate genius mice, why not genius humans? If it can create\nmonogamous voles, why not humans hard-wired to remain faithful\nto their partners?\nThe Cognitive Revolution that turned Homo sapiens from an\ninsigni\x00cant ape into the master of the world did not require any\nnoticeable change in physiology or even in the size and external\nshape of the Sapiens brain. It apparently involved no more than a\nfew small changes to internal brain structure. Perhaps another small\nchange would be enough to ignite a Second Cognitive Revolution,\ncreate a completely new type of consciousness, and transform Homo\nsapiens into something altogether di\x00erent.\nTrue, we still don’t have the acumen to achieve this, but there\nseems to be no insurmountable technical barrier preventing us from\nproducing superhumans. The main obstacles are the ethical and\npolitical objections that have slowed down research on humans. And\nno matter how convincing the ethical arguments may be, it is hard\nto see how they can hold back the next step for long, especially if\nwhat is at stake is the possibility of prolonging human life\ninde\x00nitely, conquering incurable diseases, and upgrading our\ncognitive and emotional abilities.\nWhat would happen, for example, if we developed a cure for\nAlzheimer’s disease that, as a side bene\x00t, could dramatically\nimprove the memories of healthy people? Would anyone be able to\nhalt the relevant research? And when the cure is developed, could', '200,000Homo sapiens evolves in East Africa.\n70,000The Cognitive Revolution. Emergence of \x00ctive language.\nBeginning of history. Sapiens spread out of Africa.\n45,000Sapiens settle Australia. Extinction of Australian\nmegafauna.\n30,000Extinction of Neanderthals.\n16,000Sapiens settle America. Extinction of American megafauna.\n13,000Extinction of Homo \x00oresiensis. Homo sapiens the only\nsurviving human species.\n12,000The Agricultural Revolution. Domestication of plants and\nanimals. Permanent settlements.\n5,000 First kingdoms, script and money. Polytheistic religions.\n4,250 First empire – the Akkadian Empire of Sargon.\n2,500\nInvention of coinage – a universal money.\nThe Persian Empire – a universal political order ‘for the\nbene\x00t of all humans’.\nBuddhism in India – a universal truth ‘to liberate all beings\nfrom su\x00ering’.\n2,000 Han Empire in China. Roman Empire in the\nMediterranean. Christianity.\n1,400 Islam.\n500 The Scienti\x00c Revolution. Humankind admits its ignorance\nand begins to acquire unprecedented power. Europeans\nbegin to conquer America and the oceans. The entire', 'Similarly, when Sapiens reached East Asia, they interbred with the\nlocal Erectus, so the Chinese and Koreans are a mixture of Sapiens\nand Erectus.\nThe opposing view, called the ‘Replacement Theory’ tells a very\ndi\x00erent story – one of incompatibility, revulsion, and perhaps even\ngenocide. According to this theory, Sapiens and other humans had\ndi\x00erent anatomies, and most likely di\x00erent mating habits and\neven body odours. They would have had little sexual interest in one\nanother. And even if a Neanderthal Romeo and a Sapiens Juliet fell\nin love, they could not produce fertile children, because the genetic\ngulf separating the two populations was already unbridgeable. The\ntwo populations remained completely distinct, and when the\nNeanderthals died out, or were killed o\x00, their genes died with\nthem. According to this view, Sapiens replaced all the previous\nhuman populations without merging with them. If that is the case,\nthe lineages of all contemporary humans can be traced back,\nexclusively, to East Africa, 70,000 years ago. We are all ‘pure\nSapiens’.\nMap 1. Homo sapiens conquers the globe.']

======================================================================

Metrics Summary

  - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output provides additional context and information about Neanderthal DNA in modern populations, but it introduces a theory (Interbreeding Theory) that was not mentioned in the expected output., error: None)

For test case:

  - input: What percentage of unique human DNA in modern populations in the Middle East and Europe is Neanderthal DNA?
  - actual output: According to the Interbreeding Theory, 1–4% of the unique human DNA in modern populations in the Middle East and Europe is Neanderthal DNA.
  - expected output: 1–4 per cent
  - context: None
  - retrieval context: ['perhaps a million humans living between the Indonesian\narchipelago and the Iberian peninsula, a mere blip on the ecological\nradar.\nOur own species, Homo sapiens, was already present on the world\nstage, but so far it was just minding its own business in a corner of\nAfrica. We don’t know exactly where and when animals that can be\nclassi\x00ed as Homo sapiens \x00rst evolved from some earlier type of\nhumans, but most scientists agree that by 150,000 years ago, East\nAfrica was populated by Sapiens that looked just like us. If one of\nthem turned up in a modern morgue, the local pathologist would\nnotice nothing peculiar. Thanks to the blessings of \x00re, they had\nsmaller teeth and jaws than their ancestors, whereas they had\nmassive brains, equal in size to ours.\nScientists also agree that about 70,000 years ago, Sapiens from\nEast Africa spread into the Arabian peninsula, and from there they\nquickly overran the entire Eurasian landmass.\nWhen Homo sapiens landed in Arabia, most of Eurasia was already\nsettled by other humans. What happened to them? There are two\ncon\x00icting theories. The ‘Interbreeding Theory’ tells a story of\nattraction, sex and mingling. As the African immigrants spread\naround the world, they bred with other human populations, and\npeople today are the outcome of this interbreeding.\nFor example, when Sapiens reached the Middle East and Europe,\nthey encountered the Neanderthals. These humans were more\nmuscular than Sapiens, had larger brains, and were better adapted\nto cold climes. They used tools and \x00re, were good hunters, and\napparently took care of their sick and in\x00rm. (Archaeologists have\ndiscovered the bones of Neanderthals who lived for many years with\nsevere physical handicaps, evidence that they were cared for by\ntheir relatives.) Neanderthals are often depicted in caricatures as the\narchetypical brutish and stupid ‘cave people’, but recent evidence\nhas changed their image.\nAccording to the Interbreeding Theory, when Sapiens spread into\nNeanderthal lands, Sapiens bred with Neanderthals until the two\npopulations merged. If this is the case, then today’s Eurasians are\nnot pure Sapiens. They are a mixture of Sapiens and Neanderthals.', 'A lot hinges on this debate. From an evolutionary perspective,\n70,000 years is a relatively short interval. If the Replacement\nTheory is correct, all living humans have roughly the same genetic\nbaggage, and racial distinctions among them are negligible. But if\nthe Interbreeding Theory is right, there might well be genetic\ndi\x00erences between Africans, Europeans and Asians that go back\nhundreds of thousands of years. This is political dynamite, which\ncould provide material for explosive racial theories.\nIn recent decades the Replacement Theory has been the common\nwisdom in the \x00eld. It had \x00rmer archaeological backing, and was\nmore politically correct (scientists had no desire to open up the\nPandora’s box of racism by claiming signi\x00cant genetic diversity\namong modern human populations). But that ended in 2010, when\nthe results of a four-year e\x00ort to map the Neanderthal genome\nwere published. Geneticists were able to collect enough intact\nNeanderthal DNA from fossils to make a broad comparison between\nit and the DNA of contemporary humans. The results stunned the\nscienti\x00c community.\nIt turned out that 1–4 per cent of the unique human DNA of\nmodern populations in the Middle East and Europe is Neanderthal\nDNA. That’s not a huge amount, but it’s signi\x00cant. A second shock\ncame several months later, when DNA extracted from the fossilised\n\x00nger from Denisova was mapped. The results proved that up to 6\nper cent of the unique human DNA of modern Melanesians and\nAboriginal Australians is Denisovan DNA.\nIf these results are valid – and it’s important to keep in mind that\nfurther research is under way and may either reinforce or modify\nthese conclusions – the Interbreeders got at least some things right.\nBut that doesn’t mean that the Replacement Theory is completely\nwrong. Since Neanderthals and Denisovans contributed only a small\namount of DNA to our present-day genome, it is impossible to speak\nof a ‘merger’ between Sapiens and other human species. Although\ndi\x00erences between them were not large enough to completely\nprevent fertile intercourse, they were su\x00cient to make such\ncontacts very rare.', 'point, I will often use the term ‘Sapiens’ to denote members of the\nspecies Homo sapiens, while reserving the term ‘human’ to refer to\nall extant members of the genus Homo.\nHumans \x00rst evolved in East Africa about 2.5 million years ago\nfrom an earlier genus of apes called Australopithecus, which means\n‘Southern Ape’. About 2 million years ago, some of these archaic\nmen and women left their homeland to journey through and settle\nvast areas of North Africa, Europe and Asia. Since survival in the\nsnowy forests of northern Europe required di\x00erent traits than those\nneeded to stay alive in Indonesia’s steaming jungles, human\npopulations evolved in di\x00erent directions. The result was several\ndistinct species, to each of which scientists have assigned a pompous\nLatin name.\n2. Our siblings, according to speculative reconstructions (left to right):\nHomo rudolfensis (East Africa); Homo erectus (East Asia); and Homo neanderthalensis\n(Europe and western Asia). All are humans.\nHumans in Europe and western Asia evolved into Homo\nneanderthalensis (‘Man from the Neander Valley), popularly referred\nto simply as ‘Neanderthals’. Neanderthals, bulkier and more\nmuscular than us Sapiens, were well adapted to the cold climate of\nIce Age western Eurasia. The more eastern regions of Asia were']

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.2, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output does not provide a specific number of British officials, contradicting the expected factual information about fewer than 5,000 officials governing up to 300 million Indians., error: None)

For test case:

  - input: How many British officials were sufficient to govern up to 300 million Indians during the nineteenth and early twentieth centuries?
  - actual output: The context does not provide a specific number of British officials needed to govern India during that period. However, it's clear from the documents that despite the vast population, the British managed governance along with significant cultural and infrastructural changes. The exact administrative numbers are not detailed in the provided texts.
  - expected output: Fewer than 5,000 British officials were sufficient to govern up to 300 million Indians during the nineteenth and early twentieth centuries.
  - context: None
  - retrieval context: ['passionate cricket players and chai (tea) drinkers, and both game\nand beverage are British legacies. Commercial tea farming did not\nexist in India until the mid-nineteenth century, when it was\nintroduced by the British East India Company. It was the snobbish\nBritish sahibs who spread the custom of tea drinking throughout the\nsubcontinent.\n28. The Chhatrapati Shivaji train station in Mumbai. It began its life as Victoria\nStation, Bombay. The British built it in the Neo-Gothic style that was popular in late\nnineteenth-century Britain. A Hindu nationalist government changed the names of\nboth city and station, but showed no appetite for razing such a magni\x00cent building,\neven if it was built by foreign oppressors.\nHow many Indians today would want to call a vote to divest\nthemselves of democracy, English, the railway network, the legal\nsystem, cricket and tea on the grounds that they are imperial\nlegacies? And if they did, wouldn’t the very act of calling a vote to\ndecide the issue demonstrate their debt to their former overlords?', 'a century, maintaining a huge military force of up to 350,000\nsoldiers, considerably outnumbering the armed forces of the British\nmonarchy. Only in 1858 did the British crown nationalise India\nalong with the company’s private army. Napoleon made fun of the\nBritish, calling them a nation of shopkeepers. Yet these shopkeepers\ndefeated Napoleon himself, and their empire was the largest the\nworld has ever seen.\nIn the Name of Capital\nThe nationalisation of Indonesia by the Dutch crown (1800) and of\nIndia by the British crown (1858) hardly ended the embrace of\ncapitalism and empire. On the contrary, the connection only grew\nstronger during the nineteenth century. Joint-stock companies no\nlonger needed to establish and govern private colonies – their\nmanagers and large shareholders now pulled the strings of power in\nLondon, Amsterdam and Paris, and they could count on the state to\nlook after their interests. As Marx and other social critics quipped,\nWestern governments were becoming a capitalist trade union.\nThe most notorious example of how governments did the bidding\nof big money was the First Opium War, fought between Britain and\nChina (1840–42). In the \x00rst half of the nineteenth century, the\nBritish East India Company and sundry British business people made\nfortunes by exporting drugs, particularly opium, to China. Millions\nof Chinese became addicts, debilitating the country both\neconomically and socially. In the late 1830s the Chinese government\nissued a ban on drug tra\x00cking, but British drug merchants simply\nignored the law. Chinese authorities began to con\x00scate and destroy\ndrug cargos. The drug cartels had close connections in Westminster\nand Downing Street – many MPs and Cabinet ministers in fact held\nstock in the drug companies – so they pressured the government to\ntake action.\nIn 1840 Britain duly declared war on China in the name of ‘free\ntrade’. It was a walkover. The overcon\x00dent Chinese were no match', 'Modern science and modern empires were motivated by the restless\nfeeling that perhaps something important awaited beyond the\nhorizon – something they had better explore and master. Yet the\nconnection between science and empire went much deeper. Not just\nthe motivation, but also the practices of empire-builders were\nentangled with those of scientists. For modern Europeans, building\nan empire was a scienti\x00c project, while setting up a scienti\x00c\ndiscipline was an imperial project.\nWhen the Muslims conquered India, they did not bring along\narchaeologists to systematically study Indian history,\nanthropologists to study Indian cultures, geologists to study Indian\nsoils, or zoologists to study Indian fauna. When the British\nconquered India, they did all of these things. On 10 April 1802 the\nGreat Survey of India was launched. It lasted sixty years. With the\nhelp of tens of thousands of native labourers, scholars and guides,\nthe British carefully mapped the whole of India, marking borders,\nmeasuring distances, and even calculating for the \x00rst time the exact\nheight of Mount Everest and the other Himalayan peaks. The British\nexplored the military resources of Indian provinces and the location\nof their gold mines, but they also took the trouble to collect\ninformation about rare Indian spiders, to catalogue colourful\nbutter\x00ies, to trace the ancient origins of extinct Indian languages,\nand to dig up forgotten ruins.\nMohenjo-daro was one of the chief cities of the Indus Valley\ncivilisation, which \x00ourished in the third millennium BC and was\ndestroyed around 1900 BC. None of India’s pre-British rulers –\nneither the Mauryas, nor the Guptas, nor the Delhi sultans, nor the\ngreat Mughals – had given the ruins a second glance. But a British\narchaeological survey took notice of the site in 1922. A British team\nthen excavated it, and discovered the \x00rst great civilisation of India,\nwhich no Indian had been aware of.\nAnother telling example of British scienti\x00c curiosity was the\ndeciphering of cuneiform script. This was the main script used\nthroughout the Middle East for close to 3,000 years, but the last\nperson able to read it probably died sometime in the early \x00rst']

======================================================================

Metrics Summary

  - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output provides additional context and details beyond the expected output, but it correctly includes the 7,500 hours mentioned in the expected output., error: None)

For test case:

  - input: How many hours did it take an experienced artisan to prepare 10,000 ivory beads?
  - actual output: It took approximately 7,500 hours for an experienced artisan to prepare 10,000 ivory beads. This equates to well over three years of labor.
  - expected output: 7,500 hours
  - context: None
  - retrieval context: ['and various ivory objects. A skilled craftsman (or craftswoman)\nprobably needed about forty-\x00ve minutes to prepare a single ivory\nbead. In other words, fashioning the 10,000 ivory beads that\ncovered the two children, not to mention the other objects, required\nsome 7,500 hours of delicate work, well over three years of labour\nby an experienced artisan!\nIt is highly unlikely that at such a young age the Sungir children\nhad proved themselves as leaders or mammoth-hunters. Only\ncultural beliefs can explain why they received such an extravagant\nburial. One theory is that they owed their rank to their parents.\nPerhaps they were the children of the leader, in a culture that\nbelieved in either family charisma or strict rules of succession.\nAccording to a second theory, the children had been identi\x00ed at\nbirth as the incarnations of some long-dead spirits. A third theory\nargues that the children’s burial re\x00ects the way they died rather\nthan their status in life. They were ritually sacri\x00ced – perhaps as\npart of the burial rites of the leader – and then entombed with pomp\nand circumstance.9\nWhatever the correct answer, the Sungir children are among the\nbest pieces of evidence that 30,000 years ago Sapiens could invent\nsociopolitical codes that went far beyond the dictates of our DNA\nand the behaviour patterns of other human and animal species.\nPeace or War?\nFinally, there’s the thorny question of the role of war in forager\nsocieties. Some scholars imagine ancient hunter-gatherer societies as\npeaceful paradises, and argue that war and violence began only with\nthe Agricultural Revolution, when people started to accumulate\nprivate property. Other scholars maintain that the world of the\nancient foragers was exceptionally cruel and violent. Both schools of\nthought are castles in the air, connected to the ground by the thin\nstrings of meagre archaeological remains and anthropological\nobservations of present-day foragers.', 'production stands at 30 million tons per year. Napoleon III would be\nsurprised to hear that his subjects’ descendants use cheap disposable\naluminium foil to wrap their sandwiches and put away their\nleftovers.\nTwo thousand years ago, when people in the Mediterranean basin\nsu\x00ered from dry skin they smeared olive oil on their hands. Today,\nthey open a tube of hand cream. Below is the list of ingredients of a\nsimple modern hand cream that I bought at a local store:\ndeionised water, stearic acid, glycerin, caprylic/caprictiglyceride, propylene glycol,\nisopropyl myristate, panax ginseng root extract, fragrance, cetyl alcohol, triethanolamine,\ndimeticone, arctostaphylos uva-ursi leaf extract, magnesium ascorbyl phosphate,\nimidazolidinyl urea, methyl paraben, camphor, propyl paraben, hydroxyisohexyl 3-\ncyclohexene carboxaldehyde, hydroxyl-citronellal, linalool, butylphenyl methylproplonal,\ncitronnellol, limonene, geraniol.\nAlmost all of these ingredients were invented or discovered in the\nlast two centuries.\nDuring World War One, Germany was placed under blockade and\nsu\x00ered severe shortages of raw materials, in particular saltpetre, an\nessential ingredient in gunpowder and other explosives. The most\nimportant saltpetre deposits were in Chile and India; there were\nnone at all in Germany. True, saltpetre could be replaced by\nammonia, but that was expensive to produce as well. Luckily for the\nGermans, one of their fellow citizens, a Jewish chemist named Fritz\nHaber, had discovered in 1908 a process for producing ammonia\nliterally out of thin air. When war broke out, the Germans used\nHaber’s discovery to commence industrial production of explosives\nusing air as a raw material. Some scholars believe that if it hadn’t\nbeen for Haber’s discovery, Germany would have been forced to\nsurrender long before November 1918.6 The discovery won Haber\n(who during the war also pioneered the use of poison gas in battle)\na Nobel Prize in 1918. In chemistry, not in peace.', 'Part One\nThe Cognitive Revolution\n1. A human handprint made about 30,000 years ago, on the wall of the Chauvet-\nPont-d’Arc Cave in southern France. Somebody tried to say, ‘I was here!’']

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.2, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output introduces new information about empires and their stability, which was not mentioned in the expected output., error: None)

For test case:

  - input: How long has empire been the most common form of political organization according to the context?
  - actual output: Empire has been the world's most common form of political organization for the last 2,500 years. Most humans during this period have lived within empires. This form of government is described as very stable and enduring.
  - expected output: The last 2,500 years
  - context: None
  - retrieval context: ['them to oblivion. Empires, too, ultimately fall, but they tend to\nleave behind rich and enduring legacies. Almost all people in the\ntwenty-\x00rst century are the o\x00spring of one empire or another.\nWhat is an Empire?\nAn empire is a political order with two important characteristics.\nFirst, to qualify for that designation you have to rule over a\nsigni\x00cant number of distinct peoples, each possessing a di\x00erent\ncultural identity and a separate territory. How many peoples\nexactly? Two or three is not su\x00cient. Twenty or thirty is plenty.\nThe imperial threshold passes somewhere in between.\nSecond, empires are characterised by \x00exible borders and a\npotentially unlimited appetite. They can swallow and digest more\nand more nations and territories without altering their basic\nstructure or identity. The British state of today has fairly clear\nborders that cannot be exceeded without altering the fundamental\nstructure and identity of the state. A century ago almost any place\non earth could have become part of the British Empire.\nCultural diversity and territorial \x00exibility give empires not only\ntheir unique character, but also their central role in history. It’s\nthanks to these two characteristics that empires have managed to\nunite diverse ethnic groups and ecological zones under a single\npolitical umbrella, thereby fusing together larger and larger\nsegments of the human species and of planet Earth.\nIt should be stressed that an empire is de\x00ned solely by its\ncultural diversity and \x00exible borders, rather than by its origins, its\nform of government, its territorial extent, or the size of its\npopulation. An empire need not emerge from military conquest. The\nAthenian Empire began its life as a voluntary league, and the\nHabsburg Empire was born in wedlock, cobbled together by a string\nof shrewd marriage alliances. Nor must an empire be ruled by an\nautocratic emperor. The British Empire, the largest empire in\nhistory, was ruled by a democracy. Other democratic (or at least', 'republican) empires have included the modern Dutch, French,\nBelgian and American empires, as well as the premodern empires of\nNovgorod, Rome, Carthage and Athens.\nSize, too, does not really matter. Empires can be puny. The\nAthenian Empire at its zenith was much smaller in size and\npopulation than today’s Greece. The Aztec Empire was smaller than\ntoday’s Mexico. Both were nevertheless empires, whereas modern\nGreece and modern Mexico are not, because the former gradually\nsubdued dozens and even hundreds of di\x00erent polities while the\nlatter have not. Athens lorded it over more than a hundred formerly\nindependent city states, whereas the Aztec Empire, if we can trust\nits taxation records, ruled 371 di\x00erent tribes and peoples.1\nHow was it possible to squeeze such a human potpourri into the\nterritory of a modest modern state? It was possible because in the\npast there were many more distinct peoples in the world, each of\nwhich had a smaller population and occupied less territory than\ntoday’s typical people. The land between the Mediterranean and the\nJordan River, which today struggles to satisfy the ambitions of just\ntwo peoples, easily accommodated in biblical times dozens of\nnations, tribes, petty kingdoms and city states.\nEmpires were one of the main reasons for the drastic reduction in\nhuman diversity. The imperial steamroller gradually obliterated the\nunique characteristics of numerous peoples (such as the\nNumantians), forging out of them new and much larger groups.\nEvil Empires?\nIn our time, ‘imperialist’ ranks second only to ‘fascist’ in the lexicon\nof political swear words. The contemporary critique of empires\ncommonly takes two forms:\n1. Empires do not work. In the long run, it is not possible to rule\ne\x00ectively over a large number of conquered peoples.\n2. Even if it can be done, it should not be done, because empires\nare evil engines of destruction and exploitation. Every people has a', 'right to self-determination, and should never be subject to the rule\nof another.\nFrom a historical perspective, the \x00rst statement is plain\nnonsense, and the second is deeply problematic.\nThe truth is that empire has been the world’s most common form\nof political organisation for the last 2,500 years. Most humans\nduring these two and a half millennia have lived in empires. Empire\nis also a very stable form of government. Most empires have found it\nalarmingly easy to put down rebellions. In general, they have been\ntoppled only by external invasion or by a split within the ruling\nelite. Conversely, conquered peoples don’t have a very good record\nof freeing themselves from their imperial overlords. Most have\nremained subjugated for hundreds of years. Typically, they have\nbeen slowly digested by the conquering empire, until their distinct\ncultures \x00zzled out.\nFor example, when the Western Roman Empire \x00nally fell to\ninvading Germanic tribes in 476 AD, the Numantians, Arverni,\nHelvetians, Samnites, Lusitanians, Umbrians, Etruscans and\nhundreds of other forgotten peoples whom the Romans conquered\ncenturies earlier did not emerge from the empires eviscerated\ncarcass like Jonah from the belly of the great \x00sh. None of them\nwere left. The biological descendants of the people who had\nidenti\x00ed themselves as members of those nations, who had spoken\ntheir languages, worshipped their gods and told their myths and\nlegends, now thought, spoke and worshipped as Romans.\nIn many cases, the destruction of one empire hardly meant\nindependence for subject peoples. Instead, a new empire stepped\ninto the vacuum created when the old one collapsed or retreated.\nNowhere has this been more obvious than in the Middle East. The\ncurrent political constellation in that region – a balance of power\nbetween many independent political entities with more or less\nstable borders – is almost without parallel any time in the last\nseveral millennia. The last time the Middle East experienced such a\nsituation was in the eighth century BC – almost 3,000 years ago!\nFrom the rise of the Neo-Assyrian Empire in the eighth century BC']

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.1, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output discusses Newton's law instead of the Theory of Relativity as expected. It incorrectly addresses gravity and mass without providing the correct equation related to the Theory of Relativity., error: None)

For test case:

  - input: What theory provides the equation for calculating the acceleration of mass under the influence of gravity?
  - actual output: The theory that provides the equation for calculating the acceleration of mass under the influence of gravity is Newton's law of universal gravitation. However, the context mentions Newton but asks about an equation related to gravity and mass; it hints at the Theory of Relativity without providing the specific equation. I don't have the exact equation from the given context.
  - expected output: The Theory of Relativity
  - context: None
  - retrieval context: ['world is a battleground between good and evil. An evil force created\nmatter, while a good force created spirit. Humans are caught\nbetween these two forces, and should choose good over evil. Yet the\nprophet Mani made no attempt to o\x00er a mathematical formula that\ncould be used to predict human choices by quantifying the\nrespective strength of these two forces. He never calculated that ‘the\nforce acting on a man is equal to the acceleration of his spirit\ndivided by the mass of his body’.\nThis is exactly what scientists seek to accomplish. In 1687, Isaac\nNewton published The Mathematical Principles of Natural Philosophy,\narguably the most important book in modern history. Newton\npresented a general theory of movement and change. The greatness\nof Newton’s theory was its ability to explain and predict the\nmovements of all bodies in the universe, from falling apples to\nshooting stars, using three very simple mathematical laws:\nHenceforth, anyone who wished to understand and predict the\nmovement of a cannonball or a planet simply had to make\nmeasurements of the object’s mass, direction and acceleration, and\nthe forces acting on it. By inserting these numbers into Newton’s\nequations, the future position of the object could be predicted. It\nworked like magic. Only around the end of the nineteenth century\ndid scientists come across a few observations that did not \x00t well\nwith Newton’s laws, and these led to the next revolutions in physics\n– the theory of relativity and quantum mechanics.', 'subtraction and multiplication), the basis of modern mathematical\nnotation came into being.\nAlthough this system of writing remains a partial script, it has\nbecome the world’s dominant language. Almost all states,\ncompanies, organisations and institutions – whether they speak\nArabic, Hindi, English or Norwegian – use mathematical script to\nrecord and process data. Every piece of information that can be\ntranslated into mathematical script is stored, spread and processed\nwith mind-boggling speed and e\x00ciency.\nA person who wishes to in\x00uence the decisions of governments,\norganisations and companies must therefore learn to speak in\nnumbers. Experts do their best to translate even ideas such as\n‘poverty’, ‘happiness’ and ‘honesty’ into numbers (‘the poverty line’,\n‘subjective well-being levels’, ‘credit rating’). Entire \x00elds of\nknowledge, such as physics and engineering, have already lost\nalmost all touch with the spoken human language, and are\nmaintained solely by mathematical script.\nAn equation for calculating the acceleration of mass i  under the in\x00uence of gravity,\naccording to the Theory of Relativity. When most laypeople see such an equation,\nthey usually panic and freeze, like a deer caught in the headlights of a speeding', '34. Benjamin Franklin disarming the gods.\nA famous example is lightning. Many cultures believed that\nlightning was the hammer of an angry god, used to punish sinners.\nIn the middle of the eighteenth century, in one of the most\ncelebrated experiments in scienti\x00c history, Benjamin Franklin \x00ew\na kite during a lightning storm to test the hypothesis that lightning\nis simply an electric current. Franklins empirical observations,\ncoupled with his knowledge about the qualities of electrical energy,\nenabled him to invent the lightning rod and disarm the gods.\nPoverty is another case in point. Many cultures have viewed\npoverty as an inescapable part of this imperfect world. According to\nthe New Testament, shortly before the cruci\x00xion a woman\nanointed Christ with precious oil worth 300 denarii. Jesus’ disciples\nscolded the woman for wasting such a huge sum of money instead\nof giving it to the poor, but Jesus defended her, saying that ‘The\npoor you will always have with you, and you can help them any\ntime you want. But you will not always have me’ (Mark 14:7).\nToday, fewer and fewer people, including fewer and fewer\nChristians, agree with Jesus on this matter. Poverty is increasingly']

======================================================================

Metrics Summary

  - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output provides additional context about Peugeot SA, but it slightly deviates from the expected output which only required the name. The inclusion of extra information is helpful but not aligned with the exact criteria., error: None)

For test case:

  - input: What is the official name of the carmaker that began as a small family business in Valentigney?
  - actual output: The official name of the carmaker that began as a small family business in Valentigney is Peugeot SA.
  - expected output: Peugeot SA
  - context: None
  - retrieval context: ['Modern business-people and lawyers are, in fact, powerful sorcerers.\nThe principal di\x00erence between them and tribal shamans is that\nmodern lawyers tell far stranger tales. The legend of Peugeot a\x00ords\nus a good example.\nAn icon that somewhat resembles the Stadel lion-man appears today\non cars, trucks and motorcycles from Paris to Sydney. It’s the hood\nornament that adorns vehicles made by Peugeot, one of the oldest\nand largest of Europe’s carmakers. Peugeot began as a small family\nbusiness in the village of Valentigney, just 300 kilometres from the\nStadel Cave. Today the company employs about 200,000 people\nworldwide, most of whom are complete strangers to each other.\nThese strangers cooperate so e\x00ectively that in 2008 Peugeot\nproduced more than 1.5 million automobiles, earning revenues of\nabout 55 billion euros.\nIn what sense can we say that Peugeot SA (the company’s o\x00cial\nname) exists? There are many Peugeot vehicles, but these are\nobviously not the company. Even if every Peugeot in the world were\nsimultaneously junked and sold for scrap metal, Peugeot SA would\nnot disappear. It would continue to manufacture new cars and issue\nits annual report. The company owns factories, machinery and\nshowrooms, and employs mechanics, accountants and secretaries,\nbut all these together do not comprise Peugeot. A disaster might kill\nevery single one of Peugeot’s employees, and go on to destroy all of\nits assembly lines and executive o\x00ces. Even then, the company\ncould borrow money, hire new employees, build new factories and\nbuy new machinery. Peugeot has managers and shareholders, but\nneither do they constitute the company. All the managers could be\ndismissed and all its shares sold, but the company itself would\nremain intact.', '5. The Peugeot Lion\nIt doesn’t mean that Peugeot SA is invulnerable or immortal. If a\njudge were to mandate the dissolution of the company, its factories\nwould remain standing and its workers, accountants, managers and\nshareholders would continue to live – but Peugeot SA would\nimmediately vanish. In short, Peugeot SA seems to have no essential\nconnection to the physical world. Does it really exist?\nPeugeot is a \x00gment of our collective imagination. Lawyers call\nthis a ‘legal \x00ction’. It can’t be pointed at; it is not a physical object.\nBut it exists as a legal entity. Just like you or me, it is bound by the\nlaws of the countries in which it operates. It can open a bank\naccount and own property. It pays taxes, and it can be sued and\neven prosecuted separately from any of the people who own or\nwork for it.\nPeugeot belongs to a particular genre of legal \x00ctions called\n‘limited liability companies’. The idea behind such companies is\namong humanity’s most ingenious inventions. Homo sapiens lived for\nuntold millennia without them. During most of recorded history\nproperty could be owned only by \x00esh-and-blood humans, the kind\nthat stood on two legs and had big brains. If in thirteenth-century\nFrance Jean set up a wagon-manufacturing workshop, he himself\nwas the business. If a wagon he’d made broke down a week after\npurchase, the disgruntled buyer would have sued Jean personally. If\nJean had borrowed 1,000 gold coins to set up his workshop and the', 'How exactly did Armand Peugeot, the man, create Peugeot, the\ncompany? In much the same way that priests and sorcerers have\ncreated gods and demons throughout history, and in which\nthousands of French curés were still creating Christ’s body every\nSunday in the parish churches. It all revolved around telling stories,\nand convincing people to believe them. In the case of the French\ncurés, the crucial story was that of Christ’s life and death as told by\nthe Catholic Church. According to this story, if a Catholic priest\ndressed in his sacred garments solemnly said the right words at the\nright moment, mundane bread and wine turned into God’s \x00esh and\nblood. The priest exclaimed ‘Hoc est corpus meum!’ (Latin for ‘This is\nmy body!’) and hocus pocus – the bread turned into Christ’s \x00esh.\nSeeing that the priest had properly and assiduously observed all the\nprocedures, millions of devout French Catholics behaved as if God\nreally existed in the consecrated bread and wine.\nIn the case of Peugeot SA the crucial story was the French legal\ncode, as written by the French parliament. According to the French\nlegislators, if a certi\x00ed lawyer followed all the proper liturgy and\nrituals, wrote all the required spells and oaths on a wonderfully\ndecorated piece of paper, and a\x00xed his ornate signature to the\nbottom of the document, then hocus pocus – a new company was\nincorporated. When in 1896 Armand Peugeot wanted to create his\ncompany, he paid a lawyer to go through all these sacred\nprocedures. Once the lawyer had performed all the right rituals and\npronounced all the necessary spells and oaths, millions of upright\nFrench citizens behaved as if the Peugeot company really existed.\nTelling e\x00ective stories is not easy. The di\x00culty lies not in\ntelling the story, but in convincing everyone else to believe it. Much\nof history revolves around this question: how does one convince\nmillions of people to believe particular stories about gods, or\nnations, or limited liability companies? Yet when it succeeds, it\ngives Sapiens immense power, because it enables millions of\nstrangers to cooperate and work towards common goals. Just try to\nimagine how di\x00cult it would have been to create states, or\nchurches, or legal systems if we could speak only about things that\nreally exist, such as rivers, trees and lions.']

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: local model, reason: Expected output was a numeric value (1847), but actual output is a sentence unrelated to the expected format., error: None)

For test case:

  - input: In what year did British train companies agree to use Greenwich Observatory time for all timetables?
  - actual output: British train companies agreed to use Greenwich Observatory time for all timetables in 1847.
  - expected output: 1847
  - context: None
  - retrieval context: ['train or bus had to reach the factory gate by 07:55. A few minutes’\ndelay would lower production and perhaps even lead to the lay-o\x00s\nof the unfortunate latecomers. In 1784 a carriage service with a\npublished schedule began operating in Britain. Its timetable\nspeci\x00ed only the hour of departure, not arrival. Back then, each\nBritish city and town had its own local time, which could di\x00er\nfrom London time by up to half an hour. When it was 12:00 in\nLondon, it was perhaps 12:20 in Liverpool and 11:50 in Canterbury.\nSince there were no telephones, no radio or television, and no fast\ntrains – who could know, and who cared?2\nThe \x00rst commercial train service began operating between\nLiverpool and Manchester in 1830. Ten years later, the \x00rst train\ntimetable was issued. The trains were much faster than the old\ncarriages, so the quirky di\x00erences in local hours became a severe\nnuisance. In 1847, British train companies put their heads together\nand agreed that henceforth all train timetables would be calibrated\nto Greenwich Observatory time, rather than the local times of\nLiverpool, Manchester or Glasgow. More and more institutions\nfollowed the lead of the train companies. Finally, in 1880, the\nBritish government took the unprecedented step of legislating that\nall timetables in Britain must follow Greenwich. For the \x00rst time in\nhistory, a country adopted a national time and obliged its\npopulation to live according to an arti\x00cial clock rather than local\nones or sunrise-to-sunset cycles.\nThis modest beginning spawned a global network of timetables,\nsynchronised down to the tiniest fractions of a second. When the\nbroadcast media – \x00rst radio, then television – made their debut,\nthey entered a world of timetables and became its main enforcers\nand evangelists. Among the \x00rst things radio stations broadcast\nwere time signals, beeps that enabled far-\x00ung settlements and ships\nat sea to set their clocks. Later, radio stations adopted the custom of\nbroadcasting the news every hour. Nowadays, the \x00rst item of every\nnews broadcast – more important even than the outbreak of war – is\nthe time. During World War Two, BBC News was broadcast to Nazi-\noccupied Europe. Each news programme opened with a live\nbroadcast of Big Ben tolling the hour – the magical sound of', 'no. 5 has overslept, it stalls all the other machines. In order to\nprevent such calamities, everybody must adhere to a precise\ntimetable. Each worker arrives at work at exactly the same time.\nEverybody takes their lunch break together, whether they are\nhungry or not. Everybody goes home when a whistle announces that\nthe shift is over – not when they have \x00nished their project.\n42. Charlie Chaplin as a simple worker caught in the wheels of the industrial\nassembly line, from the \x00lm Modern Times (1936).\nThe Industrial Revolution turned the timetable and the assembly\nline into a template for almost all human activities. Shortly after\nfactories imposed their time frames on human behaviour, schools\ntoo adopted precise timetables, followed by hospitals, government\no\x00ces and grocery stores. Even in places devoid of assembly lines\nand machines, the timetable became king. If the shift at the factory\nends at 5 p.m., the local pub had better be open for business by\n5:02.\nA crucial link in the spreading timetable system was public\ntransportation. If workers needed to start their shift by 08:00, the', 'freedom. Ingenious German physicists found a way to determine the\nweather conditions in London based on tiny di\x00erences in the tone\nof the broadcast ding-dongs. This information o\x00ered invaluable\nhelp to the Luftwa\x00e. When the British Secret Service discovered\nthis, they replaced the live broadcast with a set recording of the\nfamous clock.\nIn order to run the timetable network, cheap but precise portable\nclocks became ubiquitous. In Assyrian, Sassanid or Inca cities there\nmight have been at most a few sundials. In European medieval cities\nthere was usually a single clock – a giant machine mounted on top\nof a high tower in the town square. These tower clocks were\nnotoriously inaccurate, but since there were no other clocks in town\nto contradict them, it hardly made any di\x00erence. Today, a single\na\x00uent family generally has more timepieces at home than an\nentire medieval country. You can tell the time by looking at your\nwristwatch, glancing at your Android, peering at the alarm clock by\nyour bed, gazing at the clock on the kitchen wall, staring at the\nmicrowave, catching a glimpse of the TV or DVD, or taking in the\ntaskbar on your computer out of the corner of your eye. You need to\nmake a conscious e\x00ort not to know what time it is.\nThe typical person consults these clocks several dozen times a\nday, because almost everything we do has to be done on time. An\nalarm clock wakes us up at 7 a.m., we heat our frozen bagel for\nexactly \x00fty seconds in the microwave, brush our teeth for three\nminutes until the electric toothbrush beeps, catch the 07:40 train to\nwork, run on the treadmill at the gym until the beeper announces\nthat half an hour is over, sit down in front of the TV at 7 p.m. to\nwatch our favourite show, get interrupted at preordained moments\nby commercials that cost $1,000 per second, and eventually unload\nall our angst on a therapist who restricts our prattle to the now\nstandard \x00fty-minute therapy hour.\nThe Industrial Revolution brought about dozens of major upheavals\nin human society. Adapting to industrial time is just one of them.\nOther notable examples include urbanisation, the disappearance of']

======================================================================

Metrics Summary

  - ❌ Correctness (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: local model, reason: The actual output does not provide a percentage as expected, but instead states that it lacks specific statistical information from the given documents., error: None)

For test case:

  - input: What percentage of China's population sees themselves and is seen by others as Han?
  - actual output: The provided context does not contain information about the percentage of China's population that identifies as Han or is seen as Han. I don't have this specific statistic from the given documents.
  - expected output: More than 90 percent
  - context: None
  - retrieval context: ['world is composed of separate nation states, in China periods of\npolitical fragmentation were seen as dark ages of chaos and\ninjustice. This perception has had far-reaching implications for\nChinese history. Every time an empire collapsed, the dominant\npolitical theory goaded the powers that be not to settle for paltry\nindependent principalities, but to attempt reuni\x00cation. Sooner or\nlater these attempts always succeeded.\nWhen They Become Us\nEmpires have played a decisive part in amalgamating many small\ncultures into fewer big cultures. Ideas, people, goods and technology\nspread more easily within the borders of an empire than in a\npolitically fragmented region. Often enough, it was the empires\nthemselves which deliberately spread ideas, institutions, customs\nand norms. One reason was to make life easier for themselves. It is\ndi\x00cult to rule an empire in which every little district has its own\nset of laws, its own form of writing, its own language and its own\nmoney. Standardisation was a boon to emperors.\nA second and equally important reason why empires actively\nspread a common culture was to gain legitimacy. At least since the\ndays of Cyrus and Qín Shǐ Huángdì, empires have justi\x00ed their\nactions – whether road-building or bloodshed – as necessary to\nspread a superior culture from which the conquered bene\x00t even\nmore than the conquerors.\nThe bene\x00ts were sometimes salient – law enforcement, urban\nplanning, standardisation of weights and measures – and sometimes\nquestionable – taxes, conscription, emperor worship. But most\nimperial elites earnestly believed that they were working for the\ngeneral welfare of all the empires inhabitants. China’s ruling class\ntreated their country’s neighbours and its foreign subjects as\nmiserable barbarians to whom the empire must bring the bene\x00ts of\nculture. The Mandate of Heaven was bestowed upon the emperor\nnot in order to exploit the world, but in order to educate humanity.', 'between rulers and ruled, it has still recognised the basic unity of\nthe entire world, the existence of a single set of principles governing\nall places and times, and the mutual responsibilities of all human\nbeings. Humankind is seen as a large family: the privileges of the\nparents go hand in hand with responsibility for the welfare of the\nchildren.\nThis new imperial vision passed from Cyrus and the Persians to\nAlexander the Great, and from him to Hellenistic kings, Roman\nemperors, Muslim caliphs, Indian dynasts, and eventually even to\nSoviet premiers and American presidents. This benevolent imperial\nvision has justi\x00ed the existence of empires, and negated not only\nattempts by subject peoples to rebel, but also attempts by\nindependent peoples to resist imperial expansion.\nSimilar imperial visions were developed independently of the\nPersian model in other parts of the world, most notably in Central\nAmerica, in the Andean region, and in China. According to\ntraditional Chinese political theory, Heaven (Tian) is the source of\nall legitimate authority on earth. Heaven chooses the most worthy\nperson or family and gives them the Mandate of Heaven. This\nperson or family then rules over All Under Heaven (Tianxia) for the\nbene\x00t of all its inhabitants. Thus, a legitimate authority is – by\nde\x00nition – universal. If a ruler lacks the Mandate of Heaven, then\nhe lacks legitimacy to rule even a single city. If a ruler enjoys the\nmandate, he is obliged to spread justice and harmony to the entire\nworld. The Mandate of Heaven could not be given to several\ncandidates simultaneously, and consequently one could not\nlegitimise the existence of more than one independent state.\nThe \x00rst emperor of the united Chinese empire, Qín Shǐ Huángdì,\nboasted that ‘throughout the six directions [of the universe]\neverything belongs to the emperor\xa0…\xa0wherever there is a human\nfootprint, there is not one who did not become a subject [of the\nemperor]\xa0…\xa0his kindness reaches even oxen and horses. There is not\none who did not bene\x00t. Every man is safe under his own roof.’4 In\nChinese political thinking as well as Chinese historical memory,\nimperial periods were henceforth seen as golden ages of order and\njustice. In contradiction to the modern Western view that a just', 'were their societies like? Did they have monogamous relationships\nand nuclear families? Did they have ceremonies, moral codes, sports\ncontests and religious rituals? Did they \x00ght wars? The next chapter\ntakes a peek behind the curtain of the ages, examining what life was\nlike in the millennia separating the Cognitive Revolution from the\nAgricultural Revolution.\n* Here and in the following pages, when speaking about Sapiens language, I refer to the\nbasic linguistic abilities of our species and not to a particular dialect. English, Hindi and\nChinese are all variants of Sapiens language. Apparently, even at the time of the Cognitive\nRevolution, di\x00erent Sapiens groups had di\x00erent dialects.']

======================================================================

Overall Metric Pass Rates

Correctness (GEval): 40.00% pass rate

======================================================================
 Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results on Confident AI. 
‼️  Friendly reminder 😇: You can also run evaluations with ALL of deepeval's metrics directly on Confident AI 
instead.
EvaluationResult(test_results=[TestResult(name='test_case_1', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.0, reason='The actual output does not address the problem of specialization as outlined in the expected output and lacks factual correctness.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='What problem does specialization create in an economy?', actual_output="The context does not directly address the problems created by specialization in an economy. I don't have enough information from these documents to answer that specific question accurately.", expected_output="Specialization creates the problem of how to manage the exchange of goods between specialists when an economy of favours and obligations doesn't work with large numbers of strangers.", context=None, retrieval_context=['have repeatedly managed to trounce calculating merchants, and\neven to reshape the economy. It is therefore impossible to\nunderstand the uni\x00cation of humankind as a purely economic\nprocess. In order to understand how thousands of isolated cultures\ncoalesced over time to form the global village of today, we must\ntake into account the role of gold and silver, but we cannot\ndisregard the equally crucial role of steel.', 'The Capitalist Hell\nThere is an even more fundamental reason why it’s dangerous to\ngive markets a completely free rein. Adam Smith taught that the\nshoemaker would use his surplus to employ more assistants. This\nimplies that egoistic greed is bene\x00cial for all, since pro\x00ts are\nutilised to expand production and hire more employees.\nYet what happens if the greedy shoemaker increases his pro\x00ts by\npaying employees less and increasing their work hours? The\nstandard answer is that the free market would protect the\nemployees. If our shoemaker pays too little and demands too much,\nthe best employees would naturally abandon him and go to work for\nhis competitors. The tyrant shoemaker would \x00nd himself left with\nthe worst labourers, or with no labourers at all. He would have to\nmend his ways or go out of business. His own greed would compel\nhim to treat his employees well.\nThis sounds bulletproof in theory, but in practice the bullets get\nthrough all too easily. In a completely free market, unsupervised by\nkings and priests, avaricious capitalists can establish monopolies or\ncollude against their workforces. If there is a single corporation\ncontrolling all shoe factories in a country, or if all factory owners\nconspire to reduce wages simultaneously, then the labourers are no\nlonger able to protect themselves by switching jobs.\nEven worse, greedy bosses might curtail the workers’ freedom of\nmovement through debt peonage or slavery. At the end of the\nMiddle Ages, slavery was almost unknown in Christian Europe.\nDuring the early modern period, the rise of European capitalism\nwent hand in hand with the rise of the Atlantic slave trade.\nUnrestrained market forces, rather than tyrannical kings or racist\nideologues, were responsible for this calamity.\nWhen the Europeans conquered America, they opened gold and\nsilver mines and established sugar, tobacco and cotton plantations.\nThese mines and plantations became the mainstay of American\nproduction and export. The sugar plantations were particularly\nimportant. In the Middle Ages, sugar was a rare luxury in Europe. It', 'maxim that ‘When pro\x00ts increase, the landlord or weaver will\nemploy more assistants’ and not ‘When pro\x00ts increase, Scrooge will\nhoard his money in a chest and take it out only to count his coins.’\nA crucial part of the modern capitalist economy was the emergence\nof a new ethic, according to which pro\x00ts ought to be reinvested in\nproduction. This brings about more pro\x00ts, which are again\nreinvested in production, which brings more pro\x00ts, et cetera ad\nin\x00nitum. Investments can be made in many ways: enlarging the\nfactory, conducting scienti\x00c research, developing new products.\nYet all these investments must somehow increase production and\ntranslate into larger pro\x00ts. In the new capitalist creed, the \x00rst and\nmost sacred commandment is: ‘The pro\x00ts of production must be\nreinvested in increasing production.’\nThat’s why capitalism is called ‘capitalism’. Capitalism\ndistinguishes ‘capital’ from mere ‘wealth’. Capital consists of money,\ngoods and resources that are invested in production. Wealth, on the\nother hand, is buried in the ground or wasted on unproductive\nactivities. A pharaoh who pours resources into a non-productive\npyramid is not a capitalist. A pirate who loots a Spanish treasure\n\x00eet and buries a chest full of glittering coins on the beach of some\nCaribbean island is not a capitalist. But a hard-working factory hand\nwho reinvests part of his income in the stock market is.\nThe idea that ‘The pro\x00ts of production must be reinvested in\nincreasing production’ sounds trivial. Yet it was alien to most people\nthroughout history. In premodern times, people believed that\nproduction was more or less constant. So why reinvest your pro\x00ts\nif production won’t increase by much, no matter what you do? Thus\nmedieval noblemen espoused an ethic of generosity and conspicuous\nconsumption. They spent their revenues on tournaments, banquets,\npalaces and wars, and on charity and monumental cathedrals. Few\ntried to reinvest pro\x00ts in increasing their manors’ output,\ndeveloping better kinds of wheat, or looking for new markets.']), TestResult(name='test_case_4', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output provides correct numerical values but does not explicitly state the difference in nucleobases between Sapiens and mouse genomes as specified in the expected output.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='How many more nucleobases does the Sapiens genome have compared to the mouse genome?', actual_output='The Sapiens genome has about 2.9 billion nucleobases, while the mouse genome contains approximately 2.5 billion nucleobases. This means the Sapiens genome is only 14% larger than that of the mouse.', expected_output='The Sapiens genome has about 400 million more nucleobases compared to the mouse genome.', context=None, retrieval_context=['But why stop even at Neanderthals? Why not go back to God’s\ndrawing board and design a better Sapiens? The abilities, needs and\ndesires of Homo sapiens have a genetic basis, and the Sapiens\ngenome is no more complex than that of voles and mice. (The\nmouse genome contains about 2.5 billion nucleobases, the Sapiens\ngenome about 2.9 billion bases – meaning the latter is only 14 per\ncent larger.)11 In the medium range – perhaps in a few decades –\ngenetic engineering and other forms of biological engineering might\nenable us to make far-reaching alterations not only to our\nphysiology, immune system and life expectancy, but also to our\nintellectual and emotional capacities. If genetic engineering can\ncreate genius mice, why not genius humans? If it can create\nmonogamous voles, why not humans hard-wired to remain faithful\nto their partners?\nThe Cognitive Revolution that turned Homo sapiens from an\ninsigni\x00cant ape into the master of the world did not require any\nnoticeable change in physiology or even in the size and external\nshape of the Sapiens brain. It apparently involved no more than a\nfew small changes to internal brain structure. Perhaps another small\nchange would be enough to ignite a Second Cognitive Revolution,\ncreate a completely new type of consciousness, and transform Homo\nsapiens into something altogether di\x00erent.\nTrue, we still don’t have the acumen to achieve this, but there\nseems to be no insurmountable technical barrier preventing us from\nproducing superhumans. The main obstacles are the ethical and\npolitical objections that have slowed down research on humans. And\nno matter how convincing the ethical arguments may be, it is hard\nto see how they can hold back the next step for long, especially if\nwhat is at stake is the possibility of prolonging human life\ninde\x00nitely, conquering incurable diseases, and upgrading our\ncognitive and emotional abilities.\nWhat would happen, for example, if we developed a cure for\nAlzheimer’s disease that, as a side bene\x00t, could dramatically\nimprove the memories of healthy people? Would anyone be able to\nhalt the relevant research? And when the cure is developed, could', '200,000Homo sapiens evolves in East Africa.\n70,000The Cognitive Revolution. Emergence of \x00ctive language.\nBeginning of history. Sapiens spread out of Africa.\n45,000Sapiens settle Australia. Extinction of Australian\nmegafauna.\n30,000Extinction of Neanderthals.\n16,000Sapiens settle America. Extinction of American megafauna.\n13,000Extinction of Homo \x00oresiensis. Homo sapiens the only\nsurviving human species.\n12,000The Agricultural Revolution. Domestication of plants and\nanimals. Permanent settlements.\n5,000 First kingdoms, script and money. Polytheistic religions.\n4,250 First empire – the Akkadian Empire of Sargon.\n2,500\nInvention of coinage – a universal money.\nThe Persian Empire – a universal political order ‘for the\nbene\x00t of all humans’.\nBuddhism in India – a universal truth ‘to liberate all beings\nfrom su\x00ering’.\n2,000 Han Empire in China. Roman Empire in the\nMediterranean. Christianity.\n1,400 Islam.\n500 The Scienti\x00c Revolution. Humankind admits its ignorance\nand begins to acquire unprecedented power. Europeans\nbegin to conquer America and the oceans. The entire', 'Similarly, when Sapiens reached East Asia, they interbred with the\nlocal Erectus, so the Chinese and Koreans are a mixture of Sapiens\nand Erectus.\nThe opposing view, called the ‘Replacement Theory’ tells a very\ndi\x00erent story – one of incompatibility, revulsion, and perhaps even\ngenocide. According to this theory, Sapiens and other humans had\ndi\x00erent anatomies, and most likely di\x00erent mating habits and\neven body odours. They would have had little sexual interest in one\nanother. And even if a Neanderthal Romeo and a Sapiens Juliet fell\nin love, they could not produce fertile children, because the genetic\ngulf separating the two populations was already unbridgeable. The\ntwo populations remained completely distinct, and when the\nNeanderthals died out, or were killed o\x00, their genes died with\nthem. According to this view, Sapiens replaced all the previous\nhuman populations without merging with them. If that is the case,\nthe lineages of all contemporary humans can be traced back,\nexclusively, to East Africa, 70,000 years ago. We are all ‘pure\nSapiens’.\nMap 1. Homo sapiens conquers the globe.']), TestResult(name='test_case_3', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output provides additional context and information about Neanderthal DNA in modern populations, but it introduces a theory (Interbreeding Theory) that was not mentioned in the expected output.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='What percentage of unique human DNA in modern populations in the Middle East and Europe is Neanderthal DNA?', actual_output='According to the Interbreeding Theory, 1–4% of the unique human DNA in modern populations in the Middle East and Europe is Neanderthal DNA.', expected_output='1–4 per cent', context=None, retrieval_context=['perhaps a million humans living between the Indonesian\narchipelago and the Iberian peninsula, a mere blip on the ecological\nradar.\nOur own species, Homo sapiens, was already present on the world\nstage, but so far it was just minding its own business in a corner of\nAfrica. We don’t know exactly where and when animals that can be\nclassi\x00ed as Homo sapiens \x00rst evolved from some earlier type of\nhumans, but most scientists agree that by 150,000 years ago, East\nAfrica was populated by Sapiens that looked just like us. If one of\nthem turned up in a modern morgue, the local pathologist would\nnotice nothing peculiar. Thanks to the blessings of \x00re, they had\nsmaller teeth and jaws than their ancestors, whereas they had\nmassive brains, equal in size to ours.\nScientists also agree that about 70,000 years ago, Sapiens from\nEast Africa spread into the Arabian peninsula, and from there they\nquickly overran the entire Eurasian landmass.\nWhen Homo sapiens landed in Arabia, most of Eurasia was already\nsettled by other humans. What happened to them? There are two\ncon\x00icting theories. The ‘Interbreeding Theory’ tells a story of\nattraction, sex and mingling. As the African immigrants spread\naround the world, they bred with other human populations, and\npeople today are the outcome of this interbreeding.\nFor example, when Sapiens reached the Middle East and Europe,\nthey encountered the Neanderthals. These humans were more\nmuscular than Sapiens, had larger brains, and were better adapted\nto cold climes. They used tools and \x00re, were good hunters, and\napparently took care of their sick and in\x00rm. (Archaeologists have\ndiscovered the bones of Neanderthals who lived for many years with\nsevere physical handicaps, evidence that they were cared for by\ntheir relatives.) Neanderthals are often depicted in caricatures as the\narchetypical brutish and stupid ‘cave people’, but recent evidence\nhas changed their image.\nAccording to the Interbreeding Theory, when Sapiens spread into\nNeanderthal lands, Sapiens bred with Neanderthals until the two\npopulations merged. If this is the case, then today’s Eurasians are\nnot pure Sapiens. They are a mixture of Sapiens and Neanderthals.', 'A lot hinges on this debate. From an evolutionary perspective,\n70,000 years is a relatively short interval. If the Replacement\nTheory is correct, all living humans have roughly the same genetic\nbaggage, and racial distinctions among them are negligible. But if\nthe Interbreeding Theory is right, there might well be genetic\ndi\x00erences between Africans, Europeans and Asians that go back\nhundreds of thousands of years. This is political dynamite, which\ncould provide material for explosive racial theories.\nIn recent decades the Replacement Theory has been the common\nwisdom in the \x00eld. It had \x00rmer archaeological backing, and was\nmore politically correct (scientists had no desire to open up the\nPandora’s box of racism by claiming signi\x00cant genetic diversity\namong modern human populations). But that ended in 2010, when\nthe results of a four-year e\x00ort to map the Neanderthal genome\nwere published. Geneticists were able to collect enough intact\nNeanderthal DNA from fossils to make a broad comparison between\nit and the DNA of contemporary humans. The results stunned the\nscienti\x00c community.\nIt turned out that 1–4 per cent of the unique human DNA of\nmodern populations in the Middle East and Europe is Neanderthal\nDNA. That’s not a huge amount, but it’s signi\x00cant. A second shock\ncame several months later, when DNA extracted from the fossilised\n\x00nger from Denisova was mapped. The results proved that up to 6\nper cent of the unique human DNA of modern Melanesians and\nAboriginal Australians is Denisovan DNA.\nIf these results are valid – and it’s important to keep in mind that\nfurther research is under way and may either reinforce or modify\nthese conclusions – the Interbreeders got at least some things right.\nBut that doesn’t mean that the Replacement Theory is completely\nwrong. Since Neanderthals and Denisovans contributed only a small\namount of DNA to our present-day genome, it is impossible to speak\nof a ‘merger’ between Sapiens and other human species. Although\ndi\x00erences between them were not large enough to completely\nprevent fertile intercourse, they were su\x00cient to make such\ncontacts very rare.', 'point, I will often use the term ‘Sapiens’ to denote members of the\nspecies Homo sapiens, while reserving the term ‘human’ to refer to\nall extant members of the genus Homo.\nHumans \x00rst evolved in East Africa about 2.5 million years ago\nfrom an earlier genus of apes called Australopithecus, which means\n‘Southern Ape’. About 2 million years ago, some of these archaic\nmen and women left their homeland to journey through and settle\nvast areas of North Africa, Europe and Asia. Since survival in the\nsnowy forests of northern Europe required di\x00erent traits than those\nneeded to stay alive in Indonesia’s steaming jungles, human\npopulations evolved in di\x00erent directions. The result was several\ndistinct species, to each of which scientists have assigned a pompous\nLatin name.\n2. Our siblings, according to speculative reconstructions (left to right):\nHomo rudolfensis (East Africa); Homo erectus (East Asia); and Homo neanderthalensis\n(Europe and western Asia). All are humans.\nHumans in Europe and western Asia evolved into Homo\nneanderthalensis (‘Man from the Neander Valley), popularly referred\nto simply as ‘Neanderthals’. Neanderthals, bulkier and more\nmuscular than us Sapiens, were well adapted to the cold climate of\nIce Age western Eurasia. The more eastern regions of Asia were']), TestResult(name='test_case_9', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.2, reason='The actual output does not provide a specific number of British officials, contradicting the expected factual information about fewer than 5,000 officials governing up to 300 million Indians.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='How many British officials were sufficient to govern up to 300 million Indians during the nineteenth and early twentieth centuries?', actual_output="The context does not provide a specific number of British officials needed to govern India during that period. However, it's clear from the documents that despite the vast population, the British managed governance along with significant cultural and infrastructural changes. The exact administrative numbers are not detailed in the provided texts.", expected_output='Fewer than 5,000 British officials were sufficient to govern up to 300 million Indians during the nineteenth and early twentieth centuries.', context=None, retrieval_context=['passionate cricket players and chai (tea) drinkers, and both game\nand beverage are British legacies. Commercial tea farming did not\nexist in India until the mid-nineteenth century, when it was\nintroduced by the British East India Company. It was the snobbish\nBritish sahibs who spread the custom of tea drinking throughout the\nsubcontinent.\n28. The Chhatrapati Shivaji train station in Mumbai. It began its life as Victoria\nStation, Bombay. The British built it in the Neo-Gothic style that was popular in late\nnineteenth-century Britain. A Hindu nationalist government changed the names of\nboth city and station, but showed no appetite for razing such a magni\x00cent building,\neven if it was built by foreign oppressors.\nHow many Indians today would want to call a vote to divest\nthemselves of democracy, English, the railway network, the legal\nsystem, cricket and tea on the grounds that they are imperial\nlegacies? And if they did, wouldn’t the very act of calling a vote to\ndecide the issue demonstrate their debt to their former overlords?', 'a century, maintaining a huge military force of up to 350,000\nsoldiers, considerably outnumbering the armed forces of the British\nmonarchy. Only in 1858 did the British crown nationalise India\nalong with the company’s private army. Napoleon made fun of the\nBritish, calling them a nation of shopkeepers. Yet these shopkeepers\ndefeated Napoleon himself, and their empire was the largest the\nworld has ever seen.\nIn the Name of Capital\nThe nationalisation of Indonesia by the Dutch crown (1800) and of\nIndia by the British crown (1858) hardly ended the embrace of\ncapitalism and empire. On the contrary, the connection only grew\nstronger during the nineteenth century. Joint-stock companies no\nlonger needed to establish and govern private colonies – their\nmanagers and large shareholders now pulled the strings of power in\nLondon, Amsterdam and Paris, and they could count on the state to\nlook after their interests. As Marx and other social critics quipped,\nWestern governments were becoming a capitalist trade union.\nThe most notorious example of how governments did the bidding\nof big money was the First Opium War, fought between Britain and\nChina (1840–42). In the \x00rst half of the nineteenth century, the\nBritish East India Company and sundry British business people made\nfortunes by exporting drugs, particularly opium, to China. Millions\nof Chinese became addicts, debilitating the country both\neconomically and socially. In the late 1830s the Chinese government\nissued a ban on drug tra\x00cking, but British drug merchants simply\nignored the law. Chinese authorities began to con\x00scate and destroy\ndrug cargos. The drug cartels had close connections in Westminster\nand Downing Street – many MPs and Cabinet ministers in fact held\nstock in the drug companies – so they pressured the government to\ntake action.\nIn 1840 Britain duly declared war on China in the name of ‘free\ntrade’. It was a walkover. The overcon\x00dent Chinese were no match', 'Modern science and modern empires were motivated by the restless\nfeeling that perhaps something important awaited beyond the\nhorizon – something they had better explore and master. Yet the\nconnection between science and empire went much deeper. Not just\nthe motivation, but also the practices of empire-builders were\nentangled with those of scientists. For modern Europeans, building\nan empire was a scienti\x00c project, while setting up a scienti\x00c\ndiscipline was an imperial project.\nWhen the Muslims conquered India, they did not bring along\narchaeologists to systematically study Indian history,\nanthropologists to study Indian cultures, geologists to study Indian\nsoils, or zoologists to study Indian fauna. When the British\nconquered India, they did all of these things. On 10 April 1802 the\nGreat Survey of India was launched. It lasted sixty years. With the\nhelp of tens of thousands of native labourers, scholars and guides,\nthe British carefully mapped the whole of India, marking borders,\nmeasuring distances, and even calculating for the \x00rst time the exact\nheight of Mount Everest and the other Himalayan peaks. The British\nexplored the military resources of Indian provinces and the location\nof their gold mines, but they also took the trouble to collect\ninformation about rare Indian spiders, to catalogue colourful\nbutter\x00ies, to trace the ancient origins of extinct Indian languages,\nand to dig up forgotten ruins.\nMohenjo-daro was one of the chief cities of the Indus Valley\ncivilisation, which \x00ourished in the third millennium BC and was\ndestroyed around 1900 BC. None of India’s pre-British rulers –\nneither the Mauryas, nor the Guptas, nor the Delhi sultans, nor the\ngreat Mughals – had given the ruins a second glance. But a British\narchaeological survey took notice of the site in 1922. A British team\nthen excavated it, and discovered the \x00rst great civilisation of India,\nwhich no Indian had been aware of.\nAnother telling example of British scienti\x00c curiosity was the\ndeciphering of cuneiform script. This was the main script used\nthroughout the Middle East for close to 3,000 years, but the last\nperson able to read it probably died sometime in the early \x00rst']), TestResult(name='test_case_8', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output provides additional context and details beyond the expected output, but it correctly includes the 7,500 hours mentioned in the expected output.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='How many hours did it take an experienced artisan to prepare 10,000 ivory beads?', actual_output='It took approximately 7,500 hours for an experienced artisan to prepare 10,000 ivory beads. This equates to well over three years of labor.', expected_output='7,500 hours', context=None, retrieval_context=['and various ivory objects. A skilled craftsman (or craftswoman)\nprobably needed about forty-\x00ve minutes to prepare a single ivory\nbead. In other words, fashioning the 10,000 ivory beads that\ncovered the two children, not to mention the other objects, required\nsome 7,500 hours of delicate work, well over three years of labour\nby an experienced artisan!\nIt is highly unlikely that at such a young age the Sungir children\nhad proved themselves as leaders or mammoth-hunters. Only\ncultural beliefs can explain why they received such an extravagant\nburial. One theory is that they owed their rank to their parents.\nPerhaps they were the children of the leader, in a culture that\nbelieved in either family charisma or strict rules of succession.\nAccording to a second theory, the children had been identi\x00ed at\nbirth as the incarnations of some long-dead spirits. A third theory\nargues that the children’s burial re\x00ects the way they died rather\nthan their status in life. They were ritually sacri\x00ced – perhaps as\npart of the burial rites of the leader – and then entombed with pomp\nand circumstance.9\nWhatever the correct answer, the Sungir children are among the\nbest pieces of evidence that 30,000 years ago Sapiens could invent\nsociopolitical codes that went far beyond the dictates of our DNA\nand the behaviour patterns of other human and animal species.\nPeace or War?\nFinally, there’s the thorny question of the role of war in forager\nsocieties. Some scholars imagine ancient hunter-gatherer societies as\npeaceful paradises, and argue that war and violence began only with\nthe Agricultural Revolution, when people started to accumulate\nprivate property. Other scholars maintain that the world of the\nancient foragers was exceptionally cruel and violent. Both schools of\nthought are castles in the air, connected to the ground by the thin\nstrings of meagre archaeological remains and anthropological\nobservations of present-day foragers.', 'production stands at 30 million tons per year. Napoleon III would be\nsurprised to hear that his subjects’ descendants use cheap disposable\naluminium foil to wrap their sandwiches and put away their\nleftovers.\nTwo thousand years ago, when people in the Mediterranean basin\nsu\x00ered from dry skin they smeared olive oil on their hands. Today,\nthey open a tube of hand cream. Below is the list of ingredients of a\nsimple modern hand cream that I bought at a local store:\ndeionised water, stearic acid, glycerin, caprylic/caprictiglyceride, propylene glycol,\nisopropyl myristate, panax ginseng root extract, fragrance, cetyl alcohol, triethanolamine,\ndimeticone, arctostaphylos uva-ursi leaf extract, magnesium ascorbyl phosphate,\nimidazolidinyl urea, methyl paraben, camphor, propyl paraben, hydroxyisohexyl 3-\ncyclohexene carboxaldehyde, hydroxyl-citronellal, linalool, butylphenyl methylproplonal,\ncitronnellol, limonene, geraniol.\nAlmost all of these ingredients were invented or discovered in the\nlast two centuries.\nDuring World War One, Germany was placed under blockade and\nsu\x00ered severe shortages of raw materials, in particular saltpetre, an\nessential ingredient in gunpowder and other explosives. The most\nimportant saltpetre deposits were in Chile and India; there were\nnone at all in Germany. True, saltpetre could be replaced by\nammonia, but that was expensive to produce as well. Luckily for the\nGermans, one of their fellow citizens, a Jewish chemist named Fritz\nHaber, had discovered in 1908 a process for producing ammonia\nliterally out of thin air. When war broke out, the Germans used\nHaber’s discovery to commence industrial production of explosives\nusing air as a raw material. Some scholars believe that if it hadn’t\nbeen for Haber’s discovery, Germany would have been forced to\nsurrender long before November 1918.6 The discovery won Haber\n(who during the war also pioneered the use of poison gas in battle)\na Nobel Prize in 1918. In chemistry, not in peace.', 'Part One\nThe Cognitive Revolution\n1. A human handprint made about 30,000 years ago, on the wall of the Chauvet-\nPont-d’Arc Cave in southern France. Somebody tried to say, ‘I was here!’']), TestResult(name='test_case_5', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.2, reason='The actual output introduces new information about empires and their stability, which was not mentioned in the expected output.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='How long has empire been the most common form of political organization according to the context?', actual_output="Empire has been the world's most common form of political organization for the last 2,500 years. Most humans during this period have lived within empires. This form of government is described as very stable and enduring.", expected_output='The last 2,500 years', context=None, retrieval_context=['them to oblivion. Empires, too, ultimately fall, but they tend to\nleave behind rich and enduring legacies. Almost all people in the\ntwenty-\x00rst century are the o\x00spring of one empire or another.\nWhat is an Empire?\nAn empire is a political order with two important characteristics.\nFirst, to qualify for that designation you have to rule over a\nsigni\x00cant number of distinct peoples, each possessing a di\x00erent\ncultural identity and a separate territory. How many peoples\nexactly? Two or three is not su\x00cient. Twenty or thirty is plenty.\nThe imperial threshold passes somewhere in between.\nSecond, empires are characterised by \x00exible borders and a\npotentially unlimited appetite. They can swallow and digest more\nand more nations and territories without altering their basic\nstructure or identity. The British state of today has fairly clear\nborders that cannot be exceeded without altering the fundamental\nstructure and identity of the state. A century ago almost any place\non earth could have become part of the British Empire.\nCultural diversity and territorial \x00exibility give empires not only\ntheir unique character, but also their central role in history. It’s\nthanks to these two characteristics that empires have managed to\nunite diverse ethnic groups and ecological zones under a single\npolitical umbrella, thereby fusing together larger and larger\nsegments of the human species and of planet Earth.\nIt should be stressed that an empire is de\x00ned solely by its\ncultural diversity and \x00exible borders, rather than by its origins, its\nform of government, its territorial extent, or the size of its\npopulation. An empire need not emerge from military conquest. The\nAthenian Empire began its life as a voluntary league, and the\nHabsburg Empire was born in wedlock, cobbled together by a string\nof shrewd marriage alliances. Nor must an empire be ruled by an\nautocratic emperor. The British Empire, the largest empire in\nhistory, was ruled by a democracy. Other democratic (or at least', 'republican) empires have included the modern Dutch, French,\nBelgian and American empires, as well as the premodern empires of\nNovgorod, Rome, Carthage and Athens.\nSize, too, does not really matter. Empires can be puny. The\nAthenian Empire at its zenith was much smaller in size and\npopulation than today’s Greece. The Aztec Empire was smaller than\ntoday’s Mexico. Both were nevertheless empires, whereas modern\nGreece and modern Mexico are not, because the former gradually\nsubdued dozens and even hundreds of di\x00erent polities while the\nlatter have not. Athens lorded it over more than a hundred formerly\nindependent city states, whereas the Aztec Empire, if we can trust\nits taxation records, ruled 371 di\x00erent tribes and peoples.1\nHow was it possible to squeeze such a human potpourri into the\nterritory of a modest modern state? It was possible because in the\npast there were many more distinct peoples in the world, each of\nwhich had a smaller population and occupied less territory than\ntoday’s typical people. The land between the Mediterranean and the\nJordan River, which today struggles to satisfy the ambitions of just\ntwo peoples, easily accommodated in biblical times dozens of\nnations, tribes, petty kingdoms and city states.\nEmpires were one of the main reasons for the drastic reduction in\nhuman diversity. The imperial steamroller gradually obliterated the\nunique characteristics of numerous peoples (such as the\nNumantians), forging out of them new and much larger groups.\nEvil Empires?\nIn our time, ‘imperialist’ ranks second only to ‘fascist’ in the lexicon\nof political swear words. The contemporary critique of empires\ncommonly takes two forms:\n1. Empires do not work. In the long run, it is not possible to rule\ne\x00ectively over a large number of conquered peoples.\n2. Even if it can be done, it should not be done, because empires\nare evil engines of destruction and exploitation. Every people has a', 'right to self-determination, and should never be subject to the rule\nof another.\nFrom a historical perspective, the \x00rst statement is plain\nnonsense, and the second is deeply problematic.\nThe truth is that empire has been the world’s most common form\nof political organisation for the last 2,500 years. Most humans\nduring these two and a half millennia have lived in empires. Empire\nis also a very stable form of government. Most empires have found it\nalarmingly easy to put down rebellions. In general, they have been\ntoppled only by external invasion or by a split within the ruling\nelite. Conversely, conquered peoples don’t have a very good record\nof freeing themselves from their imperial overlords. Most have\nremained subjugated for hundreds of years. Typically, they have\nbeen slowly digested by the conquering empire, until their distinct\ncultures \x00zzled out.\nFor example, when the Western Roman Empire \x00nally fell to\ninvading Germanic tribes in 476 AD, the Numantians, Arverni,\nHelvetians, Samnites, Lusitanians, Umbrians, Etruscans and\nhundreds of other forgotten peoples whom the Romans conquered\ncenturies earlier did not emerge from the empires eviscerated\ncarcass like Jonah from the belly of the great \x00sh. None of them\nwere left. The biological descendants of the people who had\nidenti\x00ed themselves as members of those nations, who had spoken\ntheir languages, worshipped their gods and told their myths and\nlegends, now thought, spoke and worshipped as Romans.\nIn many cases, the destruction of one empire hardly meant\nindependence for subject peoples. Instead, a new empire stepped\ninto the vacuum created when the old one collapsed or retreated.\nNowhere has this been more obvious than in the Middle East. The\ncurrent political constellation in that region – a balance of power\nbetween many independent political entities with more or less\nstable borders – is almost without parallel any time in the last\nseveral millennia. The last time the Middle East experienced such a\nsituation was in the eighth century BC – almost 3,000 years ago!\nFrom the rise of the Neo-Assyrian Empire in the eighth century BC']), TestResult(name='test_case_2', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.1, reason="The actual output discusses Newton's law instead of the Theory of Relativity as expected. It incorrectly addresses gravity and mass without providing the correct equation related to the Theory of Relativity.", strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='What theory provides the equation for calculating the acceleration of mass under the influence of gravity?', actual_output="The theory that provides the equation for calculating the acceleration of mass under the influence of gravity is Newton's law of universal gravitation. However, the context mentions Newton but asks about an equation related to gravity and mass; it hints at the Theory of Relativity without providing the specific equation. I don't have the exact equation from the given context.", expected_output='The Theory of Relativity', context=None, retrieval_context=['world is a battleground between good and evil. An evil force created\nmatter, while a good force created spirit. Humans are caught\nbetween these two forces, and should choose good over evil. Yet the\nprophet Mani made no attempt to o\x00er a mathematical formula that\ncould be used to predict human choices by quantifying the\nrespective strength of these two forces. He never calculated that ‘the\nforce acting on a man is equal to the acceleration of his spirit\ndivided by the mass of his body’.\nThis is exactly what scientists seek to accomplish. In 1687, Isaac\nNewton published The Mathematical Principles of Natural Philosophy,\narguably the most important book in modern history. Newton\npresented a general theory of movement and change. The greatness\nof Newton’s theory was its ability to explain and predict the\nmovements of all bodies in the universe, from falling apples to\nshooting stars, using three very simple mathematical laws:\nHenceforth, anyone who wished to understand and predict the\nmovement of a cannonball or a planet simply had to make\nmeasurements of the object’s mass, direction and acceleration, and\nthe forces acting on it. By inserting these numbers into Newton’s\nequations, the future position of the object could be predicted. It\nworked like magic. Only around the end of the nineteenth century\ndid scientists come across a few observations that did not \x00t well\nwith Newton’s laws, and these led to the next revolutions in physics\n– the theory of relativity and quantum mechanics.', 'subtraction and multiplication), the basis of modern mathematical\nnotation came into being.\nAlthough this system of writing remains a partial script, it has\nbecome the world’s dominant language. Almost all states,\ncompanies, organisations and institutions – whether they speak\nArabic, Hindi, English or Norwegian – use mathematical script to\nrecord and process data. Every piece of information that can be\ntranslated into mathematical script is stored, spread and processed\nwith mind-boggling speed and e\x00ciency.\nA person who wishes to in\x00uence the decisions of governments,\norganisations and companies must therefore learn to speak in\nnumbers. Experts do their best to translate even ideas such as\n‘poverty’, ‘happiness’ and ‘honesty’ into numbers (‘the poverty line’,\n‘subjective well-being levels’, ‘credit rating’). Entire \x00elds of\nknowledge, such as physics and engineering, have already lost\nalmost all touch with the spoken human language, and are\nmaintained solely by mathematical script.\nAn equation for calculating the acceleration of mass i  under the in\x00uence of gravity,\naccording to the Theory of Relativity. When most laypeople see such an equation,\nthey usually panic and freeze, like a deer caught in the headlights of a speeding', '34. Benjamin Franklin disarming the gods.\nA famous example is lightning. Many cultures believed that\nlightning was the hammer of an angry god, used to punish sinners.\nIn the middle of the eighteenth century, in one of the most\ncelebrated experiments in scienti\x00c history, Benjamin Franklin \x00ew\na kite during a lightning storm to test the hypothesis that lightning\nis simply an electric current. Franklins empirical observations,\ncoupled with his knowledge about the qualities of electrical energy,\nenabled him to invent the lightning rod and disarm the gods.\nPoverty is another case in point. Many cultures have viewed\npoverty as an inescapable part of this imperfect world. According to\nthe New Testament, shortly before the cruci\x00xion a woman\nanointed Christ with precious oil worth 300 denarii. Jesus’ disciples\nscolded the woman for wasting such a huge sum of money instead\nof giving it to the poor, but Jesus defended her, saying that ‘The\npoor you will always have with you, and you can help them any\ntime you want. But you will not always have me’ (Mark 14:7).\nToday, fewer and fewer people, including fewer and fewer\nChristians, agree with Jesus on this matter. Poverty is increasingly']), TestResult(name='test_case_6', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output provides additional context about Peugeot SA, but it slightly deviates from the expected output which only required the name. The inclusion of extra information is helpful but not aligned with the exact criteria.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='What is the official name of the carmaker that began as a small family business in Valentigney?', actual_output='The official name of the carmaker that began as a small family business in Valentigney is Peugeot SA.', expected_output='Peugeot SA', context=None, retrieval_context=['Modern business-people and lawyers are, in fact, powerful sorcerers.\nThe principal di\x00erence between them and tribal shamans is that\nmodern lawyers tell far stranger tales. The legend of Peugeot a\x00ords\nus a good example.\nAn icon that somewhat resembles the Stadel lion-man appears today\non cars, trucks and motorcycles from Paris to Sydney. It’s the hood\nornament that adorns vehicles made by Peugeot, one of the oldest\nand largest of Europe’s carmakers. Peugeot began as a small family\nbusiness in the village of Valentigney, just 300 kilometres from the\nStadel Cave. Today the company employs about 200,000 people\nworldwide, most of whom are complete strangers to each other.\nThese strangers cooperate so e\x00ectively that in 2008 Peugeot\nproduced more than 1.5 million automobiles, earning revenues of\nabout 55 billion euros.\nIn what sense can we say that Peugeot SA (the company’s o\x00cial\nname) exists? There are many Peugeot vehicles, but these are\nobviously not the company. Even if every Peugeot in the world were\nsimultaneously junked and sold for scrap metal, Peugeot SA would\nnot disappear. It would continue to manufacture new cars and issue\nits annual report. The company owns factories, machinery and\nshowrooms, and employs mechanics, accountants and secretaries,\nbut all these together do not comprise Peugeot. A disaster might kill\nevery single one of Peugeot’s employees, and go on to destroy all of\nits assembly lines and executive o\x00ces. Even then, the company\ncould borrow money, hire new employees, build new factories and\nbuy new machinery. Peugeot has managers and shareholders, but\nneither do they constitute the company. All the managers could be\ndismissed and all its shares sold, but the company itself would\nremain intact.', '5. The Peugeot Lion\nIt doesn’t mean that Peugeot SA is invulnerable or immortal. If a\njudge were to mandate the dissolution of the company, its factories\nwould remain standing and its workers, accountants, managers and\nshareholders would continue to live – but Peugeot SA would\nimmediately vanish. In short, Peugeot SA seems to have no essential\nconnection to the physical world. Does it really exist?\nPeugeot is a \x00gment of our collective imagination. Lawyers call\nthis a ‘legal \x00ction’. It can’t be pointed at; it is not a physical object.\nBut it exists as a legal entity. Just like you or me, it is bound by the\nlaws of the countries in which it operates. It can open a bank\naccount and own property. It pays taxes, and it can be sued and\neven prosecuted separately from any of the people who own or\nwork for it.\nPeugeot belongs to a particular genre of legal \x00ctions called\n‘limited liability companies’. The idea behind such companies is\namong humanity’s most ingenious inventions. Homo sapiens lived for\nuntold millennia without them. During most of recorded history\nproperty could be owned only by \x00esh-and-blood humans, the kind\nthat stood on two legs and had big brains. If in thirteenth-century\nFrance Jean set up a wagon-manufacturing workshop, he himself\nwas the business. If a wagon he’d made broke down a week after\npurchase, the disgruntled buyer would have sued Jean personally. If\nJean had borrowed 1,000 gold coins to set up his workshop and the', 'How exactly did Armand Peugeot, the man, create Peugeot, the\ncompany? In much the same way that priests and sorcerers have\ncreated gods and demons throughout history, and in which\nthousands of French curés were still creating Christ’s body every\nSunday in the parish churches. It all revolved around telling stories,\nand convincing people to believe them. In the case of the French\ncurés, the crucial story was that of Christ’s life and death as told by\nthe Catholic Church. According to this story, if a Catholic priest\ndressed in his sacred garments solemnly said the right words at the\nright moment, mundane bread and wine turned into God’s \x00esh and\nblood. The priest exclaimed ‘Hoc est corpus meum!’ (Latin for ‘This is\nmy body!’) and hocus pocus – the bread turned into Christ’s \x00esh.\nSeeing that the priest had properly and assiduously observed all the\nprocedures, millions of devout French Catholics behaved as if God\nreally existed in the consecrated bread and wine.\nIn the case of Peugeot SA the crucial story was the French legal\ncode, as written by the French parliament. According to the French\nlegislators, if a certi\x00ed lawyer followed all the proper liturgy and\nrituals, wrote all the required spells and oaths on a wonderfully\ndecorated piece of paper, and a\x00xed his ornate signature to the\nbottom of the document, then hocus pocus – a new company was\nincorporated. When in 1896 Armand Peugeot wanted to create his\ncompany, he paid a lawyer to go through all these sacred\nprocedures. Once the lawyer had performed all the right rituals and\npronounced all the necessary spells and oaths, millions of upright\nFrench citizens behaved as if the Peugeot company really existed.\nTelling e\x00ective stories is not easy. The di\x00culty lies not in\ntelling the story, but in convincing everyone else to believe it. Much\nof history revolves around this question: how does one convince\nmillions of people to believe particular stories about gods, or\nnations, or limited liability companies? Yet when it succeeds, it\ngives Sapiens immense power, because it enables millions of\nstrangers to cooperate and work towards common goals. Just try to\nimagine how di\x00cult it would have been to create states, or\nchurches, or legal systems if we could speak only about things that\nreally exist, such as rivers, trees and lions.']), TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.0, reason='Expected output was a numeric value (1847), but actual output is a sentence unrelated to the expected format.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input='In what year did British train companies agree to use Greenwich Observatory time for all timetables?', actual_output='British train companies agreed to use Greenwich Observatory time for all timetables in 1847.', expected_output='1847', context=None, retrieval_context=['train or bus had to reach the factory gate by 07:55. A few minutes’\ndelay would lower production and perhaps even lead to the lay-o\x00s\nof the unfortunate latecomers. In 1784 a carriage service with a\npublished schedule began operating in Britain. Its timetable\nspeci\x00ed only the hour of departure, not arrival. Back then, each\nBritish city and town had its own local time, which could di\x00er\nfrom London time by up to half an hour. When it was 12:00 in\nLondon, it was perhaps 12:20 in Liverpool and 11:50 in Canterbury.\nSince there were no telephones, no radio or television, and no fast\ntrains – who could know, and who cared?2\nThe \x00rst commercial train service began operating between\nLiverpool and Manchester in 1830. Ten years later, the \x00rst train\ntimetable was issued. The trains were much faster than the old\ncarriages, so the quirky di\x00erences in local hours became a severe\nnuisance. In 1847, British train companies put their heads together\nand agreed that henceforth all train timetables would be calibrated\nto Greenwich Observatory time, rather than the local times of\nLiverpool, Manchester or Glasgow. More and more institutions\nfollowed the lead of the train companies. Finally, in 1880, the\nBritish government took the unprecedented step of legislating that\nall timetables in Britain must follow Greenwich. For the \x00rst time in\nhistory, a country adopted a national time and obliged its\npopulation to live according to an arti\x00cial clock rather than local\nones or sunrise-to-sunset cycles.\nThis modest beginning spawned a global network of timetables,\nsynchronised down to the tiniest fractions of a second. When the\nbroadcast media – \x00rst radio, then television – made their debut,\nthey entered a world of timetables and became its main enforcers\nand evangelists. Among the \x00rst things radio stations broadcast\nwere time signals, beeps that enabled far-\x00ung settlements and ships\nat sea to set their clocks. Later, radio stations adopted the custom of\nbroadcasting the news every hour. Nowadays, the \x00rst item of every\nnews broadcast – more important even than the outbreak of war – is\nthe time. During World War Two, BBC News was broadcast to Nazi-\noccupied Europe. Each news programme opened with a live\nbroadcast of Big Ben tolling the hour – the magical sound of', 'no. 5 has overslept, it stalls all the other machines. In order to\nprevent such calamities, everybody must adhere to a precise\ntimetable. Each worker arrives at work at exactly the same time.\nEverybody takes their lunch break together, whether they are\nhungry or not. Everybody goes home when a whistle announces that\nthe shift is over – not when they have \x00nished their project.\n42. Charlie Chaplin as a simple worker caught in the wheels of the industrial\nassembly line, from the \x00lm Modern Times (1936).\nThe Industrial Revolution turned the timetable and the assembly\nline into a template for almost all human activities. Shortly after\nfactories imposed their time frames on human behaviour, schools\ntoo adopted precise timetables, followed by hospitals, government\no\x00ces and grocery stores. Even in places devoid of assembly lines\nand machines, the timetable became king. If the shift at the factory\nends at 5 p.m., the local pub had better be open for business by\n5:02.\nA crucial link in the spreading timetable system was public\ntransportation. If workers needed to start their shift by 08:00, the', 'freedom. Ingenious German physicists found a way to determine the\nweather conditions in London based on tiny di\x00erences in the tone\nof the broadcast ding-dongs. This information o\x00ered invaluable\nhelp to the Luftwa\x00e. When the British Secret Service discovered\nthis, they replaced the live broadcast with a set recording of the\nfamous clock.\nIn order to run the timetable network, cheap but precise portable\nclocks became ubiquitous. In Assyrian, Sassanid or Inca cities there\nmight have been at most a few sundials. In European medieval cities\nthere was usually a single clock – a giant machine mounted on top\nof a high tower in the town square. These tower clocks were\nnotoriously inaccurate, but since there were no other clocks in town\nto contradict them, it hardly made any di\x00erence. Today, a single\na\x00uent family generally has more timepieces at home than an\nentire medieval country. You can tell the time by looking at your\nwristwatch, glancing at your Android, peering at the alarm clock by\nyour bed, gazing at the clock on the kitchen wall, staring at the\nmicrowave, catching a glimpse of the TV or DVD, or taking in the\ntaskbar on your computer out of the corner of your eye. You need to\nmake a conscious e\x00ort not to know what time it is.\nThe typical person consults these clocks several dozen times a\nday, because almost everything we do has to be done on time. An\nalarm clock wakes us up at 7 a.m., we heat our frozen bagel for\nexactly \x00fty seconds in the microwave, brush our teeth for three\nminutes until the electric toothbrush beeps, catch the 07:40 train to\nwork, run on the treadmill at the gym until the beeper announces\nthat half an hour is over, sit down in front of the TV at 7 p.m. to\nwatch our favourite show, get interrupted at preordained moments\nby commercials that cost $1,000 per second, and eventually unload\nall our angst on a therapist who restricts our prattle to the now\nstandard \x00fty-minute therapy hour.\nThe Industrial Revolution brought about dozens of major upheavals\nin human society. Adapting to industrial time is just one of them.\nOther notable examples include urbanisation, the disappearance of']), TestResult(name='test_case_7', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.0, reason='The actual output does not provide a percentage as expected, but instead states that it lacks specific statistical information from the given documents.', strict_mode=False, evaluation_model='local model', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]')], conversational=False, multimodal=False, input="What percentage of China's population sees themselves and is seen by others as Han?", actual_output="The provided context does not contain information about the percentage of China's population that identifies as Han or is seen as Han. I don't have this specific statistic from the given documents.", expected_output='More than 90 percent', context=None, retrieval_context=['world is composed of separate nation states, in China periods of\npolitical fragmentation were seen as dark ages of chaos and\ninjustice. This perception has had far-reaching implications for\nChinese history. Every time an empire collapsed, the dominant\npolitical theory goaded the powers that be not to settle for paltry\nindependent principalities, but to attempt reuni\x00cation. Sooner or\nlater these attempts always succeeded.\nWhen They Become Us\nEmpires have played a decisive part in amalgamating many small\ncultures into fewer big cultures. Ideas, people, goods and technology\nspread more easily within the borders of an empire than in a\npolitically fragmented region. Often enough, it was the empires\nthemselves which deliberately spread ideas, institutions, customs\nand norms. One reason was to make life easier for themselves. It is\ndi\x00cult to rule an empire in which every little district has its own\nset of laws, its own form of writing, its own language and its own\nmoney. Standardisation was a boon to emperors.\nA second and equally important reason why empires actively\nspread a common culture was to gain legitimacy. At least since the\ndays of Cyrus and Qín Shǐ Huángdì, empires have justi\x00ed their\nactions – whether road-building or bloodshed – as necessary to\nspread a superior culture from which the conquered bene\x00t even\nmore than the conquerors.\nThe bene\x00ts were sometimes salient – law enforcement, urban\nplanning, standardisation of weights and measures – and sometimes\nquestionable – taxes, conscription, emperor worship. But most\nimperial elites earnestly believed that they were working for the\ngeneral welfare of all the empires inhabitants. China’s ruling class\ntreated their country’s neighbours and its foreign subjects as\nmiserable barbarians to whom the empire must bring the bene\x00ts of\nculture. The Mandate of Heaven was bestowed upon the emperor\nnot in order to exploit the world, but in order to educate humanity.', 'between rulers and ruled, it has still recognised the basic unity of\nthe entire world, the existence of a single set of principles governing\nall places and times, and the mutual responsibilities of all human\nbeings. Humankind is seen as a large family: the privileges of the\nparents go hand in hand with responsibility for the welfare of the\nchildren.\nThis new imperial vision passed from Cyrus and the Persians to\nAlexander the Great, and from him to Hellenistic kings, Roman\nemperors, Muslim caliphs, Indian dynasts, and eventually even to\nSoviet premiers and American presidents. This benevolent imperial\nvision has justi\x00ed the existence of empires, and negated not only\nattempts by subject peoples to rebel, but also attempts by\nindependent peoples to resist imperial expansion.\nSimilar imperial visions were developed independently of the\nPersian model in other parts of the world, most notably in Central\nAmerica, in the Andean region, and in China. According to\ntraditional Chinese political theory, Heaven (Tian) is the source of\nall legitimate authority on earth. Heaven chooses the most worthy\nperson or family and gives them the Mandate of Heaven. This\nperson or family then rules over All Under Heaven (Tianxia) for the\nbene\x00t of all its inhabitants. Thus, a legitimate authority is – by\nde\x00nition – universal. If a ruler lacks the Mandate of Heaven, then\nhe lacks legitimacy to rule even a single city. If a ruler enjoys the\nmandate, he is obliged to spread justice and harmony to the entire\nworld. The Mandate of Heaven could not be given to several\ncandidates simultaneously, and consequently one could not\nlegitimise the existence of more than one independent state.\nThe \x00rst emperor of the united Chinese empire, Qín Shǐ Huángdì,\nboasted that ‘throughout the six directions [of the universe]\neverything belongs to the emperor\xa0…\xa0wherever there is a human\nfootprint, there is not one who did not become a subject [of the\nemperor]\xa0…\xa0his kindness reaches even oxen and horses. There is not\none who did not bene\x00t. Every man is safe under his own roof.’4 In\nChinese political thinking as well as Chinese historical memory,\nimperial periods were henceforth seen as golden ages of order and\njustice. In contradiction to the modern Western view that a just', 'were their societies like? Did they have monogamous relationships\nand nuclear families? Did they have ceremonies, moral codes, sports\ncontests and religious rituals? Did they \x00ght wars? The next chapter\ntakes a peek behind the curtain of the ages, examining what life was\nlike in the millennia separating the Cognitive Revolution from the\nAgricultural Revolution.\n* Here and in the following pages, when speaking about Sapiens language, I refer to the\nbasic linguistic abilities of our species and not to a particular dialect. English, Hindi and\nChinese are all variants of Sapiens language. Apparently, even at the time of the Cognitive\nRevolution, di\x00erent Sapiens groups had di\x00erent dialects.'])], confident_link=None)