Neo4j’s take on graph RAG

Graphs
Neo4j
Neo4j’s take on graph RAG

Neo4j has pushed (for obvious reasons) the graph RAG and knowledge graph momentum under the title ‘GenAI’ and this was rebranded later on as Neo4j Graph RAG. The vector support in Neo4j came prior to this but was a necessary prerequisite for the Python package they assembled.

The Python is v1.0 and I was curious to explore what they cooked up.

At this point the package is relatively shallow. LlamaIndex and LangGraph and a lot more beef and the stuff inside the graph RAG package is something one can easily replicate. So, this version is a good start but in comparison to, say, Microsoft graph RAG they have some catching up to do.

In order to do any type of RAG you need vectors and in Neo4j you create an index which, in fact, does not store its index in an invisible place somewhere (unlike other types of index) but rather as a property of the node. The graph elements are the index but the index has to be defined on a database level.

from neo4j import GraphDatabase
from neo4j_graphrag.indexes import create_vector_index

URI = "neo4j://localhost:7687"
AUTH = ("neo4j", "123456789")

driver = GraphDatabase.driver(URI, auth=AUTH)

Creating an index is straightforward and you can do it with the following command:

INDEX_NAME = "vector-condition"

create_vector_index(
    driver,
    INDEX_NAME,
    label="Condition",
    embedding_property="vector",
    dimensions=768,
    similarity_fn="euclidean",
)

To populate things you need somehow to create a vector. Speed is crucial here if you have a large graph.

Locally, you can use the Nomic embedder via Ollama, for instance:

from openai import OpenAI
 
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='nah'
)
def get_vector(text):
    resp = client.embeddings.create(input=text, model="nomic-embed-text")
    return resp.data[0].embedding

Inserting vectors in the graph is done via the following command:

from neo4j_graphrag.utils import execute_query
from neo4j_graphrag.indexes import upsert_vector
from tqdm import tqdm
r = await execute_query(driver, "MATCH (n:Condition) RETURN n")

for rec in tqdm(r, desc="Processing vectorization"):
    id = rec["n"].element_id
    text = dict(rec["n"])["name"]    
    vector = get_vector(text)    
    upsert_vector(
        driver,
        node_id=id,
        embedding_property="vector",
        vector=vector,
    )
Processing vectorization: 100%|██████████| 1467/1467 [01:45<00:00, 13.87it/s]

To query the vector index against a given text, you need to vectorize or pass the embedder. The important bit here is that the vector index has itself no notion of how to create a vector, only how to store and compare it. This also means that that the vector index is not of any use unless you know how the vectors were created.

from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
from neo4j_graphrag.retrievers import VectorRetriever
# Create Embedder object
# Note: An OPENAI_API_KEY environment variable is required here
embedder = OpenAIEmbeddings( base_url='http://localhost:11434/v1/', model="nomic-embed-text")

# Initialize the retriever
retriever = VectorRetriever(driver, INDEX_NAME, embedder)

# Run the similarity search
query_text = "Raymond's disease"
response = retriever.search(query_text=query_text, top_k=5)

# for r in response:
for item in response.items:
    print(item.metadata)
{'score': 0.6420621871948242, 'nodeLabels': ['Condition'], 'id': '4:b0113d3f-aa2d-4abf-ba51-bb3ed228a437:1105'}
{'score': 0.636494517326355, 'nodeLabels': ['Condition'], 'id': '4:b0113d3f-aa2d-4abf-ba51-bb3ed228a437:999'}
{'score': 0.6272282004356384, 'nodeLabels': ['Condition'], 'id': '4:b0113d3f-aa2d-4abf-ba51-bb3ed228a437:1183'}
{'score': 0.6109791398048401, 'nodeLabels': ['Condition'], 'id': '4:b0113d3f-aa2d-4abf-ba51-bb3ed228a437:819'}
{'score': 0.6109533309936523, 'nodeLabels': ['Condition'], 'id': '4:b0113d3f-aa2d-4abf-ba51-bb3ed228a437:1417'}

Another important aspect is that the querying happens against a given index. This is true for any index but can be an annoyance in a bot context, since you can have potentially many indexes and you need to know which one to query, or query them all to fetch the one with the best score.

The above assumes you have a KG in place. The process of creating one via Neo4j GraphRAG is highlighted below.

If you find via similarity search a node you might want to collect more info via the topology/neighborhood:

from neo4j_graphrag.retrievers import VectorCypherRetriever

retrieval_query = "MATCH (node)-[:HAS_TREATMENT]->(t:Treatment)-[:HAS_DRUG]->(d:Drug) RETURN distinct node.name as condition, d.name as drug"
retriever = VectorCypherRetriever(
  driver, INDEX_NAME, retrieval_query, embedder
)
response = retriever.search(query_text="Raymond's disease", top_k=5)
for u in response.items:
    print(u.content)
<Record condition='Bartter disease' drug='Enalapril'>
<Record condition='heart disease' drug='Clopidogrel'>
<Record condition='heart disease' drug='Warfarin'>
<Record condition='heart disease' drug='Furosemide'>
<Record condition='heart disease' drug='Acetylsalicylic acid'>
<Record condition='heart disease' drug='Metoprolol'>
<Record condition='heart disease' drug='Amiodarone'>
<Record condition='heart disease' drug='Valsartan'>
<Record condition='heart disease' drug='Palivizumab'>
<Record condition='heart disease' drug='Apixaban'>
<Record condition='heart disease' drug='Dofetilide'>
<Record condition='heart disease' drug='Rosuvastatin'>
<Record condition='heart disease' drug='Rivaroxaban'>
<Record condition='heart disease' drug='Evolocumab'>
<Record condition='heart disease' drug='Dabigatran etexilate'>
<Record condition='heart disease' drug='Ticagrelor'>
<Record condition='heart disease' drug='Carvedilol'>
<Record condition='heart disease' drug='Atorvastatin'>
<Record condition='Raynaud disease' drug='Prazosin'>
<Record condition='hemorrhagic disease' drug='Desmopressin'>
<Record condition="Behçet's disease" drug='Infliximab'>
<Record condition="Behçet's disease" drug='Apremilast'>
<Record condition="Behçet's disease" drug='Etanercept'>
<Record condition="Behçet's disease" drug='Thalidomide'>
<Record condition="Behçet's disease" drug='Cyclosporine'>
<Record condition="Behçet's disease" drug='Adalimumab'>

This is just a utility, one can easily loop over the vector search results and perform the same with a Cypher query.

Asking direct question with automatica schema sniffing can be done via the following command:

from neo4j_graphrag.retrievers import Text2CypherRetriever
from neo4j_graphrag.llm import OpenAILLM

# Llama seems to have difficulties with this, but Qwen does it.
llm = OpenAILLM(
    base_url='http://localhost:11434/v1/',
    model_name="qwen2.5:14b"
    # model_name="gpt-4o-mini"
    )

retriever = Text2CypherRetriever(
    driver=driver,
    llm=llm,  # type: ignore
    neo4j_schema=None,
    examples=None
)

response = retriever.search(query_text="Raymond's disease")

This produces underneath a cypher query and it gets executed as part of this call:

MATCH (c:Condition)-[:HAS_TREATMENT]->(t:Treatment)-[:HAS_DRUG]->(d:Drug)
WHERE c.name = "Raymond\'s disease"
RETURN c.id AS condition_id, c.name AS condition_name, d.id AS drug_id, d.name AS drug_name, t.source_id AS treatment_source_id

Note that this does not have anything to do with vectors, it’s purely a LLM converting a question into a query via clever prompting.

On the ingestion level the package has a straightforward pipeline which chuncks a given pdf and turns the entities into a KG. It’s nice but again, not anything you can’t assemble yourself.

import nest_asyncio
nest_asyncio.apply()

from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader
pdf_loader = PdfLoader()
found = await pdf_loader.run("/Users/swa/Desktop/AI/PdfKnowledgeBase/selfrag.pdf")

Services like LamaParse are a lot more refined than this loaded and you can hook up your own if you wish here.

json.loads(found.json())
{'text': "Published as a conference paper at ICLR 2024\nSELF-RAG: LEARNING TO RETRIEVE , GENERATE ,AND\nCRITIQUE THROUGH SELF-REFLECTION\nAkari Asai†, Zeqiu Wu†, Yizhong Wang†§, Avirup Sil‡, Hannaneh Hajishirzi†§\n†University of Washington§Allen Institute for AI‡IBM Research AI\n{akari,zeqiuwu1,yizhongw,hannaneh }@cs.washington.edu ,avi@us.ibm.com\nABSTRACT\nDespite their remarkable capabilities, large language models (LLMs) often produce\nresponses containing factual inaccuracies due to their sole reliance on the paramet-\nric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad\nhoc approach that augments LMs with retrieval of relevant knowledge, decreases\nsuch issues. However, indiscriminately retrieving and incorporating a fixed number\nof retrieved passages, regardless of whether retrieval is necessary, or passages are\nrelevant, diminishes LM versatility or can lead to unhelpful response generation.\nWe introduce a new framework called Self-Reflective Retrieval-Augmented Gen-\neration ( SELF-RAG)that enhances an LM’s quality and factuality through retrieval\nand self-reflection. Our framework trains a single arbitrary LM that adaptively\nretrieves passages on-demand, and generates and reflects on retrieved passages\nand its own generations using special tokens, called reflection tokens. Generating\nreflection tokens makes the LM controllable during the inference phase, enabling it\nto tailor its behavior to diverse task requirements. Experiments show that SELF-\nRAG(7B and 13B parameters) significantly outperforms state-of-the-art LLMs\nand retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG\noutperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA,\nreasoning and fact verification tasks, and it shows significant gains in improving\nfactuality and citation accuracy for long-form generations relative to these models.1\n1 I NTRODUCTION\nState-of-the-art LLMs continue to struggle with factual errors (Mallen et al., 2023; Min et al., 2023)\ndespite their increased model and data scale (Ouyang et al., 2022). Retrieval-Augmented Generation\n(RAG) methods (Figure 1 left; Lewis et al. 2020; Guu et al. 2020) augment the input of LLMs\nwith relevant retrieved passages, reducing factual errors in knowledge-intensive tasks (Ram et al.,\n2023; Asai et al., 2023a). However, these methods may hinder the versatility of LLMs or introduce\nunnecessary or off-topic passages that lead to low-quality generations (Shi et al., 2023) since they\nretrieve passages indiscriminately regardless of whether the factual grounding is helpful. Moreover,\nthe output is not guaranteed to be consistent with retrieved relevant passages (Gao et al., 2023) since\nthe models are not explicitly trained to leverage and follow facts from provided passages.\nThis work introduces Self-Reflective Retrieval-augmented Generation ( SELF-RAG)to improve an\nLLM’s generation quality, including its factual accuracy without hurting its versatility, via on-demand\nretrieval and self-reflection. We train an arbitrary LM in an end-to-end manner to learn to reflect on\nits own generation process given a task input by generating both task output and intermittent special\ntokens (i.e., reflection tokens ). Reflection tokens are categorized into retrieval andcritique tokens to\nindicate the need for retrieval and its generation quality respectively (Figure 1 right). In particular,\ngiven an input prompt and preceding generations, SELF-RAGfirst determines if augmenting the\ncontinued generation with retrieved passages would be helpful. If so, it outputs a retrieval token that\ncalls a retriever model on demand (Step 1). Subsequently, SELF-RAGconcurrently processes multiple\nretrieved passages, evaluating their relevance and then generating corresponding task outputs (Step\n2). It then generates critique tokens to criticize its own output and choose best one (Step 3) in terms\nof factuality and overall quality. This process differs from conventional RAG (Figure 1 left), which\n1Our code and trained models are available at https://selfrag.github.io/ .\n1\nPublished as a conference paper at ICLR 2024\nStep 1: Retrieve K documentsCalifornia was named after a fictional island in a Spanish book. Prompt How did US states get their names? \nUS states got their names from a variety of sources. Eleven states are named after an individual person (e.g, California was named after Christopher Columbus). Some states including Texas and Utah, are named after Native American tribe.\nRetrieval-Augmented Generation (RAG)Ours: Self-reflective Retrieval-Augmented Generation (Self-RAG) \nPopular names by states. In Texas, Emma is a popular baby name. Of the fifty states, eleven are named after an individual person. \nPrompt How did US states get their names? + Step 2: Prompt LM with K docs and generateRetriever\nLM\nPrompt How did US states get their names? US states got their names from a variety of sources. RetrieveStep 1: Retrieve on demand  \nPrompt +  \n11 of 50 state namesRelevant\nStep 2: Generate segment in parallel \ncome from persons.SupportedIrrelevantTexas is namedafter a Native American tribe. Step 3: Critique outputs and select best segmentorigins in a 16th-century novel Las Sergas de Esplandián. California's name has itsRelevantPartially\nUS states got their names from a variety of sources. 11 of 50 states names are come from persons.    26 states are named after Native Americans, including Utah. \nPrompt: Write an essay of your best summer vacation\nPrompt: Write an essay of your best summer vacation\nNo RetrievalMy best summer vacation is when my family and I embarked on a road trip along …My best… \n>Repeat.…\nNo information in passagesContradictory>Prompt +  \nPrompt +  \nRetrieve\nFigure 1: Overview of SELF-RAG.SELF-RAGlearns to retrieve, critique, and generate text passages\nto enhance overall generation quality, factuality, and verifiability.\nconsistently retrieves a fixed number of documents for generation regardless of the retrieval necessity\n(e.g., the bottom figure example does not require factual knowledge) and never second visits the\ngeneration quality. Moreover, SELF-RAGprovides citations for each segment with its self-assessment\nof whether the output is supported by the passage, leading to easier fact verification.\nSELF-RAGtrains an arbitrary LM to generate text with reflection tokens by unifying them as the\nnext token prediction from the expanded model vocabulary. We train our generator LM on a diverse\ncollection of text interleaved with reflection tokens and retrieved passages. Reflection tokens, inspired\nby reward models used in reinforcement learning (Ziegler et al., 2019; Ouyang et al., 2022), are\ninserted offline into the original corpus by a trained critic model. This eliminates the need to host a\ncritic model during training, reducing overhead. The critic model, in part, is supervised on a dataset\nof input, output, and corresponding reflection tokens collected by prompting a propriety LM (i.e.,\nGPT-4; OpenAI 2023). While we draw inspiration from studies that use control tokens to start and\nguide text generation (Lu et al., 2022; Keskar et al., 2019), our trained LM uses critique tokens to\nassess its own predictions after each generated segment as an integral part of the generation output.\nSELF-RAGfurther enables a customizable decoding algorithm to satisfy hard or soft constraints,\nwhich are defined by reflection token predictions. In particular, our inference-time algorithm enables\nus to (1) flexibly adjust retrieval frequency for different downstream applications and (2) customize\nmodels’ behaviors to user preferences by leveraging reflection tokens through segment-level beam\nsearch using the weighted linear sum of the reflection token probabilities as segment score.\nEmpirical results on six tasks, including reasoning and long-form generation, demonstrate that SELF-\nRAGsignificantly outperforms pre-trained and instruction-tuned LLMs that have more parameters and\nwidely adopted RAG approaches with higher citation accuracy. In particular, SELF-RAGoutperforms\nretrieval-augmented ChatGPT on four tasks, Llama2-chat (Touvron et al., 2023) and Alpaca (Dubois\net al., 2023) on all tasks. Our analysis demonstrates the effectiveness of training and inference with\nreflection tokens for overall performance improvements as well as test-time model customizations\n(e.g., balancing the trade-off between citation previsions and completeness).\n2 R ELATED WORK\nRetrieval-Augmented Generation. Retrieval-Augmented Generation (RAG) augments the input\nspace of LMs with retrieved text passages (Guu et al., 2020; Lewis et al., 2020), leading to large\nimprovements in knowledge-intensive tasks after fine-tuning or used with off-the-shelf LMs (Ram\net al., 2023). A more recent work (Luo et al., 2023) instruction-tunes an LM with a fixed number\n2\nPublished as a conference paper at ICLR 2024\nType Input Output Definitions\nRetrieve x/x, y {yes, no, continue } Decides when to retrieve with R\nISREL x, d {relevant , irrelevant } dprovides useful information to solve x.\nISSUP x, d, y {fully supported , partially\nsupported, no support }All of the verification-worthy statement in y\nis supported by d.\nISUSE x, y {5, 4, 3, 2, 1 } yis a useful response to x.\nTable 1: Four types of reflection tokens used in SELF-RAG. Each type uses several tokens to represent\nits output values. The bottom three rows are three types of Critique tokens, and the bold text indicates\nthe most desirable critique tokens. x, y, d indicate input, output, and a relevant passage, respectively.\nof retrieved passages prepended to input, or pre-train a retriever and LM jointly, followed by few-\nshot fine-tuning on task datasets (Izacard et al., 2022b). While prior work often retrieves only\nonce at the beginning, Jiang et al. (2023) propose to adaptively retrieve passages for generation\non top of a proprietary LLM or Schick et al. (2023) train an LM to generate API calls for named\nentities. Yet, the improved task performance of such approaches often comes at the expense of\nrun-time efficiency (Mallen et al., 2023), robustness to irrelevant context (Shi et al., 2023), and lack\nof attributions (Liu et al., 2023a; Gao et al., 2023). We introduce a method to train an arbitrary LM to\nlearn to use retrieval on-demand for diverse instruction-following queries and introduce controlled\ngeneration guided by reflections tokens to further improve generation quality and attributions.\nTraining and generating with critics. Training LLMs with reinforcement learning (e.g., Proximal\nPolicy Optimization or PPO; Schulman et al. 2017) from human feedback (RLHF) has proven\neffective in aligning LLMs with human preferences (Ouyang et al., 2022; Wu et al., 2023). Though\nour work also studies fine-grained critique on retrieval and generation, we train our target LM on task\nexamples augmented with reflection tokens from a critic model offline, with a far lower training cost\ncompared to RLHF. Compared to prior work using control tokens to guide LM generation (Lu et al.,\n2022; Korbak et al., 2023), SELF-RAGuses reflection tokens to decide the need for retrieval and to\nself-evaluate generation quality.\n3 S ELF-RAG: LEARNING TO RETRIEVE , GENERATE AND CRITIQUE\nWe introduce Self-Reflective Retrieval-Augmented Generation ( SELF-RAG), shown in Figure 1.\nSELF-RAGis a framework that enhances the quality and factuality of an LLM through retrieval and\nself-reflection, without sacrificing LLM’s original creativity and versatility. Our end-to-end training\nlets an LM Mgenerate text informed by retrieved passages, if needed, and criticize the output by\nlearning to generate special tokens. These reflection tokens (Table 1) signal the need for retrieval\nor confirm the output’s relevance, support, or completeness. In contrast, common RAG approaches\nretrieve passages indiscriminately, without ensuring complete support from cited sources.\n3.1 P ROBLEM FORMALIZATION AND OVERVIEW\nFormally, given input x, we train Mto sequentially generate textual outputs yconsisting of multiple\nsegments y= [y1, . . . , y T], where ytindicates a sequence of tokens for the t-th segment.2Generated\ntokens in ytinclude text from the original vocabulary as well as the reflection tokens (Table 1).\nInference overview. Figure 1 and Algorithm 1 present an overview of S ELF-RAGat inference. For\nevery xand preceding generation y<t, the model decodes a retrieval token to evaluate the utility\nof retrieval. If retrieval is not required, the model predicts the next output segment, as it does in a\nstandard LM. If retrieval is needed, the model generates: a critique token to evaluate the retrieved\npassage’s relevance, the next response segment, and a critique token to evaluate if the information in\nthe response segment is supported by the passage. Finally, a new critique token evaluates the overall\nutility of the response.3To generate each segment, SELF-RAGprocesses multiple passages in parallel\nand uses its own generated reflection tokens to enforce soft constraints (Section 3.3) or hard control\n2In this paper, we treat one sentence as a segment in our experiments, but our framework is applicable to any\nsegment unit (i.e., sub-sentence).\n3We follow Liu et al. (2023a) in using a “perceived” utility value that is independent of retrieved passages.\n3\nPublished as a conference paper at ICLR 2024\nAlgorithm 1 SELF-RAGInference\nRequire: Generator LM M, Retriever R, Large-scale passage collections {d1, . . . , d N}\n1:Input: input prompt xand preceding generation y<t,Output: next output segment yt\n2:Mpredicts Retrieve given (x, y<t)\n3:ifRetrieve ==Yes then\n4: Retrieve relevant text passages DusingRgiven (x, yt−1) ▷Retrieve\n5: Mpredicts ISRELgiven x, dandytgiven x, d, y <tfor each d∈D ▷Generate\n6: Mpredicts ISSUPand ISUSEgiven x, yt, dfor each d∈D ▷Critique\n7: Rank ytbased on ISREL,ISSUP,ISUSE ▷Detailed in Section 3.3\n8:else if Retrieve ==Nothen\n9: Mgenpredicts ytgiven x ▷ Generate\n10: Mgenpredicts ISUSEgiven x, yt ▷Critique\n(Algorithm 1) over the generated task output. For instance, in Figure 1 (right), the retrieved passages\nd1is selected at the first time step since d2does not provide direct evidence ( ISRELis Irrelevant)\nandd3output is only partially supported while d1are fully supported.\nTraining overview. SELF-RAGenables an arbitrary LM to generate text with reflection tokens\nby unifying them as next token predictions from the expanded model vocabulary (i.e., the original\nvocabulary plus reflection tokens). Specifically, we train the generator model Mon a curated corpus\nwith interleaving passages retrieved by a retriever Rand reflection tokens predicted by a critic model\nC(summarized in Appendix Algorithm 2). We train Cto generate reflection tokens for evaluating\nretrieved passages and the quality of a given task output (Section 3.2.1). Using the critic model, we\nupdate the training corpus by inserting reflection tokens into task outputs offline. Subsequently, we\ntrain the final generator model ( M) using the conventional LM objective (Section 3.2.2) to enable\nMto generate reflection tokens by itself without relying on the critic at inference time.\n3.2 S ELF-RAGTRAINING\nHere, we describe the supervised data collection and training of two models, the critic C(Section 3.2.1)\nand the generator M(Section 3.2.2).\n3.2.1 T RAINING THE CRITIC MODEL\nData collection for critic model. Manual annotation of reflection tokens for each segment is\nexpensive (Wu et al., 2023). A state-of-the-art LLM like GPT-4 (OpenAI, 2023) can be effectively\nused to generate such feedback (Liu et al., 2023b). However, depending on such proprietary LMs\ncan raise API costs and diminish reproducibility (Chen et al., 2023). Our method requires fine-\ngrained evaluations on multiple passages as well as segments for each input-output instance from the\ntraining dataset, increasing the number of evaluations required to generate SELF-RAGtraining data\nexponentially. To overcome those issues, we create supervised data by prompting GPT-4 to generate\nreflection tokens and then distill their knowledge into an in-house C. For each group of reflection\ntokens, we randomly sample instances from the original training data: {Xsample, Ysample} ∼\n{X, Y}. As different reflection token groups have their definitions and input, as shown in Table 1,\nwe use different instruction prompts for them. Here, we use Retrieve as an example. We prompt\nGPT-4 with a type-specific instruction (“Given an instruction, make a judgment on whether finding\nsome external documents from the web helps to generate a better response.”) followed by few-shot\ndemonstrations Ithe original task input xand output yto predict an appropriate reflection token\nas text: p(r|I, x, y ). Manual assessment reveals that GPT-4 reflection token predictions show high\nagreement with human evaluations. We collect 4k-20k supervised training data for each type and\ncombine them to form training data for C. Appendix Section D shows the full list of instructions, and\nA.1 contains more details and our analysis.\nCritic learning. After we collect training data Dcritic , we initialize Cwith a pre-trained LM and\ntrain it on Dcritic using a standard conditional language modeling objective, maximizing likelihood:\nmax\nCE((x,y),r)∼Dcritic logpC(r|x, y), rfor reflection tokens. (1)\n4\nPublished as a conference paper at ICLR 2024\nInput: How did US states get their names? Input: Write an essay of your best summer vacationOutput: My best summer vacation was a magical escape to the coastal town of Santorini. The azure waters, charming white-washed building are unforgettable. \nCritic LMOutput: 1 of 50 states names come from persons. For instance, Louisiana was named in honor of King Louis XIV of France and Georgia was named after King George II. \nRetrieve\nPartially\nAugmented Output:                Retrieve<p>LOUISIANA: Named in<p>Of the fifty states, eleven are named after an individual person</p>.               11 of 50 states’ names come from person. RelevantSupportedhonor of Louis XIV of France.</p>.  RelevantFor instance, Louisiana was named after King Louis XIV, andUtil: 5Georgia was named after King George II. \nUtil: 5Augmented Output:                     My best summer vacation was a magical escape to the coastal town of Santorini.                     The azure waters, charming white-washed building are unforgettable experience.No RetrievalNo Retrieval\nRetriever\nFigure 2: SELF-RAGtraining examples. The left example does not require retrieval while the right\none requires retrieval; thus, passages are inserted. More examples are in Appendix Table 4.\nThough the initial model can be any pre-trained LM, we use the same one as the generator LM\n(i.e., Llama 2-7B; Touvron et al. 2023) for Cinitialization. The critic achieves a higher than 90%\nagreement with GPT-4-based predictions on most reflection token categories (Appendix Table 4).\n3.2.2 T RAINING THE GENERATOR MODEL\nData collection for generator. Given an input-output pair (x, y), we augment the original output\nyusing the retrieval and critic models to create supervised data that precisely mimics the SELF-\nRAGinference-time process (Section 3.1). For each segment yt∈y, we run Cto assess whether\nadditional passages could help to enhance generation. If retrieval is required, the retrieval special\ntoken Retrieve =Yes is added, and Rretrieves the top Kpassages, D. For each passage, Cfurther\nevaluates whether the passage is relevant and predicts ISREL. If a passage is relevant, Cfurther\nevaluates whether the passage supports the model generation and predicts ISSUP. Critique tokens\nISRELand ISSUPare appended after the retrieved passage or generations. At the end of the output, y\n(oryT),Cpredicts the overall utility token ISUSE, and an augmented output with reflection tokens\nand the original input pair is added to Dgen. See the example training data in Figure 2.\nGenerator learning. We train the generator model Mby training on the curated corpus augmented\nwith reflection tokens Dgenusing the standard next token objective:\nmax\nME(x,y,r )∼DgenlogpM(y, r|x). (2)\nUnlike Ctraining (Eq. 1), Mlearns to predict the target output as well as the reflection tokens. During\ntraining, we mask out the retrieved text chunks (surrounded by <p> and</p> in Figure 2) for loss\ncalculation and expand the original vocabulary Vwith a set of reflection tokens {Critique ,Retrieve}.\nConnections to prior work on learning with critique. Recent work incorporates additional\ncritique (feedback) during training, e.g., RLHF (Ouyang et al. 2022) via PPO. While PPO relies on\nseparate reward models during training, we compute critique offline and directly insert them into the\ntraining corpus, where the generator LM is trained with a standard LM objective. This significantly\nreduces training costs compared to PPO. Our work also relates to prior work that incorporates special\ntokens to control generation (Keskar et al., 2019; Lu et al., 2022; Korbak et al., 2023). Our SELF-RAG\nlearns to generate special tokens to evaluate its own prediction after each generated segment, enabling\nthe use of a soft re-ranking mechanism or hard constraints at inference (discussed next).\n3.3 S ELF-RAGINFERENCE\nGenerating reflection tokens to self-evaluate its own output makes SELF-RAGcontrollable during the\ninference phase, enabling it to tailor its behavior to diverse task requirements. For tasks demanding\nfactual accuracy (Min et al., 2023), we aim for the model to retrieve passages more frequently to\nensure that the output aligns closely with the available evidence. Conversely, in more open-ended\ntasks, like composing a personal experience essay, the emphasis shifts towards retrieving less and\nprioritizing the overall creativity or utility score. In this section, we describe approaches to enforce\ncontrol to meet these distinct objectives during the inference process.\nAdaptive retrieval with threshold. SELF-RAGdynamically decides when to retrieve text passages by\npredicting Retrieve . Alternatively, our framework allows a threshold to be set. Specifically, if the prob-\n5\nPublished as a conference paper at ICLR 2024\nability of generating the Retrieve =Yes token normalized over all output tokens in Retrieve surpasses a\ndesignated threshold, we trigger retrieval (details in Appendix Section A.4).\nTree-decoding with critique tokens. At each segment step t, when retrieval is required, based either\non hard or soft conditions, Rretrieves Kpassages, and the generator Mprocesses each passage in\nparallel and outputs Kdifferent continuation candidates. We conduct a segment-level beam search\n(with the beam size= B) to obtain the top- Bsegment continuations at each timestamp t, and return\nthe best sequence at the end of generation. The score of each segment ytwith respect to passage dis\nupdated with a critic score Sthat is the linear weighted sum of the normalized probability of each\nCritique token type. For each critique token group G(e.g., ISREL), we denote its score at timestamp\ntassG\nt, and we compute a segment score as follows:\nf(yt, d, Critique ) =p(yt|x, d, y <t)) +S(Critique ),where (3)\nS(Critique ) =X\nG∈GwGsG\ntforG={ISREL,ISSUP,ISUSE}, (4)\nwhere sG\nt=pt(ˆr)PNG\ni=1pt(ri)stands for the generation probability of the most desirable reflection token\nˆr(e.g., ISREL=Relevant ) for the critique token type GwithNGdistinct tokens (that represent\ndifferent possible values for G). The weights wGin Eq. 4 are hyperparameters that can be adjusted\nat inference time to enable customized behaviors at test time. For instance, to ensure that result\nyis mostly supported by evidence, we can set a weight term for the ISSUPscore higher, while\nrelatively lowering weights for other aspects. Alternatively, we could further enforce hard constraints\nduring decoding using Critique e.g., filtering out a segment continuation when the model generates an\nundesirable token (e.g., ISSUP=No support ).\n4 E XPERIMENTS\n4.1 T ASKS AND DATASETS\nWe conduct evaluations of our SELF-RAGand diverse baselines on a range of downstream tasks,\nholistically evaluating outputs with metrics designed to assess overall correctness, factuality, and\nfluency. Throughout these experiments, we conduct zero-shot evaluations, where we provide instruc-\ntions describing tasks without few-shot demonstrations (Wei et al., 2022; Sanh et al., 2022). Details of\nour experiments’ settings, including test-time instructions, are available in the Appendix Section B.1.\nClosed-set tasks include two datasets, i.e., a fact verification dataset about public health ( PubHealth ;\nZhang et al. 2023) and a multiple-choice reasoning dataset created from scientific exams ( ARC-\nChallenge ; Clark et al. 2018). We use accuracy as an evaluation metric and report on the test set. We\naggregate the answer probabilities of target classes for both of these datasets (Appendix Section B.2).\nShort-form generations tasks include two open-domain question answering (QA) datasets,\nPopQA (Mallen et al., 2023) and TriviaQA-unfiltered (Joshi et al., 2017), where systems need\nto answer arbitrary questions about factual knowledge. For PopQA, we use the long-tail subset,\nconsisting of 1,399 rare entity queries whose monthly Wikipedia page views are less than 100. As the\nTriviaQA-unfiltered (open) test set is not publicly available, we follow prior work’s validation and\ntest split (Min et al., 2019; Guu et al., 2020), using 11,313 test queries for evaluation. We evaluate\nperformance based on whether gold answers are included in the model generations instead of strictly\nrequiring exact matching, following Mallen et al. (2023); Schick et al. (2023).\nLong-form generation tasks include a biography generation task (Min et al., 2023) and a long-form\nQA task ALCE-ASQA Gao et al. (2023); Stelmakh et al. (2022). We use FactScore (Min et al.,\n2023) to evaluate biographies, and we use official metrics of correctness (str-em), fluency based on\nMAUVE (Pillutla et al., 2021), and citation precision and recall (Gao et al., 2023) for ASQA.4\n4.2 B ASELINES\nBaselines without retrievals. We evaluate strong publicly available pre-trained LLMs,\nLlama2 7B,13B(Touvron et al., 2023), instruction-tuned models, Alpaca 7B,13B(Dubois et al., 2023)\n4https://github.com/princeton-nlp/ALCE\n6\nPublished as a conference paper at ICLR 2024\n(our replication based on Llama2); and models trained and reinforced using private data, Chat-\nGPT (Ouyang et al., 2022) and Llama2-chat 13B. For instruction-tuned LMs, we use the official\nsystem prompt or instruction format used during training if publicly available. We also compare our\nmethod to concurrent work, CoVE 65B(Dhuliawala et al., 2023), which introduces iterative prompt\nengineering to improve the factuality of LLM generations.\nBaselines with retrievals. We evaluate models augmented with retrieval at test time or during training.\nThe first category includes standard RAG baselines, where an LM (Llama2, Alpaca) generates output\ngiven the query prepended with the top retrieved documents using the same retriever as in our system.\nIt also includes Llama2-FT, where Llama2 is fine-tuned on all training data we use without the\nreflection tokens or retrieved passages. We also report the result of retrieval-augmented baselines\nwith LMs trained with private data: Ret-ChatGPT and Ret-Llama2-chat, which deploy the same\naugmentation technique above, as well as perplexity.ai, an InstructGPT-based production search\nsystem. The second category includes concurrent methods that are trained with retrieved text\npassages, i.e., SAIL (Luo et al., 2023) to instruction-tune an LM on the Alpaca instruction-tuning\ndata with top retrieved documents inserted before instructions, and Toolformer (Schick et al., 2023)\nto pre-train an LM with API calls (e.g., Wikipedia APIs).5\n4.3 E XPERIMENTAL SETTINGS\nTraining data and settings. Our training data consists of diverse instruction-following input-output\npairs. In particular, we sample instances from Open-Instruct processed data (Wang et al., 2023) and\nknowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018). In\ntotal, we use 150k instruction-output pairs. We use Llama2 7B and 13B (Touvron et al., 2023) as\nour generator base LM, and we use Llama2 7B as our base critic LM. For the retriever model R, we\nuse off-the-shelf Contriever-MS MARCO (Izacard et al., 2022a) by default and retrieve up to ten\ndocuments for each input. More training details are in the Appendix Section B.1.\nInference settings. As a default configuration, we assign the weight terms ISREL,ISSUP,ISUSE\nvalues of 1.0, 1.0 and 0.5, respectively. To encourage frequent retrieval, we set the retrieval threshold\nto 0.2 for most tasks and to 0 for ALCE (Gao et al., 2023) due to citation requirements. We speed\nup inference using vllm (Kwon et al., 2023). At each segment level, we adopt a beam width of 2.\nFor a token-level generation, we use greedy decoding. By default, we use the top five documents\nfrom Contriever-MS MARCO (Izacard et al., 2022a); for biographies and open-domain QA, we\nuse additional top five documents retrieved by a web search engine, following Luo et al. (2023);\nfor ASQA, we use the author-provided top 5 documents by GTR-XXL (Ni et al., 2022) across all\nbaselines for a fair comparison.\n5 R ESULTS AND ANALYSIS\n5.1 M AINRESULTS\nComparison against baselines without retrieval. Table 2 (top) presents the baselines without\nretrieval. Our SELF-RAG(bottom two rows) demonstrates a substantial performance advantage\nover supervised fine-tuned LLMs in all tasks and even outperforms ChatGPT in PubHealth, PopQA,\nbiography generations, and ASQA (Rouge and MAUVE). Our approach also significantly outperforms\na concurrent method that employs sophisticated prompt engineering; specifically, on the bio generation\ntask, our 7B and 13B models outperform the concurrent CoVE (Dhuliawala et al., 2023), which\niteratively prompts Llama2 65Bto refine output.\nComparison against baselines with retrieval. As shown in Tables 2 (bottom), our SELF-RAGalso\noutperforms existing RAG in many tasks, obtaining the best performance among non-proprietary LM-\nbased models on all tasks. Powerful instruction-tuned LMs with retrieval (e.g., LLama2-chat, Alpaca)\nshow large gains from their non-retrieval baselines. However, we found that these baselines provide\nlimited solutions for tasks where we cannot simply copy or extract sub-strings of retrieved passages.\nOn PubHealth and ARC-Challenge, baselines with retrieval do not improve performance notably\nfrom their no-retrieval counterparts. We also observe that most baselines with retrieval struggle to\nimprove citation accuracy. On ASQA, our model shows significantly higher citation precision and\n5We report numbers using the results reported in the paper as the implementations are not available.\n7\nPublished as a conference paper at ICLR 2024\nTable 2: Overall experiment results on six tasks. Bold numbers indicate the best performance among\nnon-proprietary models, and gray-colored bold text indicates the best proprietary model when\nthey outperforms all non-proprietary models.∗indicates concurrent or recent results reported by\nconcurrent work. – indicates numbers that are not reported by the original papers or are not applicable.\nModels are sorted based on scale. FS, em, rg, mau, prec, rec denote FactScore (factuality); str-em,\nrouge (correctness); MAUVE (fluency); citation precision and recall, respectively.\nShort-form Closed-set Long-form generations (with citations)\nPopQA TQA Pub ARC Bio ASQA\nLM (acc) (acc) (acc) (acc) (FS) (em) (rg) (mau) (pre) (rec)\nLMs with proprietary data\nLlama2-c 13B 20.0 59.3 49.4 38.4 55.9 22.4 29.6 28.6 – –\nRet-Llama2-c 13B 51.8 59.8 52.1 37.9 79.9 32.8 34.8 43.8 19.8 36.1\nChatGPT 29.3 74.3 70.1 75.3 71.8 35.3 36.2 68.8 – –\nRet-ChatGPT 50.8 65.7 54.7 75.3 – 40.7 39.9 79.7 65.1 76.6\nPerplexity.ai – – – – 71.2 – – – – –\nBaselines without retrieval\nLlama2 7B 14.7 30.5 34.2 21.8 44.5 7.9 15.3 19.0 – –\nAlpaca 7B 23.6 54.5 49.8 45.0 45.8 18.8 29.4 61.7 – –\nLlama2 13B 14.7 38.5 29.4 29.4 53.4 7.2 12.4 16.0 – –\nAlpaca 13B 24.4 61.3 55.5 54.9 50.2 22.9 32.0 70.6 – –\nCoVE 65B* – – – – 71.2 – – – – –\nBaselines with retrieval\nToolformer* 6B – 48.8 – – – – – – – –\nLlama2 7B 38.2 42.5 30.0 48.0 78.0 15.2 22.1 32.0 2.9 4.0\nAlpaca 7B 46.7 64.1 40.2 48.0 76.6 30.9 33.3 57.9 5.5 7.2\nLlama2-FT 7B 48.7 57.3 64.3 65.8 78.2 31.0 35.8 51.2 5.0 7.5\nSAIL* 7B – – 69.2 48.4 – – – – – –\nLlama2 13B 45.7 47.0 30.2 26.0 77.5 16.3 20.5 24.7 2.3 3.6\nAlpaca 13B 46.1 66.9 51.1 57.6 77.7 34.8 36.7 56.6 2.0 3.8\nOur SELF-RAG 7B 54.9 66.4 72.4 67.3 81.2 30.0 35.7 74.3 66.9 67.8\nOur SELF-RAG 13B 55.8 69.3 74.5 73.1 80.2 31.7 37.0 71.6 70.3 71.3\nPQA Med AS\n(acc) (acc) (em)\nSELF-RAG(50k) 45.5 73.5 32.1\nTraining\nNo Retriever R 43.6 67.8 31.0\nNo Critic C 42.6 72.0 18.1\nTest\nNo retrieval 24.7 73.0 –\nHard constraints 28.3 72.6 –\nRetrieve top1 41.8 73.1 28.6\nRemove ISSUP 44.1 73.2 30.6\n(a) Ablation\n1 270.070.5Precision\n1 2\nWeight for IsSupport9095Mauve\n (b) Customization\n0.0 0.2 0.4 0.60.980.990.991.00Accuracy\nPubHealth\n0.0 0.2 0.4 0.6\nRetrieval Threshold0.60.81.0AccuracyPopQA0.00.51.0\nFrequency\n0.250.500.751.00\nFrequency\n (c) Retrieval\nFigure 3: Analysis on SELF-RAG:(a)Ablation studies for key components of SELF-RAGtraining\nand inference based on our 7B model. (b) Effects of soft weights on ASQA citation precision and\nMauve (fluency). (c) Retrieval frequency andnormalized accuracy on PubHealth and PopQA.\nrecall than all models except ChatGPT. Gao et al. (2023) found that ChatGPT consistently exhibits\nsuperior efficacy in this particular task, surpassing smaller LMs. Our SELF-RAGbridges this\nperformance gap, even outperforming ChatGPT in citation precision, which measures whether the\nmodel-generated claim is fully supported by cited evidence. Llama2-FT 7B, which is the baseline\nLM trained on the same instruction-output pairs as SELF-RAGwithout retrieval or self-reflection and\nis retrieval-augmented at test time only, lags behind SELF-RAG. This result indicates SELF-RAG\ngains are not solely from training data and demonstrate the effectiveness of SELF-RAGframework.\n8\nPublished as a conference paper at ICLR 2024\n5.2 A NALYSIS\nAblation studies. We conduct a set of ablations of our framework to identify which factors play\nkey roles. We evaluate two model variants trained differently than our model: No Retriever trains an\nLM using the standard instruction-following method given instruction-output pairs, without retrieved\npassages; No Critic trains an LM trained with input-output pairs that are always augmented with the\ntop one retrieved document without reflection tokens. This is similar to SAIL (Luo et al., 2023), and\nwe use our instruction-output data instead of using the Alpaca dataset (Dubois et al., 2023), as in\nSAIL. We also conduct ablation on our inference-time algorithm, including No retrieval disables\nretrieval during inference; Hard constraints indicates the model performance that retrieves when\nRetrieve =Yes instead of using the adaptive threshold; Retrieve top 1 always retrieves and uses the\ntop one document only, similar to standard RAG approaches; Remove ISSUPindicates the model\nperformance that removes ISSUPscore only during critique-guided beam search in Eq. 4. In this\nablation experiment, we use a training instance size of 50k for a more efficient exploration of training\nvariations. Later in this section, we conduct an analysis of the effect of training data size. We conduct\nthe ablation studies on three datasets, PopQA, PubHealth, and ASQA. On ASQA, we evaluate models\non sampled 150 instances and exclude ablations involving adaptive or no retrieval processes.\nWe show in Table 3a the ablation results. The top part of the table shows results for training ablations,\nand the bottom part is for inference ablations. We see that all components play important roles. We\nalso observe a large performance gap between SELF-RAGand No Retriever or Critic baselines across\ntasks, indicating that training an LM with those models largely contributes to the performance gain of\nSELF-RAG. Using the top passages regardless of their relevance (Retrieve top 1) as in conventional\nRAG approaches causes a large drop in PopQA and ASQA, and removing ISSUPduring the beam\nsearch results hurts performance on ASQA. This demonstrates the effectiveness of SELF-RAG’s\ncapabilities of carefully selecting generations based on fine-grained multiple criteria, instead of\nnaively using all passages from the retrieval model or solely depending on relevance scores.\nEffects of inference-time customization. One key benefit of our proposed framework is that it\nenables us to control how much each critique type affects the final generation sampling. We analyze\nthe effects of different parameter weights on the top of our 7B model during inference time on\nASQA, where multiple evaluation aspects are considered. Figure 3b shows the effects of changing\nthe weighting term for ISSUP, which criticizes how supported the output is by the text passage. As\nthe figure shows, increasing the weight leads to positive effects on the models’ citation precision\nsince this puts more emphasis on whether model generation is supported by the evidence. On the\ncontrary, a larger weight results in lower MAUVE scores: when generation gets longer and more\nfluent, there are often more claims that are not fully supported by citations, consistent with findings\nby Liu et al. (2023a). Our framework lets practitioners choose and customize models’ behaviors at\ntest time by adjusting such parameters without requiring additional training.\nEfficiency and accuracy trade-off. Using our framework, practitioners can adjust how often retrieval\noccurs using the token probability of reward tokens. We evaluate how this adaptive threshold affects\nthe overall accuracy and frequency of retrieval, and we evaluate the performance with varying\nnumbers of threshold δ(larger δresults in less retrieval) on PubHealth and PopQA. Figure 3c shows\nthat the model’s retrieval frequencies dramatically change on both datasets. as δvaries. On one hand,\nperformance deterioration by retrieving less is smaller on PubHealth but larger in PopQA.\n6 C ONCLUSION\nThis work introduces SELF-RAG, a new framework to enhance the quality and factuality of LLMs\nthrough retrieval on demand and self-reflection. SELF-RAGtrains an LM to learn to retrieve, generate,\nand critique text passages and its own generation by predicting the next tokens from its original\nvocabulary as well as newly added special tokens, called reflection tokens. SELF-RAGfurther enables\nthe tailoring of LM behaviors at test time by leveraging reflection tokens. Our holistic evaluations on\nsix tasks using multiple metrics demonstrate that SELF-RAGsignificantly outperforms LLMs with\nmore parameters or with conventional retrieval-augmented generation approaches.\n9\nPublished as a conference paper at ICLR 2024\nETHICAL CONCERNS\nThis work aims to improve the factuality of LLM outputs, the lack of which continues to cause nu-\nmerous real-world problems (e.g., spread of misinformation and provision of incorrect and dangerous\nadvice). While our method shows significant improvements in terms of performance, factuality, and\ncitation accuracy, it can still generate outputs that are not fully supported by the citations. We hope\nthat explicit self-reflection and fine-grained attribution may help users verify factual errors in the\nmodel outputs.\nACKNOWLEDGMENTS\nWe thank Sewon Min, Scott Wen-tau Yih, Sean Welleck, and Kawin Ethayarajh for fruitful discussions\nin the early stages of this work. We thank Sewon Min, Joongwon (Daniel) Kim, and Sandy Kaplan\nfor valuable feedback on the paper, and Tianyu Gao and Weijia Shi for their help on evaluations.\nAkari Asai is supported by the IBM Fellowship. We thank Stability AI for providing computing\nto train and evaluate the LMs in this work, and Microsoft Accelerate Foundation Models Research\nProgram for the access to OpenAI APIs. This work was funded in part by the DARPA MCS program\nthrough NIWC Pacific (N66001-19-2-4031), NSF IIS-2044660, and gifts from AI2.\nREFERENCES\nAkari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. Learn-\ning to retrieve reasoning paths over wikipedia graph for question answering. In International\nConference on Learning Representations , 2020. URL https://openreview.net/forum?\nid=SJgVHkrYDH .\nAkari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. Retrieval-based language models and appli-\ncations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics\n(Tutorial) , 2023a. URL https://aclanthology.org/2023.acl-tutorials.6 .\nAkari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh\nHajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. In Findings of the Associ-\nation for Computational Linguistics , 2023b. URL https://aclanthology.org/2023.\nfindings-acl.225 .\nBernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob\nEisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. Attributed question answering:\nEvaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037 ,\n2022. URL https://arxiv.org/abs/2212.08037 .\nLingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time? arXiv\npreprint arXiv:2307.09009 , 2023. URL https://arxiv.org/abs/2307.09009 .\nPeter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and\nOyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.\narXiv preprint arXiv:1803.05457 , 2018. URL https://arxiv.org/abs/1803.05457 .\nTri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-\nefficient exact attention with io-awareness. In Advances in Neural Information Processing Systems ,\n2022. URL https://openreview.net/forum?id=H4DqfPSibmx .\nShehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and\nJason Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint\narXiv:2309.11495 , 2023. URL https://arxiv.org/abs/2309.11495 .\nEmily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wizard of\nwikipedia: Knowledge-powered conversational agents. In International Conference on Learning\nRepresentations , 2019. URL https://openreview.net/forum?id=r1l73iRqKm .\nYann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin,\nPercy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that\n10\nPublished as a conference paper at ICLR 2024\nlearn from human feedback. arXiv preprint arXiv:2305.14387 , 2023. URL https://arxiv.\norg/abs/2305.14387 .\nTianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate\ntext with citations. arXiv preprint arXiv:2305.14627 , 2023. URL https://arxiv.org/abs/\n2305.14627 .\nKelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented\nlanguage model pre-training. In International Conference on Machine Learning , 2020. URL\nhttps://dl.acm.org/doi/pdf/10.5555/3524938.3525306 .\nGautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand\nJoulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning.\nTransactions on Machine Learning Research , 2022a. URL https://openreview.net/\nforum?id=jKN1pXi7b0 .\nGautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane\nDwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with\nretrieval augmented language models. arXiv preprint arXiv:2208.03299 , 2022b. URL https:\n//arxiv.org/abs/2208.03299 .\nZhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang,\nJamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint\narXiv:2305.06983 , 2023. URL https://arxiv.org/abs/2305.06983 .\nMandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly\nsupervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual\nMeeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2017. URL\nhttps://aclanthology.org/P17-1147 .\nNitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher.\nCtrl: A conditional transformer language model for controllable generation. arXiv preprint\narXiv:1909.05858 , 2019. URL https://arxiv.org/abs/1909.05858 .\nTomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason\nPhang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences.\nInInternational Conference on Machine Learning , 2023. URL https://openreview.net/\nforum?id=AT8Iw8KOeC .\nTom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris\nAlberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion\nJones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and\nSlav Petrov. Natural questions: A benchmark for question answering research. Transactions of\nthe Association for Computational Linguistics , 2019. URL https://aclanthology.org/\nQ19-1026 .\nWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.\nGonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model\nserving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating\nSystems Principles , 2023. URL https://arxiv.org/abs/2309.06180 .\nPatrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,\nHeinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela.\nRetrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Infor-\nmation Processing Systems , 2020. URL https://proceedings.neurips.cc/paper/\n2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf .\nXi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez,\nJacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-\naugmented dual instruction tuning, 2023. URL https://arxiv.org/abs/2310.01352 .\nNelson F Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines.\narXiv preprint arXiv:2304.09848 , 2023a. URL https://arxiv.org/abs/2304.09848 .\n11\nPublished as a conference paper at ICLR 2024\nYang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg\nevaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634 , 2023b.\nURLhttps://arxiv.org/abs/2303.16634 .\nXiming Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Am-\nmanabrolu, and Yejin Choi. QUARK: Controllable text generation with reinforced unlearning.\nInAdvances in Neural Information Processing Systems , 2022. URL https://openreview.\nnet/forum?id=5HaIds3ux5O .\nHongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox,\nHelen Meng, and James Glass. Sail: Search-augmented instruction learning. arXiv preprint\narXiv:2305.15225 , 2023. URL https://arxiv.org/abs/2305.15225 .\nAlex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi.\nWhen not to trust language models: Investigating effectiveness of parametric and non-parametric\nmemories. In Proceedings of the 61st Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers) , 2023. URL https://aclanthology.org/2023.\nacl-long.546 .\nJacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick,\nMia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching\nlanguage models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 , 2022.\nURLhttps://arxiv.org/abs/2203.11147 .\nTodor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct\nelectricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference\non Empirical Methods in Natural Language Processing , 2018. URL https://aclanthology.\norg/D18-1260 .\nSewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. A discrete hard EM approach\nfor weakly supervised question answering. In Proceedings of the 2019 Conference on Empirical\nMethods in Natural Language Processing and the 9th International Joint Conference on Natu-\nral Language Processing (EMNLP-IJCNLP) , 2019. URL https://aclanthology.org/\nD19-1284 .\nSewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer,\nLuke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual\nprecision in long form text generation. arXiv preprint arXiv:2305.14251 , 2023. URL https:\n//arxiv.org/abs/2305.14251 .\nReiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher\nHesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted\nquestion-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021. URL https:\n//arxiv.org/abs/2112.09332 .\nJianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao,\nYi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable\nretrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language\nProcessing , 2022. URL https://aclanthology.org/2022.emnlp-main.669 .\nOpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. URL https://arxiv.\norg/abs/2303.08774 .\nLong Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong\nZhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton,\nLuke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and\nRyan Lowe. Training language models to follow instructions with human feedback. In Advances in\nNeural Information Processing Systems , 2022. URL https://openreview.net/forum?\nid=TG8KACxEON .\nFabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James\nThorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt ¨aschel,\n12\nPublished as a conference paper at ICLR 2024\nand Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Proceedings\nof the 2021 Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies , 2021. URL https://aclanthology.org/\n2021.naacl-main.200 .\nKrishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi,\nand Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using\ndivergence frontiers. In Advances in Neural Information Processing Systems , 2021. URL https:\n//openreview.net/forum?id=Tqx7nJp7PR .\nSamyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations\ntoward training trillion parameter models. In Proceedings of the International Conference for High\nPerformance Computing, Networking, Storage and Analysis , 2020. URL https://dl.acm.\norg/doi/10.5555/3433701.3433727 .\nOri Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and\nYoav Shoham. In-context retrieval-augmented language models. Transactions of the Association\nfor Computational Linguistics , 2023. URL https://arxiv.org/abs/2302.00083 .\nVictor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine\nChaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker,\nShanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, De-\nbajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen,\nZheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen,\nAbheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao,\nStella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training\nenables zero-shot task generalization. In International Conference on Learning Representations ,\n2022. URL https://openreview.net/forum?id=9Vrb9D0WI4 .\nTimo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,\nNicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to\nuse tools. arXiv preprint arXiv:2302.04761 , 2023. URL https://arxiv.org/abs/2302.\n04761 .\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\noptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. URL https://arxiv.org/\nabs/1707.06347 .\nFreda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael\nSch¨arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context.\nInProceedings of the 40th International Conference on Machine Learning , 2023. URL https:\n//proceedings.mlr.press/v202/shi23a.html .\nIvan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-\nform answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language\nProcessing , 2022. URL https://aclanthology.org/2022.emnlp-main.566 .\nJames Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-\nscale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language Tech-\nnologies, Volume 1 (Long Papers) , 2018. URL https://aclanthology.org/N18-1074 .\nHugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation\nand fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023. URL https://arxiv.\norg/abs/2307.09288 .\nYizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu,\nDavid Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go?\nexploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751 , 2023.\nURLhttps://arxiv.org/abs/2306.04751 .\n13\nPublished as a conference paper at ICLR 2024\nJason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,\nAndrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International\nConference on Learning Representations , 2022. URL https://openreview.net/forum?\nid=gEZrGCozdqR .\nZeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A\nSmith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better\nrewards for language model training. arXiv preprint arXiv:2306.01693 , 2023. URL https:\n//arxiv.org/abs/2306.01693 .\nXiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su, and Huan Sun. Automatic evaluation of\nattribution by large language models. arXiv preprint arXiv:2305.06311 , 2023. URL https:\n//arxiv.org/abs/2305.06311 .\nTianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen,\nXixin Wu, Danny Fox, Helen Meng, and James Glass. Interpretable unified language checking.\narXiv preprint arXiv:2304.03728 , 2023. URL https://arxiv.org/abs/2304.03728 .\nDaniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul\nChristiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv\npreprint arXiv:1909.08593 , 2019. URL https://arxiv.org/abs/1909.08593 .\n14\nPublished as a conference paper at ICLR 2024\nAPPENDIX\nA S ELF-RAGDetails 16\nA.1 Reflection Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16\nA.2 Advantages of Learning-based Methods . . . . . . . . . . . . . . . . . . . . . . . 16\nA.3 S ELF-RAGTraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17\nA.4 S ELF-RAGInference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19\nB Experimental Details 19\nB.1 More Details of Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19\nB.2 More Details of Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19\nC Results 20\nC.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20\nC.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21\nC.3 Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21\nD Full List of Instructions and Demonstrations for GPT-4 21\n15\nPublished as a conference paper at ICLR 2024\nA S ELF-RAGDETAILS\nA.1 R EFLECTION TOKENS\nDefinitions of reflection tokens. Below, we provide a detailed definition of reflection type and\noutput tokens. The first three aspects will be provided at each segment level, while the final aspect is\nonly given at each output level.\n•Retrieval-on-demand (Retrieve ): Given an input and previous-step generation (if applicable),\nan LM determines whether the continuation requires factual grounding. Noindicates retrieval\nis unnecessary as the sequence does not require factual grounding or may not be enhanced by\nknowledge retrieval, Yes indicates retrieval is necessary. We additionally have continue\nto use evidence , which indicates that a model can continue to use the evidence retrieved\npreviously. For instance, a passage may contain rich factual information, and thus SELF-RAG\ngenerates multiple segments based on the passage.\n•Relevant (ISREL): Retrieved knowledge may not be always relevant to the input. This aspect\nindicates whether the evidence provides useful information ( Relevant ) or not ( Irrelevant ).\n•Supported (ISSUP): Attribution is the concept of whether the output is fully supported by\ncertain evidence (Menick et al., 2022; Bohnet et al., 2022). This aspect judges how much infor-\nmation in the output is entailed by the evidence. We evaluate attributions in three scale, Fully\nsupported ,Partially supported , and No support / Contradictory , follow-\ning Yue et al. (2023); Nakano et al. (2021).\n•Useful (ISUSE): Following the definitions from Liu et al. (2023a), we define the perceived utility\nas whether the response is a helpful and informative answer to the query, independently from\nwhether it is in fact factual or not. This can be also viewed as plausibility in Menick et al. (2022).\nFor usefulness, we use a five-scale evaluation (1 is the lowest and 5 is the highest).\nDetails of GPT-4-based data collections. We use the instruction and demonstration pairs to prompt\nGPT-4, listed in Section D. Following an official recommendation, we separate instructions and\noutputs with “##”. We use the temperature 1 and set the maximum output token counts to be 200. We\ndiscard instances where GPT-4 does not follow the designated output formats or output sequences\nthat do not match our expected category names. As a result, we collected 1,2594 for Retrieve , 11,181\nfor ISSUP, 19,317 for relevance, and 3,831 for utility.\nManual analysis of the GPT-4 predictions. The authors of this paper manually assess randomly\nsampled 20 instances for each aspect and check if GPT-4 predictions match their assessments given\nthe same instruction, demonstrations, and test instances. We found our assessments show high\nagreement with GPT-4 predictions, especially for relevance (95%), retrieval necessity (95%), and\nthe degree of support (90%). Agreement was slightly lower in usefulness (80%), mostly due to\nthe disagreement between 1 and 2 or 4 and 5. Compared to prior efforts on agreement of GPT-4\npredictions and human annotators in pair-wise evaluations, we found our human annotators often\nagree with GPT-4 predictions. We hypothesize this is because our fine-grained evaluation with\nabsolute scoring systems, unlike such relative, overall pair-wise evaluation systems enables GPT-4\nto generate more reliable and agreeable predictions. The effectiveness of GPT-4 evaluations in\nfine-grained aspects has shown to be effective in prior work (Liu et al., 2023a).\nA.2 A DVANTAGES OF LEARNING -BASED METHODS\nWhile recent work (Jiang et al., 2023) proposes a prompting-based method to enable retrieval on-\ndemand, we find a learning-based method is more suitable to enable fine-grained self-reflection\nfeedback and inference-time control. First, Self-RAG requires careful multi-aspect fine-grained\nself-evaluations at inference time. To make an LM to comprehend fine-grained aspects and scoring\nsystems, precise and detailed instructions, as well as few-shot demonstrations, are necessary. This\nsignificantly increases the input sequence length, resulting in higher costs and latency. Nevertheless,\nwe briefly tried prompting-based approaches in our preliminary experiments and found it is nontrivial.\nWhen we combine all instructions for all aspects and feed them to the target pre-trained LMs (GPT-3\ndavinci-003 / 002, Llama2-13B-chat), all models struggle to precisely follow our evaluation scheme,\noften generating output formats that do not suit our scheme or whose reflections are less accurate. To\n16\nPublished as a conference paper at ICLR 2024\nDataset name Category Data source # of instances % of Retrieve =Yes\nGPT-4 Alpaca Instruction-following Open-Instruct 26,168 53.2\nStanford Alpaca Instruction-following Open-Instruct 25,153 48.0\nFLAN-V2 Instruction-following Open-Instruct 17,817 15.8\nShareGPT Instruction-following Open-Instruct 13,406 76.8\nOpen Assistant 1 Instruction-following Open-Instruct 9,464 77.1\nWizard of Wikipedia Knowledge-intensive KILT 17,367 22.7\nNatural Questions Knowledge-intensive KILT 15,535 87.7\nFEVER Knowledge-intensive KILT 9,966 63.2\nOpenBoookQA Knowledge-intensive HF Dataset 4,699 2.3\nArc-Easy Knowledge-intensive HF Dataset 2,147 11.0\nASQA Knowledge-intensive ASQA 3,897 91.5\nTable 3: The generator LM Mtraining data statistics.\nmake the most use of the Self-RAG potential, we need to use the token probabilities for the reflection\ntokens, which may not be always available for black box proprietary LM APIs. Note that at the time\nof submission, ChatGPT and GPT-4 do not support long probability information, preventing us from\napplying the Self-RAG algorithm to such models. This limitation is also discussed in the Active\nRetrieval paper, which also requires access to token probabilities.\nA.3 S ELF-RAGTRAINING\nOverview of training. Algorithm 2 provides a high-level overview of our training.\nAlgorithm 2 SELF-RAGTraining\n1:Input input-output data D={X, Y}, generator M,Cθ\n2:Initialize Cwith a pre-trained LM\n3:Sample data {Xsample, Ysample} ∼ { X, Y} ▷Training Critic LM (Section 3.2.1)\n4:for(x, y)∈(Xsample, Ysample)do ▷Data collections for C\n5: Prompt GPT-4 to collect a reflection token rfor(x, y)\n6: Add{(x, y, r )}toDcritic\n7:Update Cwith next token prediction loss ▷Critic learning; Eq. 1\n8:Initialize Mwith a pre-trained LM ▷Training Generator LM (Section 3.2.2)\n9:for(x, y)∈(X, Y )do ▷Data collection for MwithDcritic\n10: RunCto predict rgiven (x, y)\n11: Add(x, y, r )toDgen\n12:Update MonDgenwith next token prediction loss ▷Generator LM learning; Eq. 2\nFull list of seed datasets. To sample diverse input-output pairs, we sample instances of the Open-\nInstruct (Wang et al., 2023) dataset. In particular, we use their ShareGPT, GPT-4 Alpaca, Alpaca,\nOpenAssistant, and FLAN subsets subsets. We also sample instances from a couple of knowledge-\nintensive datasets, Natural Questions (Kwiatkowski et al., 2019), Wizard of Wikipedia (Dinan et al.,\n2019) and FEVER (Thorne et al., 2018) from the KILT benchmark (Petroni et al., 2021), ASQA (Stel-\nmakh et al., 2022) and multiple QA datasets including ARC-Easy and OpenBookQA (Mihaylov\net al., 2018). Table 3 shows the full list of training instances, and in total, we use 145,619 instances.\nWe also present the percentage of the instances where Retrieve =Yes appears at least once. While\nsome instruction-following datasets such as FLAN-T5 show a lower percentage of instances with\nRetrieve =Yes, other datasets show significantly higher percentages, indicating that our Critic model\npredicts the necessity of retrieval according to the given instances. In FLAN-T5, many training data\ncome from non-knowledge-intensive tasks such as grammatical error collections or simple string\nmanipulations that are unlikely to benefit from knowledge retrieval from Wikipedia.\nPerformance of the Critic C.We evaluate the accuracy of reward predictions by splitting GPT-4\ngenerated feedback into training, development, and test sets. The accuracy of the reward model is\nas follows. Table 4 shows the model performance of predicting GPT-4 judgments. As you can see,\n17\nPublished as a conference paper at ICLR 2024\nbase LM Retrieve ISSUP ISREL ISUSE\nLlama2-7B 93.8 93.5 80.2 73.5\nFLAN-3B 85.6 73.1 82.0 72.1\nFigure 4: Reward prediction accuracy using GPT-4 predictions as ground-truth predictions.\noverall our fine-tuned reward model shows high prediction matching with GPT-4 predicted feedback.\nWhile our final model uses Llama2-7B as a base LM, we also train and compare FLAN-3B (Wei\net al., 2022) model on the same data, to investigate the effectiveness of different data sizes affect final\nreward predictions. In most aspects, our reward model shows higher than 80% accuracy, indicating\nthe powerful ability of fine-tuned specialized LMs to evaluate text. While both models show relatively\nlower performance on ISUSE, this is because both models often confuse between the two highest\ncases (5 and 4), where human annotators can also disagree.\nDetails of Mdata creation. Here, we provide detailed data creation procedures. Algorithm 3\nsummarizes the process. Here we set yttoyfor simplification. Once we train the critic model, we\nfirst run it on input data from the aforementioned datasets, to predict whether retrieval is needed or\nnot. For the instances where the critic predicts Retrieve =No, we only predict the ISUSEgiven input\nand output. For the instances where the critic predicts Retrieve =Yes, we first retrieve passages using\nthe input and the entire output as queries, to find passages that are relevant to the entire output. We\nthen split output sentences using Spacy.6For each sentence, we run Cto predict whether the retrieval\nis necessary or not, given the input, preceding segments, and the initial retrieved passage. If Cpredicts\nRetrieve =No, then do not insert any paragraph at the tth segment. If Cpredicts Retrieve =Yes, then\nwe use the original input and the tth segment as a retrieval query to find relevant passages for the\nt-th segment. For each retrieved passage, we predict ISRELand ISSUP. If there is any passage and\ncontinuation with ISREL=Relevant and ISSUP=Fully Supported /ISSUP=Partially\nSupported , then we sample it as the continuation, while we discard part of those examples on\nsome cases (see details below). If there is more than one passage satisfying this criterion, we use\nthe one with the highest retrieval score. If there are only ISREL=Irrelevant or ISSUP=No\nSupport passages, we randomly sample one passage.\nTo avoid the dominance of certain reflection tokens in training data, we down-sample training\ninstances. In particular, we down-sample and discard 50% of the instances without any retrieval\ntokens, since large-scale instruction-following datasets (e.g., Alpaca), include many queries that\ndo not require retrieval (e.g., simple and easy facts or not knowledge intensive). We also notice\nthat in Open-domain QA, there are many relevant and fully supported passages, and when we\nalways prioritize such cases, ISREL=Relevant and ISSUP=Fully Supported will be overly\nrepresented and there’s a risk that a model learns to simply output the same reflection tokens. We,\ntherefore, up-sample some instances with the ISREL=Irrelevant token for the QA dataset.\nAlgorithm 3 MgenData creation\n1:Input Input-output data D=X, Y\n2:for(x, y)∈ {X, Y}do\n3: Given (x, y)Cpredicts Retrieve\n4: ifRetrieve is predicted then\n5: Retrieve relevant passages DusingRgiven (x, y) ▷Retrieve passages\n6: ford∈Ddo\n7: Cpredicts ISRELfor each d ▷ Predict relevance of passages\n8: Cpredicts ISSUPfor each (y, d) ▷Predict supports of outputs\n9: Cpredicts ISUSEfor each d ▷ Predict overall utility ( t=Tonly)\n10: Sample d\n11: else if Retrieve is not predicted then\n12: Cpredicts ISUSEgiven x, y\nAdd augmented (x, y, d, r )toDgen\n6https://spacy.io/\n18\nPublished as a conference paper at ICLR 2024\nTraining examples. Table 4 show several training examples used for Mtraining.\nA.4 S ELF-RAGINFERENCE\nDetails of beam-search score calculations. We first compute scores for each critique type by\ntaking the normalized probabilities of desirable tokens. For ISREL, we compute the score as follows:\ns(ISREL) =p(ISREL=RELEVANT )\np(ISREL=RELEVANT ) +p(ISREL=IRRELEVANT ).\nFor ISSUP, we compute the score as follows:\ns(ISREL) =p(ISSUP=FULLY )\nS+ 0.5×p(ISSUP=PARTIALLY )\nS,\nwhere S=P\nt∈{FULLY,PARTIALLY ,NO}p(ISSUP=t). For ISUSEwhere we have a five-scale score, we\ncompute the weighted sum of the scores. We assigns weighted scores of w={−1,−0.5,0,0.5,1}\nto the tokens ISUSE={1,2,3,4,5}, and compute the final scores as follows:\ns(ISUSE) =5X\niwip(ISUSE=i)\nS,\nwhere S=P\nt∈{1,2,3,4,5}p(ISUSE=t).\nDetails of adaptive retrieval. For retrieval based on soft constraints, we trigger retrieval if the\nfollowing condition is satisfied:\np(Retrieve =YES)\np(Retrieve =YES) +p(p(Retrieve =NO)> δ.\nB E XPERIMENTAL DETAILS\nB.1 M ORE DETAILS OF TRAINING\nMore details of training and computations. We use 4 Nvidia A100 with 80GB memory to train\nour models. All models are trained for 3 epochs with a batch size of 128, a peak learning rate of 2e-5\nwith 3% warmup steps, and linear decay afterward. We set the maximum token length to be 2,048\nfor the 7B model, and 1,524 for the 13B model due to the memory constraint. We use Deepspeed\nstage 3 (Rajbhandari et al., 2020) to conduct multi-GPU distributed training, with training precision\nBfloat16 enabled. FlashAttention (Dao et al., 2022) is used to make the long-context training more\nefficient. We run inference of our trained models using 1-2 Quadro RTX 6000 GPUs with 24GB\nmemory.\nB.2 M ORE DETAILS OF EVALUATIONS\nRetrieval setup details. By default, we use Contriever-MS MARCO to retrieve the top five\ndocuments from Wikipedia, and use official Wikipedia embeddings based on 2018 English Wikipedia.\nOn PopQA, where question and answer pairs are created based on WikiData in 2022, we found\nthat the 2018 Wikipedia sometimes lacks articles about some entities that have been more recently\nadded to Wikipedia. Therefore, for PopQA, we used the December 2020 preprocessed Wikipedia\ncorpus provided by Izacard et al. (2022b) and generated document embeddings.7The issues of\nperformance variance from different Wikipedia dumps have been reported by prior work (Asai et al.,\n2020; Izacard et al., 2022b). Yet, we observe limited effectiveness of such off-the-shelf retrieval\nmodels trained primarily on knowledge-intensive tasks for open-ended generation (e.g., instruction\nfollowing). Recent or concurrent work studies instruction-tuning of retrieval systems (Asai et al.,\n2023b) or joint training of retrieval and LM components (Lin et al., 2023), while we leave exploring\nthe effectivess of such appraoches for future work. For bio generation and open-domain QA tasks,\nwe additionally retrieve five documents using Google Programmable Search8and search documents\nfrom English Wikipedia. As this API only provides snippets, we retrieve Wikipedia introductory\nparagraphs for the corresponding entities.\n7https://github.com/facebookresearch/atlas\n8https://programmablesearchengine.google.com/about/\n19\nPublished as a conference paper at ICLR 2024\nDetailed experimental settings for individual datasets. For OpenQA datasets, we set the max-\nimum new token number to 100 tokens. For closed-set tasks (PubHealth and ARC-C), we set the\nmaximum new token length to 50 for all baselines. For SELF-RAGinference on PubHealth and\nARC-C, instead of determining the output with the highest score 4 as in other tasks, we aggregate the\nscores for each option and select the answer option with the highest score. We found in zero-shot\nsettings of fact checking, some LLMs can generate capitalized class labels (e.g., True) while our\ngold labels are lower-cased. Therefore, across different LMs, for fact checking, we lowercase the\npredictions. In multiple choice tasks, we found some models generate answers in slightly different\nways (e.g., (A) instead of A). We slightly modify instructions for each LLM to avoid such format\nviolations, and further conduct string matching between each candidate and model predictions if\nformat violations still remain. After that processing, in closed set tasks, model predictions match\none of the gold classes in almost all cases. For ALCE, we found that Llama2-chat tend to generate\nsignificantly lower outputs than other models (e.g., on average, their output is nearly 100 token, while\nChatGPT generates 40 tokens on average), resulting in inflated str-em scores. We limit the maximum\ngeneration length to 100 tokens for all baselines to avoid this issue, rather than the original 300\ntokens in the ALCE paper. Consequently, all of the baseline output length is within 30-60 tokens.\nFor FactScore, we set the maximum new token length to 500 for baselines and 200 for SELF-RAGat\neach segment level.\nTask-specific instructions. Table 5 shows the list of the instructions used during evaluations. For\nOpen-domain QA, we do not provide explicit instructions.\nC R ESULTS\nC.1 A NALYSIS\nReliance on parametric- and non-parametric memories. We analyze how frequently model\nanswers come from retrieved passages (non-parametric memories) or their parametric memories.\nOn two open-domain QA datasets, TriviaQA and PopQA, we conduct the following analysis: 1)\nsample query models successfully answer correctly, 2) for each query in this group, check whether the\nmatched ground-truth answer is a sub-string of the retrieved passage or not. We evaluate SELF-RAG\n7B, Alpaca 7B, Alpaca 13B, and Llama2-Chat-13B. We found that SELF-RAGsignificantly less\nfrequently generates answers that are not included in the provided evidence; in particular, in Alpaca\n30B, 20% of the correct predictions are not included in the provided passages, followed by Llama2-\nchat 13B (18%) and Alpaca (15%), while it is only 2% in SELF-RAG. When retrieved passages are\nnot relevant, SELF-RAGgenerates ISREL=Irrelevant , indicating that the following answers\nmay not be factually grounded, while those instruction-tuned models continue to generate plausible\nanswers.\nEffects of training data size. We analyze how the data scale affects the model’s performance. In\nparticular, we randomly sample 5k, 10k, 20k, and 50k instances from our original 150k training\ninstances, and fine-tune four SELF-RAG 7Bvariants on those subsets. Then, we compare the model\nperformance on PopQA, PubHealth, and ASQA (citation precision) with our final SELF-RAGtrained\non the full 150k instances. We also evaluate Figures 5a, 5b and 5c shows the models’ performance\ntrained on different amount of data. Across all datasets, increasing data size often shows upward\ntrajectories and the improvements are significantly larger in PopQA and ASQA, while we do not\nobserve such significant improvements on Llama2-FT 7Bwhen increasing the training data from 50k\nto 150k. These results also indicate that further expanding the training data of SELF-RAGmay lead\nto further improvements, although in this work we limit our training data size to 150k.\nReflection token prediction performance. We evaluate the accuracy of the Critic and Generator\nLMs in predicting reflection tokens. For the Critic LM, we evaluate its agreement against GPT-4\npredictions on a validation set of the initially collected GPT-4 predictions. Table 5d shows the model\nperformance of predicting GPT-4 judgments. As you can see, overall our fine-tuned reward model\nshows high prediction matching with GPT-4 predicted feedback. In most aspects, our reward model\nshows higher than 80% accuracy, indicating the powerful ability of fine-tuned specialized LMs to\nevaluate text. While both models show relatively lower performance on ISUSE, this is because both\nmodels often confuse between the two highest cases (5 and 4), where human annotators also disagree.\n20\nPublished as a conference paper at ICLR 2024\n0 50 100 150\nNum of training (k)3540455055Perfomance\n(a) PopQA\n0 100\nNum of training (k)717273\n (b) PubHealth\n0 100\nNum of training (k)4060\n (c) ASQA (prec)Type acc.\nRetrieve 93.8\nISSUP 93.5\nISREL 80.2\nISUSE 73.5\n(d) Reflection pre-\ndiction accuracy.\nFigure 5: Training scale and Human analysis: (a) (b) (c) Training scale analysis shows the effect\nof the training data scale on PopQA, PubHealth and ASQA (citation precision), respectively. (d)\nReflection token prediction accuracy of the Critic LM.\nC.2 H UMAN EVALUATION\nEvaluations on supportiveness and plausibility. We conduct small human evaluations on SELF-\nRAGoutputs, as well as the reliability of predicted reflection tokens. In particular, we sampled 50\nsamples from PopQA and Bio results. Following Menick et al. (2022), human annotators evaluate\nS&P , which indicates whether the model output is plausible (i.e., the output is a reasonable and\non-topic response to the question as if it were occurring in a conversation) and supported (i.e., the\nprovided evidence is sufficient to verify the validity of the answer). For S&P, we do not consider the\ninstances where SELF-RAGpredicts irrelevant orno support . We then ask our annotators\nwhether the model-predicted reflection tokens about ISRELand ISSUPmatch their inspections\n(e.g., whether the fully supported output is supported by the cited evidence). Human annotators find\nSELF-RAGanswers are often plausible and supported by relevant passages with higher S&P scores\non short-form PopQA, which is consistent with Menick et al. (2022). Human annotators also find\nISRELand ISSUPreflection token predictions are mostly aligned with their assessments. Appendix\nTable 6 shows several annotated examples and explanations on assessments.\nExamples of human evaluations. Table 6 shows examples with human evaluations on S&P and\ncorrectness of ISRELand ISSUPreflection tokens.\nC.3 Q UALITATIVE EXAMPLES\nTable 7 shows several examples predicted by our SELF-RAG(13B). The first example is the model\noutput to an ASQA question. The first reference states that Emperor Constantine made Sunday a\nday of rest from labor, and further the second citation supports the fact that the official adoption\nof Sunday as a day of rest by Constantine in AD 321. In the second example, the model predicts\nContradictory to the first output as the output says the person has served as the CEO since 2010,\nwhile the passage says he stepped down as CEO in 2015. Indicating those factual contradictions\nas reflection tokens enables to enforcement of hard control and also verification of model outputs\neasily. In the third example, while the generation is mostly correct, SELF-RAGpredicts Partially\nSupport to the statement listing the name of the songs, as they were not explicitly mentioned.\nD F ULL LIST OF INSTRUCTIONS AND DEMONSTRATIONS FOR GPT-4\nHere, we show the instructions and demonstrations used to prompt GPT-4 to collect reflection tokens.\nTable 8 shows the instructions and demonstrations for the initial retrieval token. Table 9 shows\nthe instruction and demonstrations used to collect the three-way output tokens for Retrieve given\ninstruction, preceding sentences, and previously retrieved passages. Due to the longer demonstration\nand test input, we only use a single demonstration. Table 10 shows an instruction and demonstrations\nused to collect the three-way output tokens for ISREL. Table 11 shows an instruction and demon-\nstrations used to collect the three-way output tokens for ISREL. Table 12 shows an instruction and\ndemonstrations used to collect the five-way output tokens for ISUSE.\n21\nPublished as a conference paper at ICLR 2024\nInput Identify the two types of computer memory. (from gpt4 alpaca )\nOutput [Retrieve =Yes]<p>Computer memory Computer memory – In computing, memory refers to the\ncomputer hardware integrated circuits that store information for immediate use in a computer; it is synonymous\nwith the term “primary storage”. Computer memory operates at a high speed, for example, random-access\nmemory (RAM), as a distinction from storage that provides slow-to-access information but offers higher\ncapacities. If needed, contents of the computer memory can be transferred to secondary storage; a very\ncommon way of doing this is through a memory management technique called “virtual memory”. An archaic\nsynonym for memory is store. The term “memory”, meaning “primary storage” or “main memory”, is <p>[\nISREL=Relevant ]Primary memory, also known as main memory, is the memory that the computer can\ndirectly access and manipulate. [ ISSUP=Fully Supported ] [Retrieve =Yes]<p>Computer memory–often\nassociated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors,\nused for example as primary storage but also other purposes in computers and other digital electronic devices.\nThere are two main kinds of semiconductor memory, volatile and non-volatile. Examples of non-volatile memory\nare flash memory (used as secondary memory) and ROM, PROM, EPROM and EEPROM memory (used for\nstoring firmware such as BIOS). Examples of volatile memory are primary storage, which is typically dynamic\nrandom-access memory (DRAM), and fast CPU cache memory, which is typically static random-access memory\n(SRAM) that is fast but energy-consuming, offering lower memory <p>[ISREL=Relevant ]Primary memory\ncan be accessed quickly, but it is volatile, meaning that the data is lost if the power is turned off.[ ISSUP\n=Partially Supported ]...[ ISUSE:5]\nInput Give an example of a situation where you have to delegate tasks (from gpt4 alpaca )\nOutput [Retrieve =No]The manager realizes that there is an extensive list of tasks that need to be completed in\norder to make the event a success.[ Retrieve =Yes]<p>Delegation–Delegation is the assignment of any responsi-\nbility or authority to another person (normally from a manager to a subordinate) to carry out specific activities. It\nis one of the core concepts of management leadership. However, the person who delegated the work remains\naccountable for the outcome of the delegated work. Delegation empowers a subordinate to make decisions, i.e. it\nis a shifting of decision-making authority from one organizational level to a lower one. Delegation, if properly\ndone, is not fabrication. The opposite of effective delegation is micromanagement, where a manager provides too\nmuch input, <p>[ISREL=Relevant ]This includes tasks such as creating a guest list, designing and sending in-\nvitations, booking a venue, coordinating with vendors, planning the event ´s program, and monitoring the budget.[\nISSUP=Partially Supported ][Retrieve =No]Since it is not possible for the manager to handle everything\nalone, they must delegate tasks to their team members.[ Retrieve =No]The manager would then assign each\nteam member specific responsibilities based on their individual skillsets and strengths.[ Retrieve =Yes]<p>Joint\napplication design – Subject Matter Experts: These are the business users, the IS professionals, and the outside\nexperts that will be needed for a successful workshop. This group is the backbone of the meeting <p>[ISREL\n=Irrelevant ]For instance, the team ´s graphic designer might be tasked with designing the invitations, while\nthe event coordinator would be in charge of planning the program and working with vendors...[ Retrieve =No]By\ndelegating tasks, the manager can ensure that all necessary tasks are completed in a timely manner, while also\nempowering their team members to take on responsibilities and contribute to the success of the event.[ ISUSE\n=5]\nInput Synthesize a poem based on this prompt:“The sun is setting” (from stanford alpaca )\nOutput [Retrieve =No]The stars begin to twinkle bright the trees, scattered throughout the night;\n[Retrieve =No]The stars begin to twinkle bright the trees, scattered throughout the night [ Retrieve =No] They\nflicker like a silent dance, A peaceful moment in its trance. [ Retrieve =No] They flicker like a silent dance, A\npeaceful moment in its trance. [ Retrieve =No] A chorus of crickets sing a lullaby, To every creature that roams\nthe sky [ ISUSE=5]\nInput when did walking dead season 7 come out (from nq)\nOutput [Retrieve =Yes]<p>The Walking Dead (season 7)–The Walking Dead (season 7) The seventh season of\n“The Walking Dead”, an American post-apocalyptic horror television series on AMC, premiered on October 23,\n2016, and concluded on April 2, 2017, consisting of 16 episodes. Developed for television by Frank Darabont,\nthe series is based on the eponymous series of comic books by Robert Kirkman, Tony Moore, and Charlie Adlard.\n...<p>[ISREL=Relevant ]October 23 , 2016[ ISSUP=Fully Supported ][ISUSE=5]\nInput Given a chat history separated by new lines, generates an informative, knowledgeable and engaging\nresponse: Hello, I love German shepherd dogs and you? – My favorite dog, they are officially known as German\nShepherd Dogs? – Yes, I have a German shepherd in my house and she is pregnant (from wow)\nOutput hey also are easily trained, obedient, and intelligent.[ Retrieve =Yes]<p>German Shepherd – a simple\ndifference of opinion, it is the fundamental issue of the breed ´s essential conformation and movement. ... They\nare known for being easy to train and good for performing tasks and following instructions. They are especially\nwell known for their police work, being used for tracking criminals, patrolling troubled areas and detecting and\nholding suspects. <p>[ISREL=Relevant ]Because of this, they are often used for disability assistance and\nfor police work. [ ISSUP=Partially Supported ][ISUSE=2]\nTable 4: List of the training examples.\n22\nPublished as a conference paper at ICLR 2024\nDataset Instruction\nARC-C Given four answer candidates, A, B, C and D, choose the best answer choice. Please answer\nwith the capitalized alphabet only, without adding any extra phrase or period.\nPubHealth Is the following statement correct or not? Say true if it’s correct; otherwise, say false. Don’t\ncapitalize or add periods, just say “true” or “false”.\nBio Generation Tell me a bio about [Person Name]\nASQA (baseline) Instruction: Write an accurate, engaging, and concise answer for the given question using only\nthe provided search results (some of which might be irrelevant) and cite them properly. Use\nan unbiased and journalistic tone. Always cite for any factual claim. When citing several\nsearch results, use [1][2][3]. Cite at least one document and at most three documents in each\nsentence. If multiple documents support the sentence, only cite a minimum sufficient subset of\nthe documents.\nASQA (ours) Answer the following question. The question may be ambiguous and have multiple correct\nanswers, and in that case, you have to provide a long-form answer including all correct answers.\nTable 5: Full list of instructions used during zero-shot evaluations. For open-domain QA, we don’t\nuse any task specific instruction and simply use the original questions as input query.\n23\nPublished as a conference paper at ICLR 2024\nInput What genre is A Question and Answer Guide to Astronomy? ( PopQA )\nOutput [Retrieve =Yes]<p>Reference [1] <p>[ISREL=Relevant ] A Question and Answer Guide to As-\ntronomy is a non-fiction book that falls under the genre of Astronomy and Cosmology. [ ISSUP=Fully\nSupported ][ISUSE=5]\nReference [1] A Question and Answer Guide to Astronomy is a book about astronomy and cosmology and is\nintended for a general audience. The book was written by Pierre-Yves Bely, Carol Christian, and Jean-Rene\nRoy, and published in English by Cambridge University Press in 2010. It was originally written in French. The\ncontent within the book is written using a question and answer format. It contains some 250 questions, which\nThe Science Teacher states each are answered with a concise and well-formulated essay that is informative and\nreadable. The Science Teacher review goes on to state that many of the answers given in the book are little gems\nof science writing. The Science Teacher summarizes by stating that each question is likely to be thought of by a\nstudent and that the answers are informative, well-constructed, and thorough. The book covers information about\nthe planets, the Earth, the Universe, practical astronomy, history, and awkward questions such as astronomy in\nthe Bible, UFOs, and aliens. Also covered are subjects such as the Big Bang, comprehension of large numbers,\nand the Moon illusion.\nS&P Supported and Plausible.\nCorrectness of ISRELand ISSUP:Both ISRELand ISSUPare correct, as the Reference [1] provides\ndetailed information of the book including the genre, and the model output is fully supported by the reference.\nInput Who is the author of The Lie? ( PopQA )\nOutput [Retrieve =Yes]<p>Reference [1] <p>[ISREL=Relevant ] The author of The Lie is Sam Harris.[\nISSUP=Fully Supported ][ISUSE=5]\nReference [1] Lying is a 2011 long-form essay book by American author and neuroscience expert Sam Harris.\nHarris argues that we can radically simplify our lives and improve society by merely telling the truth in situations\nwhere others often lie.\nS&P Plausible but not supported.\nCorrectness of ISRELand ISSUPBoth ISRELand ISSUPare incorrect, as the Reference [1] is about a\ndifferent book whose title is similar to The Lie.\nInput Tell me a bio about Ronaldinho ( Bio)\nOutput [Retrieve =Yes]<p>Reference [1] <p>[ISREL=Relevant ] Ronaldinho is a retired professional\nfootballer from Brazil.[ ISSUP=Fully Supported ][Retrieve =Continue ]He played as an attacking\nmidfielder and was known for his technical skills, creativity, dribbling ability, and accuracy from free\nkicks[ Retrieve =Continue ]He won two FIFA World Player of the Year awards and a Ballon d’Or.[ ISUSE=5]\nReference [1] Ronaldo de Assis Moreira (born 21 March 1980), commonly known as Ronaldinho Ga ´ucho or\nsimply Ronaldinho, is a Brazilian retired professional footballer who played mostly as an attacking midfielder,\nbut was also deployed as a winger. Considered one of the best players of his generation and regarded by many as\none of the greatest of all time, Ronaldinho won two FIFA World Player of the Year awards and a Ballon d’Or. A\nglobal icon of the sport, he was renowned for his technical skills, creativity, dribbling ability and accuracy from\nfree-kicks, as well as his use of tricks, feints, no-look passes and overhead...\nS&P Plausible and fully supported.\nCorrectness of ISRELand ISSUPBoth ISRELand ISSUPare correct.\nInput Tell me a bio about Sergei Bodrov ( Bio)\nOutput [Retrieve =Yes]<p>Reference [1] <p>[ ISREL=Relevant ] Sergei Vladimirovich Bodrov\n(born June 28, 1948) is a Russian film director, screenwriter, and producer.[ ISSUP=Fully\nSupported ][Retrieve =Continue ]He was born in Moscow, Russia.[ Retrieve =Continue ]He is the presi-\ndent of the Russian Guild of Film Producers.[ ISUSE=5]\nReference [1] Sergei Vladimirovich Bodrov (born June 28, 1948) is a Russian film director, screenwriter, and\nproducer. In 2003 he was the President of the Jury at the 25th Moscow International Film Festival.\nS&P Plausible but only partially supported. The the Reference [1] does not explicitly mention the person is born\nin Moscow and there is no description about the Russian Guild of Film Producers.\nCorrectness of ISRELand ISSUP ISRELand ISSUPare correct as the reference document is highly relevant\nand the first sentence is fully supported.\nTable 6: Examples of human evaluation results.\n24\nPublished as a conference paper at ICLR 2024\nInput When did sunday became the seventh day of the week ( asqa )\nOutput [Retrieve =Yes]<p>Reference [1] <p>[ISREL=Relevant ] In 321 AD, the Roman Emperor Con-\nstantine the Great made Sunday a day of rest from labor, stating: “On the venerable Day of the Sun let the\nmagistrates and people residing in cities rest, and let all workshops be closed”.[ ISSUP=Fully Supported ]\n[Retrieve =Yes]<p>Reference [2] <p>[ISREL=Relevant ] The official adoption of Sunday as a day of\nrest by Constantine the Great in AD 321 made Sunday the seventh day of the week. [2][ISSUP=Fully\nSupported ][ISUSE=5]\nReference [1] Emperor Constantine then enacted the first Sunday Laws, for “the venerable Day of the Sun”\nin 321 A.D. On March 7, 321, the Roman emperor Constantine issued a decree making Sunday a day of rest\nfrom labor, stating: We find a good description of this in sources such as the book Rest Days which states:\nEarly Christian observance of both the spiritual seventh-day sabbath and a Lord ´s Day assembly is evidenced in\nIgnatius ´s letter to the Magnesians ca. 110.\nReference [2] Despite the official adoption of Sunday as a day of rest by Constantine, the seven-day week and\nthe nundial cycle continued to be used side-by-side until at least the Calendar of 354 and probably later. ... The\nfact that the canon had to be issued at all is an indication that adoption of Constantine’s decree of 321 was still\nnot universal ...\nInput Tell me a bio about Rory Tapner. ( bio generation )\nOutput <p>Reference [1] <p>[ISREL=Relevant ] Rory Tapner is a British businessman who has served as\nthe Chief Executive Officer of Coutts, a private banking and wealth management company, since 2010.[ ISSUP\n=Contradictory ],Reference [2] <p>[ISREL=Relevant ] Rory Tapner was born on 30 September 1959\nin London, England.[ ISSUP=Fully Supported ]\nReference [1] Rory Tapner Rory Tapner is a UK businessman who, in September 2010 was appointed CEO of\nCoutts, the Wealth Management division of Royal Bank of Scotland Group. In February 2015 he stepped down\nas CEO when he was replaced in the role by Alison Rose\nReference [2] Holdings (from August 2016); and the Council of the University of Buckingham (from July 2017).\nRory Tapner Rory Tapner (born 30 September 1959) is a UK businessman ...\nInput Tell me a bio about G. Venugopa. ( bio generation )\nOutput (ranked 1) [Retrieve =Yes]<p>Reference [1] <p>[ISREL=Relevant ] G. Venugopal is a popu-\nlar playback singer in the Malayalam film industry. [ Retrieve =Continue ] He has sung over 300 songs in\nMalayalam, Tamil, and Hindi movies.[ ISSUP=Partially Supported ] [Retrieve =Yes]<p>Reference\n[2]<p>G. Venugopal was born on December 10, 1960 in India.[ ISSUP=Fully Supported ]\nReference [1] ’G. Venugopal G Venugopal (born 10 December 1960) is an Indian playback singer known for his\nwork in Malayalam films. He started his singing career in the film “Odaruthammaava Aalariyaam” (1984). Since\nthen he has sung in more than 300 films; and has over 500 private albums to his credit. The many awards he has\nwon include the coveted Best Playback Singer (Kerala State) which he won thrice, the Film Critics award and\nthe Kerala Kaumudi Gallop Poll award (twice). He won The Kerala State award for Best Playback singer (2004)\nthird time for the song ”aadedee..aadaadedee” from the film ”Ullam”.Venugopal’\nReference [2] Kerala State Film Awards: Kerala Film Critics Award Kerala State Government Award (Profes-\nsional Drama): Asianet Film Awards: Kerala Kaumadi Gallup Poll Award: Academic G. Venugopal G Venugopal\n(born 10 December 1960) is an Indian playback singer known for his work in Malayalam films.\nTable 7: Examples of outputs.\n25\nPublished as a conference paper at ICLR 2024\nInstructions\nGiven an instruction, please make a judgment on whether finding some external documents\nfrom the web (e.g., Wikipedia) helps to generate a better response. Please answer [Yes] or\n[No] and write an explanation.\nDemonstrations\nInstruction Give three tips for staying healthy.\nNeed retrieval? [Yes]\nExplanation There might be some online sources listing three tips for staying healthy or\nsome reliable sources to explain the effects of different behaviors on health. So retrieving\ndocuments is helpful to improve the response to this query.\nInstruction Describe a time when you had to make a difficult decision.\nNeed retrieval? [No]\nExplanation This instruction is asking about some personal experience and thus it does not\nrequire one to find some external documents.\nInstruction Write a short story in third person narration about a protagonist who\nhas to make an important career decision.\nNeed retrieval? [No]\nExplanation This instruction asks us to write a short story, which does not require external\nevidence to verify.\nInstruction What is the capital of France?\nNeed retrieval? [Yes]\nExplanation While the instruction simply asks us to answer the capital of France, which is a\nwidely known fact, retrieving web documents for this question can still help.\nInstruction Find the area of a circle given its radius. Radius = 4\nNeed retrieval? [No]\nExplanation This is a math question and although we may be able to find some documents\ndescribing a formula, it is unlikely to find a document exactly mentioning the answer.\nInstruction Arrange the words in the given sentence to form a grammatically cor-\nrect sentence. quickly the brown fox jumped\nNeed retrieval? [No]\nExplanation This task doesn’t require any external evidence, as it is a simple grammatical\nquestion.\nInstruction Explain the process of cellular respiration in plants.\nNeed retrieval? [Yes]\nExplanation This instruction asks for a detailed description of a scientific concept, and is\nhighly likely that we can find a reliable and useful document to support the response.\nTable 8: Instructions and demonstrations for Retrieve aspect given the input only.\n26\nPublished as a conference paper at ICLR 2024\nInstructions\nYou will be provided with an instruction, evidence, output sentence, and preceding sentences\n(optional). If the preceding sentence is given, the output should be the sentence that follows\nthose preceding sentences. Your task is to determine whether the information in the output\nsentence can be fully verified by the evidence or if it requires further external verification.\nThere are three cases:\n- If the output sentence can be verified solely with the evidence, then respond with [Continue\nto Use Evidence].\n- If the sentence doesn’t require any factual verification (e.g., a subjective sentence or a\nsentence about common sense), then respond with [No Retrieval].\n- If additional information is needed to verify the output sentence, respond with [Retrieval].\nPlease provide explanations for your judgments.\nInstruction Explain the use of word embeddings in Natural Language Processing.\nPreceding sentences Word embeddings are one of the most powerful tools available for\nNatural Language Processing (NLP). They are mathematical representations of words or\nphrases in a vector space, allowing similarities between words and the context in which they\nare used to be measured.\nEvidence : Word embedding\nWord embedding is the collective name for a set of language modeling and feature learning\ntechniques in natural language processing (NLP) where words or phrases from the vocabulary\nare mapped to vectors of real numbers. Conceptually it involves a mathematical embedding\nfrom a space with one dimension per word to a continuous vector space with a much lower\ndimension. Output: Word embeddings are useful for tasks such as sentiment analysis, text\nclassification, predicting the next word in a sequence, and understanding synonyms and\nanalogies.\nRating [Retrieval]\nExplanation The output discusses the applications of word embeddings, while the evidence\nonly discusses the definitions of word embeddings and how they work. Therefore, we need to\nretrieve other evidence to verify whether the output is correct or not.\nTable 9: Instructions and demonstrations for Retrieve aspect given the input, preceding generations,\nand retrieved passages.\n27\nPublished as a conference paper at ICLR 2024\nInstructions\nYou’ll be provided with an instruction, along with evidence and possibly some preceding\nsentences. When there are preceding sentences, your focus should be on the sentence that\ncomes after them. Your job is to determine if the evidence is relevant to the initial instruction\nand the preceding context, and provides useful information to complete the task described in\nthe instruction. If the evidence meets this requirement, respond with [Relevant]; otherwise,\ngenerate [Irrelevant].\nInstruction Given four answer options, A, B, C, and D, choose the best answer.\nInput Earth’s rotating causes\nA: the cycling of AM and PM\nB: the creation of volcanic eruptions\nC: the cycling of the tides\nD: the creation of gravity\nEvidence Rotation causes the day-night cycle which also creates a corresponding cycle of\ntemperature and humidity creates a corresponding cycle of temperature and humidity. Sea\nlevel rises and falls twice a day as the earth rotates.\nRating [Relevant]\nExplanation The evidence explicitly mentions that the rotation causes a day-night cycle, as\ndescribed in the answer option A.\nInstruction age to run for US House of Representatives\nEvidence The Constitution sets three qualifications for service in the U.S. Senate: age (at\nleast thirty years of age); U.S. citizenship (at least nine years); and residency in the state a\nsenator represents at the time of election.\nRating [Irrelevant]\nExplanation The evidence only discusses the ages to run for the US Senate, not for the\nHouse of Representatives.\nTable 10: Instructions and demonstrations for ISRELaspect given the input only.\n28\nPublished as a conference paper at ICLR 2024\nInstructions\nYou will receive an instruction, evidence, and output, and optional preceding sentences. If the\npreceding sentence is given, the output should be the sentence that follows those preceding\nsentences. Your task is to evaluate if the output is fully supported by the information provided\nin the evidence.\nUse the following entailment scale to generate a score:\n- [Fully supported] - All information in output is supported by the evidence, or extractions\nfrom the evidence. This is only applicable when the output and part of the evidence are\nalmost identical.\n- [Partially supported] - The output is supported by the evidence to some extent, but there\nis major information in the output that is not discussed in the evidence. For example, if an\ninstruction asks about two concepts and the evidence only discusses either of them, it should\nbe considered a [Partially supported].\n- [No support / Contradictory] - The output completely ignores evidence, is unrelated to the\nevidence, or contradicts the evidence. This can also happen if the evidence is irrelevant to the\ninstruction.\nMake sure to not use any external information/knowledge to judge whether the out-\nput is true or not. Only check whether the output is supported by the evidence, and not\nwhether the output follows the instructions or not.\nInstruction Explain the use of word embeddings in Natural Language Processing.\nPreceding sentences Word embeddings are one of the most powerful tools available for\nNatural Language Processing (NLP). They are mathematical representations of words or\nphrases in a vector space, allowing similarities between words and the context in which they\nare used to be measured.\nOutput Word embeddings are useful for tasks such as sentiment analysis, text classification,\npredicting the next word in a sequence, and understanding synonyms and analogies.\nEvidence Word embedding\nWord embedding is the collective name for a set of language modeling and feature learning\ntechniques in natural language processing (NLP) where words or phrases from the vocabulary\nare mapped to vectors of real numbers. Conceptually it involves a mathematical embedding\nfrom a space with one dimension per word to a continuous vector space with a much lower\ndimension. Methods to generate this mapping include neural networks, dimensionality\nreduction on the word co-occurrence matrix, probabilistic models, explainable knowledge\nbase method, and explicit representation in terms of the context in which words appear. Word\nand phrase embeddings, when used as the underlying input representation, have been shown\nto boost the performance in NLP tasks such as syntactic parsing, sentiment analysis, next\ntoken predictions as well and analogy detection.\nScore [Fully supported]\nExplanation The output sentence discusses the application of word embeddings, and the\nevidence mentions all of the applications syntactic parsing, sentiment analysis, next token\npredictions as well as analogy detection as the applications. Therefore, the score should be\n[Fully supported].\nTable 11: Instructions and demonstrations for ISSUPtokens.\n29\nPublished as a conference paper at ICLR 2024\nInstructions\nGiven an instruction and an output, rate whether the response appears to be a helpful and\ninformative answer to the query, from 1 (lowest) - 5 (highest). We call this score perceived\nutility. The detailed criterion is as follows: 5: The response provides a complete, highly\ndetailed, and informative response to the query, fully satisfying the information needs. 4: The\nresponse mostly fulfills the need in the query, while there can be some minor improvements\nsuch as discussing more detailed information, having better structure of the response, or im-\nproving coherence. 3: The response is acceptable, but some major additions or improvements\nare needed to satisfy users’ needs. 2: The response still addresses the main request, but it is\nnot complete or not relevant to the query. 1: The response is barely on-topic or completely\nirrelevant.\nInstruction Who is the current prime minister of the UK as of 2023?\nOutput Boris Johnson was the prime minister of the UK from 2019 - 2022.\nPerceived utility 2\nExplanation While the output provides a factually correct statement about the UK prime\nminister from 2019 to 2022, this instruction asks who the prime minister is as of 2023, so it\ndoesn’t answer the instruction. Therefore, the utility is 2.\nInstruction Given a description of a travel destination, recommend 10 tourist attrac-\ntions with detailed descriptions of each. The travel destination is Tokyo, Japan.\nOutput Tokyo is a vibrant city full of exciting tourist attractions. Some must-see sights\ninclude the Tokyo Skytree, Tokyo Disneyland, Sensoji Temple, Meiji Shrine, Tsukiji Fish\nMarket, Harajuku, and Shinjuku Gyoen.\nPerceived utility 3\nExplanation This output doesn’t provide descriptions of each attraction and the number of\nthe attractions is also less than 10. While this output partially answers the instructions, it\ndoesn’t match the instructions strictly.\nTable 12: Instructions and demonstrations for ISUSEtokens.\n30",
 'document_info': {'path': '/Users/swa/Desktop/AI/PdfKnowledgeBase/selfrag.pdf',
  'metadata': None}}

With the parsing done, you can start chunking things:

from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

splitter = FixedSizeSplitter(chunk_size=4000, chunk_overlap=200)
chunks = await splitter.run(found.text)
chunks.chunks[2]
TextChunk(text='ces by leveraging reflection tokens through segment-level beam\nsearch using the weighted linear sum of the reflection token probabilities as segment score.\nEmpirical results on six tasks, including reasoning and long-form generation, demonstrate that SELF-\nRAGsignificantly outperforms pre-trained and instruction-tuned LLMs that have more parameters and\nwidely adopted RAG approaches with higher citation accuracy. In particular, SELF-RAGoutperforms\nretrieval-augmented ChatGPT on four tasks, Llama2-chat (Touvron et al., 2023) and Alpaca (Dubois\net al., 2023) on all tasks. Our analysis demonstrates the effectiveness of training and inference with\nreflection tokens for overall performance improvements as well as test-time model customizations\n(e.g., balancing the trade-off between citation previsions and completeness).\n2 R ELATED WORK\nRetrieval-Augmented Generation. Retrieval-Augmented Generation (RAG) augments the input\nspace of LMs with retrieved text passages (Guu et al., 2020; Lewis et al., 2020), leading to large\nimprovements in knowledge-intensive tasks after fine-tuning or used with off-the-shelf LMs (Ram\net al., 2023). A more recent work (Luo et al., 2023) instruction-tunes an LM with a fixed number\n2\nPublished as a conference paper at ICLR 2024\nType Input Output Definitions\nRetrieve x/x, y {yes, no, continue } Decides when to retrieve with R\nISREL x, d {relevant , irrelevant } dprovides useful information to solve x.\nISSUP x, d, y {fully supported , partially\nsupported, no support }All of the verification-worthy statement in y\nis supported by d.\nISUSE x, y {5, 4, 3, 2, 1 } yis a useful response to x.\nTable 1: Four types of reflection tokens used in SELF-RAG. Each type uses several tokens to represent\nits output values. The bottom three rows are three types of Critique tokens, and the bold text indicates\nthe most desirable critique tokens. x, y, d indicate input, output, and a relevant passage, respectively.\nof retrieved passages prepended to input, or pre-train a retriever and LM jointly, followed by few-\nshot fine-tuning on task datasets (Izacard et al., 2022b). While prior work often retrieves only\nonce at the beginning, Jiang et al. (2023) propose to adaptively retrieve passages for generation\non top of a proprietary LLM or Schick et al. (2023) train an LM to generate API calls for named\nentities. Yet, the improved task performance of such approaches often comes at the expense of\nrun-time efficiency (Mallen et al., 2023), robustness to irrelevant context (Shi et al., 2023), and lack\nof attributions (Liu et al., 2023a; Gao et al., 2023). We introduce a method to train an arbitrary LM to\nlearn to use retrieval on-demand for diverse instruction-following queries and introduce controlled\ngeneration guided by reflections tokens to further improve generation quality and attributions.\nTraining and generating with critics. Training LLMs with reinforcement learning (e.g., Proximal\nPolicy Optimization or PPO; Schulman et al. 2017) from human feedback (RLHF) has proven\neffective in aligning LLMs with human preferences (Ouyang et al., 2022; Wu et al., 2023). Though\nour work also studies fine-grained critique on retrieval and generation, we train our target LM on task\nexamples augmented with reflection tokens from a critic model offline, with a far lower training cost\ncompared to RLHF. Compared to prior work using control tokens to guide LM generation (Lu et al.,\n2022; Korbak et al., 2023), SELF-RAGuses reflection tokens to decide the need for retrieval and to\nself-evaluate generation quality.\n3 S ELF-RAG: LEARNING TO RETRIEVE , GENERATE AND CRITIQUE\nWe introduce Self-Reflective Retrieval-Augmented Generation ( SELF-RAG), shown in Figure 1.\nSELF-RAGis a framework that enhances the quality and factuality of an LLM through retrieval and\nself-reflection, without sacrificing LLM’s original creativity and versatility. Our end-to-end training\nlets an LM Mgenerate text informed by retrieved passages, if needed, and criticize the output by\n', index=2, metadata=None)

At this point you can convert the chunks to vectors, nodes and relationships. The whole process happens in one go via the SimpleKGPipeline. It’s nice they put this in place since it lowers the experiemntation threshold.

You can give a bunch of entities and relationships and your KG will flow out of this, directly into the Neo4j database.

from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

entities = ["PERSON", "ORGANIZATION", "LOCATION"]
relations = ["LOCATED_AT", "INTERACTS", "LED_BY"]
potential_schema = [
    ("PERSON", "LOCATED_AT", "LOCATION"),
    ("PERSON", "INTERACTS", "PERSON"),
    ("ORGANIZATION", "LED_BY", "PERSON"),
]

pipeline = SimpleKGPipeline(llm, driver, embedder, entities, relations, potential_schema, from_pdf=False)
r = await pipeline.run_async(text="Raymond lives in New York and works at Google. He is the CEO of Apple.")
/Users/swa/conda/envs/grag/lib/python3.13/site-packages/neo4j_graphrag/experimental/components/entity_relation_extractor.py:428: UserWarning: No document metadata provided, the document node won't be created in the lexical graph
  warnings.warn(

All of this is a nice stepping stone but if you looks at Microsoft’s effort or TrustGraph you can see that Neo4j has a long way to go.