Skip to main content

Retrieval Accuracy

1007550250
97.9%
91.3%
88.8%
70.2%
40.4%
38.6%
HotpotQA
FinanceBench
Agentset
RAG + Reranker
Standard RAG
Configurations:
  • Agentset — Our hosted retrieval pipeline including query expansion, hybrid search, and multi-step reasoning.
  • RAG + Reranker — Same as standard RAG, plus a reranking model that reorders retrieved chunks by relevance.
  • Standard RAG — Embeds documents, retrieves top-k chunks via vector similarity, passes them to the LLM.
All configurations use matching set-ups: 2048 character recursive chunking with Chonkie, Turbopuffer vector database with top-k set to 20, and text-embedding-3-large for embeddings. RAG + Reranker and Agentset use Zerank-2 for reranking.

HotpotQA

HotpotQA is the leading multi-hop reasoning benchmark for RAG systems. It’s a challenging dataset containing 113k question-answer pairs, each answer requires information from 2 or more documents. For example: “What government position was held by the woman who portrayed Roxie Hart in the film Chicago?” To answer this, a system must first find the actress, then find her government role.
ConfigurationCorrect AnswersAverage Score
Agentset979 / 10009.84
RAG + Reranker913 / 10009.2
Standard RAG888 / 10009.0
View the HotpotQA results and JSON outputs on GitHub.

FinanceBench

FinanceBench is a benchmark for evaluating financial question-answering over real public company filings. It contains 150 question-answer pairs requiring extraction and reasoning over 10-K and 10-Q documents from companies across multiple sectors. For example: “What is the FY2018 capital expenditure amount for 3M?” To answer this, a system must locate and extract the correct value from the company’s cash flow statement.
ConfigurationCorrect AnswersAverage Score
Agentset80 / 1147.75
RAG + Reranker46 / 1145.1
Standard RAG44 / 1145.0
View the FinanceBench results and JSON outputs on GitHub.