Retrieval Accuracy
1007550250
97.9%
91.3%
88.8%
70.2%
40.4%
38.6%
HotpotQA
FinanceBench
Agentset
RAG + Reranker
Standard RAG
- Agentset — Our hosted retrieval pipeline including query expansion, hybrid search, and multi-step reasoning.
- RAG + Reranker — Same as standard RAG, plus a reranking model that reorders retrieved chunks by relevance.
- Standard RAG — Embeds documents, retrieves top-k chunks via vector similarity, passes them to the LLM.
HotpotQA
HotpotQA is the leading multi-hop reasoning benchmark for RAG systems. It’s a challenging dataset containing 113k question-answer pairs, each answer requires information from 2 or more documents. For example: “What government position was held by the woman who portrayed Roxie Hart in the film Chicago?” To answer this, a system must first find the actress, then find her government role.| Configuration | Correct Answers | Average Score |
|---|---|---|
| Agentset | 979 / 1000 | 9.84 |
| RAG + Reranker | 913 / 1000 | 9.2 |
| Standard RAG | 888 / 1000 | 9.0 |
FinanceBench
FinanceBench is a benchmark for evaluating financial question-answering over real public company filings. It contains 150 question-answer pairs requiring extraction and reasoning over 10-K and 10-Q documents from companies across multiple sectors. For example: “What is the FY2018 capital expenditure amount for 3M?” To answer this, a system must locate and extract the correct value from the company’s cash flow statement.| Configuration | Correct Answers | Average Score |
|---|---|---|
| Agentset | 80 / 114 | 7.75 |
| RAG + Reranker | 46 / 114 | 5.1 |
| Standard RAG | 44 / 114 | 5.0 |