Retrieval Accuracy

{/* Y-axis */}

{yAxisTicks.slice().reverse().map((tick) => ( {tick} ))}

{/* Chart area */}

{/* Labels + bars area */}

{/* Grid lines */}

{yAxisTicks.slice().reverse().map((tick) => (

))}

{/* Bars */}

{data.map((item) => (

{/* Agentset bar */}

{item.agentset}%

{/* RAG + Reranker bar */}

{item.reranker}%

{/* Standard RAG bar */}

{item.standard}%

))}

{/* X-axis labels */}

{data.map((item) => (

{item.benchmark}

))}

{/* Legend below */}

Agentset

RAG + Reranker

Standard RAG

); }; Configurations: * **Agentset** — Our hosted retrieval pipeline including query expansion, hybrid search, and multi-step reasoning. * **RAG + Reranker** — Same as standard RAG, plus a reranking model that reorders retrieved chunks by relevance. * **Standard RAG** — Embeds documents, retrieves top-k chunks via vector similarity, passes them to the LLM. All configurations use matching set-ups: 2048 character recursive chunking with Chonkie, Turbopuffer vector database with top-k set to 20, and text-embedding-3-large for embeddings. RAG + Reranker and Agentset use Zerank-2 for reranking. ## HotpotQA [HotpotQA](https://hotpotqa.github.io/) is the leading multi-hop reasoning benchmark for RAG systems. It's a challenging dataset containing 113k question-answer pairs, each answer requires information from 2 or more documents. For example: *"What government position was held by the woman who portrayed Roxie Hart in the film Chicago?"* To answer this, a system must first find the actress, then find her government role. | Configuration | Correct Answers | Average Score | | -------------- | --------------- | ------------- | | **Agentset** | **979 / 1000** | **9.84** | | RAG + Reranker | 913 / 1000 | 9.2 | | Standard RAG | 888 / 1000 | 9.0 | View the HotpotQA results and JSON outputs on [GitHub](https://github.com/agentset-ai/benchmarks). ## FinanceBench [FinanceBench](https://huggingface.co/datasets/PatronusAI/financebench) is a benchmark for evaluating financial question-answering over real public company filings. It contains 150 question-answer pairs requiring extraction and reasoning over 10-K and 10-Q documents from companies across multiple sectors. For example: *"What is the FY2018 capital expenditure amount for 3M?"* To answer this, a system must locate and extract the correct value from the company's cash flow statement. | Configuration | Correct Answers | Average Score | | -------------- | --------------- | ------------- | | **Agentset** | **80 / 114** | **7.75** | | RAG + Reranker | 46 / 114 | 5.1 | | Standard RAG | 44 / 114 | 5.0 | View the FinanceBench results and JSON outputs on [GitHub](https://github.com/agentset-ai/financebench).