arXiv · 2602.17687 · February 2026

IRPAPERS: A Visual Document Benchmark
for Scientific Retrieval and Question Answering

How do image-based retrieval systems compare to established text-based methods on scientific PDF pages?

Shorten · Skaburskas · Jones · Pierse · Esposito · Trengrove · Dilocker · van Luijt · Weaviate

166

IR papers

3,230

pages (image + OCR)

180

needle-in-haystack questions

49%

best Recall@1 (multimodal)

Retrieval recall by system and depth

Click a @k level to compare all systems at that retrieval depth

Recall@1 — top-1 retrieval accuracy

Text-based

Image-based

Multimodal / combined

Baselines

Experimental pipeline

Click any stage to learn what it does

Select a stage above to see details.

RAG system comparison — question answering accuracy

TextRAG (hybrid text) vs ImageRAG (ColModernVBERT + GPT-4.1) at k=1 and k=5

TextRAG OCR + hybrid

k = 1 accuracy0.62

k = 5 accuracy0.82

Oracle (gold page)0.74

Hard negativelower

No retrieval baseline0.16

ImageRAG ColModernVBERT + GPT-4.1

k = 1 accuracy0.40

k = 5 accuracy0.71

Oracle (gold image)0.68

Hard negativelower

ReaderGPT-4.1 (vision)

Key insight: At k=5, both systems exceed oracle single-document retrieval (TextRAG 0.82 > 0.74, ImageRAG 0.71 > 0.68) — confirming that scientific QA often requires synthesising evidence across multiple pages, not just finding one definitive source.

Benchmark question categories

180 needle-in-the-haystack questions span three reasoning types

🔎

Information extraction

Precise methodological details and factual lookups from paper text

🔢

Numerical reasoning

Calculations and comparisons using numbers from tables or figures

🧩

Logical reasoning

Multi-step inference across sections — cannot be answered from memory alone

Key findings

Click any finding to explore further

Open-source vs closed-source image retrieval

Cohere Embed v4.0 and Voyage 3 Large vs open-source ColModernVBERT

Recall@1 — image embedding comparison

MUVERA encoding — efficiency vs quality tradeoff

MUVERA compresses multi-vector image embeddings into single fixed-size vectors for faster HNSW search

Recall@1 vs MUVERA compression level

Trade-off: MUVERA trades retrieval quality for efficiency — higher compression = faster search but lower recall. The full multi-vector ColModernVBERT achieves best recall but requires MUVERA or similar to scale to large corpora.