arXiv · 2602.17687 · February 2026
IRPAPERS: A Visual Document Benchmark
for Scientific Retrieval and Question Answering
How do image-based retrieval systems compare to established text-based methods on scientific PDF pages?
Shorten · Skaburskas · Jones · Pierse · Esposito · Trengrove · Dilocker · van Luijt  ·  Weaviate
166
IR papers
3,230
pages (image + OCR)
180
needle-in-haystack questions
49%
best Recall@1 (multimodal)

Retrieval recall by system and depth
Click a @k level to compare all systems at that retrieval depth
Recall@1 — top-1 retrieval accuracy
Text-based
Image-based
Multimodal / combined
Baselines
Experimental pipeline
Click any stage to learn what it does
Select a stage above to see details.

RAG system comparison — question answering accuracy
TextRAG (hybrid text) vs ImageRAG (ColModernVBERT + GPT-4.1) at k=1 and k=5
TextRAG OCR + hybrid
k = 1 accuracy0.62
k = 5 accuracy0.82
Oracle (gold page)0.74
Hard negativelower
No retrieval baseline0.16
ImageRAG ColModernVBERT + GPT-4.1
k = 1 accuracy0.40
k = 5 accuracy0.71
Oracle (gold image)0.68
Hard negativelower
ReaderGPT-4.1 (vision)
Key insight: At k=5, both systems exceed oracle single-document retrieval (TextRAG 0.82 > 0.74, ImageRAG 0.71 > 0.68) — confirming that scientific QA often requires synthesising evidence across multiple pages, not just finding one definitive source.
Benchmark question categories
180 needle-in-the-haystack questions span three reasoning types
🔎
Information extraction
Precise methodological details and factual lookups from paper text
🔢
Numerical reasoning
Calculations and comparisons using numbers from tables or figures
🧩
Logical reasoning
Multi-step inference across sections — cannot be answered from memory alone

Key findings
Click any finding to explore further
Open-source vs closed-source image retrieval
Cohere Embed v4.0 and Voyage 3 Large vs open-source ColModernVBERT
Recall@1 — image embedding comparison
MUVERA encoding — efficiency vs quality tradeoff
MUVERA compresses multi-vector image embeddings into single fixed-size vectors for faster HNSW search
Recall@1 vs MUVERA compression level
Trade-off: MUVERA trades retrieval quality for efficiency — higher compression = faster search but lower recall. The full multi-vector ColModernVBERT achieves best recall but requires MUVERA or similar to scale to large corpora.