A Framework for Identifying Biases in Retrievers
⚠️ The best accuracy of Dense Retrievers on the foil (default) set is lower than 🔴10%🔴.
Retrievers consistently score document_1 higher than document_2 in all subsets.
⇒ Retrieval biases often outweigh the impact of answer presence.
Model | Accuracy | Paired t-Test Statistic | p-value |
---|---|---|---|
🥇ReasonIR-8B 🆕 | 8.0% | -36.92 | < 0.01 |
🥈ColBERT (v2) 🆕 | 7.6% | -20.96 | < 0.01 |
🥉COCO-DR Base MSMARCO | 2.4% | -32.92 | < 0.01 |
Dragon+ | 1.2% | -40.94 | < 0.01 |
Dragon RoBERTa | 0.8% | -36.53 | < 0.01 |
Contriever MSMARCO | 0.8% | -42.25 | < 0.01 |
RetroMAE MSMARCO FT | 0.4% | -41.49 | < 0.01 |
Contriever | 0.4% | -34.58 | < 0.01 |
Evaluate any model using this code: https://colab.research.google.com/github/mohsenfayyaz/ColDeR/blob/main/Benchmark_Eval.ipynb
- foil (default):
- document_1: Foil Document with Multiple Biases but No Evidence: This document contains multiple biases, such as repetition and position biases. It includes two repeated mentions of the head entity in the opening sentence, followed by a sentence that mentions the head but not the tail (answer). So it does not include the evidence.
- document_2: Evidence Document with Unrelated Content: This document includes four unrelated sentences from another document, followed by the evidence sentence with both the head and tail entities. The document ends with the same four unrelated sentences.
- answer_importance:
- document_1: Document with Evidence: Contains a leading evidence sentence with both the head entity and the tail entity (answer).
- document_2: Document without Evidence: Contains a leading sentence with only the head entity but no tail.
- brevity_bias:
- document_1: Single Evidence, consisting of only the evidence sentence.
- document_2: Evidence+Document, consisting of the evidence sentence followed by the rest of the document.
- literal_bias:
- document_1: Both query and document use the shortest name variant (short-short).
- document_2: The query uses the short name but the document contains the long name variant (short-long).
- position_bias:
- document_1: Beginning-Evidence Document: The evidence sentence is positioned at the start of the document.
- document_2: End-Evidence Document: The same evidence sentence is positioned at the end of the document.
- repetition_bias:
- document_1: More Heads, comprising an evidence sentence and two more sentences containing head mentions but no tails
- document_2: Fewer Heads, comprising an evidence sentence and two more sentences without head or tail mentions from the document
- poison:
- document_1: Poisoned Biased Evidence: We add the evidence sentence to foil document 1 and replace the tail entity in it with a contextually plausible but entirely incorrect entity using GPT-4o.
- document_2: Correct Evidence Document with Unrelated Content: This document includes four unrelated sentences from another document, followed by the evidence sentence with both the head and tail entities. The document ends with the same four unrelated sentences.
- Paper: https://arxiv.org/abs/2503.05037
- Dataset: https://huggingface.co/datasets/mohsenfayyaz/ColDeR
- Repository: https://github.com/mohsenfayyaz/ColDeR
BibTeX: If you found this work useful, please consider citing our paper:
@inproceedings{fayyaz-etal-2025-collapse,
title = "Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence",
author = "Fayyaz, Mohsen and
Modarressi, Ali and
Schuetze, Hinrich and
Peng, Nanyun",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.447/",
pages = "9136--9152",
ISBN = "979-8-89176-251-0",
abstract = "Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid downstream failures. In this work, we repurpose a relation extraction dataset (e.g., Re-DocRED) to design controlled experiments that quantify the impact of heuristic biases, such as a preference for shorter documents, on retrievers like Dragon+ and Contriever. We uncover major vulnerabilities, showing retrievers favor shorter documents, early positions, repeated entities, and literal matches, all while ignoring the answer{'}s presence! Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 10{\%} of cases over a synthetic biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34{\%} performance drop than providing no documents at all.https://huggingface.co/datasets/mohsenfayyaz/ColDeR"
}