← smac.pub home

On Survivorship Bias in MS MARCO

pdf bibtex code slides poster 10 citations short conference paper

Authors: Prashansa Gupta, Sean MacAvaney

Appeared in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)

Links/IDs:
DOI 10.1145/3477495.3531832 DBLP conf/sigir/GuptaM22 ACM 3477495.3531832 arXiv 2204.12852 Google Scholar 7wWfoDgAAAAJ:e5wmG9Sq2KIC Semantic Scholar e0b0dbea8702f0535c47e440bab7406df6e5a0c0 Enlighten 268516 smac.pub sigir2022-survivor

Abstract:

Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that gener- ate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38–45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find that some discarded queries in MS MARCO are ill-defined or otherwise unanswerable, many are valid questions that could be an- swered had the collection been annotated more completely (around two thirds using modern ranking techniques). This survivability problem distorts the MS MARCO collection in several ways. We find that it affects the natural distribution of queries in terms of the type of information needed. When used for evaluation, we find that the bias likely yields a significant distortion of the absolute performance scores observed. Finally, given that MS MARCO is fre- quently used for model training, we train models based on subsets of MS MARCO that simulates more survivorship bias. We find that models trained in this setting are up to 9.9% worse when evaluated on versions of the dataset with more complete annotations, and up to 3.5% worse at zero-shot transfer. Our findings are complementary to other recent suggestions for further annotation of MS MARCO, but with a focus on discarded queries.

BibTeX @inproceedings{gupta:sigir2022-survivor, author = {Gupta, Prashansa and MacAvaney, Sean}, title = {On Survivorship Bias in MS MARCO}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2022}, url = {https://arxiv.org/abs/2204.12852}, doi = {10.1145/3477495.3531832} }