Variations in Relevance Judgments and the Shelf Life of Test Collections

pdf bibtex 2 citations reproducibility paper to appear

Authors: Andrew Parry, Maik Fröbe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Faggioli, Saber Zerhoudi, Sean MacAvaney, Eugene Yang

Appearing in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)

Links/IDs:

arXiv 2502.20937 Google Scholar 7wWfoDgAAAAJ:UxriW0iASnsC Enlighten 352747 smac.pub sigir2025-anno

Abstract:

Cranfield-style test collections are the bedrock of offline IR evaluation, allowing for the reproducible comparison of systems with respect to each other. When offline evaluation became the norm, practitioners contested its validity due to the subjectivity of relevance; when web-scale corpora in offline evaluation became the norm, practitioners expressed concerns that without exhaustive annotation of a corpus, it is unreliable. In each case, studies were carried out to assess these concerns, ultimately concluding that the Cranfield paradigm remains robust under developments in retrieval. In modern ad-hoc retrieval, neural and generative ranking models are increasingly the norm, curators of test corpora expressed concerns that through iterative development of systems we may accidentally overfit to the particular notions of relevance expressed by annotators. Despite calls to put an expiration date'' on these test collections, it is still unclear when or how to determine that a test collection isexpired'' if such a condition exists. The reliability of depth-pooled test collections was validated by re-annotating a subset of judgments. Under different annotators, system order was conserved, albeit with variations in absolute performance. The continued use of the popular TREC Deep Learning 2019 collection leads us to consider that questions raised earlier in the development of test collections may once again be important as the landscape of ad-hoc retrieval has changed. By comparing system effectiveness under the official TREC relevance judgements and manually re-annotated judgements with multiple assessors, we find that some models (particularly small models distilled using LLM data) substantially degrade when exposed to alternative equally correct'' relevance assessments. We analyze several factors beyond modern training processes, which align with previous literature on how different annotators and information needs can affect the utility of a test collection. Furthermore, we observe that some systems are approaching the effectiveness of our alternative assessments (when evaluating over TREC assessments). This observation suggests we may be near a practical upper bound on the effectiveness that can be meaningfully measured on this particular test collection. Overall, we demonstrate one (costly) way to establish that a test collection is nearlyexpired'' while validating the findings of previous work.

BibTeX @inproceedings{parry:sigir2025-anno, author = {Parry, Andrew and Fröbe, Maik and Scells, Harrisen and Schlatt, Ferdinand and Faggioli, Guglielmo and Zerhoudi, Saber and MacAvaney, Sean and Yang, Eugene}, title = {Variations in Relevance Judgments and the Shelf Life of Test Collections}, booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2025}, url = {https://arxiv.org/abs/2502.20937} }