pdf bibtex reproducibility paper to appear
Appearing in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)
Abstract:
Cranfield-style test collections are the bedrock of offline IR evaluation, allowing for the reproducible comparison of systems with respect to each other. When offline evaluation became the norm, practitioners contested its validity due to the subjectivity of relevance; when web-scale corpora in offline evaluation became the norm, practitioners expressed concerns that without exhaustive annotation of a corpus, it is unreliable. In each case, studies were carried out to assess these concerns, ultimately concluding that the Cranfield paradigm remains robust under developments in retrieval. In modern ad-hoc retrieval, neural and generative ranking models are increasingly the norm, curators of test corpora expressed concerns that through iterative development of systems we may accidentally overfit to the particular notions of relevance expressed by annotators. Despite calls to put an expiration date'' on these test collections, it is still unclear when or how to determine that a test collection is
expired'' if such a condition exists. The reliability of depth-pooled test collections was validated by re-annotating a subset of judgments. Under different annotators, system order was conserved, albeit with variations in absolute performance. The continued use of the popular TREC Deep Learning 2019 collection leads us to consider that questions raised earlier in the development of test collections may once again be important as the landscape of ad-hoc retrieval has changed. By comparing system effectiveness under the official TREC relevance judgements and manually re-annotated judgements with multiple assessors, we find that some models (particularly small models distilled using LLM data) substantially degrade when exposed to alternative equally correct'' relevance assessments. We analyze several factors beyond modern training processes, which align with previous literature on how different annotators and information needs can affect the utility of a test collection. Furthermore, we observe that some systems are approaching the effectiveness of our alternative assessments (when evaluating over TREC assessments). This observation suggests we may be near a practical upper bound on the effectiveness that can be meaningfully measured on this particular test collection. Overall, we demonstrate one (costly) way to establish that a test collection is nearly
expired'' while validating the findings of previous work.
BibTeX @inproceedings{parry:sigir2025-anno, author = {Parry, Andrew and Fröbe, Maik and Scells, Harrisen and Schlatt, Ferdinand and Faggioli, Guglielmo and Zerhoudi, Saber and MacAvaney, Sean and Yang, Eugene}, title = {Variations in Relevance Judgments and the Shelf Life of Test Collections}, booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2025}, url = {https://arxiv.org/abs/2502.20937} }