← smac.pub home

Large Language Model Relevance Assessors Agree With One Another More Than With Human Assessors

bibtex short conference paper to appear

Authors: Maik Fröbe, Andrew Parry, Ferdinand Schlatt, Sean MacAvaney, Benno Stein, Martin Potthast, Matthias Hagen

Appearing in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)

Links/IDs:
Google Scholar 7wWfoDgAAAAJ:CHSYGLWDkRkC Enlighten 352747 smac.pub sigir2025-llmagree

Abstract:

Relevance judgments can differ between assessors, but previous work has shown that such disagreements have little impact on the effectiveness rankings of retrieval systems. This applies to disagreements between humans as well as between human and large language model~(LLM) assessors. However, the agreement between different LLM~assessors has not yet been systematically investigated. To close this gap, we compare eight LLM~assessors on the TREC DL tracks and the retrieval task of the RAG track with each other and with human assessors. We find that the agreement between LLM~assessors is higher than between LLMs and humans and, importantly, that LLM~assessors favor retrieval systems that use LLMs in their ranking decisions: our analyses with 30-50 retrieval systems show that the system rankings obtained by LLM~assessors overestimate LLM-based re-rankers by 7~to 16~positions on average.

BibTeX @inproceedings{fröbe:sigir2025-llmagree, author = {Fröbe, Maik and Parry, Andrew and Schlatt, Ferdinand and MacAvaney, Sean and Stein, Benno and Potthast, Martin and Hagen, Matthias}, title = {Large Language Model Relevance Assessors Agree With One Another More Than With Human Assessors}, booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2025} }