Neural Passage Quality Estimation for Static Pruning

pdf bibtex slides 5 citations long conference paper

Authors: Xuejun Chang, Debabrata Mishra, Craig Macdonald, Sean MacAvaney

Appeared in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

Links/IDs:

DOI 10.1145/3626772.3657765 DBLP conf/sigir/ChangMMM24 arXiv 2407.12170 Google Scholar 7wWfoDgAAAAJ:rO6llkc54NcC Semantic Scholar 8be6a0b8baf3eff8def1a0d976c072317842735a Enlighten 323368 smac.pub sigir2024-pprune

Abstract:

Neural networks—especially those that employ large, pre-trained language models—have improved search engines in a variety of ways. Most prominently, they are used for estimating the relevance of a passage or document to a user’s query. In this work, we depart from this direction by exploring whether neural networks can effectively predict which of a document’s passages are unlikely to be relevant to any query. We refer to this query-agnostic estimation of passage relevance as a passage’s quality. We find that supervised and unsupervised neural methods for estimating passage quality are more effective than existing lexical methods, allowing passage corpora to be pruned considerably with statistically no effect on the quality of the retrieved results (e.g., the best methods can consistently prune 25% or more MSMARCO passages across a variety of retrieval pipelines). This static pruning approach reduces index size and consequently increases retrieval throughput. Further, we find that lightweight models (e.g., a 4-layer transformer) can perform this quality estimation efficiently with minimal impact on effectiveness. Therefore, lightweight neural passage quality models can actually reduce indexing costs for dense and learned sparse models, since low-quality passages can be pruned prior to the expensive passage encoding step. This work sets the stage for developing more advanced neural "learning-what-to-index" methods.

BibTeX @inproceedings{chang:sigir2024-pprune, author = {Chang, Xuejun and Mishra, Debabrata and Macdonald, Craig and MacAvaney, Sean}, title = {Neural Passage Quality Estimation for Static Pruning}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2024}, url = {https://arxiv.org/abs/2407.12170}, doi = {10.1145/3626772.3657765} }