pdf bibtex slides long conference paper
Appeared in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)
Abstract:
Neural networks—especially those that employ large, pre-trained language models—have improved search engines in a variety of ways. Most prominently, they are used for estimating the relevance of a passage or document to a user’s query. In this work, we depart from this direction by exploring whether neural networks can effectively predict which of a document’s passages are unlikely to be relevant to any query. We refer to this query-agnostic estimation of passage relevance as a passage’s quality. We find that supervised and unsupervised neural methods for estimating passage quality are more effective than existing lexical methods, allowing passage corpora to be pruned considerably with statistically no effect on the quality of the retrieved results (e.g., the best methods can consistently prune 25% or more MSMARCO passages across a variety of retrieval pipelines). This static pruning approach reduces index size and consequently increases retrieval throughput. Further, we find that lightweight models (e.g., a 4-layer transformer) can perform this quality estimation efficiently with minimal impact on effectiveness. Therefore, lightweight neural passage quality models can actually reduce indexing costs for dense and learned sparse models, since low-quality passages can be pruned prior to the expensive passage encoding step. This work sets the stage for developing more advanced neural "learning-what-to-index" methods.
BibTeX @inproceedings{chang:sigir2024-pprune, author = {Chang, Xuejun and Mishra, Debabrata and Macdonald, Craig and MacAvaney, Sean}, title = {Neural Passage Quality Estimation for Static Pruning}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2024}, url = {https://arxiv.org/abs/2407.12170}, doi = {10.1145/3626772.3657765} }