An Approach for Weakly-Supervised Deep Information Retrieval

See revised version, published in SIGIR 2019 link

Authors: Sean MacAvaney, Kai Hui, Andrew Yates

Appeared in: SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR @ SIGIR 2017)

Links/IDs:

DBLP journals/corr/MacAvaneyHY17 arXiv 1707.00189v2 Semantic Scholar d8191c96a89bb843b5e422583246ec6464cad27a smac.pub neuir2017-nyt

Abstract:

Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach that---given a weak training set of pseudo-queries, documents, relevance information---filters the data to produce effective positive and negative query-document pairs. This allows large corpora to be used as neural IR model training data, while eliminating training examples that do not transfer well to relevance scoring. The filters include unsupervised ranking heuristics and a novel measure of interaction similarity. We evaluate our approach using a news corpus with article headlines acting as pseudo-queries and article content as documents, with implicit relevance between an article's headline and its content. By using our approach to train state-of-the-art neural IR models and comparing to established baselines, we find that training data generated by our approach can lead to good results on a benchmark test collection.

BibTeX @inproceedings{macavaney:neuir2017-nyt, author = {MacAvaney, Sean and Hui, Kai and Yates, Andrew}, title = {An Approach for Weakly-Supervised Deep Information Retrieval}, booktitle = {SIGIR 2017 Workshop on Neural Information Retrieval}, year = {2017}, url = {https://arxiv.org/abs/1707.00189v2} }