← smac.pub home

Overcoming Low-Utility Facets for Complex Answer Retrieval

bibtex pdf slides workshop abstract

Authors: Sean MacAvaney, Andrew Yates, Arman Cohan, Luca Soldaini, Kai Hui, Nazli Goharian, Ophir Frieder

Appeared in: Proceedings of the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR @ SIGIR 2018)


Complex Answer Retrieval (CAR) is the process of retrieval for questions whose answers require many details or additional context to explain thoroughly. These questions can be formulated as having a topic entity and a facet. For instance, for the question "Is cheese healthy?", the topic is “Cheese” and the facet is "Health effects". We observe that some facets, such as “Health effects”, exhibit low utility: answers to questions about the health effects of cheese are unlikely to use the terms directly. Instead, they will include related entities, such as nutrients or related diseases. We call these low-utility facets because the terms in the facet are not used directly in the text, and thus the terms themselves do not provide much value. In contrast, high-utility facets use language that is specific to the topic and can be found directly in relevant answers (e.g., a facet of “Curdling" for the question "Why does cheese curdle?"); it would be difficult (and unlikely) that an answer to this this question does not include the term "curdle" or "curdling". In this talk, we propose a two-pronged approach for CAR by modifying a leading neural information retrieval architecture (PACRR).

First, we introduce two estimators of facet utility. We observe that the CAR query structure (provided as a list of headings) itself implies a notion of utility: the root of the query is the topic (which is likely high-utility), any intermediate facets provide structure (which are likely low-utility), and the leaf (main) facet can be either, depending on the nature of the question. To further disambiguate between high- and low-utility facets, we incorporate facet frequency statistics, with the intuition that high-frequency facets (e.g., "Health effects") are low-utility, and low-frequency facets (e.g., "Curdling") are high-utility.

Second, we incorporate knowledge graph information to overcome low-utility facets. We train knowledge graph embeddings (HolE), and include the top similarity scores during relevance prediction. To avoid bias in our Wikipedia-based evaluation, we propose an approach for building a knowledge graph from the CAR training data. We construct the graph by linking entity mentions to the article topic entity with the containing heading as a label. Since Wikipedia headings are used as facets for CAR, the entity embeddings can be combined with the facet relation embedding to get context-specific scores. Recall that since high-frequency headings correspond to low-utility facets, this approach should naturally provide the most value for the headings that need the most additional context.

We evaluate our approach using TREC CAR version 1.5, and find that our approach performs favorably when compared to other leading approaches. We also provide an empirical evaluation of our assumptions and a detailed analysis of our results. As one of the first comprehensive works with CAR, we expect our findings and observations to shape the directions taken for CAR in the future.

BibTex @InProceedings{macavaney:kg4ir2018-car, author = {MacAvaney, Sean and Yates, Andrew and Cohan, Arman and Soldaini, Luca and Hui, Kai and Goharian, Nazli and Frieder, Ophir}, title = {Overcoming Low-Utility Facets for Complex Answer Retrieval}, booktitle = {Proceedings of the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding}, year = {2018} }