On the Evaluation of Machine-Generated Reports

Nominated for Best Paper pdf bibtex 20 citations perspective paper

Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

Appeared in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

Links/IDs:

DOI 10.1145/3626772.3657846 DBLP conf/sigir/MayfieldYLMMOSS24 arXiv 2405.00982 Google Scholar 7wWfoDgAAAAJ:u_35RYKgDlwC Enlighten 323367 smac.pub sigir2024-argue

Abstract:

Large Language Models (LLMs) have enabled new ways by which we can satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short- form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of analysts. In this perspective paper, we draw together opinions from industry, government, and academia to present our vision for automatic report generation, and—critically—a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of the information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These features, which are desirable—if not required—in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in information retrieval evaluations from various settings. To test completeness and accuracy, the framework uses nuggets of information that need to part of any high-quality generated report. Meanwhile, citations that map claims made in the report to their source documents ensure verifiability. We envisage that each component of our framework could be applied either manually or automatically (as LLM technology further improves). We believe that our framework is practical and that focusing attention on the evaluation of machine-generated reports will help foster new lines of research in an era of generative AI.

BibTeX @inproceedings{mayfield:sigir2024-argue, author = {Mayfield, James and Yang, Eugene and Lawrie, Dawn and MacAvaney, Sean and McNamee, Paul and Oard, Douglas and Soldaini, Luca and Soboroff, Ian and Weller, Orion and Kayi, Efsun and Sanders, Kate and Mason, Marc and Hibbler, Noah}, title = {On the Evaluation of Machine-Generated Reports}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2024}, url = {https://arxiv.org/abs/2405.00982}, doi = {10.1145/3626772.3657846} }