Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review

pdf bibtex 44 citations long conference paper

Authors: Eugene Yang, Sean MacAvaney, David Lewis, Ophir Frieder

Appeared in: Proceedings of the 44th European Conference on Information Retrieval Research (ECIR 2022)

Links/IDs:

DOI 10.1007/978-3-030-99736-6_34 DBLP conf/ecir/YangMLF22 arXiv 2105.01044 Google Scholar 7wWfoDgAAAAJ:QIV2ME_5wuYC Semantic Scholar 05dc55f5b9930535c584f8076fcf145e9074468d Enlighten 259713 smac.pub ecir2022-tar

Abstract:

Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in TAR. We indeed find that the pre-trained BERT model reduces review cost by 10% to 15% in TAR workflows simulated on the RCV1-v2 newswire collection. In contrast, we likewise determined that linear models outperform BERT for simulated legal discovery topics on the Jeb Bush e-mail collection. This suggests the match between transformer pre-training corpora and the task domain is of greater significance than generally appreciated. Additionally, we show that \textit{just-right} language model fine-tuning on the task collection before starting active learning is critical. Too little or too much fine-tuning hinders performance, worse than that of linear models, even for a favorable corpus such as RCV1-v2.

BibTeX @inproceedings{yang:ecir2022-tar, author = {Yang, Eugene and MacAvaney, Sean and Lewis, David and Frieder, Ophir}, title = {Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review}, booktitle = {Proceedings of the 44th European Conference on Information Retrieval Research}, year = {2022}, url = {https://arxiv.org/abs/2105.01044}, doi = {10.1007/978-3-030-99736-6_34} }