← smac.pub home

An Inspection of the Reproducibility and Replicability of TCT-ColBERT

link bibtex code slides 13 citations reproducibility paper

Authors: Xiao Wang, Sean MacAvaney, Craig Macdonald, Iadh Ounis

Appeared in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)

Links/IDs:
DOI 10.1145/3477495.3531721 DBLP conf/sigir/WangMMO22 ACM 3477495.3531721 Google Scholar 7wWfoDgAAAAJ:R3hNpaxXUhUC Semantic Scholar f5c3d68c905630fb6e7955d0049cb018e91647a9 Enlighten 268399 smac.pub sigir2022-tctrepro

Abstract:

Dense retrieval approaches are of increasing interest because they can better capture contextualised similarity compared to sparse retrieval models such as BM25. Among the most prominent of these approaches is TCT-ColBERT, which trains a light-weight "student" model from a more expensive "teacher" model. In this work, we take a closer look into TCT-ColBERT concerning its reproducibility and replicability. To structure our study, we propose a three-stage perspective on reproducing the training, inference, and evaluation of model-focused papers, each using artefacts produced from different stages in the pipeline. We find that --- perhaps as expected --- precise reproduction is more challenging when the complete training process is conducted, rather than just inference from a released trained model. Each stage provides the opportunity to perform replication and ablation experiments. We are able to replicate (i.e., produce an effective independent implementation) for model inference and dense indexing/retrieval, but are unable to replicate the training process. We conduct several ablations to cover gaps in the original paper, and make the following observations: (1) the model can function as an inexpensive re-ranker, establishing a new Pareto-optimal result; (2) the index size can be reduced by using lower-precision floating point values, but only if ties in scores are handled appropriately; (3) training needs to be conducted for the entire suggested duration to achieve optimal performance; and (4) student initialisation from the teacher is not necessary.

BibTeX @inproceedings{wang:sigir2022-tctrepro, author = {Wang, Xiao and MacAvaney, Sean and Macdonald, Craig and Ounis, Iadh}, title = {An Inspection of the Reproducibility and Replicability of TCT-ColBERT}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2022}, doi = {10.1145/3477495.3531721} }