← smac.pub home

Lost in Transliteration: Bridging the Script Gap in Neural IR

bibtex short conference paper to appear

Authors: Andreas Chari, Iadh Ounis, Sean MacAvaney

Appearing in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)

Links/IDs:
Google Scholar 7wWfoDgAAAAJ:u9iWguZQMMsC Enlighten 352746 smac.pub sigir2025-translit

Abstract:

Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated --usually Latinized-- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a script gap" between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the populartranslate-train" paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring various combinations of non-Latin and Latinized query text for training, we investigate whether we can enhance the capacity of existing neural retrieval techniques and enable them to apply to this important setting. We show that by further fine-tuning IR models on an even mixture of native and Latinized text, they can perform this cross-script matching at nearly the same performance as when the query was formulated in the native script. Out-of-domain evaluation and further qualitative analysis show that transliterations can also cause queries to lose some of their nuances, motivating further research in this direction.

BibTeX @inproceedings{chari:sigir2025-translit, author = {Chari, Andreas and Ounis, Iadh and MacAvaney, Sean}, title = {Lost in Transliteration: Bridging the Script Gap in Neural IR}, booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2025} }