HFT: High Frequency Tokens for Low-Resource NMT

Signoroni,  Edoardo; Rychlý,  Pavel

Publication details

HFT: High Frequency Tokens for Low-Resource NMT

Authors	SIGNORONI Edoardo RYCHLÝ Pavel
Year of publication	2022
Type	Article in Proceedings
Conference	Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
MU Faculty or unit	Faculty of Informatics
Citation
Web	https://aclanthology.org/2022.loresmt-1.8
Keywords	Machine Translation; Tokenization
Description	Tokenization has been shown to impact the quality of downstream tasks, such as Neural Machine Translation (NMT), which is susceptible to out-of-vocabulary words and low frequency training data. Current state-of-the-art algorithms have been helpful in addressing the issues of out-of-vocabulary words, bigger vocabulary sizes and token frequency by implementing subword segmentation. We argue, however, that there is still room for improvement, in particular regarding low-frequency tokens in the training data. In this paper, we present “High Frequency Tokenizer”, or HFT, a new language-independent subword segmentation algorithm that addresses this issue. We also propose a new metric to measure the frequency coverage of a tokenizer’s vocabulary, based on a frequency rank weighted average of the frequency values of its items. We experiment with a diverse set of language corpora, vocabulary sizes, and writing systems and report improvements on both frequency statistics and on the average length of the output. We also observe a positive impact on downstream NMT.
Related projects:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy Interní grantová agentura Masarykovy univerzity A New Machine Translation-based approach to Parallel Corpora Alignment

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

Our strategic projects

HFT: High Frequency Tokens for Low-Resource NMT