NMT’s wonderland where people turn into rabbits. A study on the comprehensibility of newly invented words in NMT output
Abstract
Machine translation (MT) quality has improved enormously since the arrival of neural machine translation (NMT). The most noticeable improvement compared to statistical MT systems is the increased grammaticality and fluency of the produced MT output. At the lexical level, the quality of NMT systems is less promising. New types of lexical mistakes appear in NMT output, such as the occurrence of non existing words, i.e. words that are not part of the vocabulary of the target language and were thus invented by the NMT system. For MT use cases in which readers only have access to the MT output without the source text, such non-existing words can affect comprehension as the intended source meaning may not be recovered. To investigate if and to what extent non-existing words in English-to-Dutch NMT output impair comprehension, an experiment was set up in SurveyMonkey. Eighty-six participants were given 15 non-existing words (5 single words and 10 noun compounds) and were either asked to describe the meaning of these words or to select the correct meaning from a predefined list. The words were presented either in isolation or in sentence context. Participants were asked to indicate how confident they were about their answer. Results show that non existing words indeed impair comprehension as in 60% of the cases the participants gave a wrong answer. Sentence context had a positive impact and made it easier for the participants to determine the meaning of the non-existing word. Participants were also more confident about their answer when the words were presented in sentence context.