Preprocessing on bilingual data for Statistical Machine Translation

Fournier, Bas

Machine Translation (MT) is the translation of text from one human language to another by a computer. Computers, like all machines, are excellent at taking over repetitive and mundane tasks from humans. As translating long texts from one language to another qualifies as such a task, Machine Translation is a potentially very economic way of translation. Unfortunately natural languages are not very suitable for processing by a machine. They are ambiguous, illogical and constantly evolving, qualities that are difficult to handle with a machine. This makes the problem of Natural Language Processing, and by extension MT, a difficult one to solve. A theoretical method that can analyze a text in a natural language and decipher its semantic content can store this semantic content in a language-independent representation. From this representation, another text with the same semantic content can be generated in any language for which exists a generation mechanism. Such an MT architecture would provide high quality translations, and be modular; a new language could be added to the pool of inter-translatable languages simply by developing an analysis and generation method for that language. Unfortunately this method does not exist. Some existing MT attempts to approach it to a degree, but as long as semantic analysis remains an unsolved problem in the field of Natural Language Processing there can be no true language independent representation.

Preprocessing on bilingual data for Statistical Machine Translation

Fournier, Bas (2008)