The United Nations has released its official Parallel Corpus, made up of manually translated documents, between the years of 1990 to 2014, in each of the UN’s six official languages: Arabic, English, Spanish, French, Russian and Chinese.
This official release by the United Nations marks the first time such high-quality parallel corpora are available in the public domain in Arabic and Russian. Progress in natural language research is driven by the availability of data, and particularly in the field of statistical machine translation (SMT), which thrives on large quantities of parallel text – original documents paired with their translations into a second or more languages. Typically, researchers count on multinational institutions such as the European Union, or governments of multilingual countries like Canada or Hong Kong.
Statistical machine translation, or SMT, is based on information theory – the study of transmission, processing, utilisation and extraction of information. Documents are analysed by probability that a source text string matches a translated text string by feeding the engine with test documents (see diagram below). Until now, SMT researchers had been almost exclusively using the Europarl corpus, with the (then) eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish).
We’ll be watching with interest to see how the machine translation landscape develops over the next few months!