The Russian search giant doesn’t get attention in the US like other European and Asian internet companies do, letting news of its own projects slip through the cracks
Russian search engine Yandex and its open machine translation service Yandex.Translate has been beta-testing Western (Hill) Mari, Eastern Mari, Papiamento, and Udmurt. The effort takes advantage of the company’s local expertise to target language groups US-based Alphabet and its Google Translate product have not yet cataloged.
While Google has made major headlines by upgrading machine translation technology with advanced neural networks and stacking its data banks up to 103 languages, other companies have also made strides. That’s not just Microsoft Translator either. Yandex.Translate added Elvish last year, and it has quietly continued to beat Google at adding languages native to the Russian Federation and in overlooked corners of the world to the Moscovite interpretation app.
For those wondering how Papiamento ended up on the list, it was because of a simple request. A Curaçaoan working for Yandex in the Netherlands asked if it could be added to the company’s translation service.
The addition of the Curaçaoan dialect and three languages only found in Russia gives Yandex a niche and some unique characteristics for its machine learning research.
These are not the most widely spoken languages in the world by any means. Only 330,000 speak Papiamento on the Dutch island of Curaçao, 340,000 speak Udmurt in the Russian republic of Udmurtia, while Eastern Hill Mari has 500,000 natives and Western Hill Mari only 30,000.
“[With] additional data that would let us improve the quality of translations, we hope any technology developed from the use of these languages will be introduced into the ‘big’ areas in the future to help better understand communication between languages in general, and hence more precisely translate texts,” Yandex’s Anton Dvorkvitch of the Yandex.Translate research team recently wrote in tech blog N+1 (Russian).
He highlighted in his post how Yandex is on the cusp of what Google calls “zero-shot translation,” more specifically the interlinguistic algorithms that allow Google Translate to use another language to fill missing translation data between two other languages. Dvorkvitch’s blog illustrates this by citing two more local dialects that also feature prominently on Yandex.Translate.
“Take for example the Tatar and Bashkir languages, two very close Turkic languages. They differ in some sounds . . . but their linguistic characteristics, morphology, and syntax are almost identical. Our technology is able to understand the difference between sounds and borrow some words when translating from either language, if either of them were not to have enough information.”
That same principle is at play with the two Mari dialects, but based on some noticeable differences Yandex.Translate’s technology can make strong guesses as to what a certain word means should that word’s definition be missing from Yandex’s translation memory (databank). He cites another example more familiar with some Western readers: dovetails between Yiddish and German.
“So many words in Yiddish and German are identical or very similar. Yiddish uses the Hebrew alphabet, but unlike Hebrew, uses vowels in writing – except the words of the Hebrew origin that are now accepted to write as well as in Hebrew.”
Because of that, explains Dvorkvitch, the two lexicons can be compared to grab data from either language to complete translations related to a third. “The rest in Yiddish phonetic principle of writing, and this makes it possible, if the German word and the word in Yiddish coincide, to automatically transliterate them.”
There are few deep learning translation operations operating today, and even fewer with an open platform. Yandex’s work with languages that have small speaker populations, as well as Microsoft’s with Native American languages Mayan and Querétaro Otomi, provides new hope to a number of seemingly hopeless efforts to preserve dying languages around the world, like Rutgers professor Charles Häberl’s efforts with Aramaic dialect Neo-Mandaic (full disclosure, I helped Professor Häberl enter data for his projects when I was a student).
It remains to be seen if Google will launch any special efforts to launch open machine translation with other groups of languages with small populations or dialects that face extinction. However, even the least tech-savvy enthusiast can take advantage of neural translation advancements that will only make such future projects even easier.