Russia’s Yandex outpaces Google Translate as it quietly beta tests Papiamento, Udmurt, and Mari languages
Share on Facebook
Share on Twitter
Share on Google+
Share on Reddit
Share on Email

Russia's president Vladimir Putin receives a talisman from a well wisher during a meeting of the All-Russian People's Front (ONF) in Mari-El's capital, Yoshkar-Ola. (Photo by Mikhail Metzel/TASS via Getty Images Israel

The Russian search giant doesn’t get attention in the US like other European and Asian internet companies do, letting news of its own projects slip through the cracks

Russian search engine Yandex and its open machine translation service Yandex.Translate has been beta-testing Western (Hill) Mari, Eastern Mari, Papiamento, and Udmurt. The effort takes advantage of the company’s local expertise to target language groups US-based Alphabet and its Google Translate product have not yet cataloged.

While Google has made major headlines by upgrading machine translation technology with advanced neural networks and stacking its data banks up to 103 languages, other companies have also made strides. That’s not just Microsoft Translator either. Yandex.Translate added Elvish last year, and it has quietly continued to beat Google at adding languages native to the Russian Federation and in overlooked corners of the world to the Moscovite interpretation app.

For those wondering how Papiamento ended up on the list, it was because of a simple request. A Curaçaoan working for Yandex in the Netherlands asked if it could be added to the company’s translation service.

The addition of the Curaçaoan dialect and three languages only found in Russia gives Yandex a niche and some unique characteristics for its machine learning research.

St. Michael’s Cathedral in the Udmurt-speaking region of Udmurtia (Public Domain via Wikimedia Commons)

These are not the most widely spoken languages in the world by any means. Only 330,000 speak Papiamento on the Dutch island of Curaçao, 340,000 speak Udmurt in the Russian republic of Udmurtia, while Eastern Hill Mari has 500,000 natives and Western Hill Mari only 30,000.

“[With] additional data that would let us improve the quality of translations, we hope any technology developed from the use of these languages will be introduced into the ‘big’ areas in the future to help better understand communication between languages in general, and hence more precisely translate texts,” Yandex’s Anton Dvorkvitch of the Yandex.Translate research team recently wrote in tech blog N+1 (Russian).

He highlighted in his post how Yandex is on the cusp of what Google calls “zero-shot translation,” more specifically the interlinguistic algorithms that allow Google Translate to use another language to fill missing translation data between two other languages. Dvorkvitch’s blog illustrates this by citing two more local dialects that also feature prominently on Yandex.Translate.

“Take for example the Tatar and Bashkir languages, two very close Turkic languages. They differ in some sounds . . . but their linguistic characteristics, morphology, and syntax are almost identical. Our technology is able to understand the difference between sounds and borrow some words when translating from either language, if either of them were not to have enough information.”

That same principle is at play with the two Mari dialects, but based on some noticeable differences Yandex.Translate’s technology can make strong guesses as to what a certain word means should that word’s definition be missing from Yandex’s translation memory (databank). He cites another example more familiar with some Western readers: dovetails between Yiddish and German.

Netherlands Antills, Curacao, Willemstad, tourism, beach of the Avila Beach Hotel (Photo by Markus Matzel/ullstein bild via Getty Images Israel)

“So many words in Yiddish and German are identical or very similar. Yiddish uses the Hebrew alphabet, but unlike Hebrew, uses vowels in writing – except the words of the Hebrew origin that are now accepted to write as well as in Hebrew.”

Because of that, explains Dvorkvitch, the two lexicons can be compared to grab data from either language to complete translations related to a third. “The rest in Yiddish phonetic principle of writing, and this makes it possible, if the German word and the word in Yiddish coincide, to automatically transliterate them.”

There are few deep learning translation operations operating today, and even fewer with an open platform. Yandex’s work with languages that have small speaker populations, as well as Microsoft’s with Native American languages Mayan and Querétaro Otomi, provides new hope to a number of seemingly hopeless efforts to preserve dying languages around the world, like Rutgers professor Charles Häberl’s efforts with Aramaic dialect Neo-Mandaic (full disclosure, I helped Professor Häberl enter data for his projects when I was a student).

It remains to be seen if Google will launch any special efforts to launch open machine translation with other groups of languages with small populations or dialects that face extinction. However, even the least tech-savvy enthusiast can take advantage of neural translation advancements that will only make such future projects even easier.

 

Share on:Share
Share on Facebook
Share on Twitter
Share on Google+
Share on Reddit
Share on Email

More Goodies From Big Data


How Cognitive Search Eliminates Common Struggles Website Users Face

How did Big Data transform the manufacturing industry?

10 ways to save money with AWS Redshift

  • [anonymous]

    Good for Yandex. Hopefully this helps give Google a competitive push into the realm of smaller regional languages.

  • memes333

    Add Google Tatar language in the Google translator

  • JN9

    Google is connected to Islamic terrorists. Google uses its products and services especially that notorius Adsense networks to channel funds to terrorists and also through that infamous Clinton Foundation.

    Google added non Pakistani languages of India almost only several years after the Islamic origin language Hindi was added to the list.

    Google’s Adsense REJECTS websites in any Indian language and only allows the Pakistani Islamic origin language Hindi/Urdu.

    The fact the Yandex has also added major Indian languages in its translate quite quickly after user feedback and also many more languages which the racist company Google does not have is very welcome news.

    Hope Yandex can develop an OS which can topple Google, Google advertising policies are extremely sickening. Google is not the same company which is the 2000s. They have several other evil interests, one of them is implementing the racist Indian governments Hindi hegemony and other is terror funding.

    Yandex might be the good alternative.