Why Google Translate’s neural machine translation is a major gambit on startups
Google made a massive announcement about its Google Translate service on Tuesday while the world waited to cover Elon Musk’s plans to blast humanity’s best and bravest to colonies on Mars. The Google Brain Team announced results of research into replacing its “phrase-based machine translation” (PMT) with a “neural machine translation” (NMT). What that means is that rather than analyzing individual words and sometimes groups of words, the new algorithms will consider the entire sentence as a single phrase, as well as clause and word combos within the sentences entered for translation.
“Today we announce the Google Neural Machine Translation system (GNMT), which utilizes state-of-the-art training techniques to achieve the largest improvements to date for machine translation quality,” Quoc V. Le and Mike Schuster jointly posted on Google Translate’s official blog. Their full study was also linked to in the announcement, dubbed “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.“
That is a game-changing prospect in a surprisingly uncrowded field of machine translation startups. The unfortunate reality for translation is that not only is everyone a critic of your work, but most people criticizing you actually are qualified to trash bad translations. There is very little wiggle room in the business for people with ‘so-so’ or ‘okay’ translations. You can’t get money if you don’t make sense.
“GNMT reduces translation errors by more than 55%-85% on several major language pairs measured on sampled sentences from Wikipedia and news websites with the help of bilingual human raters,” Le and Schuster claimed in their report, which they illustrate with the following example:
This is a very strange example considering the researchers did not note that the GNMT translation is actually better than the human translation. Another graph shows survey results of human reviewers who usually ranked human translations better than GNMT results. But that might be the intention here, to show that there are still some common mechanical errors humans make when they cannot figure out phrasing, while their new algorithm will not have that kind of issue. That is not to say it won’t have problems, but they might be implying that they have solved some problems in producing flowing translated sentences that can challenge the strongest emergent technology that can do this: translation memory.
Translation memory collects data from previous translations, sometimes client or industry-specific, and uses that as a bank of previous in-context translations of words, phrases and entire sentences.
Geektime has profiled three really strong startups in the field who have built their business on translation memory: Sino-Indian startup Stepes, LingoHub in Austria and Unbabel in Portugal. We have only profiled three solid startups in the business, but that’s not for lack of trying to find more. There aren’t many that are doing well and new players are having a hard time getting funding and staying in the game. Some have folded and others reached dead ends. Unbabel has Silicon Valley support while LingoHub is still bootstrapped.
What Google has done adds immense pressure to these three companies and anyone who is clever enough in both human language and coding language to mount a successful, game-changing translation service. Algorithms can by their nature collect data much more quickly than memory banks.
One other simplified way to describe it is the same way we read words, as a unit. An old meme has people believing that if all the letters in a word are scrambled only slightly but the first and last letters are correct, we will be able to read the misspelled word because of its resemblance to the correct order of letters. Research shows that reading this way is still slower, so it’s not a completely true idea, yet it’s not exactly a fictitious myth either. The same idea works similarly here. Some languages write sentences in a Subject-Verb-Object order (SVO), as in “Joshua buys the book.” Others work in different arrangements or are flexible, as you can say in Spanish, “compra Josué el libro” in certain contexts. Except, of course, algorithms will group the terms together in the most logical way in the output language.
On the other hand, the algorithm is not proven yet. It’s best metric is comparison to previous Google Translate results, which can sometimes be humorously awful and embarrass corner-cutting 5th graders too lazy to do their Spanish homework as well as well-meaning politicians just trying to convey greetings to people. There are plenty of multilingual developers out there who might want to take a crack at this, and I’m sure plenty who work at Google who might be harboring some good ideas and looking for an excuse to get incubated somewhere. The game is by no means over, but Google’s updates on translation can prove themselves to be as debilitating for translation startups as their Google Search updates can be on SEO companies.
“Machine translation is by no means solved,” Le and Schuster go on to say. “GNMT can still make significant errors that a human translator would never make, like dropping words and mistranslating proper names or rare terms, and translating sentences in isolation rather than considering the context of the paragraph or page. There is still a lot of work we can do to serve our users better.”
“However, GNMT represents a significant milestone. We would like to celebrate it with the many researchers and engineers—both within Google and the wider community—who have contributed to this direction of research in the past few years.”
Google’s full research report includes the following, intimidatingly brilliant authors: Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes and Jeffrey Dean.