Even bigger datasets and more natural language are needed to go further now
A new project from Oxford University promises to offer more accuracy in lipreading. The software, developed by the computer sciences program and named LipNet, is 93.4% accurate reading from a 64,000 sentence database, GRID.
This, as Mashable notes, is a lot better than many programs before now, and also human-trained lipreaders. It also processes the data in near-real time, like an instantaneous closed caption.
There is still a lot of work to do, though, since the test database is not natural conversation. It follows a natural conservational structure of giving a command with certain details, including colors and numbers, but the formulations are not your usual everyday dialogue.
For machine learning to become effective commercially, whether in transcribing speech or recognizing visual data, there will need to be millions of database samples. (GRID has just 34 distinct speakers and 51 word choices, and so there’s plenty of room for growth.)
So AI lipreading is not quite there yet, but the team predicts that, “given that our model is based on scalable components and that it has [connectionist temporal classification] CTC, language modeling and beam search, it will likely do well on datasets with more complex grammar, as research in [automatic speech recognition] ASR has already shown.”
Lipreading works by determining what vowels and consonants the movements we make while talking correspond to. How we hear words depends on the ways our mouths, jaws, teeth, and tongues move and interact with each other. It is a very complex process. To give one example the “th” sound in “tooth” and “bath” is, for lipreaders, distinct from the “th” sound in “teeth” and “bathe” because the vowels changing changes how we move our lips and tongues to say the “th” phoneme. But overall, there are a lot more phonemes (50) than visemes (10-14), which are the shapes our mouths make that lipreaders learn to identify.
This is what makes learning to read lips so difficult: multiple phonemes come off as the same viseme.
Or, think of how much harder it is to understand someone when they are mumbling out of the side of their mouth rather than enunciating properly, and also the differences if they are whispering or screaming something. So there is a lot of potential for this technology. “Machine lipreaders,” the Oxford paper notes, “have enormous practical potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification, and silent-movie processing.” And, as The Verge notes, it could even be put into camera glasses or virtual assistants.
(As someone who talks to himself a lot, I’d certainly be wondering, in the day and age when this tech does turn up on Snap Spectacles or Siri and Cortana, if I’m being recorded without knowing it every time I mouth a swear in frustration with someone or something!)
Lipreading done by people can be inaccurate, so teaching a program to consistently recognize movement and auditory input would improve consistency. If it has video input, it can be even more accurate since that comprises the whole package, rather than just basing it on phonemes alone.
Another limitation, for all such programs and not just LipNet, is that the person’s tongue must be visible to the camera, something one wouldn’t be able to count on when trying in the real world to dub over soundless surveillance camera footage or a movie where the dialogue needs to be voiced over in another language.