With just 20 minutes of prep work, you can have anyone say anything
The technology lacks the artificial cadence of a virtual assistant that spaces out words somewhat unnaturally, or the problems that early text-to-speech systems had where they could not pronounce certain phonetic combinations, like the “oi” in “soy,” and instead responded with strange noises.
For adding in words, it does not matter if the speaker has actually said those words. If told to insert them into the sentence, it will, in the speaker’s (mimicked) voice. The software is designed to imitate speech after being fed 20 minutes’ worth of dialogue read in the speaker’s voice.
Using this audio sample, it breaks down everything the speaker says into phonemes, and in splicing together cues from the input audio, it can “predict” what words it hasn’t heard from the speaker would sound like, playing them back in the speaker’s voice. Phonemes – sounds like JH (in “judge”) and ER (“bird”) – have been used for years in more advanced systems to produce natural-sounding results, since these interact with one another to produce words as we hear them.
Adobe has not divulged much more information about the project yet, such as when a commercial release will take place. Though it was only shown in English, the system could be adapted to other languages. Windows speech recognition software currently processes English, German, Spanish, French, Japanese, and Chinese, for instance, using the phonemes unique to each language.
These processes are very intensive, as Google’s Fernando Pereira recently told Backchannel: “When you try to build a system for understanding natural language, and you don’t have many examples of the kind of understanding you want … then you have to prescribe, you have to write — essentially teach it grammar — so that it can do the understanding. That teaching is very laborious.” Earlier systems trained to pick out words could not make sense of slang, non-standard proper nouns, or unique sentence structures very well. “In order to overcome these challenges,” notes The Tartan, “today’s voice recognition software employs sophisticated statistical modeling algorithms to predict the most likely and most sensible outcome for the input.”
Audio moves in on text
Speech recognition is becoming increasingly accurate. The future of this technology will also incorporate lip reading for more accurate results in a video, as opposed to purely audio, medium and will be usable for translating languages.
That VoCo can handle continuous speech separates it from, say, customer service hotlines where the software learns only how to recognize a limited set of commands and something like 100 words, such as “speak to an operator” or letting someone read off their credit card number.
The processing power and machine learning techniques available today let the software make more accurate guesses about how the words will sound.
VoCo is also speaker dependent because it needs to receive sufficient unique input before it actually recognizes their speech patterns. But it makes the results sound very natural. Though the demonstration sentences had some stilted parts, to the naked ear some of the more common words sounded just like natural conversation.
Adobe said it will introduce digital watermarks to reduce the risk of forgeries.
Since so little is known, there are a lot of uses and troubling questions about VoCo. Could it be used to impersonate someone? Almost definitely. For instance, it could be used to fake a celebrity or politician speaking at a private event, or to fool voice recognition IoT devices in the home. Less potentially harmful uses could include making fixes in “recording voiceovers, dialog, and narration” as well as podcasting, Abode said in a statement.
Indeed, speech recognition and editing could become a whole new aspect of law enforcement, from using it in identifying suspects to the need for a new range of tools and practices to account for potentially falsified evidence.
In theory, though, incriminating leaks could be spun for whole cloth and media feeding frenzies launched on the strength of a made-up conversation. This would be a serious issue, and requires some kind of detection system that flags when the speech has been manipulated by a program, just as today there are tools and methods to detect photoshopping.