Short Bytes: A new artificial intelligence system has been developed by Google DeepMind and the University of Oxford. The Watch, Listen, Attend, Spell (WLAS) system can annotate after lip-reading from an unedited video footage. It even outperformed a professional human lip-reader during the tests.Researchers at DeepMind (Google’s UK-based AI subsidiary) teamed up with the researchers from the University of Oxford to create a highly accurate lip-reading software using artificial intelligence.
The neural network, known as “Watch, Listen, Attend, and Spell” (WLAS), was trained using around 5000 hours of TV content from the BBC network. The video content – corrected for audio-video sync issues – from shows like World Today, Breakfast, Newsnight, etc. comprised of almost 118,116 natural different sentences and 17,428 different words.
The AI was able to lip read directly from the video footage with an accuracy of 46.8%. It even outperformed a professional human lip reader who achieved an accuracy 12.4% while transcribing from the same television content.
“Trained Using 118,116 different sentences and 17,500 Unique words”
According to the researchers, their Watch, Listen, Attend, and Spell (WLAS) system has surpassed transcription performance of all previous works in the field by a considerable margin.
Such a system can have a variety of applications which can help hearing impaired people to decipher spoken words. Moreover, it can be used to enable personal assistants like Siri and Cortana to take commands without the need of audio input.
The new AI system adds to the bunch of other efforts by Google. The company’s Google Brain team has recently created as a Multilingual system which can translate a language pair using its own interlingua. Show and Tell is another AI by Google Brain capable of captioning images.
Here is a video clip with subtitles provided by the AI:
To know more, read the research paper Lip Reading Sentences in the Wild.
If you have something to add, tell us in the comments below.