Lipreading using Deep Learning : a research from the University of Oxford and Google DeepMind

in technology •  8 years ago 

lipreading.gif

The analysis of articulated speech sounds was known over 2000 years earlier to ancient Indian phoneticians, they did analytic study and description of Sanskrit in order to insure the preservation and propagation of the Vedas, the religious scriptures of the ancient Hindus. Later on, a Benedictine monk of the 16th century named Pietro Ponce was mentioned by Brücke to have applied or introduced a Spanish tradition of language teaching to the deaf, he was the first to use successfully this method to teach a language without the need of speech sounds[1].

During the ALR process, the mouth forms a set of 10 to 14 different shapes, they are called visemes. The computer recognizes the letters or words by matching the recorded shapes with the predefined or saved set of shapes. But that’s not as easy as it seems, as the speech contains around 50 individual sounds known as phonemes, the computer might very well link a certain viseme with a the wrong phoneme stemming from the visual learning process. Now with the help of machine learning approaches, the prediction is far more efficient and relevant, most classifications were only done on a word-level, giving us more chance of having real time information but so limiting down the accuracy level. In a recent research from the University of Oxford and Google DeepMind, using a Deep Learning algorithm and working on a sentence-level, an accuracy of 93.4% was achieved[2]. The video below shows how the model does perform :

Abstract[2] of the article showing overall information:

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

Bibliography:

[1] – Concise History of the Language Sciences: From the Sumerians to the Cognitivists by E.F.K. Koerner, R.E. Asher Seen on 07.11.2016

[2] – LipNet: Sentence-level Lipreading [Yannis M. Assael, Brendan Shillingford], Shimon Whiteson, Nando de Freitas Seen on 07.11.2016

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
http://gizmodo.com/researchers-just-created-the-most-amazing-lip-reading-s-1788748163