Smart Audio - Applying AI to Audio Processing

We often hear about how AI (Artificial Intelligence) has been used in computer vision, data modeling and machine learning. Less talked about is how it is also used in processing audio signals used in smart devices like Amazon, Apple and Google's smart speakers. This is the world of smart audio and it's application is emerging in many fields. Smart audio is not just about producing sound, but analyzing uttered speech with NLP (Natural Language Processing). Devices that are interactive use smart audio features to allow them to analyze speech and reply back intelligibly.

Screen Shot 2018-05-02 at 5.53.21 PM.png

One method used in smart audio is Dilated Convolutional Neural Network. It uses a deep learning model for generating audio recordings. An example of this is Google DeepMind's Wavenet. From a device perspective there is Apple's Homepod. The HomePod uses “spatial awareness” meaning it is "aware" of your room's layout and adjusts it's audio output accordingly. This is like 3D surround without additional speakers. The trick is in the way the HomePod was designed. It’s powered by an A8 processor, not the A11 which I thought would have been the best choice. The speaker system consists of a high-excursion woofer, an array of 6 microphones and 7 tweeters. Enclosing the cylindrical body is a seamless mesh fabric. The woofer drives a 20 mm speaker that produces a deep bass sound. Since it is cylindrical, it delivers sound from different angles which the AI software utilizes. This focuses sound in a directional manner with precision audio. The HomePod looks like a roll of toilet paper or another trashcan. The design is supposed to provide sound from all directions of it’s cylindrical body.

Smart audio is built on 3 foundations:

Digital Signal Processing
Sound Recognition
Deep Learning

Digital Signal Processing DSP deals with discrete variable domains. A DSP is a physical device that is built into an IC chip. One way this applies to audio is in source separation. That way DSP chips can be designed to track various sources of an audio signal, like in the case of 2 or 3 or even a group of people talking in a room at the same time. Humans don't have problems with these situations, but to a computer it won't know who is talking. A DSP can isolate who is speaking by variations in frequency.

Sound Recognition analyzes the patterns in audio, particularly speech and other audible noises. Sound uses audio equipment like transducers to capture sound waves and convert them to an electrical signal, which a DSP can then digitize. Microphones are used for this purpose. This requires extraction of audio from a source and then using classification algorithms for analysis. This would classify feature vectors used with linear predictive coding. This helps to identify what the sounds are and in the case of smart audio what a person is saying.

Deep Learning software is used with DSP and speech to analyze and identify audio. For example a training model can teach a system what a train sounds like and get accurate results. However, it may encounter other loud noises like explosions and loud banging. With deep learning models, the system can further identify what it needs to look for and filter out what information is not necessary.

Some Applications of Smart Audio

Smart Speakers allow users at home or in the office to interact with their device using voice commands. NLP can be used to allow command chaining as well, to allow smart speaker devices to process multiple tasks. Currently smart speakers like the Echo, Homepod and Home device are used for narrow AI operations. They run on AI software like Assistant on Google Home, Alexa on the Echo and Siri on the Homepod. Users can talk to these devices and get information back.

Smartphones are the most widely available devices for smart audio applications. There are apps that one can download to use in doing things like identifying a song, language translating services by speech and speech-to-text transcribing. Perhaps it is the digital assistants that provide the best example. On Android phones, Google Assistant provides ways for user to interact with using speech. A user may want to know what the weather will be like tomorrow in their city and ask this as a question with a hot word like "Hey Google ...".

Voice-Assisted Intelligent Assistant Software allows users to use voice commands to perform tasks. This can be useful on car navigation systems and infotainment consoles. When driving it is difficult at times to control things. By using voice commands to interact with an intelligent assistant, the driver just needs to utter what they need to do. An example is finding the directions to a particular restaurant. With voice commands it doesn't require the driver to pull aside or take their eyes off the road. They can continue driving while the intelligent assistant guides the driver. This can also prevent distracted driving incidents in which a driver is too busy looking at their screen and not notice a crossing pedestrian.

Musical composition can use AI audio processing when producing or mixing tracks. Composing songs is still a human ability based on talent, but smart audio processing techniques can enhance that and take it to another level. A producer can feed a music sample to AI software which will then create a different composition based on that information which producer's can play around with. They can even combine several tracks or use musicologist software to make suggestions on which tracks to blend to get a specific sound based on a certain criteria.

Image matching with spoken words allows users to say something and the system will understand it. Take for example a smart appliance like a refrigerator. Perhaps asking the fridge "how many apples" are left inside allows for decisive actions to either go to the grocery store or order online for delivery. In this case an intelligent fridge using computer vision sensors identifies the apple and it's quantity, but it must know what it is looking for based on the word apple.

Assistant for disabled people can use smart audio tremendously. For example an app can be installed to help guide people with sight problems. A smart audio app can provide voice feedback back to a user to tell them things like when it is safe to cross the street or to stop walking into an obstacle. In speech disabled people, it can amplify signals that people utter in their minds to convert into intelligible speech (theoretically). Other cases are with those who have disabilities, be they handicap or temporarily recovering from an injury. A command interface that uses speech can benefit those who need to make phone calls, watch content on their devices to emergency situations in which a user may not have use of their hands.

Another example of smart audio is in TTS (Text to Speech) processing. DeepMind has developed Wavenet for this application. This is done through the use of Parametric TTS. All the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model. This uses signal processing algorithms called vocoders. This results in more natural-sounding speech, using raw waveforms. This would make it possible to develop AI talking machines that sound very human.

The Future

Smart audio with deep learning can be applied to enhance audio processing in many ways.

Noise cancelation - Further removing distortion in sound increases audio quality to more hi-fidelity levels, for both consumer and research related applications.

New music - Using ordinary equipment only, it can be possible to create hi-fidelity audio tracks.

Entertainment - More human sounding speech in gaming and animations without human actors.

Speech manipulation - The ability to change language in audio to another language, change accent, dialect and even other attributes of speech in recordings.