Localizing Sound in Visual Scenes - [Deep Learning]

Resources #113.png

Senocak and colleagues (2018) published a deep learning paper in which they start from the question:

"Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human?" [source]

So, in the paper they propose an unsupervised learning model that will deal with the problem of locating sound in visual scenes.

For the model, they use two convolutional neural networks, one for sound and one for vision. Ok, so you might be thinking it has two inputs. In actuality you could say it has, as the model uses pairs of frames and sound to learn where the sound comes from. However, each of the inputs is processed in its according network.

To be more precise, they actually use a combination of unsupervised, semi-supervised, and supervised learning. To know the exact technical details, I'd suggest reading the full paper below. Some take-away messages are that:

" By empirically demonstrating the capability of our unsupervised network, we show the model plausibly works in a variety of categories but partially, in that the network can often get to false conclusion without prior knowledge.

We also show that leveraging small amount of human knowledge can discipline the model, so that it can correct to capture semantically meaningful relationships. " [source]

Below is a video presentation of the paper, which might be very insightful if the paper is too hard to grasp.

YOLOv3 definitely takes computer vision to a superior level, if you ask me.

Localizing Sound in Visual Scenes - [Deep Learning]

To stay in touch with me, follow @cristi

Cristi Vlad Self-Experimenter and Author