By isolating sounds, MIT could make robots more aware of their surroundings

The PixelPlayer system could let robots pick out car noises, as well as changing the way that the music industry works

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a deep-learning system that can watch (yes, watch - read on) a video of a musical performance and isolate the sounds of specific instruments.

The ‘PixelPlayer' system was trained on more than 60 hours of videos, and is now capable of identifying instruments ‘at pixel level' after viewing a video just one time. It can then extract the sounds associated with those instruments and play them back, without any other noise.

At the moment PixelPlayer can identify sounds from more than 20 instruments, which could be increased with more training data.

The system uses three neural networks: one to analyse the visuals of the video; one to analyse the audio; and a third ‘synthesiser', which associates specific pixels with specific soundwaves to separate the different sounds.

https://www.youtube.com/embed/2eVDLEQlKD0

So far we have only seen videos of two instruments being played at once, so we do wonder how the system would handle a larger-scale performance like an orchestra.

Lead author Hang Zhao said that the system may have trouble handling subtle differences between subclasses of instruments (such as an alto sax versus a tenor).

The system can adjust the volume of individual instruments after the fact, which has implications for cleaning up old recordings, or previewing how certain instruments sound in a new composition.

A similar system could even be used on robots, to better understand environmental sounds made by animals or vehicles.

In the past, attempts to isolate sounds have focused on audio alone, which MIT says ‘often requires extensive human labeling'; but PixelPlayer's reliance on vision makes labelling unnecessary.

The system locates the image regions that produce sounds (implying that concert recordings would be a much more difficult proposition; we have asked MIT about this), then separates the input sounds into a set of components that represent the sound from each pixel.

The downside to this ‘self-supervised' deep learning is that MIT doesn't exactly understand how PixelPlayer learns which instruments make which sounds.