MIT boffins develop system that can recognise speech and objects at the same time

Can highlight in real-time the relevant regions of an image being described

MIT computer scientists have developed a system that tackles speech and object recognition at the same time.

The system learns to identify objects within an image, based on a spoken description of the image. So, given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described.

Unlike current speech recognition technologies, the model doesn't require manual transcriptions and annotations of the examples it's trained on. Instead, it learns words directly from recorded speech clips and objects in raw images and associates them with one another.

The model can currently recognise only several hundred different words and object types. For example, speech-recognition systems such as Siri and Google Voice, require transcriptions of many thousands of hours of speech recordings. Using these data, the systems learn to map speech signals with specific words.

However, this approach becomes problematic when new terms enter our lexicon, and the systems must be retrained.

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing," said David Harwath, a researcher in MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL).

In their new paper, the researchers demonstrate their model on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background. The model learned to associate which pixels in the image corresponded with the words "girl," "blonde hair," "blue eyes," "blue dress," "white lighthouse," and "red roof." When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.

A potential application is learning translations between different languages, without the need of a bilingual annotator.

"Of the estimated 7,000 languages spoken worldwide, only 100 or so have enough transcription data for speech recognition," explained the study.

"Consider, however, a situation where two different-language speakers describe the same image. If the model learns speech signals from language A that correspond to objects in the image and learns the signals in language B that correspond to those same objects, it could assume those two signals - and matching words - are translations of one another."

Harwath added that this means there's potential for translating different languages to the wearer.

He and his fellow MIT researchers also hope that one day their combined speech-object recognition technique could save countless hours of manual labour and open new doors in speech and image recognition.