Microsoft claims new speech recognition accuracy record

Microsoft looking to move on from speech recognition to computers understanding speech

Microsoft has claimed a new record for speech recognition accuracy.

An announcement from Xuedong Huang, a Microsoft Technical fellow, indicates that the company's latest tests show a 5.1 human parity word-rate error, an improvement from the 5.9 previously recorded, which was already better than that of a regular, casual human conversation.

The results are based on the standard Switchboard test, a measure based on a flurry of conversations which the machine being tested then transcribes.

According to Microsoft: "We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modelling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels."

The company goes on to explain that the system can now tune itself better to the language model of the individual's previous conversations to better predict what it would say next to improve the topic and context - something that its current public bots are doing less sell at.

However, the post warns that the community still has much to do. Noisy surroundings, far away microphones, weird accents and more fundamentally, speaking styles and languages where there is limited training data.

Another issue is whether computers can be made not only to interpret the spoken words, but to understand and contextualise them, adding meaning.

An error rate is one thing, but to meaningfully interpret words computers will have to learn to better understand them, with Huang adding: "Moving from recognising to understanding speech is the next major frontier for speech technology."

Mozilla is in the middle of the process of publicly collating new data to make its speech library more accurate. The company then plans to release the audio data as a library to make it easier for smaller businesses and individual coders to incorporate speech data.