Google DeepMind AI learns to talk with WaveNet tool

Voice almost indistinguishable from humans

Google's DeepMind artificial intelligence unit has learned how to talk, and the company claimed a 50 per cent better performance than other speech technologies.

A blog post by Google said that the WaveNet technology can produce far better imitations of English and Mandarin, based on feedback from human listeners.

"For both Chinese and English, Google's current TTS [text-to-speech] systems are considered among the best worldwide, so improving on both with a single model is a major achievement," the firm said.

The graph below shows the scores the voice achieved in tests, highlighting how close it is getting to a level indistinguishable from human voices.

Audio files in the blog post demonstrate just how impressive this is, and Google explained that a future in which humans and computers can chat is not far off.

"Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech has been revolutionised in the last few years by the application of deep neural networks," the company said.

However, it will be some time before such effects become commonplace, as Google admitted that the computational power required to generate the output is very high.

"Typically 16,000 samples per second or more, with important structure at many time-scales. Building a completely auto-regressive model, in which the prediction for every one of those samples is influenced by all previous ones (in statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task."

Google said that WaveNet can also generate music and posted some interesting samples of piano music that unquestionably could have been produced by a human, underlining just how huge the potential for AI really is.

"Unlike the TTS experiments, we didn't condition the networks on an input sequence telling it what to play (such as a musical score). Instead, we simply let it generate whatever it wanted to. When we trained it on a dataset of classical piano music, it produced fascinating samples," the firm said.