Communicate with anyone on the planet, with no linguistic divide. Sounds like something out of cheesy sci-fi flick doesn’t it? Until recently, it might have, but with Skype’s latest offering, the Skype Translator, the world may have just become a smaller place.

Over a decade of research and development has allowed Microsoft to achieve what a number of Silicon Valley icons—not to mention the U.S. Department of Defence—have not yet been able to. To do so, Microsoft Research (MSR) had to solve some major machine learning problems while pushing technologies like deep neural networks into new territory.

Translation though, has never been the hardest part of the equation. Effective text translators have been around for a while. Translating spoken language—and especially doing so in real time—requires a whole different set of tools. Spoken words aren’t just a different medium of linguistic communication; we compose our words differently in speech and in text. Then there’s inflection, tone, body language, slang, idiom, mispronunciation, regional dialect and colloquialism. Text offers data; speech and all its nuances offers nothing but problems.

To translate an English phrase like “the straw that broke the camel’s back” into, say, German, the system looks for probabilistic matches, selecting the best solution from a number of candidate phrases based on what it thinks is most likely to be correct. Over time the system builds confidence in certain results, reducing errors. With enough use, it figures out that an equivalent phrase, “the drop that tipped the bucket,” will likely sound more familiar to a German speaker.

This kind of probabilistic, statistical matching allows the system to get smarter over time, but it doesn’t really represent a breakthrough in machine learning or translation (though MSR researchers would point out that they’ve built some pretty sophisticated and unique syntax parsing algorithms into their engine). And anyhow, translation is no longer the hardest part of the equation. The real breakthrough for real-time speech-to-speech translation came around in 2009, when a group at MSR decided to return to deep neural network research in an effort to enhance speech recognition and synthesis—the turning of spoken words into text and vice versa.

Designed more like the human brain than a classical computer, Deep Neural Networks (DNNs)—biologically inspired computing paradigms designed more like the human brain than a classical computer—enable computers to learn observationally through a powerful process known as deep learning. New DNN-based models that learn as they go proved capable of building larger and more complex bodies of knowledge about the data sets they were trained on—including things like language. Speech recognition accuracy rates shot up by 25 percent. Moreover, DNNs are fast enough to make real-time translation a reality, as 50,000 people found out this week.

So how do all these magical elements come together?

When one party on a Skype Translator call speaks, his or her words touch all of those pieces, traveling first to the cloud, then in series through a speech recognition system, a program that cleans up unnecessary “ums” and “ahs” and the like, a translation engine, and a speech synthesizer that turns that translation back into audible speech. Half a beat after that person stops speaking, an audio translation is already playing while a text transcript of the translation displays within the Skype app.

Skype translator still isn’t perfect though, with its fumbles on uncommon idioms and phrases and how the system evolves as it tries to keep up with tens of thousands of users testing its capabilities, still remains to be seen. What is for certain, is that through Skype, Microsoft has ushered in an age of digital communication without borders.


Leave a Reply