Let’s take a moment to look at the history of voice recognition technology.

The attempts to artificially synthesize human voice go back to the 18th century. However, making machines recognize human voice was not possible until computers became available.

In the 1950s, American researchers sought to figure out how people produce voice by taking X-ray photos of the vocal tract of people as they spoke. The vocal tract is the cavity extending from the vocal folds to the mouth and the nose, forming a series of intricately shaped tubes. The researchers observed how each part of the vocal tract changes during vocalization, and described the changes mathematically. They thought they could synthesize human voice by producing the sound that matches the mathematical voice model, and conversely, that they could attain voice recognition by comparing the voice input against the model and identifying the corresponding state of the vocal tract.

The researchers, who were acoustic physicists at Bell Labs in the U.S., successfully developed an elegant mathematical model of human voice in the 1970s. The vowel classified as [a] in the International Phonetic Alphabet (IPA) has distinctive peaks of frequency (called formants), which are different from those of the vowel [i], for instance.

The model made it easy to describe human voice signals. It meant that by simply combining a dozen mathematical functions or so, an approximation of human voice could be synthesized. The process of converting voice into mathematical functions is known as vocal coding.

Once vocal coding has been achieved, it would be easy to achieve voice recognition; all you have to do is analyze the voice input and detect a pattern that most closely resembles it. And that’s how most researchers saw it at first.

Are human voice signals determined by the throw of a dice?

There is no doubt that the mathematical model significantly improved voice recognition technology. In the 1960s, IBM announced a device called Shoebox that could perform voice recognition, and Kyoto University developed a voice typewriter that could recognize one syllable at a time.

Photo: IBM Shoebox, the first voice recognition device in the world based on a mathematical model.

In the 1970s, dynamic programming (DP) algorithms that normalize the variation in utterance duration were independently developed in Japan and Russia. As a result, it became possible for a machine to recognize syllables spoken consecutively.

Researchers thought that a voice recognition system would be perfected once the mathematical model became refined enough. However, this type of machine had to be used in a quiet environment by a speaker who could enunciate clearly. These are very special conditions, and if the machine couldn’t operate under other circumstances it simply meant the product wasn’t practical at all.

Humans are sloppy speakers. Few people pronounce each syllable as clearly as an announcer. You may develop a mathematical model of an [a] sound all you like, but the actual [a] spoken by real people could be a far cry from the ideal. For example, people change the shape of their mouth as they speak one syllable after another, and yet early mathematical models could not effectively capture this phenomenon known as coarticulation.

It should be noted that the voice coding technology has been used in telecommunications to great effects. A representative voice coding algorithm known as code-excited linear prediction (CELP), for example, expresses human voice as fully as possible using mathematical functions. As a result, human voice can be coded and decoded using far fewer bits of data than when the waveforms are directly converted into analog data. Thanks to the voice coding technology, it also became possible to speak normally on mobile phones which typically have slow data rates.