In the 1980s, an innovative method emerged to solve many of the persistent voice recognition problems such as coarticulation. Carnegie Mellon University applied a statistical model called hidden Markov model (HMM)—developed by IBM in the 1970s—to voice recognition.

Previous approaches to voice recognition involved developing mathematical models that fit the physical data obtained from experiments, and using those models to analyze the actual voice signals. The HMM is a totally different approach, because it uses statistical data to determine the probability that a given set of data corresponds to a particular phoneme.
“If you listen to how people utter the vowel [a], for instance, you’ll find that there are many variations, ranging from a standard pronunciation to a number of rather aberrant ones,” says Professor Katsunobu Ito of the Faculty of Computer and Information Sciences at Hosei University. “The HMM recognizes these variations, and attempts to describe how far each variation deviates from the average. So, the HMM allows us to statistically describe voice signals including the time parameter.”

Researchers who had studied human voice only as a physical phenomenon were at first puzzled with HMM’s premise. Trying to identify phonemes by means of statistical probability seemed to them just as haphazard as a throw of the dice. Naturally they were very skeptical.

For the HMM to work properly, a vast amount of data needs to be processed statistically. Because the volume of voice data is overwhelmingly larger than that of text data, earlier computers simply could not handle the workload, to the detriment of the study. As computers became faster and the storage volume larger, however, the research picked up speed from the 1980s onward.

In Japan, too, research on voice recognition based on the HMM was undertaken, mainly by the Advanced Telecommunications Research Institute International (ATR), a public and private sector partnership organization established in 1986.

Voice recognition cannot be done by acoustic analysis alone; it also requires a language model. When we talk with others, we don’t just listen to the sound. We have reasonable expectations about how each word would be followed by the next, and that helps us understand what’s being said even if some parts of the speech were unclear. In Japanese, for example, watashi (I) is most likely followed by ga or wa, but never by ki. A comprehensive list of similar rules about sequences of words is called a language model. Voice recognition requires an acoustic model (like HMM) for identifying phonemes, and a language model to determine word sequences. The approach ATR employed to construct its language model was also a statistical one, called the n-gram method.

To build a voice recognition system, ATR gathered voice data from hundreds of people, as well as text data consisting of 10 years of newspaper articles. By the mid-1990s, ATR successfully developed a semi-practical system, and a sightseeing guide system with a voice recognition feature was introduced in the early 2000s.

The Japanese voice recognition research based on a statistical model resulted in an open-source speech recognition engine called Julius, with the cooperation of many researchers and institutions.

The HMM acoustic model and the n-gram language model have become the world standard, and virtually all voice recognition systems today use these models. With regard to Japanese, the voice recognition rate of these systems is typically around 90%, although the rates may vary under different conditions.