Hidden Markov Models
In real exploitation conditions, which are characterized by a low quality of speech signals, presence of external noises or background conversations, the automatic speech recognition systems are not able to provide an acceptable recognition performance even using various methods for signal filtration and noise suppression. In order to improve accuracy and robustness of automatic recognition of speech, various approaches to analysis of visual speech based on technologies of computer vision (so called «lip-reading») are studied, creating bimodal models for audio-visual speech recognition.
In this paper, we present study of a model for automatic bimodal recognition of audio-visual Russian speech using the mathematical apparatus of Coupled Hidden Markov Models of the first order, which allows performing the fusion of the feature vector streams from auditory and visual speech modalities on the level of states of the united stochastic model. This model allows to take into account possible time mismatch (asynchrony) between corresponding speech units (phonemes and visemes) peculiar to conversational speech and to make information fusion from both speech modalities considering weight coefficients of their informativity. One of the main problems at development of the audio-visual speech recognizer is to realize an effective way for fusion and synchronization of speech modalities. The necessity of asynchronous fusion of modalities is forced by non-stationary time discrepancies between acoustical and visual speech, which are caused by limitations of speech production dynamics and by the effects of co-articulation (influence and overlapping of neighboring speech units in flow) that influence differently on audio and video components. In the given research, several bimodal and unimodal speaker-dependent Russian speech recognizers with a small vocabulary were implemented. They were tested by using previously collected corpus of audio-visual Russian speech (it contains continuously pronounced connected digits with the phrase length from 3 till 6 words) of 6 native Russian speakers. Recognition experiments were made with adding acoustical noises and varying Signal-to-Noise Ratio (SNR).
The paper presents the accuracy results of continuous Russian speech recognition for a small-sized recognition vocabulary as well as the comparison of bimodal recognition with unimodal models.
The experimental results demonstrate that the bimodal speech recognition outperforms the unimodal pure-audio recognition especially for low values of SNR < 15 dB. At that the asynchronous modality fusion recognizer based on Coupled Hidden Markov Models (CHMM) works slightly better than the synchronous modality fusion recognizer based on Multi-Stream Hidden Markov Models (MSHMM). Moreover, acoustical information about speech becomes less informative in experiments with low values of SNR < 5 dB, and in that conditions a highest accuracy is achieved by the unimodal pure-video speech recognizer.