Combined methods for speaker diarization


V.Yu. Budkov, A.L. Ronzhin

Speech documentation is a required procedure during the business meeting, session, conference and other official events. However expert analysis and decoding of speech records is time expensive process carried out by stenographists. The modern methods of speech analysis and speaker diarization are used to automate the process of participant speech labeling. One of the perspective approaches to improve speaker diarization is a usage of features extracted by multichannel and multimodal analysis of participant behavior inside a meeting room. The completed survey of combined methods of speaker diarization shown that the additional parameters obtained by image processing and audio localization increase the accuracy of speaker change detection in the multichannel signal. During image processing the most important features are head orientation, face and lips changes in vertical axis. The problems of audio and video synchronization, fusion of joint feature set taking into account the weight of each parameter are mainly solved at the preliminary stage of speaker diarization system fitting. Also the methods of statistical analysis of participant profiles, frequency of his/her speech during meetings, as well as tracking of the meeting situation and determination of speaker based on analysis of participant head direction in the meeting room are implemented.

June 24, 2020
May 29, 2020

