Publishing house Radiotekhnika

"Publishing house Radiotekhnika":
scientific and technical literature.
Books and journals of publishing houses: IPRZHR, RS-PRESS, SCIENCE-PRESS

Тел.: +7 (495) 625-9241


Combined methods for speaker diarization


V.Yu. Budkov, A.L. Ronzhin

Speech documentation is a required procedure during the business meeting, session, conference and other official events. However expert analysis and decoding of speech records is time expensive process carried out by stenographists. The modern methods of speech analysis and speaker diarization are used to automate the process of participant speech labeling. One of the perspective approaches to improve speaker diarization is a usage of features extracted by multichannel and multimodal analysis of participant behavior inside a meeting room. The completed survey of combined methods of speaker diarization shown that the additional parameters obtained by image processing and audio localization increase the accuracy of speaker change detection in the multichannel signal. During image processing the most important features are head orientation, face and lips changes in vertical axis. The problems of audio and video synchronization, fusion of joint feature set taking into account the weight of each parameter are mainly solved at the preliminary stage of speaker diarization system fitting. Also the methods of statistical analysis of participant profiles, frequency of his/her speech during meetings, as well as tracking of the meeting situation and determination of speaker based on analysis of participant head direction in the meeting room are implemented.

  1. Noulas A., Englebienne G., Kröse B.J.A. Multimodal Speaker Diarization // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012. №34(1). R. 79–93.
  2. Sinha R., Tranter S.E., Gales M.J.F., Woodland P.C. The Cambridge University March 2005 speaker diarisation system // In: Proc. of the European Conference on Speech Communication and Technology. 2005. R. 2437–2440.
  3. Wu C.H., Hsieh C.H. Multiple change-point audio segmentation and classification using an MDL-based Gaussian model // IEEE Trans. Audio Speech Language Process. 2006. № 14 (2). R. 647–657.
  4. Meignier S., Moraru D., Fredouille C., Bonastre J.F., Besacier L. Step-by-step and integrated approaches in broadcast news speaker diarization // Comput. Speech Language 20 (2–3). 2006. R. 303–330.
  5. Kotti M., Moschou V., Kotropoulos C. Speaker segmentation and clustering // Signal Process. 88 (5). 2008. R. 1091–1124.
  6. Tsiaras V., Panagiotakis C., Stylianou Y. Video and audio based detection of filled hesitation pauses in classroom lectures // Proc. of the 17th European Signal Processing Conference (EUSIPCO 2009). Glasgow. Scotland. 2009. R. 834–838.
  7. Garau G., Dielmann A., Bourlard H. Audio and Visual Synchronisation for Speaker Diarisation // In Proc. of International Conference on Speech and Language Processing, Interspeech, Makuhari. Japan. 2010. R. 2654–2657.
  8. Friedland G., Hung H., Yeo C. Multi-Modal Speaker Diarization of Real-World Meetings using Compressed Domain Video Features // in Proc. ICASSP. 2009. R. 4069–4072.
  9. Xitrov M.V. Mul'timodal'naya sistema dostupa s ispol'zovaniem golosovoj biometrii // Direktor po bezopasnosti. 2012. № 5 (33). S. 48–53.
  10. Hershey J., Movellan J. Audio-Vision: Using Audio-Visual Synchrony to Locate Sound // In Proc. NIPS. 1999. R. 813–819.
  11. Slaney M., Covell M. FaceSync: a linear operator for measuring synchronization of visual facial images and audio tracks // in Proc. NIPS. 2000. R. 814–820.
  12. Padilha E., Carletta J. Nonverbal Behaviours Improving a Simulation of Small Group Discussion // In Proc. of the 1st Nordic Symposium on Multimodal Communications. 2003. R. 93–105.
  13. Eveno N., Caplier A., Coulon P.-Y. Accurate and Quasi-Automatic Lip Tracking // IEEE Trans. on Circuits and Systems for Video Technology. 2004. V. 14. Iss. 5. R. 706–715.
  14. Omologo M., Svaizer P., Brutti A., Cristoforetti L. Speaker Localization in CHIL Lectures: Evaluation Criteria and Results // Proc. of Machine MLMI 2005. Eds. Steve Renals, Samy Bengio. LNCS 3869. Springer-Verlag Berlin Heidelberg. 2006. R. 476–487.
  15. Pfau T., Ellis D., Stolcke D. Multispeaker Speech Activity Detection for the ICSI Meeting Recorder // IEEE ASRU Workshop. 2001. R. 107–110.
  16. Ronzhin A.L., Budkov V.Ju., Ronzhin Al.L. Texnologii formirovaniya audiovizual'nogo interfejsa sistemy' telekonferenczij // Avtomatizacziya i sovremenny'e texnologii. 2011. № 5. S. 20–26.
  17. Ronzhin A.L., Karpov A.A. Proektirovanie interaktivny'x prilozhenij s mnogomodal'ny'm interfejsom // Doklady' TUSURa. 2010. № 1 (21). Ch. 1. S. 124–127.
  18. Ronzhin A.L., Budkov V.Yu. Multimodal Interaction with Intelligent Meeting Room Facilities from Inside and Outside // Springer-Verlag Berlin Heidelberg, S. Balandin et al. (Eds.): NEW2AN/ruSMART 2009. LNCS 5764. 2009. R. 77–88.
  19. Ronzhin Al.L., Budkov V.Ju., Ronzhin An.L. Formirovanie profilya pol'zovatelya na osnove audiovizual'nogo analiza situaczii v intellektual'nom zale soveshhanij // Trudy' SPIIRAN. 2012. Vy'p. 23. S. 482–494.
  20. Kipyatkova I.S. Kompleks programmny'x sredstv obrabotki i raspoznavaniya razgovornoj russkoj rechi // Informaczionno-izmeritel'ny'e i upravlyayushhie sistemy'. 2011. № 4. T. 53. S. 53–59.

June 24, 2020
May 29, 2020

© Издательство «РАДИОТЕХНИКА», 2004-2017            Тел.: (495) 625-9241                   Designed by [SWAP]Studio