A.A. Aksenov – Junior Research Scientist, St. Petersburg Institute for Informatics and Automation of RAS
D.A. Ryumin – Research Scientist, St. Petersburg Institute for Informatics and Automation of RAS
I.A. Kagirov – Junior Research Scientist, St. Petersburg Institute for Informatics and Automation of RAS
D.V. Ivanko – Research Scientist, St. Petersburg Institute for Informatics and Automation of RAS
Gestures as a form of nonverbal communication are of great importance in everyday life and constitute different language systems and sub-systems: from the «body language» to sign languages. Nowadays gesture recognition increasingly finds applications in various domains associated with computer vision tasks, such as human-machine interaction (HMI) or virtual reality. In a general sense, the gesture recognition aims at comprehension of any meaningful movement of a person’s hand, or hands, or other body parts. The problem of gesture recognition has not been resolved so far due to variations between the sign languages of the world, noisy signing environment, small size of articulators (hands, fingers).
The gesture recognition, in most cases, comes down to processing of a video sequence, which provides the viewer with information about a part of the human body and its coordinates in space and time. The exceptions are the so-called static gestures, involving no constant, dynamic articulator movements, and the time-space coordinates are mostly one and the same for all the gesture time. Complex gestures involving different articulators and localizations also contribute to difficulties of gesture recognition due to challenges of spatial feature extraction, where it finds out that the articulators are relatively small if compared to the whole picture. It seems reasonable, therefore, that the process of gesture recognition should be based on processing of a video sequence, not a single video picture, so that not only spatial coordinates, but also time features could be extracted.
The paper presents an approach to the multimodal recognition of dynamic and static gestures of Russian sign language through 3D convolutional and LSTM neural networks. A set of data in color format and a depth map, consisting of 48 one-handed gestures of Russian sign language, is presented as well. The set of data was obtained with the use of the Kinect sensor v2 and contains records of 13 different native signers of Russian sign language. The obtained results are compared with these of other methods. The experiment on classification showed a great potential of neural networks in solving this problem. Achieved recognition accuracy was of 74.07%, and, compared to other approaches to the problem, this turns out to be the best result.
- Ryumin D., Karpov A. Towards Automatic Recognition of Sign Language Gestures using Kinect 2.0. 19th International Conference on Human Computer Interaction HCII-2017. 2017. P. 89−104.
- Karpov A., Krnoul Z., Zelezny M., Ronzhin A. Multimodal Synthesizer for Russian and Czech Sign Languages and Audio-Visual Speech. UAHCI/HCII 2013. P. 520−529.
- Ryumin D., Ivanko D., Axyonov A., Kagirov I., Karpov A., Zelezny M. Human-Robot Interaction with Smart Shopping Trolley using Sign Language: Data Collection. Proc. of IEEE International Conference on Pervasive Computing and Communications. PerCom 2019. P. 949−954.
- Lin W., Du L., Harris-Adamson C., Barr A., Rempel D. Design of hand gestures for manipulating objects in virtual reality. International Conference on Human–Computer Interaction. 2017. P. 584−592.
- Cao Z., Hidalgo G., Simon T., Wei S.-E., Sheikh Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. arXiv preprint arXiv:1812.08008.
- Oyedotun O., Khashman A. Deep learning in vision-based static hand gesture recognition. Neural Computing and Applications. 2017. V. 28. P. 3941−3951.
- Zhu Y., Lan Z., Newsam S., Hauptmann A.G. Hidden two-stream convolutional networks for action recognition. 2017 arXiv preprint arXiv:1704.00389.
- Ouyang D., Zhang Y., Shao J. Video-based person re-identification via spatio-temporal attentional and two-stream fusion convolutional networks. Pattern Recognition Letters. 2019. V. 117. P. 153−160.
- Li Z., Gavves E., Jain M., Snoek C.G. VideoLSTM convolves, attends and flows for action recognition. 2016. arXiv preprint arXiv:1607.01794.
- Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997. V. 9. № 8. P. 1735−1780.
- Ji S., Xu W., Yang M., Yu K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010. V. 35. P. 221−231.
- Nanni L., Ghidoni S., Brahnam S. Handcrafted vs. Non-Handcrafted Features for computer vision classification. Pattern Recognition. 2017. V. 71. P. 158−172.
- Chang C., Lin C. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology. TIST. 2011.V. 2. № 3. P. 27.
- Escalante H., Ponce-López V., Wan J., Riegler M., Chen B., Clapés A., Escalera S., Guyon I., Baró X., Halvorsen P., Müller H. Chalearn joint contest on multimedia challenges beyond visual analysis: An overview. 23rd International Conference on Pattern Recognition. ICPR-2016. P. 67−73.
- Zhu G., Zhang L., Mei L., Shao J., Song J., Shen P. Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. 23rd International Conference on Pattern Recognition. ICPR-2016. P. 19−24.
- Duan J., Zhou S., Wan J., Guo X., Li S. Multi-modality fusion based on consensus-voting and 3D convolution for isolated gesture recognition. 2016. arXiv preprint arXiv:1611.06689.
- Duan J., Wan J., Zhou S., Guo X., Li S. A unified framework for multi-modal isolated gesture recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2018. V. 14. № 1s. P. 21.
- He K., Zhang X., Ren S., Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015. V. 37. № 9. P. 1904−1916.
- Kudubayeva S., Ryumin D., Kalghanov M. The influence of the Kazakh language semantic peculiarities on computer sign language. International Conferences on Information and Communication Technology, Society, and Human Beings. ICT-2016. P. 221−226.
- Karpov A., Kipyatkova I., Zelezny M. Automatic Technologies for Processing Spoken Sign Languages. 5th Workshop on Spoken Language Technologies for Under-resourced languages. SLTU-2016. V. 81. P. 201−207.
- Wang P., Li W., Liu S., Gao Z., Tang C., Ogunbona P. Large-scale isolated gesture recognition using convolutional neural networks. Proc. of 23rd Int. Conf. Pattern Recognition. ICPR-2016. P. 7−12.
- Ryumin D., Kagirov I., Ivanko D., Axyonov A. and Karpov A.A. Automatic detection and recognition of 3D manual gestures for human-machine interaction. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLII-2/W12. 2019. P. 179−183. URL = https://doi.org/10.5194/isprs-archives-XLII-2-W12-179-2019.
- Kagirov I., Ryumin D., Axyonov A. Method for Multimodal Recognition of One-Handed Sign Language Gestures Through 3D Convolution and LSTM Neural Networks. SPECOM 2019. Lecture Notes in Computer Science. 2019. V. 11658. P. 191−200.
- Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat S., Irving G., Isard M., Kudlur M. Tensorflow: A system for large-scale machine learning. 12th Symposium on Operating Systems Design and Implementation. 2016. P. 265−283.
- Gulli A., Pal S. Deep Learning with Keras. Packt Publishing Ltd. 2017.
- Liu L., Shao L. Learning discriminative representations from RGB-D video data. 23rd International Joint Conference on Artificial Intelligence. 2013.
- Tung P., Ngoc L. Elliptical density shape model for hand gesture recognition. International Proc. of the ICTD. 2014.
- Molchanov P., Yang X., Gupta S., Kim K., Tyree S., Kautz J. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. P. 4207−4215.
- Zheng J., Feng Z., Xu C., Hu J., Ge W. Fusing shape and spatiotemporal features for depth-based dynamic hand gesture recognition. Multimedia Tools and Applications. 2016. P. 1−20.