An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module

Publisher:
IEEE
Publication Type:
Conference Proceeding
Citation:
2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2024, 00, pp. 572-577
Issue Date:
2024-01-29
Filename Description Size
1706918.pdfPublished version2.04 MB
Adobe PDF
Full metadata record
Compared to relying only on audio information incorporating visual information improves speech recognition accuracy in noisy environments Existing works are prone to design specific architecture for feature extraction neglecting feature enhancement In this paper we propose an end to end Mandarin audio visual speech recognition model with a Feature Enhancement Module Specifically we design a Feature Enhancement Module FEM that uses deconvolution and up sampling to obtain the twin enhanced data for generating high resolution feature representation We further develop the Visual Feature Enhancement Module Visual FEM and Audio Feature Enhancement Module Audio FEM to enhance feature extraction from both visual data and audio data We incorporate the proposed modules into the blocks of the Residual Network for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k and Chinese Mandarin Lip Reading CMLR datasets The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules
Please use this identifier to cite or link to this item: