An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module
- Publisher:
- IEEE
- Publication Type:
- Conference Proceeding
- Citation:
- 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2024, 00, pp. 572-577
- Issue Date:
- 2024-01-29
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
1706918.pdf | Published version | 2.04 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Compared to relying only on audio information incorporating visual information improves speech recognition accuracy in noisy environments Existing works are prone to design specific architecture for feature extraction neglecting feature enhancement In this paper we propose an end to end Mandarin audio visual speech recognition model with a Feature Enhancement Module Specifically we design a Feature Enhancement Module FEM that uses deconvolution and up sampling to obtain the twin enhanced data for generating high resolution feature representation We further develop the Visual Feature Enhancement Module Visual FEM and Audio Feature Enhancement Module Audio FEM to enhance feature extraction from both visual data and audio data We incorporate the proposed modules into the blocks of the Residual Network for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k and Chinese Mandarin Lip Reading CMLR datasets The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules
Please use this identifier to cite or link to this item: