Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Publisher:
IEEE
Publication Type:
Conference Proceeding
Citation:
2023 IEEE International Conference on Multimedia and Expo (ICME), 2023, 2023-July, pp. 642-647
Issue Date:
2023-01-01
Filename Description Size
1665709.pdfPublished version2.55 MB
Adobe PDF
Full metadata record
Compared to feature or decision fusion hybrid fusion can beneficially improve audio visual speech recognition accuracy Existing works are mainly prone to design the multi modality feature extraction process interaction and prediction neglecting useful information on the multi modality and the optimal combination of different predicted results In this paper we propose a multi scale hybrid fusion network MSHF for mandarin audio visual speech recognition Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi scale feature extraction module MSFE to obtain multi scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information optimizing the weights of prediction results for different modalities to achieve the best classification We further design a feature recognition module FRM for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k dataset The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules
Please use this identifier to cite or link to this item: