Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Wang, J; Guo, Z; Yang, C; Li, X; Cui, Z

Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Wang, J Guo, Z Yang, C Li, X Cui, Z

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE International Conference on Multimedia and Expo (ICME), 2023, 2023-July, pp. 642-647
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	1665709.pdf	Published version	2.55 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, J
dc.contributor.author	Guo, Z
dc.contributor.author	Yang, C
dc.contributor.author	Li, X
dc.contributor.author	Cui, Z
dc.date	2023-07-10
dc.date.accessioned	2024-05-13T04:20:21Z
dc.date.available	2024-05-13T04:20:21Z
dc.date.issued	2023-01-01
dc.identifier.citation	2023 IEEE International Conference on Multimedia and Expo (ICME), 2023, 2023-July, pp. 642-647
dc.identifier.isbn	978-1-6654-6892-3
dc.identifier.issn	1945-7871
dc.identifier.issn	1945-788X
dc.identifier.uri	http://hdl.handle.net/10453/178897
dc.description.abstract	Compared to feature or decision fusion hybrid fusion can beneficially improve audio visual speech recognition accuracy Existing works are mainly prone to design the multi modality feature extraction process interaction and prediction neglecting useful information on the multi modality and the optimal combination of different predicted results In this paper we propose a multi scale hybrid fusion network MSHF for mandarin audio visual speech recognition Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi scale feature extraction module MSFE to obtain multi scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information optimizing the weights of prediction results for different modalities to achieve the best classification We further design a feature recognition module FRM for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k dataset The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE International Conference on Multimedia and Expo (ICME)
dc.relation.ispartof	2023 IEEE International Conference on Multimedia and Expo
dc.relation.isbasedon	10.1109/icme55011.2023.00116
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition
dc.type	Conference Proceeding
utslib.citation.volume	2023-July
utslib.location.activity	Brisbane, Australia
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-13T04:20:18Z
pubs.finish-date	2023-07-14
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-07-10
pubs.volume	2023-July
dc.location	Piscataway, USA

Abstract:

Compared to feature or decision fusion hybrid fusion can beneficially improve audio visual speech recognition accuracy Existing works are mainly prone to design the multi modality feature extraction process interaction and prediction neglecting useful information on the multi modality and the optimal combination of different predicted results In this paper we propose a multi scale hybrid fusion network MSHF for mandarin audio visual speech recognition Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi scale feature extraction module MSFE to obtain multi scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information optimizing the weights of prediction results for different modalities to achieve the best classification We further design a feature recognition module FRM for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k dataset The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178897