An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module

Wang, J; Yang, C; Guo, Z; Li, X; Wang, W

An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module

Wang, J Yang, C Guo, Z Li, X Wang, W

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2024, 00, pp. 572-577
Issue Date:: 2024-01-29

Closed Access

	Filename	Description	Size
	1706918.pdf	Published version	2.04 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, J
dc.contributor.author	Yang, C
dc.contributor.author	Guo, Z
dc.contributor.author	Li, X
dc.contributor.author	Wang, W
dc.date	2023-10-01
dc.date.accessioned	2024-05-13T04:26:55Z
dc.date.available	2024-05-13T04:26:55Z
dc.date.issued	2024-01-29
dc.identifier.citation	2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2024, 00, pp. 572-577
dc.identifier.isbn	979-8-3503-3703-7
dc.identifier.issn	1062-922X
dc.identifier.uri	http://hdl.handle.net/10453/178911
dc.description.abstract	Compared to relying only on audio information incorporating visual information improves speech recognition accuracy in noisy environments Existing works are prone to design specific architecture for feature extraction neglecting feature enhancement In this paper we propose an end to end Mandarin audio visual speech recognition model with a Feature Enhancement Module Specifically we design a Feature Enhancement Module FEM that uses deconvolution and up sampling to obtain the twin enhanced data for generating high resolution feature representation We further develop the Visual Feature Enhancement Module Visual FEM and Audio Feature Enhancement Module Audio FEM to enhance feature extraction from both visual data and audio data We incorporate the proposed modules into the blocks of the Residual Network for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k and Chinese Mandarin Lip Reading CMLR datasets The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
dc.relation.ispartof	2023 IEEE International Conference on Systems, Man, and Cybernetics
dc.relation.isbasedon	10.1109/smc53992.2023.10394108
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	An End-to-End Mandarin Audio-Visual Speech Recognition Model with a Feature Enhancement Module
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Honolulu, Oahu, HI, USA
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-13T04:26:53Z
pubs.finish-date	2023-10-04
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-01
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Compared to relying only on audio information incorporating visual information improves speech recognition accuracy in noisy environments Existing works are prone to design specific architecture for feature extraction neglecting feature enhancement In this paper we propose an end to end Mandarin audio visual speech recognition model with a Feature Enhancement Module Specifically we design a Feature Enhancement Module FEM that uses deconvolution and up sampling to obtain the twin enhanced data for generating high resolution feature representation We further develop the Visual Feature Enhancement Module Visual FEM and Audio Feature Enhancement Module Audio FEM to enhance feature extraction from both visual data and audio data We incorporate the proposed modules into the blocks of the Residual Network for accurate audio visual speech recognition We conducted experiments on the CAS VSR W1k and Chinese Mandarin Lip Reading CMLR datasets The experimental results show that the proposed method outperforms the selected competitive baselines and the state of the art indicating the superiority of our proposed modules

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178911