A Front-End Technique for Automatic Noisy Speech Recognition

Naing, HMS; Hidayat, R; Hartanto, R; Miyanaga, Y

A Front-End Technique for Automatic Noisy Speech Recognition

Naing, HMS Hidayat, R Hartanto, R Miyanaga, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: PROCEEDINGS of 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), 2020, 00, pp. 49-54
Issue Date:: 2020-12-28

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (925.79 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Naing, HMS
dc.contributor.author	Hidayat, R
dc.contributor.author	Hartanto, R
dc.contributor.author	Miyanaga, Y https://orcid.org/0000-0002-2795-2234
dc.date	2020-11-05
dc.date.accessioned	2021-05-13T01:55:27Z
dc.date.available	2021-05-13T01:55:27Z
dc.date.issued	2020-12-28
dc.identifier.citation	PROCEEDINGS of 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), 2020, 00, pp. 49-54
dc.identifier.isbn	978-1-7281-9896-5
dc.identifier.issn	2163-3479
dc.identifier.issn	2472-7695
dc.identifier.uri	http://hdl.handle.net/10453/148883
dc.description.abstract	The sounds in a real environment not often take place in isolation because sounds are building complex and usually happen concurrently. Auditory masking relates to the perceptual interaction between sound components. This paper proposes modeling the effect of simultaneous masking into the Mel frequency cepstral coefficient (MFCC) and effectively improve the performance of the resulting system. Moreover, the Gammatone frequency integration is presented to warp the energy spectrum which can provide gradually decaying the weights and compensate for the loss of spectral correlation. Experiments are carried out on the Aurora-2 database, and frame-level cross entropy-based deep neural network (DNN-HMM) training is used to build an acoustic model. While given models trained on multi-condition speech data, the accuracy of our proposed feature extraction method achieves up to 98.14% in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and 51.5% in -5dB, respectively.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	PROCEEDINGS of 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
dc.relation.ispartof	Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques
dc.relation.isbasedon	10.1109/o-cocosda50338.2020.9295006
dc.rights	info:eu-repo/semantics/openAccess
dc.title	A Front-End Technique for Automatic Noisy Speech Recognition
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Yangon, Myanmar
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2021-05-13T01:55:26Z
pubs.finish-date	2020-11-07
pubs.publication-status	Published
pubs.start-date	2020-11-05
pubs.volume	00

Abstract:

The sounds in a real environment not often take place in isolation because sounds are building complex and usually happen concurrently. Auditory masking relates to the perceptual interaction between sound components. This paper proposes modeling the effect of simultaneous masking into the Mel frequency cepstral coefficient (MFCC) and effectively improve the performance of the resulting system. Moreover, the Gammatone frequency integration is presented to warp the energy spectrum which can provide gradually decaying the weights and compensate for the loss of spectral correlation. Experiments are carried out on the Aurora-2 database, and frame-level cross entropy-based deep neural network (DNN-HMM) training is used to build an acoustic model. While given models trained on multi-condition speech data, the accuracy of our proposed feature extraction method achieves up to 98.14% in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and 51.5% in -5dB, respectively.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148883