Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos

Zhang, H; Xu, M

Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos

Zhang, H Xu, M

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings - International Conference on Image Processing, ICIP, 2016, 2016-August pp. 629 - 633
Issue Date:: 2016-08-03

Closed Access

	Filename	Description	Size
	Haimin.pdf	Published version	271.68 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, H	en_US
dc.contributor.author	Xu, M https://orcid.org/0000-0001-9581-8849	en_US
dc.date.issued	2016-08-03	en_US
dc.identifier.citation	Proceedings - International Conference on Image Processing, ICIP, 2016, 2016-August pp. 629 - 633	en_US
dc.identifier.isbn	9781467399616	en_US
dc.identifier.issn	1522-4880	en_US
dc.identifier.uri	http://hdl.handle.net/10453/56151
dc.description.abstract	© 2016 IEEE. With the widespread of user-generated Internet videos, emotion recognition in those videos attracts increasing research efforts. However, most existing works are based on framelevel visual features and/or audio features, which might fail to model the temporal information, e.g. characteristics accumulated along time. In order to capture video temporal information, in this paper, we propose to analyse features in frequency domain transformed by discrete Fourier transform (DFT features). Frame-level features are firstly extract by a pre-trained deep convolutional neural network (CNN). Then, time domain features are transferred and interpolated into DFT features. CNN and DFT features are further encoded and fused for emotion classification. By this way, static image features extracted from a pre-trained deep CNN and temporal information represented by DFT features are jointly considered for video emotion recognition. Experimental results demonstrate that combining DFT features can effectively capture temporal information and therefore improve emotion recognition performance. Our approach has achieved a state-of-the-art performance on the largest video emotion dataset (VideoEmotion-8 dataset), improving accuracy from 51.1% to 55.6%.	en_US
dc.relation.ispartof	Proceedings - International Conference on Image Processing, ICIP	en_US
dc.relation.isbasedon	10.1109/ICIP.2016.7532433	en_US
dc.title	Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2016-August	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US
pubs.volume	2016-August	en_US

Abstract:

© 2016 IEEE. With the widespread of user-generated Internet videos, emotion recognition in those videos attracts increasing research efforts. However, most existing works are based on framelevel visual features and/or audio features, which might fail to model the temporal information, e.g. characteristics accumulated along time. In order to capture video temporal information, in this paper, we propose to analyse features in frequency domain transformed by discrete Fourier transform (DFT features). Frame-level features are firstly extract by a pre-trained deep convolutional neural network (CNN). Then, time domain features are transferred and interpolated into DFT features. CNN and DFT features are further encoded and fused for emotion classification. By this way, static image features extracted from a pre-trained deep CNN and temporal information represented by DFT features are jointly considered for video emotion recognition. Experimental results demonstrate that combining DFT features can effectively capture temporal information and therefore improve emotion recognition performance. Our approach has achieved a state-of-the-art performance on the largest video emotion dataset (VideoEmotion-8 dataset), improving accuracy from 51.1% to 55.6%.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/56151