Human action recognition and localization in video using structured learning of local space-time features

Thi, TH; Zhang, J; Cheng, L; Wang, L; Satoh, S

Human action recognition and localization in video using structured learning of local space-time features

Thi, TH Zhang, J

Cheng, L Wang, L Satoh, S

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings - IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2010, 2010, pp. 204 - 211
Issue Date:: 2010-01-01

Closed Access

	Filename	Description	Size
	2013006868OK.pdf		8.97 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Thi, TH	en_US
dc.contributor.author	Zhang, J https://orcid.org/0000-0002-7240-3541	en_US
dc.contributor.author	Cheng, L	en_US
dc.contributor.author	Wang, L	en_US
dc.contributor.author	Satoh, S	en_US
dc.date.issued	2010-01-01	en_US
dc.identifier.citation	Proceedings - IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2010, 2010, pp. 204 - 211	en_US
dc.identifier.isbn	9780769542645	en_US
dc.identifier.uri	http://hdl.handle.net/10453/29734
dc.description.abstract	This paper presents a unified framework for human action classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of local patches. In our approach, we first use a discriminative hierarchical Bayesian classifier to select those space-time interest points that are constructive for each particular action. Those concise local features are then passed to a Support Vector Machine with Principal Component Analysis projection for the classification task. Meanwhile, the action localization is done using Dynamic Conditional Random Fields developed to incorporate the spatial and temporal structure constraints of superpixels extracted around those features. Each superpixel in the video is defined by the shape and motion information of its corresponding feature region. Compelling results obtained from experiments on KTH [22], Weizmann [1], HOHA [13] and TRECVid [23] datasets have proven the efficiency and robustness of our framework for the task of human action recognition and localization in video. © 2010 IEEE.	en_US
dc.relation.ispartof	Proceedings - IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2010	en_US
dc.relation.isbasedon	10.1109/AVSS.2010.76	en_US
dc.title	Human action recognition and localization in video using structured learning of local space-time features	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
dc.location.activity	Boston, MA	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

This paper presents a unified framework for human action classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of local patches. In our approach, we first use a discriminative hierarchical Bayesian classifier to select those space-time interest points that are constructive for each particular action. Those concise local features are then passed to a Support Vector Machine with Principal Component Analysis projection for the classification task. Meanwhile, the action localization is done using Dynamic Conditional Random Fields developed to incorporate the spatial and temporal structure constraints of superpixels extracted around those features. Each superpixel in the video is defined by the shape and motion information of its corresponding feature region. Compelling results obtained from experiments on KTH [22], Weizmann [1], HOHA [13] and TRECVid [23] datasets have proven the efficiency and robustness of our framework for the task of human action recognition and localization in video. © 2010 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/29734