Visual event recognition in videos by learning from web data

Duan, L; Xu, D; Tsang, IWH; Luo, J

Visual event recognition in videos by learning from web data

Duan, L Xu, D

Tsang, IWH

Luo, J

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34 (9), pp. 1667 - 1680
Issue Date:: 2012-09-05

Closed Access

	Filename	Description	Size
	2013003701OK.pdf		1.24 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Duan, L	en_US
dc.contributor.author	Xu, D https://orcid.org/0000-0003-2775-9730	en_US
dc.contributor.author	Tsang, IWH https://orcid.org/0000-0001-8095-4637	en_US
dc.contributor.author	Luo, J	en_US
dc.date.issued	2012-09-05	en_US
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34 (9), pp. 1667 - 1680	en_US
dc.identifier.issn	0162-8828	en_US
dc.identifier.uri	http://hdl.handle.net/10453/29703
dc.description.abstract	We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class. © 2012 IEEE.	en_US
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	en_US
dc.relation.isbasedon	10.1109/TPAMI.2011.265	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Reproducibility of Results	en_US
dc.subject.mesh	Human Activities	en_US
dc.subject.mesh	Internet	en_US
dc.subject.mesh	Video Recording	en_US
dc.subject.mesh	Pattern Recognition, Automated	en_US
dc.subject.mesh	Support Vector Machines	en_US
dc.subject.mesh	Support Vector Machine	en_US
dc.title	Visual event recognition in videos by learning from web data	en_US
dc.type	Journal Article
utslib.citation.volume	9	en_US
utslib.citation.volume	34	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	9	en_US
pubs.publication-status	Published	en_US
pubs.volume	34	en_US

Abstract:

We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class. © 2012 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/29703