A Discriminative CNN Video Representation for Event Detection

Xu, ZW; Yang, Y; Hauptmann, AG

A Discriminative CNN Video Representation for Event Detection

Xu, ZW Yang, Y

Hauptmann, AG

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
Issue Date:: 2015-01-01

Closed Access

	Filename	Description	Size
	ad.pdf	Published version	221.94 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, ZW	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.contributor.author	Hauptmann, AG	en_US
dc.date	2015-06-07	en_US
dc.date.issued	2015-01-01	en_US
dc.identifier.citation	2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015	en_US
dc.identifier.issn	1063-6919	en_US
dc.identifier.uri	http://hdl.handle.net/10453/121625
dc.description.abstract	In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.	en_US
dc.publisher	IEEE	en_US
dc.relation.ispartof	2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)	en_US
dc.relation.ispartof	IEEE Conference on Computer Vision and Pattern Recognition	en_US
dc.relation.isbasedon	10.1109/CVPR.2015.7298789	en_US
dc.rights	© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.title	A Discriminative CNN Video Representation for Event Detection	en_US
dc.type	Conference Proceeding
utslib.location.activity	Boston, USA	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access
pubs.consider-herdc	true	en_US
pubs.finish-date	2015-06-12	en_US
pubs.publication-status	Published	en_US
pubs.start-date	2015-06-07	en_US

Abstract:

In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/121625