Robust spatial-temporal deep model for multimedia event detection

Publisher:
Elsevier BV
Publication Type:
Journal Article
Citation:
Neurocomputing, 2016, 213, pp. 48-53
Issue Date:
2016-11-12
Filename Description Size
1-s2.0-S0925231216307275-main.pdfPublished version1.06 MB
Adobe PDF
Full metadata record
The task of multimedia event detection (MED) aims at training a set of models that can automatically detect the most event-relevant videos from large datasets. In this paper, we attempt to build a robust spatial-temporal deep neural network for large-scale video event detection. In our setting, each video follows a multiple instance assumption, where its visual segments contain both spatial and temporal properties of events. Regarding these properties, we try to implement the MED system by a two-step training phase: unsupervised recurrent video reconstruction and supervised fine-tuning. We conduct extensive experiments on the challenging TRECVID MED14 dataset, which indicate that with the consideration of both spatial and temporal information, the detection performance can be further boosted compared with the state-of-the-art MED models.
Please use this identifier to cite or link to this item: