Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

Li, Z; Chang, X; Yao, L; Pan, S; Zongyuan, G; Zhang, H

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

Li, Z Chang, X Yao, L Pan, S

Zongyuan, G Zhang, H

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020, pp. 297-305
Issue Date:: 2020-08-23

Closed Access

	Filename	Description	Size
	3394486.3403072.pdf	Published version	11.67 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Z
dc.contributor.author	Chang, X
dc.contributor.author	Yao, L
dc.contributor.author	Pan, S https://orcid.org/0000-0003-0794-527X
dc.contributor.author	Zongyuan, G
dc.contributor.author	Zhang, H
dc.date.accessioned	2021-04-11T19:08:48Z
dc.date.available	2021-04-11T19:08:48Z
dc.date.issued	2020-08-23
dc.identifier.citation	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020, pp. 297-305
dc.identifier.isbn	9781450379984
dc.identifier.uri	http://hdl.handle.net/10453/147996
dc.description.abstract	The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.
dc.language	en
dc.publisher	ACM
dc.relation.ispartof	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
dc.relation.ispartof	KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
dc.relation.isbasedon	10.1145/3394486.3403072
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2021-04-11T19:08:41Z
pubs.publication-status	Published

Abstract:

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/147996