Multi-frame feature-fusion-based model for violence detection

Asad, M; Yang, J; He, J; Shamsolmoali, P; He, X

Multi-frame feature-fusion-based model for violence detection

Asad, M Yang, J He, J Shamsolmoali, P He, X

Permalink

Publisher:: Springer Science and Business Media LLC
Publication Type:: Journal Article
Citation:: Visual Computer, 2020, pp. 1-17
Issue Date:: 2020-01-01

Closed Access

	Filename	Description	Size
	Draft Journal v3_SH_20190608.pdf	Submitted version	1.2 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Asad, M
dc.contributor.author	Yang, J
dc.contributor.author	He, J
dc.contributor.author	Shamsolmoali, P
dc.contributor.author	He, X https://orcid.org/0000-0001-8962-540X
dc.date.accessioned	2020-11-03T23:29:46Z
dc.date.available	2020-11-03T23:29:46Z
dc.date.issued	2020-01-01
dc.identifier.citation	Visual Computer, 2020, pp. 1-17
dc.identifier.issn	0178-2789
dc.identifier.issn	1432-2315
dc.identifier.uri	http://hdl.handle.net/10453/143741
dc.description.abstract	© 2020, Springer-Verlag GmbH Germany, part of Springer Nature. Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.
dc.language	en
dc.publisher	Springer Science and Business Media LLC
dc.relation.ispartof	Visual Computer
dc.relation.isbasedon	10.1007/s00371-020-01878-6
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 1702 Cognitive Sciences
dc.subject.classification	Software Engineering
dc.title	Multi-frame feature-fusion-based model for violence detection
dc.type	Journal Article
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CRIN - Realtime Information Networks
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
dc.date.updated	2020-11-03T23:29:43Z
pubs.publication-status	Published

Abstract:

© 2020, Springer-Verlag GmbH Germany, part of Springer Nature. Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/143741