Taking a Closer Look At Visual Relation: Unbiased Video Scene Graph Generation With Decoupled Label Learning

Wang, W; Luo, Y; Chen, Z; Jiang, T; Yang, Y; Xiao, J

Taking a Closer Look At Visual Relation: Unbiased Video Scene Graph Generation With Decoupled Label Learning

Wang, W Luo, Y Chen, Z Jiang, T Yang, Y

Xiao, J

Permalink

Publisher:: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2024, 26, pp. 5718-5728
Issue Date:: 2024-01-01

Closed Access

	Filename	Description	Size
	1694504.pdf	Published version	7.66 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, W
dc.contributor.author	Luo, Y
dc.contributor.author	Chen, Z
dc.contributor.author	Jiang, T
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Xiao, J
dc.date.accessioned	2024-08-07T04:37:59Z
dc.date.available	2024-08-07T04:37:59Z
dc.date.issued	2024-01-01
dc.identifier.citation	IEEE Transactions on Multimedia, 2024, 26, pp. 5718-5728
dc.identifier.issn	1520-9210
dc.identifier.issn	1941-0077
dc.identifier.uri	http://hdl.handle.net/10453/180212
dc.description.abstract	Current video-based scene graph generation (VidSGG) methods have been found to perform poorly in predicting predicates that are less represented due to the inherently biased distribution of the training data. In this paper, we take a closer look at the inherent characteristics of predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
dc.language	English
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.relation.ispartof	IEEE Transactions on Multimedia
dc.relation.isbasedon	10.1109/TMM.2023.3338078
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	40 Engineering
dc.subject.classification	46 Information and computing sciences
dc.title	Taking a Closer Look At Visual Relation: Unbiased Video Scene Graph Generation With Decoupled Label Learning
dc.type	Journal Article
utslib.citation.volume	26
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-08-07T04:37:52Z
pubs.publication-status	Published
pubs.volume	26

Abstract:

Current video-based scene graph generation (VidSGG) methods have been found to perform poorly in predicting predicates that are less represented due to the inherently biased distribution of the training data. In this paper, we take a closer look at the inherent characteristics of predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180212