ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

Zhuo, TY; Liao, Y; Lei, Y; Qu, L; de Melo, G; Chang, X; Ren, Y; Xu, Z

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

Zhuo, TY Liao, Y Lei, Y Qu, L de Melo, G Chang, X

Ren, Y Xu, Z

Permalink

Publisher:: ACL
Publication Type:: Conference Proceeding
Citation:: 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, pp. 2192-2207
Issue Date:: 2023

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhuo, TY
dc.contributor.author	Liao, Y
dc.contributor.author	Lei, Y
dc.contributor.author	Qu, L
dc.contributor.author	de Melo, G
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Ren, Y
dc.contributor.author	Xu, Z
dc.contributor.editor	Augenstein, I
dc.contributor.editor	Vlachos, A
dc.date	2023-05-02
dc.date.accessioned	2024-07-03T22:39:46Z
dc.date.available	2024-07-03T22:39:46Z
dc.date.issued	2023
dc.identifier.citation	17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, pp. 2192-2207
dc.identifier.isbn	978-1-959429-47-0
dc.identifier.uri	http://hdl.handle.net/10453/179709
dc.description.abstract	We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
dc.language	en
dc.publisher	ACL
dc.relation.ispartof	17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023
dc.relation.ispartof	Conference of the European-Chapter of the Association-for-Computational-Linguistics
dc.relation.isbasedon	10.18653/v1/2023.findings-eacl.164
dc.rights	info:eu-repo/semantics/openAccess
dc.title	ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities
dc.type	Conference Proceeding
utslib.location.activity	Dubrovnik, Croatia
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2024-07-03T22:39:40Z
pubs.finish-date	2023-05-06
pubs.place-of-publication	USA
pubs.publication-status	Published
pubs.start-date	2023-05-02
dc.location	USA

Abstract:

We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179709