Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion

Liu, Z; Braytee, A; Anaissi, A; Zhang, G; Qin, L; Akram, J

Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion

Liu, Z Braytee, A

Anaissi, A

Zhang, G Qin, L Akram, J

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: WWW 2024 Companion - Companion Proceedings of the ACM Web Conference, 2024, pp. 1841-1848
Issue Date:: 2024-05-13

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Download Published versionAdobe PDF (1.34 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, Z
dc.contributor.author	Braytee, A https://orcid.org/0000-0003-2561-6496
dc.contributor.author	Anaissi, A https://orcid.org/0000-0002-8864-0314
dc.contributor.author	Zhang, G
dc.contributor.author	Qin, L
dc.contributor.author	Akram, J
dc.date.accessioned	2025-01-10T03:32:48Z
dc.date.available	2025-01-10T03:32:48Z
dc.date.issued	2024-05-13
dc.identifier.citation	WWW 2024 Companion - Companion Proceedings of the ACM Web Conference, 2024, pp. 1841-1848
dc.identifier.uri	http://hdl.handle.net/10453/183181
dc.description.abstract	We introduce an ensemble model approach for multimodal sentiment analysis, focusing on the fusion of textual and video data to enhance the accuracy and depth of emotion interpretation. By integrating three foundational models—IFFSA, BFSA, and TBJE—using advanced ensemble techniques, we achieve a significant improvement in sentiment analysis performance across diverse datasets, including MOSI and MOSEI. Specifically, we propose two novel models—IFFSA and BFSA, which utilise the large language models BERT and GPT-2 to extract the features from text modality and ResNet and VGG for video modality. Our work uniquely contributes to the field by demonstrating the synergistic potential of combining different modal analytical strengths, thereby addressing the intricate challenge of nuanced emotion detection in multimodal contexts. Through comprehensive experiments and an extensive ablation study, we not only validate the superior performance of our ensemble model against current state-of-the-art benchmarks but also reveal critical insights into the model’s capability to discern complex emotional states. Our findings underscore the strategic advantage of ensemble methods in multimodal sentiment analysis and set a new precedent for future research in effectively integrating multimodal data sources.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	WWW 2024 Companion - Companion Proceedings of the ACM Web Conference
dc.relation.ispartof	Companion Proceedings of the ACM Web Conference 2024
dc.relation.isbasedon	10.1145/3589335.3651971
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion
dc.type	Conference Proceeding
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Provost
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	University of Technology Sydney/Provost/TD School
utslib.copyright.status	open_access	*
dc.rights.license	This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
dc.date.updated	2025-01-10T03:32:46Z
pubs.publication-status	Published

Abstract:

We introduce an ensemble model approach for multimodal sentiment analysis, focusing on the fusion of textual and video data to enhance the accuracy and depth of emotion interpretation. By integrating three foundational models—IFFSA, BFSA, and TBJE—using advanced ensemble techniques, we achieve a significant improvement in sentiment analysis performance across diverse datasets, including MOSI and MOSEI. Specifically, we propose two novel models—IFFSA and BFSA, which utilise the large language models BERT and GPT-2 to extract the features from text modality and ResNet and VGG for video modality. Our work uniquely contributes to the field by demonstrating the synergistic potential of combining different modal analytical strengths, thereby addressing the intricate challenge of nuanced emotion detection in multimodal contexts. Through comprehensive experiments and an extensive ablation study, we not only validate the superior performance of our ensemble model against current state-of-the-art benchmarks but also reveal critical insights into the model’s capability to discern complex emotional states. Our findings underscore the strategic advantage of ensemble methods in multimodal sentiment analysis and set a new precedent for future research in effectively integrating multimodal data sources.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/183181