Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion
- Publisher:
- Association for Computing Machinery (ACM)
- Publication Type:
- Conference Proceeding
- Citation:
- WWW 2024 Companion - Companion Proceedings of the ACM Web Conference, 2024, pp. 1841-1848
- Issue Date:
- 2024-05-13
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
We introduce an ensemble model approach for multimodal sentiment analysis, focusing on the fusion of textual and video data to enhance the accuracy and depth of emotion interpretation. By integrating three foundational models—IFFSA, BFSA, and TBJE—using advanced ensemble techniques, we achieve a significant improvement in sentiment analysis performance across diverse datasets, including MOSI and MOSEI. Specifically, we propose two novel models—IFFSA and BFSA, which utilise the large language models BERT and GPT-2 to extract the features from text modality and ResNet and VGG for video modality. Our work uniquely contributes to the field by demonstrating the synergistic potential of combining different modal analytical strengths, thereby addressing the intricate challenge of nuanced emotion detection in multimodal contexts. Through comprehensive experiments and an extensive ablation study, we not only validate the superior performance of our ensemble model against current state-of-the-art benchmarks but also reveal critical insights into the model’s capability to discern complex emotional states. Our findings underscore the strategic advantage of ensemble methods in multimodal sentiment analysis and set a new precedent for future research in effectively integrating multimodal data sources.
Please use this identifier to cite or link to this item: