Enhancing interaction accuracy, efficiency, and robustness in multimodal large language models

Li, Yanda

Enhancing interaction accuracy, efficiency, and robustness in multimodal large language models

Li, Yanda

Permalink

Publication Type:: Thesis
Issue Date:: 2025

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (16.44 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Yanda
dc.date.accessioned	2026-01-29T02:42:12Z
dc.date.available	2026-01-29T02:42:12Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10453/192525
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Recent advances in artificial intelligence have been significantly driven by large language models (LLMs) and vision-language models (VLMs), which have demonstrated remarkable performance in language reasoning and perceptual understanding, respectively. However, the growing need for unified multimodal reasoning has led to the emergence of multimodal large language models (MLLMs). The rapid evolution of MLLMs has revolutionized the integration of vision, language, and audio, enabling transformative advancements in intelligent systems. Yet, their real-world deployment remains constrained by challenges in interaction accuracy, computational efficiency, and resilience to noisy or adversarial inputs. This thesis systematically addresses these limitations through an in-depth exploration of three pivotal dimensions. First, the thesis introduces a robust multimodal instruction-tuned model built upon a novel image-dialogue generation pipeline. This pipeline synthesizes high-quality, instruction-aligned image-text pairs using multi-stage prompting and model filtering, effectively addressing the lack of scalable multimodal instruction data. Leveraging this synthetic data, the resulting model achieves state-of-the-art performance across multiple benchmarks, demonstrating strong instruction-following capability, spatial reasoning, and resistance to hallucination. Second, this thesis proposes a lightweight agent framework for multimodal reasoning and task execution on resource-constrained mobile devices. Designed for environments with limited compute and memory, the framework integrates memory-driven reasoning, OCR-based visual parsing, and retrieval-augmented planning to enable dynamic decision-making across applications. It supports efficient execution of complex, multi-step tasks, such as long-horizon workflows and cross-app interactions, without relying on cloud-based inference or retraining. Experiments show that the proposed system outperforms existing mobile agent baselines in task success rate, demonstrating adaptability and deployment potential in real-world settings. Finally, the thesis presents a comprehensive benchmark for evaluating the robustness of large audio-language models under adversarial and noisy conditions. The benchmark includes over 1200 adversarial examples across four categories: content distortion, emotional interference, explicit noise, and implicit noise. It supports evaluation using standard metrics, LLM-as-a-judge, and human assessments. Experiments show that current audio-language models remain vulnerable to adversarial audio, revealing persistent weaknesses in robustness. This benchmark serves as a foundation for analyzing reliability in voice-based language systems and informs future research on building more stable audio interactions. This research advances the precision, adaptability, and resilience of multimodal systems. Experimental results validate the proposed methodologies, demonstrating effectiveness across domains. These findings provide a foundation for deploying MLLMs in dynamic environments, paving the way for future advancements in multimodal interaction technologies.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/192525/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/cph
dc.rights	© 2025 Yanda Li
dc.title	Enhancing interaction accuracy, efficiency, and robustness in multimodal large language models	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Recent advances in artificial intelligence have been significantly driven by large language models (LLMs) and vision-language models (VLMs), which have demonstrated remarkable performance in language reasoning and perceptual understanding, respectively. However, the growing need for unified multimodal reasoning has led to the emergence of multimodal large language models (MLLMs). The rapid evolution of MLLMs has revolutionized the integration of vision, language, and audio, enabling transformative advancements in intelligent systems. Yet, their real-world deployment remains constrained by challenges in interaction accuracy, computational efficiency, and resilience to noisy or adversarial inputs. This thesis systematically addresses these limitations through an in-depth exploration of three pivotal dimensions. First, the thesis introduces a robust multimodal instruction-tuned model built upon a novel image-dialogue generation pipeline. This pipeline synthesizes high-quality, instruction-aligned image-text pairs using multi-stage prompting and model filtering, effectively addressing the lack of scalable multimodal instruction data. Leveraging this synthetic data, the resulting model achieves state-of-the-art performance across multiple benchmarks, demonstrating strong instruction-following capability, spatial reasoning, and resistance to hallucination. Second, this thesis proposes a lightweight agent framework for multimodal reasoning and task execution on resource-constrained mobile devices. Designed for environments with limited compute and memory, the framework integrates memory-driven reasoning, OCR-based visual parsing, and retrieval-augmented planning to enable dynamic decision-making across applications. It supports efficient execution of complex, multi-step tasks, such as long-horizon workflows and cross-app interactions, without relying on cloud-based inference or retraining. Experiments show that the proposed system outperforms existing mobile agent baselines in task success rate, demonstrating adaptability and deployment potential in real-world settings. Finally, the thesis presents a comprehensive benchmark for evaluating the robustness of large audio-language models under adversarial and noisy conditions. The benchmark includes over 1200 adversarial examples across four categories: content distortion, emotional interference, explicit noise, and implicit noise. It supports evaluation using standard metrics, LLM-as-a-judge, and human assessments. Experiments show that current audio-language models remain vulnerable to adversarial audio, revealing persistent weaknesses in robustness. This benchmark serves as a foundation for analyzing reliability in voice-based language systems and informs future research on building more stable audio interactions. This research advances the precision, adaptability, and resilience of multimodal systems. Experimental results validate the proposed methodologies, demonstrating effectiveness across domains. These findings provide a foundation for deploying MLLMs in dynamic environments, paving the way for future advancements in multimodal interaction technologies.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/192525