Enhancing interaction accuracy, efficiency, and robustness in multimodal large language models

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Recent advances in artificial intelligence have been significantly driven by large language models (LLMs) and vision-language models (VLMs), which have demonstrated remarkable performance in language reasoning and perceptual understanding, respectively. However, the growing need for unified multimodal reasoning has led to the emergence of multimodal large language models (MLLMs). The rapid evolution of MLLMs has revolutionized the integration of vision, language, and audio, enabling transformative advancements in intelligent systems. Yet, their real-world deployment remains constrained by challenges in interaction accuracy, computational efficiency, and resilience to noisy or adversarial inputs. This thesis systematically addresses these limitations through an in-depth exploration of three pivotal dimensions. First, the thesis introduces a robust multimodal instruction-tuned model built upon a novel image-dialogue generation pipeline. This pipeline synthesizes high-quality, instruction-aligned image-text pairs using multi-stage prompting and model filtering, effectively addressing the lack of scalable multimodal instruction data. Leveraging this synthetic data, the resulting model achieves state-of-the-art performance across multiple benchmarks, demonstrating strong instruction-following capability, spatial reasoning, and resistance to hallucination. Second, this thesis proposes a lightweight agent framework for multimodal reasoning and task execution on resource-constrained mobile devices. Designed for environments with limited compute and memory, the framework integrates memory-driven reasoning, OCR-based visual parsing, and retrieval-augmented planning to enable dynamic decision-making across applications. It supports efficient execution of complex, multi-step tasks, such as long-horizon workflows and cross-app interactions, without relying on cloud-based inference or retraining. Experiments show that the proposed system outperforms existing mobile agent baselines in task success rate, demonstrating adaptability and deployment potential in real-world settings. Finally, the thesis presents a comprehensive benchmark for evaluating the robustness of large audio-language models under adversarial and noisy conditions. The benchmark includes over 1200 adversarial examples across four categories: content distortion, emotional interference, explicit noise, and implicit noise. It supports evaluation using standard metrics, LLM-as-a-judge, and human assessments. Experiments show that current audio-language models remain vulnerable to adversarial audio, revealing persistent weaknesses in robustness. This benchmark serves as a foundation for analyzing reliability in voice-based language systems and informs future research on building more stable audio interactions. This research advances the precision, adaptability, and resilience of multimodal systems. Experimental results validate the proposed methodologies, demonstrating effectiveness across domains. These findings provide a foundation for deploying MLLMs in dynamic environments, paving the way for future advancements in multimodal interaction technologies.
Please use this identifier to cite or link to this item: