Towards Better Accuracy and Efficiency in Classification and Generation using Deep Multi-Modal Learning

Publication Type:
Issue Date:
Full metadata record
In the rapidly evolving landscape of artificial intelligence, significant strides have been made across key domains, such as visual perception, visual-language understanding, virtual human creation, and robotics, all of which fundamentally engage with multi-modal data sources, including images, text, audio, video, 3D pose, and sensors. This dissertation addresses the pivotal challenge of effectively utilizing this diverse multi-modal data to enhance the robustness and versatility of AI systems, with a special focus on refining the aspects of classification and generation. It introduces novel methods, such as a moderate hard example mining to enhance fine-grained recognition and reduce overfitting, alongside episodic linear probing to increase network generalizability across a spectrum of visual tasks. Additionally, the research extends into generative tasks, presenting insights like an image-compounded captioning method for video, a semantic-energized approach for virtual human pose generation, and a multimodality-aware autoencoder for robotic affordance learning. These methods aim to extract critical cues from multi-modal data, thereby significantly improving performance in both classification and generation tasks. Through extensive experimentation, the thesis not only demonstrates the superior effectiveness of these approaches compared to prior methods but also contributes to the ongoing enhancement of the field's techniques and understanding in managing complex, multi-modal data.
Please use this identifier to cite or link to this item: