From Sound to Sign: Deep Learning for Audio Style Adaptation and Multi-modal Sign Language Analysis

Publication Type:
Thesis
Issue Date:
2024
Full metadata record
The growing demand for accessible communication tools for the deaf community highlights the need to bridge auditory and visual languages. However, the inherent disparity between spoken and sign languages poses substantial challenges, making their conversion a quintessential problem in multimodal learning. Spoken language varies in patterns, accent, and intonation, requiring translation systems to adapt to diverse speech styles. Sign language similarly incorporates multimodal elements like gestures, facial expressions, and body movements, necessitating accurate distinction of actions and conveying of complex contexts. This study proposes an efficient multimodal learning framework to tackle bidirectional translation challenges between the two languages. In particular, we integrate audio style adaptation and multimodal sign language analysis within a unified system. For audio style adaptation, we leverage diffusion models and mutual learning mechanisms to convert raw audio into various target styles, minimizing the need for paired audio data. For multimodal sign language analysis, we implement sign language production and translation using sequence diffusion models and large language models, respectively. Our system is capable of adapting to various audio styles and harnessing the extensive prior knowledge of large models to preserve the original semantics in terms of both integrity and accuracy.
Please use this identifier to cite or link to this item: