Computational Understanding of Figurative Language on Social Media

Publication Type:
Thesis
Issue Date:
2022
Full metadata record
Figurative language in online user-generated text poses challenges to Natural Language Processing (NLP) systems designed to automate the understanding of natural language. This thesis introduces empirical studies that quantify the presence and describes the nature of figurative language in Social Media posts (i.e. Twitter). It also quantifies the impact of figurative language on particular NLP applications and introduces new resources (i.e. datasets and methodologies) for the computational processing of figurative language. This thesis contains a focused case-study on general figurative language in the context of Public Health Surveillance (PHS) applications that monitor Twitter for health events. Findings indicate that some symptom and disease topics are mentioned in a figurative context more than in a health context, which results in a biased signal. To address this bias, a new annotated dataset and text classifier is proposed that reduces bias by targeting figurative expressions of health-related concepts on Twitter. There is limited research on the expression of hyperbole on Twitter compared to other types of figurative language (e.g., metaphor). To address this gap, a dataset of tweets annotated for the presence of hyperbole is collected and explored. Findings show that hyperbole is relatively common on Twitter and the expression of hyperbole varies from simple and repetitive to complex and novel. A common theme of hyperbole expression on Twitter is the strong affective-laden intentions of the authors, heightening the importance of hyperbole understanding for affective computing applications. Several text classifiers are proposed that leverage pre-trained language models, affective signals, and privileged information for the detection of hyperbole. Experiments show improvements in the detection of hyperbole and importantly highlight annotation biases inherent in the current annotation scheme for hyperbole detection, which is likely to be a roadblock to further improvements. This thesis quantifies the occurrence of figurative language on Twitter and demonstrates a considerable and consistent presence. Additionally, figurative language is often mishandled by various NLP resources and is scantly addressed by existing datasets and methodologies. Experiment results show that through direct targeting and careful handling of figurative language, improvements to the detection of figurative language are achievable. However, it is concluded that the complexity and novelty of figurative language requires further algorithmic and data inventions for continued progress.
Please use this identifier to cite or link to this item: