DLRG@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil using Multilingual Transformer Models

Duraphe, A; Rajalakshmi, R; Shibani, A

DLRG@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil using Multilingual Transformer Models

Duraphe, A Rajalakshmi, R Shibani, A

Permalink

Publisher:: Association for Computational Linguistics (ACL)
Publication Type:: Conference Proceeding
Citation:: DravidianLangTech 2022 - 2nd Workshop on Speech and Language Technologies for Dravidian Languages, Proceedings of the Workshop, 2022, pp. 207-213
Issue Date:: 2022-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (131.88 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Duraphe, A
dc.contributor.author	Rajalakshmi, R
dc.contributor.author	Shibani, A https://orcid.org/0000-0003-4619-8684
dc.date	2022-05
dc.date.accessioned	2023-03-31T12:22:01Z
dc.date.available	2023-03-31T12:22:01Z
dc.date.issued	2022-01-01
dc.identifier.citation	DravidianLangTech 2022 - 2nd Workshop on Speech and Language Technologies for Dravidian Languages, Proceedings of the Workshop, 2022, pp. 207-213
dc.identifier.isbn	9781955917346
dc.identifier.uri	http://hdl.handle.net/10453/169027
dc.description.abstract	Online Social Network has let people connect and interact with each other. It does, however, also provide a platform for online abusers to propagate abusive content. The majority of these abusive remarks are written in a multilingual style, which allows them to easily slip past internet inspection. This paper presents a system developed for the Shared Task on Abusive Comment Detection (Misogyny, Misandry, Homophobia, Transphobic, Xenophobia, CounterSpeech, Hope Speech) in Tamil DravidianLangTech@ACL 2022 to detect the abusive category of each comment. We approach the task with three methodologies - Machine Learning, Deep Learning and Transformer-based modeling, for two sets of data - Tamil and Tamil+English language dataset. The dataset used in our system can be accessed from the competition on CodaLab. For Machine Learning, eight algorithms were implemented, among which Random Forest gave the best result with Tamil+English dataset, with a weighted average F1-score of 0.78. For Deep Learning, Bi-Directional LSTM gave best result with pre-trained word embeddings. In Transformer-based modeling, we used IndicBERT and mBERT with fine-tuning, among which mBERT gave the best result for Tamil dataset with a weighted average F1-score of 0.7.
dc.language	en
dc.publisher	Association for Computational Linguistics (ACL)
dc.relation.ispartof	DravidianLangTech 2022 - 2nd Workshop on Speech and Language Technologies for Dravidian Languages, Proceedings of the Workshop
dc.relation.ispartof	Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
dc.relation.isbasedon	10.18653/v1/2022.dravidianlangtech-1.32
dc.rights	info:eu-repo/semantics/openAccess
dc.title	DLRG@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil using Multilingual Transformer Models
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Provost
pubs.organisational-group	/University of Technology Sydney/Provost/TD School
utslib.copyright.status	open_access	*
dc.date.updated	2023-03-31T12:22:00Z
pubs.finish-date	2022-05
pubs.publication-status	Published
pubs.start-date	2022-05

Abstract:

Online Social Network has let people connect and interact with each other. It does, however, also provide a platform for online abusers to propagate abusive content. The majority of these abusive remarks are written in a multilingual style, which allows them to easily slip past internet inspection. This paper presents a system developed for the Shared Task on Abusive Comment Detection (Misogyny, Misandry, Homophobia, Transphobic, Xenophobia, CounterSpeech, Hope Speech) in Tamil DravidianLangTech@ACL 2022 to detect the abusive category of each comment. We approach the task with three methodologies - Machine Learning, Deep Learning and Transformer-based modeling, for two sets of data - Tamil and Tamil+English language dataset. The dataset used in our system can be accessed from the competition on CodaLab. For Machine Learning, eight algorithms were implemented, among which Random Forest gave the best result with Tamil+English dataset, with a weighted average F1-score of 0.78. For Deep Learning, Bi-Directional LSTM gave best result with pre-trained word embeddings. In Transformer-based modeling, we used IndicBERT and mBERT with fine-tuning, among which mBERT gave the best result for Tamil dataset with a weighted average F1-score of 0.7.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/169027