An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Curiskis, SA; Drake, B; Osborn, TR; Kennedy, PJ

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Curiskis, SA Drake, B

Osborn, TR Kennedy, PJ

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Information Processing and Management, 2020, 57, (2)
Issue Date:: 2020-03-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0306457318307805-main.pdf	Published version	2.64 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Curiskis, SA
dc.contributor.author	Drake, B https://orcid.org/0000-0003-0572-9936
dc.contributor.author	Osborn, TR
dc.contributor.author	Kennedy, PJ
dc.date.accessioned	2020-10-16T22:30:29Z
dc.date.available	2020-10-16T22:30:29Z
dc.date.issued	2020-03-01
dc.identifier.citation	Information Processing and Management, 2020, 57, (2)
dc.identifier.issn	0306-4573
dc.identifier.issn	1873-5371
dc.identifier.uri	http://hdl.handle.net/10453/143319
dc.description.abstract	© 2019 Elsevier Ltd Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
dc.language	English
dc.publisher	Elsevier
dc.relation.ispartof	Information Processing and Management
dc.relation.isbasedon	10.1016/j.ipm.2019.04.002
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0804 Data Format, 0806 Information Systems, 0807 Library and Information Studies
dc.subject.classification	Information & Library Sciences
dc.title	An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
dc.type	Journal Article
utslib.citation.volume	57
utslib.for	0804 Data Format
utslib.for	0806 Information Systems
utslib.for	0807 Library and Information Studies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
pubs.consider-herdc	true
dc.date.updated	2020-10-16T22:30:10Z
pubs.issue	2
pubs.publication-status	Published
pubs.volume	57
utslib.citation.issue	2

Abstract:

© 2019 Elsevier Ltd Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/143319