Word embedding-based techniques for text clustering and topic modelling with application in the healthcare domain

Seifollahi, Sattar

Word embedding-based techniques for text clustering and topic modelling with application in the healthcare domain

Seifollahi, Sattar

Permalink

Publication Type:: Thesis
Issue Date:: 2019

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (162.44 kB)

Adobe PDF

Download thesisAdobe PDF (6.12 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Seifollahi, Sattar
dc.date.accessioned	2020-04-24T02:30:57Z
dc.date.available	2020-04-24T02:30:57Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/10453/140254
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	In the field of text analytics, document clustering and topic modelling are two widely-used tools for many applications. Document clustering aims to automatically organize similar documents into groups, which is crucial for document organization, browsing, summarization, classification and retrieval. Topic modelling refers to unsupervised models that automatically discover the main topics of a collection of documents. In topic modelling, the topics are simply represented as probability distributions over the words in the collection (the different probabilities reveal what topic is at stake). In turn, each document is represented as a distribution over the topics. Such distributions can also be seen as low-dimensional representations of the documents that can be used for information retrieval, document summarization and classification. Document clustering and topic modelling are highly correlated and can mutually benefit from each other. Many document clustering algorithms exist, including the classic k-means. In this thesis, we have developed three new algorithms: 1) a maximum-margin clustering approach which was originally proposed for general data, but can also suit text clustering, 2) a modified global k-means algorithm for text clustering which is able to improve the local minima and find a deeper local solution for clustering document collections in a limited amount of time, and 3) a taxonomy-augmented algorithm which addresses two main drawbacks of the so-called “bag-of-words” (BoW) models, namely, the curse of dimensionality and the dismissal of word ordering. Our main emphasis is on high accuracy and effectiveness within the bounds of limited memory consumption. Although great effort has been devoted to topic modelling to date, a limitation of many topic models such as latent Dirichlet allocation is that they do not take the words’ relations explicitly into account. Our contribution has been two-fold. We have developed a topic model which captures how words are topically related. The model is presented as a semi-supervised Markov chain topic model in which topics are assigned to individual words based on how each word is topically connected to the previous one in the collection. We have combined topic modelling and clustering to propose a new algorithm that benefits from both. This research was industry-driven, focusing on projects from the Transport Accident Commission (TAC), a major accident compensation agency of the Victorian Government in Australia. It has received full ethics approval from the UTS Human Research Ethics Committee. The results presented in this thesis do not allow reidentifying any person involved in the services.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/140254/2/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.title	Word embedding-based techniques for text clustering and topic modelling with application in the healthcare domain	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

In the field of text analytics, document clustering and topic modelling are two widely-used tools for many applications. Document clustering aims to automatically organize similar documents into groups, which is crucial for document organization, browsing, summarization, classification and retrieval. Topic modelling refers to unsupervised models that automatically discover the main topics of a collection of documents. In topic modelling, the topics are simply represented as probability distributions over the words in the collection (the different probabilities reveal what topic is at stake). In turn, each document is represented as a distribution over the topics. Such distributions can also be seen as low-dimensional representations of the documents that can be used for information retrieval, document summarization and classification. Document clustering and topic modelling are highly correlated and can mutually benefit from each other. Many document clustering algorithms exist, including the classic k-means. In this thesis, we have developed three new algorithms: 1) a maximum-margin clustering approach which was originally proposed for general data, but can also suit text clustering, 2) a modified global k-means algorithm for text clustering which is able to improve the local minima and find a deeper local solution for clustering document collections in a limited amount of time, and 3) a taxonomy-augmented algorithm which addresses two main drawbacks of the so-called “bag-of-words” (BoW) models, namely, the curse of dimensionality and the dismissal of word ordering. Our main emphasis is on high accuracy and effectiveness within the bounds of limited memory consumption. Although great effort has been devoted to topic modelling to date, a limitation of many topic models such as latent Dirichlet allocation is that they do not take the words’ relations explicitly into account. Our contribution has been two-fold. We have developed a topic model which captures how words are topically related. The model is presented as a semi-supervised Markov chain topic model in which topics are assigned to individual words based on how each word is topically connected to the previous one in the collection. We have combined topic modelling and clustering to propose a new algorithm that benefits from both. This research was industry-driven, focusing on projects from the Transport Accident Commission (TAC), a major accident compensation agency of the Victorian Government in Australia. It has received full ethics approval from the UTS Human Research Ethics Committee. The results presented in this thesis do not allow reidentifying any person involved in the services.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/140254