Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Seifollahi, S; Piccardi, M

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Seifollahi, S Piccardi, M

Permalink

Publisher:: Springer Nature
Publication Type:: Chapter
Citation:: Computational Linguistics and Intelligent Text Processing, 2023, 13452 LNCS, pp. 575-586
Issue Date:: 2023-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 1 Jan 2025

Adobe PDF

Download Supporting informationAdobe PDF (171.54 kB)

Adobe PDF

Download Accepted VersionAdobe PDF (957.58 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Seifollahi, S
dc.contributor.author	Piccardi, M https://orcid.org/0000-0001-9250-6604
dc.date.accessioned	2023-06-15T10:34:56Z
dc.date.available	2023-06-15T10:34:56Z
dc.date.issued	2023-01-01
dc.identifier.citation	Computational Linguistics and Intelligent Text Processing, 2023, 13452 LNCS, pp. 575-586
dc.identifier.isbn	9783031243394
dc.identifier.uri	http://hdl.handle.net/10453/170766
dc.description.abstract	Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation.
dc.language	en
dc.publisher	Springer Nature
dc.relation.ispartof	Computational Linguistics and Intelligent Text Processing
dc.relation.ispartofseries	Lecture Notes in Computer Science
dc.relation.isbasedon	10.1007/978-3-031-24340-0_43
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis
dc.type	Chapter
utslib.citation.volume	13452 LNCS
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2025-01-01T00:00:00+1000Z
dc.date.updated	2023-06-15T10:34:54Z
pubs.publication-status	Published
pubs.volume	13452 LNCS

Abstract:

Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170766