Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Publisher:
Springer Nature
Publication Type:
Chapter
Citation:
Computational Linguistics and Intelligent Text Processing, 2023, 13452 LNCS, pp. 575-586
Issue Date:
2023-01-01
Filename Description Size
ToC.pdfSupporting information171.54 kB
Adobe PDF
2s2.0-85149987090 AM.pdfAccepted Version957.58 kB
Adobe PDF
Full metadata record
Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation.
Please use this identifier to cite or link to this item: