Taxonomy-augmented features for document clustering

Publication Type:
Conference Proceeding
Citation:
Communications in Computer and Information Science, 2019, 996 pp. 241 - 252
Issue Date:
2019-01-01
Filename Description Size
taxonomy-augmented-features 2018 08 17.pdfAccepted Manuscript version1.09 MB
Adobe PDF
Full metadata record
© Springer Nature Singapore Pte Ltd. 2019. In document clustering, individual documents are typically represented by feature vectors based on term-frequency or bag-of-word models. However, such feature vectors intrinsically dismiss the order of the words in the document and suffer from very high dimensionality. For these reasons, in this paper we present novel taxonomy-augmented features that enjoy two promising characteristics: (1) they leverage semantic word embeddings to take the word order into account, and (2) they reduce the feature dimensionality to a very manageable size. Our feature extraction approach consists of three main steps: first, we apply a word embedding technique to represent the words in a word embedding space. Second, we partition the word vocabulary into a hierarchy of clusters by using k-means hierarchically. Lastly, the individual documents are projected to the hierarchy and a compact feature vector is extracted. We propose two methods for generating the features: the first uses all the clusters in the hierarchy and results in a feature vector whose dimensionality is equal to the number of the clusters. The second uses a small set of user-defined words and results in an even smaller feature vector whose dimensionality is equal to the size of the set. Numerical experiments on document clustering show that the proposed approach is capable of achieving comparable or even higher accuracy than conventional feature vectors with a much more compact representation.
Please use this identifier to cite or link to this item: