Taxonomy-augmented features for document clustering

Seifollahi, S; Piccardi, M; Borzeshi, EZ; Kruger, B

Taxonomy-augmented features for document clustering

Seifollahi, S

Piccardi, M

Borzeshi, EZ Kruger, B

Permalink

Publication Type:: Conference Proceeding
Citation:: Communications in Computer and Information Science, 2019, 996 pp. 241 - 252
Issue Date:: 2019-01-01

Closed Access

	Filename	Description	Size
	taxonomy-augmented-features 2018 08 17.pdf	Accepted Manuscript version	1.09 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Seifollahi, S https://orcid.org/0000-0002-5325-9724	en_US
dc.contributor.author	Piccardi, M https://orcid.org/0000-0001-9250-6604	en_US
dc.contributor.author	Borzeshi, EZ	en_US
dc.contributor.author	Kruger, B	en_US
dc.date.issued	2019-01-01	en_US
dc.identifier.citation	Communications in Computer and Information Science, 2019, 996 pp. 241 - 252	en_US
dc.identifier.isbn	9789811366604	en_US
dc.identifier.issn	1865-0929	en_US
dc.identifier.uri	http://hdl.handle.net/10453/137037
dc.description.abstract	© Springer Nature Singapore Pte Ltd. 2019. In document clustering, individual documents are typically represented by feature vectors based on term-frequency or bag-of-word models. However, such feature vectors intrinsically dismiss the order of the words in the document and suffer from very high dimensionality. For these reasons, in this paper we present novel taxonomy-augmented features that enjoy two promising characteristics: (1) they leverage semantic word embeddings to take the word order into account, and (2) they reduce the feature dimensionality to a very manageable size. Our feature extraction approach consists of three main steps: first, we apply a word embedding technique to represent the words in a word embedding space. Second, we partition the word vocabulary into a hierarchy of clusters by using k-means hierarchically. Lastly, the individual documents are projected to the hierarchy and a compact feature vector is extracted. We propose two methods for generating the features: the first uses all the clusters in the hierarchy and results in a feature vector whose dimensionality is equal to the number of the clusters. The second uses a small set of user-defined words and results in an even smaller feature vector whose dimensionality is equal to the size of the set. Numerical experiments on document clustering show that the proposed approach is capable of achieving comparable or even higher accuracy than conventional feature vectors with a much more compact representation.	en_US
dc.relation.ispartof	Communications in Computer and Information Science	en_US
dc.relation.isbasedon	10.1007/978-981-13-6661-1_19	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Taxonomy-augmented features for document clustering	en_US
dc.type	Conference Proceeding
utslib.citation.volume	996	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access	*
pubs.publication-status	Published	en_US
pubs.volume	996	en_US

Abstract:

© Springer Nature Singapore Pte Ltd. 2019. In document clustering, individual documents are typically represented by feature vectors based on term-frequency or bag-of-word models. However, such feature vectors intrinsically dismiss the order of the words in the document and suffer from very high dimensionality. For these reasons, in this paper we present novel taxonomy-augmented features that enjoy two promising characteristics: (1) they leverage semantic word embeddings to take the word order into account, and (2) they reduce the feature dimensionality to a very manageable size. Our feature extraction approach consists of three main steps: first, we apply a word embedding technique to represent the words in a word embedding space. Second, we partition the word vocabulary into a hierarchy of clusters by using k-means hierarchically. Lastly, the individual documents are projected to the hierarchy and a compact feature vector is extracted. We propose two methods for generating the features: the first uses all the clusters in the hierarchy and results in a feature vector whose dimensionality is equal to the number of the clusters. The second uses a small set of user-defined words and results in an even smaller feature vector whose dimensionality is equal to the size of the set. Numerical experiments on document clustering show that the proposed approach is capable of achieving comparable or even higher accuracy than conventional feature vectors with a much more compact representation.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/137037