Efficient mining of distance-based subspace clusters

Liu, G; Sim, K; Li, J; Wong, L

Efficient mining of distance-based subspace clusters

Liu, G Sim, K Li, J

Wong, L

Permalink

Publication Type:: Journal Article
Citation:: Statistical Analysis and Data Mining, 2009, 2 (5-6), pp. 427 - 444
Issue Date:: 2009-12-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (269.3 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, G	en_US
dc.contributor.author	Sim, K	en_US
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.contributor.author	Wong, L	en_US
dc.date.issued	2009-12-01	en_US
dc.identifier.citation	Statistical Analysis and Data Mining, 2009, 2 (5-6), pp. 427 - 444	en_US
dc.identifier.issn	1932-1872	en_US
dc.identifier.uri	http://hdl.handle.net/10453/14517
dc.description.abstract	Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high-dimensional datasets. Many existing algorithms use a grid-based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid-based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding-window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding-window approach generates more bins than the grid-based approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non-maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that (i) the nCluster model can indeed preserve clusters that are shattered by the grid-based approach on synthetic datasets; (ii) the nCluster model produces more significant clusters than the grid-based approach on two real gene expression datasets and (iii) MaxnCluster is efficient in mining maximal nClusters. © 2009 Wiley Periodicals, Inc.	en_US
dc.relation.ispartof	Statistical Analysis and Data Mining	en_US
dc.relation.isbasedon	10.1002/sam.10062	en_US
dc.title	Efficient mining of distance-based subspace clusters	en_US
dc.type	Journal Article
utslib.citation.volume	5-6	en_US
utslib.citation.volume	2	en_US
utslib.for	0104 Statistics	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	open_access
pubs.issue	5-6	en_US
pubs.publication-status	Published	en_US
pubs.volume	2	en_US

Abstract:

Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high-dimensional datasets. Many existing algorithms use a grid-based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid-based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding-window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding-window approach generates more bins than the grid-based approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non-maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that (i) the nCluster model can indeed preserve clusters that are shattered by the grid-based approach on synthetic datasets; (ii) the nCluster model produces more significant clusters than the grid-based approach on two real gene expression datasets and (iii) MaxnCluster is efficient in mining maximal nClusters. © 2009 Wiley Periodicals, Inc.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/14517