Dimensionality reduction and data clustering for large high-dimensional data

Zhao, Yanchang

Dimensionality reduction and data clustering for large high-dimensional data

Zhao, Yanchang

Permalink

Publication Type:: Thesis
Issue Date:: 2006

Closed Access

	Filename	Description	Size
	01Front.pdf	contents and abstract	616.17 kB	Adobe PDF	View/Open
	02Whole.pdf	thesis	9.15 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhao, Yanchang
dc.date.accessioned	2016-10-18T05:32:23Z
dc.date.available	2016-10-18T05:32:23Z
dc.date.issued	2006
dc.identifier.uri	http://hdl.handle.net/10453/56159
dc.description	University of Technology, Sydney. Faculty of Information Technology.	en_AU
dc.description	NO FULL TEXT AVAILABLE. Access is restricted indefinitely. The hardcopy may be available for consultation at the UTS Library.
dc.description.abstract	NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- Large high-dimensional data is a major application area of data mining. In many applications, the data volume is in Gigabytes or even Terabytes, and the dimensionality is hundreds or even thousands. Such large high-dimensional data bring big challenges to data mining. First, to discover patterns from large high-dimensional data in an acceptable time, the algorithms should be efficient and scalable. Algorithms with linear or sub-linear complexities are preferred. Second, the difference between the farthest point and the nearest point becomes less discriminating with the increase of dimensionality, so effective similarity measures, efficient dimensionality reduction methods and indexing techniques are to be designed. Moreover, clusters may only exist in subspaces for high dimensional data. Third, new data arrives continuously in many applications, which leads to the increase of data size (e.g., transactional data) or dimensionality (e.g., time series data), so incremental data miming methods are in demand to discover time-changing patterns efficiently. To attack the above issues brought by large high-dimensional data, this thesis presents four novel algorithms: two of them are designed for recent-biased dimensionality reduction for time series data and the other two for large high-dimensional data clustering. First, time series data are usually of very high dimensionality and the dimensionality increases with the incoming of new data. Moreover, in many applications, recent data are more important than old data, but most existing techniques for dimensionality reduction treat every part of a sequence equally. To capture this trend, this thesis constructs new recent-biased measures for distance and energy and proposes recent-biased dimensionality reduction. Then a new algorithm is designed for enhancing Discrete ·wavelet Transform for recent-biased dimensionality reduction, and a novel generalized recent-biased framework is devised for supporting traditional techniques for online recent-biased dimensionality reduction. Second, for large high-dimensional data clustering, a new similarity measure, minimal subspace distance, is proposed to discover clusters in subspaces, based on which k-means algorithm is successfully adapted for discovering subspace clusters. Furthermore, for efficiently clustering large high-dimensional data, a new grid-density based clustering algorithm is proposed by adopting the strengths of both density-based and grid-based approaches, and the ideas of low-order neighbors and density compensation are designed to improve its efficiency and accuracy. Extensive experiments have been conducted on both synthetic and real-world data for the above algorithms, and have demonstrated that the proposed approaches are efficient and promising.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.rights	info:eu-repo/semantics/closedAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.title	Dimensionality reduction and data clustering for large high-dimensional data	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	closed_access

Abstract:

NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- Large high-dimensional data is a major application area of data mining. In many applications, the data volume is in Gigabytes or even Terabytes, and the dimensionality is hundreds or even thousands. Such large high-dimensional data bring big challenges to data mining. First, to discover patterns from large high-dimensional data in an acceptable time, the algorithms should be efficient and scalable. Algorithms with linear or sub-linear complexities are preferred. Second, the difference between the farthest point and the nearest point becomes less discriminating with the increase of dimensionality, so effective similarity measures, efficient dimensionality reduction methods and indexing techniques are to be designed. Moreover, clusters may only exist in subspaces for high dimensional data. Third, new data arrives continuously in many applications, which leads to the increase of data size (e.g., transactional data) or dimensionality (e.g., time series data), so incremental data miming methods are in demand to discover time-changing patterns efficiently. To attack the above issues brought by large high-dimensional data, this thesis presents four novel algorithms: two of them are designed for recent-biased dimensionality reduction for time series data and the other two for large high-dimensional data clustering. First, time series data are usually of very high dimensionality and the dimensionality increases with the incoming of new data. Moreover, in many applications, recent data are more important than old data, but most existing techniques for dimensionality reduction treat every part of a sequence equally. To capture this trend, this thesis constructs new recent-biased measures for distance and energy and proposes recent-biased dimensionality reduction. Then a new algorithm is designed for enhancing Discrete ·wavelet Transform for recent-biased dimensionality reduction, and a novel generalized recent-biased framework is devised for supporting traditional techniques for online recent-biased dimensionality reduction. Second, for large high-dimensional data clustering, a new similarity measure, minimal subspace distance, is proposed to discover clusters in subspaces, based on which k-means algorithm is successfully adapted for discovering subspace clusters. Furthermore, for efficiently clustering large high-dimensional data, a new grid-density based clustering algorithm is proposed by adopting the strengths of both density-based and grid-based approaches, and the ideas of low-order neighbors and density compensation are designed to improve its efficiency and accuracy. Extensive experiments have been conducted on both synthetic and real-world data for the above algorithms, and have demonstrated that the proposed approaches are efficient and promising.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/56159