Dimensionality reduction and data clustering for large high-dimensional data
- Publication Type:
- Thesis
- Issue Date:
- 2006
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
01Front.pdf | contents and abstract | 616.17 kB | |||
02Whole.pdf | thesis | 9.15 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- Large high-dimensional data is a major application area of data mining. In many
applications, the data volume is in Gigabytes or even Terabytes, and the dimensionality
is hundreds or even thousands. Such large high-dimensional data bring big
challenges to data mining. First, to discover patterns from large high-dimensional
data in an acceptable time, the algorithms should be efficient and scalable. Algorithms
with linear or sub-linear complexities are preferred. Second, the difference
between the farthest point and the nearest point becomes less discriminating with
the increase of dimensionality, so effective similarity measures, efficient dimensionality
reduction methods and indexing techniques are to be designed. Moreover,
clusters may only exist in subspaces for high dimensional data. Third, new data
arrives continuously in many applications, which leads to the increase of data size
(e.g., transactional data) or dimensionality (e.g., time series data), so incremental
data miming methods are in demand to discover time-changing patterns efficiently.
To attack the above issues brought by large high-dimensional data, this thesis
presents four novel algorithms: two of them are designed for recent-biased dimensionality
reduction for time series data and the other two for large high-dimensional data
clustering. First, time series data are usually of very high dimensionality and the
dimensionality increases with the incoming of new data. Moreover, in many applications,
recent data are more important than old data, but most existing techniques
for dimensionality reduction treat every part of a sequence equally. To capture this
trend, this thesis constructs new recent-biased measures for distance and energy
and proposes recent-biased dimensionality reduction. Then a new algorithm is designed
for enhancing Discrete ·wavelet Transform for recent-biased dimensionality
reduction, and a novel generalized recent-biased framework is devised for supporting
traditional techniques for online recent-biased dimensionality reduction. Second, for
large high-dimensional data clustering, a new similarity measure, minimal subspace
distance, is proposed to discover clusters in subspaces, based on which k-means
algorithm is successfully adapted for discovering subspace clusters. Furthermore,
for efficiently clustering large high-dimensional data, a new grid-density based clustering
algorithm is proposed by adopting the strengths of both density-based and
grid-based approaches, and the ideas of low-order neighbors and density compensation
are designed to improve its efficiency and accuracy. Extensive experiments have
been conducted on both synthetic and real-world data for the above algorithms, and
have demonstrated that the proposed approaches are efficient and promising.
Please use this identifier to cite or link to this item: