Dimensionality reduction and data clustering for large high-dimensional data
- Publication Type:
- Issue Date:
NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- Large high-dimensional data is a major application area of data mining. In many applications, the data volume is in Gigabytes or even Terabytes, and the dimensionality is hundreds or even thousands. Such large high-dimensional data bring big challenges to data mining. First, to discover patterns from large high-dimensional data in an acceptable time, the algorithms should be efficient and scalable. Algorithms with linear or sub-linear complexities are preferred. Second, the difference between the farthest point and the nearest point becomes less discriminating with the increase of dimensionality, so effective similarity measures, efficient dimensionality reduction methods and indexing techniques are to be designed. Moreover, clusters may only exist in subspaces for high dimensional data. Third, new data arrives continuously in many applications, which leads to the increase of data size (e.g., transactional data) or dimensionality (e.g., time series data), so incremental data miming methods are in demand to discover time-changing patterns efficiently. To attack the above issues brought by large high-dimensional data, this thesis presents four novel algorithms: two of them are designed for recent-biased dimensionality reduction for time series data and the other two for large high-dimensional data clustering. First, time series data are usually of very high dimensionality and the dimensionality increases with the incoming of new data. Moreover, in many applications, recent data are more important than old data, but most existing techniques for dimensionality reduction treat every part of a sequence equally. To capture this trend, this thesis constructs new recent-biased measures for distance and energy and proposes recent-biased dimensionality reduction. Then a new algorithm is designed for enhancing Discrete ·wavelet Transform for recent-biased dimensionality reduction, and a novel generalized recent-biased framework is devised for supporting traditional techniques for online recent-biased dimensionality reduction. Second, for large high-dimensional data clustering, a new similarity measure, minimal subspace distance, is proposed to discover clusters in subspaces, based on which k-means algorithm is successfully adapted for discovering subspace clusters. Furthermore, for efficiently clustering large high-dimensional data, a new grid-density based clustering algorithm is proposed by adopting the strengths of both density-based and grid-based approaches, and the ideas of low-order neighbors and density compensation are designed to improve its efficiency and accuracy. Extensive experiments have been conducted on both synthetic and real-world data for the above algorithms, and have demonstrated that the proposed approaches are efficient and promising.
Please use this identifier to cite or link to this item: