Outlier detection in large high-dimensional data and its application in stock market surveillance

Publication Type:
Thesis
Issue Date:
2011
Full metadata record
Outlier detection techniques play an important role in stock market surveillance that involves analysis of large volume of high-dimensional trading data. However, outlier detection in large high-dimensional data is very challenging and is not well addressed by existing techniques. Firstly, it is difficult to select useful and relevant features from high-dimensional data. Secondly, large high-dimensional data need more efficient algorithms. To attack the above issues brought by large high-dimensional data, this thesis presents two outlier detection models and one subspace clustering model. Firstly, an outlier mining model is proposed to detect the outliers from multiple complex stock market data. In order to improve the efficiency of outlier detection, a financial model is used to select the features to construct multiple datasets. This model is able to improve the precision of outlier mining on individual measurements. The experiments on real-world stock market data show that the proposed model is effective and outperforms traditional technologies. Secondly, in order to find relevant features automatically, an agent-based algorithm is proposed to discover subspace clusters in high dimensional data. Each data object is represented by an agent, and the agents move from one local environment to another to find optimal clusters in subspaces. Heuristic rules and objective functions are defined to guide the movements of agents, so that similar agents (data objects) go to one group. The experimental results show that our proposed agent-based subspace clustering algorithm performs better than existing subspace clustering methods on both F1 measure and Entropy. The running time of our algorithm is scalable with the size and dimensionality of data. Furthermore, an application of our technique to stock market surveillance demonstrates its effectiveness in real world applications. Finally, we propose a reference-based outlier detection model by agent-based subspace clustering. At first, agent-based subspace clustering is utilized to generate clusters in subspaces. After that, the centers of clusters, together with the corresponding subspaces, are used as references, and a reference-based model is employed to find outliers in relevant subspaces. The experimental results on real-world datasets prove that the proposed model is able to effectively and efficiently identify outliers in subspaces. In summary, this thesis research on outlier detection techniques on high-dimensional data and its application in stock market surveillance. The proposed models are novel and effective. They have shown their potentials in real business.
Please use this identifier to cite or link to this item: