Concept Drift Detection for Machine Learning with Stream Data

Publication Type:
Thesis
Issue Date:
2019
Full metadata record
Machine learning in streaming data is often inhibited by arbitrary changes of the data distribution. Particularly, classification boundary change, also known as concept drift, is the major cause of machine learning performance deterioration. Accurately and efficiently detecting concept drift remains challenging because of inherent limitations of stream data - non-stationarity, velocity and availability of true label data. The non-stationarity of the stream data causes performance degradation of pretrained models and the high velocity of the data generation requires highly efficient prediction algorithms for real time applications. The theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information about the trend of the drift, which could be invaluable for model maintenance. To better address concept drift problems, this thesis first proposes a novel concept drift detection method based on 𝗡eighbor 𝗦earch 𝗗iscrepancy (NSD), a new statistic that measures the classification boundary difference between two samples. The proposed method uses true label data to detect concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying invasion or retreat of a certain class, which is also an indicator of separability change between classes. To improve concept drift adaptation efficiency, based on NSD, this thesis proposes two novel instance selection methods for both concept drift detection – 𝗗ecision 𝗥egion 𝗦upport Set (DRS) and classification - 𝗗ecision 𝗥egion 𝗕order Set (DRB). The unified framework yields reduction instances for both objectives simultaneously without computational overhead. The drift detection method efficiently detects concept drift without relying on resampling technique. The reduction rule based on Neighbor Search better estimates decision boundaries, resulting in improved classification accuracy. For scenarios where true label data is unavailable, this thesis first proposes a novel distribution change detection method - 𝗘qual 𝗗ensity 𝗘stimation (EDE) based on the estimation of equal density regions. The aim is to overcome the issues of instability and inefficiency that underlie methods of predefined space partitioning schemes. This method is general, nonparametric and requires no prior knowledge of the data distribution. Finally, in order to detect concept drift without true label data, this thesis introduces a novel categorization of drift types - maintainable and unmaintainable drift, to describe the necessity of model maintenance in different scenarios. Then we develop a unique drift detection algorithm based on 𝗣robability 𝗣ercentile 𝗗iscrepancy (PPD), which detects only maintainable drift without relying on true label data. In summary, this thesis targets a critical issue in modern machine learning research. The approaches taken in the thesis of building effective and efficient concept drift detection algorithms are novel and practical. There has been no previous study on the theories of neighbor search discrepancy and maintainable concept drift. The findings of this thesis contribute to both scientific research and practical applications.
Please use this identifier to cite or link to this item: