Concept Drift Detection for Machine Learning with Stream Data

Gu, Feng

Concept Drift Detection for Machine Learning with Stream Data

Gu, Feng

Permalink

Publication Type:: Thesis
Issue Date:: 2019

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (362.08 kB)

Adobe PDF

Download thesisAdobe PDF (2.83 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Gu, Feng
dc.date.accessioned	2020-04-22T00:22:05Z
dc.date.available	2020-05-25T19:16:02Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/10453/140165
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	Machine learning in streaming data is often inhibited by arbitrary changes of the data distribution. Particularly, classification boundary change, also known as concept drift, is the major cause of machine learning performance deterioration. Accurately and efficiently detecting concept drift remains challenging because of inherent limitations of stream data - non-stationarity, velocity and availability of true label data. The non-stationarity of the stream data causes performance degradation of pretrained models and the high velocity of the data generation requires highly efficient prediction algorithms for real time applications. The theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information about the trend of the drift, which could be invaluable for model maintenance. To better address concept drift problems, this thesis first proposes a novel concept drift detection method based on 𝗡eighbor 𝗦earch 𝗗iscrepancy (NSD), a new statistic that measures the classification boundary difference between two samples. The proposed method uses true label data to detect concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying invasion or retreat of a certain class, which is also an indicator of separability change between classes. To improve concept drift adaptation efficiency, based on NSD, this thesis proposes two novel instance selection methods for both concept drift detection – 𝗗ecision 𝗥egion 𝗦upport Set (DRS) and classification - 𝗗ecision 𝗥egion 𝗕order Set (DRB). The unified framework yields reduction instances for both objectives simultaneously without computational overhead. The drift detection method efficiently detects concept drift without relying on resampling technique. The reduction rule based on Neighbor Search better estimates decision boundaries, resulting in improved classification accuracy. For scenarios where true label data is unavailable, this thesis first proposes a novel distribution change detection method - 𝗘qual 𝗗ensity 𝗘stimation (EDE) based on the estimation of equal density regions. The aim is to overcome the issues of instability and inefficiency that underlie methods of predefined space partitioning schemes. This method is general, nonparametric and requires no prior knowledge of the data distribution. Finally, in order to detect concept drift without true label data, this thesis introduces a novel categorization of drift types - maintainable and unmaintainable drift, to describe the necessity of model maintenance in different scenarios. Then we develop a unique drift detection algorithm based on 𝗣robability 𝗣ercentile 𝗗iscrepancy (PPD), which detects only maintainable drift without relying on true label data. In summary, this thesis targets a critical issue in modern machine learning research. The approaches taken in the thesis of building effective and efficient concept drift detection algorithms are novel and practical. There has been no previous study on the theories of neighbor search discrepancy and maintainable concept drift. The findings of this thesis contribute to both scientific research and practical applications.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/140165/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Concept Drift Detection for Machine Learning with Stream Data	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

Machine learning in streaming data is often inhibited by arbitrary changes of the data distribution. Particularly, classification boundary change, also known as concept drift, is the major cause of machine learning performance deterioration. Accurately and efficiently detecting concept drift remains challenging because of inherent limitations of stream data - non-stationarity, velocity and availability of true label data. The non-stationarity of the stream data causes performance degradation of pretrained models and the high velocity of the data generation requires highly efficient prediction algorithms for real time applications. The theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information about the trend of the drift, which could be invaluable for model maintenance. To better address concept drift problems, this thesis first proposes a novel concept drift detection method based on 𝗡eighbor 𝗦earch 𝗗iscrepancy (NSD), a new statistic that measures the classification boundary difference between two samples. The proposed method uses true label data to detect concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying invasion or retreat of a certain class, which is also an indicator of separability change between classes. To improve concept drift adaptation efficiency, based on NSD, this thesis proposes two novel instance selection methods for both concept drift detection – 𝗗ecision 𝗥egion 𝗦upport Set (DRS) and classification - 𝗗ecision 𝗥egion 𝗕order Set (DRB). The unified framework yields reduction instances for both objectives simultaneously without computational overhead. The drift detection method efficiently detects concept drift without relying on resampling technique. The reduction rule based on Neighbor Search better estimates decision boundaries, resulting in improved classification accuracy. For scenarios where true label data is unavailable, this thesis first proposes a novel distribution change detection method - 𝗘qual 𝗗ensity 𝗘stimation (EDE) based on the estimation of equal density regions. The aim is to overcome the issues of instability and inefficiency that underlie methods of predefined space partitioning schemes. This method is general, nonparametric and requires no prior knowledge of the data distribution. Finally, in order to detect concept drift without true label data, this thesis introduces a novel categorization of drift types - maintainable and unmaintainable drift, to describe the necessity of model maintenance in different scenarios. Then we develop a unique drift detection algorithm based on 𝗣robability 𝗣ercentile 𝗗iscrepancy (PPD), which detects only maintainable drift without relying on true label data. In summary, this thesis targets a critical issue in modern machine learning research. The approaches taken in the thesis of building effective and efficient concept drift detection algorithms are novel and practical. There has been no previous study on the theories of neighbor search discrepancy and maintainable concept drift. The findings of this thesis contribute to both scientific research and practical applications.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/140165