Accumulating regional density dissimilarity for concept drift detection in data stream

In a non-stationary environment, newly received data may have different knowledge patterns to the data used to train learning models. As time passes, the performance of learning models becomes increasingly unreliable. This problem is known as concept drift and is a common issue in real-world domains. Concept drift detection has attracted increasing attention in recent years, however, hardly any existing methods pay attention to small regional drifts, and their drift detection accuracy may vary due to different statistical signiﬁcance test. To address these problems, this paper presents a novel concept drift detection method that is based on regional density estimation, named nearest neighbor-based density variation identiﬁcation (NN-DVI). It consists of three components. The ﬁrst one is a k-nearest neighbor-based space partitioning schema (NNPS) which transforms unmeasurable discrete data instances into a set of shared subspaces for density estimation. The second one is a distance function that accumulates the density discrepancies in these subspaces and quantiﬁes the overall discrepancies. The last component is a tailored statistical signiﬁcant test by which the conﬁdence interval of a concept drift can be accurately determined. The distance applied in NN-DVI is sensitive to regional drift, and has been proven to follow a normal distribution. As a result, both the accuracy and false alarm rate of NN-DVI are statistically guaranteed. In addition, several benchmarks have been used to evaluate the method, including both synthetic and real-world datasets. The overall results show that NN-DVI has better performance in terms of addressing concept-drift-detection-related problems.


Introduction
As technology advances, it has become increasingly easier to collect and organize data from different sources.
Details of daily life that were previously unavailable can now be acquired and stored in mobile devices [15]. These seemingly insignificant details convey an enormous amount of valuable information and have certain patterns. When these details come with timestamp, it becomes streaming data. In the absence of outside interference, the patterns of a person's daily routine will not change. However, this assumption usually does not hold [36,44]. We live in an interactive world. People's daily routines are easily changed due to special events, as well as the impact of companies' or countries' new policies. Learning models trained to discover knowledge patterns have to consider pattern drifts as time shifts [24,39]. A wide range of machine learning problems need proper solutions to handle such a dynamic environment, for example, personal assistance applications that deal with information filtering, macroeconomic forecasts, bankruptcy prediction and individual credit scoring [15].
The term concept drift in machine learning field refers to a phenomenon in knowledge patterns where data distribution continues to change over time [41]. As concepts change, new data may no longer conform to the patterns induced from historical data [45], and such conflicts will exert a negative impact on subsequent analysis tasks. More importantly, in real-world scenarios, these types of changes are barely perceptible [46,45]. For this reason, instead of making an assumption in a stationary environment, an effective learning model must always be alert to concept drift, and track and adapt to them quickly [18,28,47].
The root causes of concept drift is the variation in the distribution of the underlying data between different time periods. So far, the various concept drift adaptation algorithms that have been developed can be divided into two categories [15,18,44,45] : 1) active drift detection and adaptive learning ; or 2) passive online learning with a forgetting mechanism. Category 1) methods actively detect concept drifts at every time step and react after confirming a drift. They can be further divide into three subcategories [46] : a) data distribution-based drift detection, b) learner output-based drift detection and c) learner parameter-based drift detection. These detection algorithms are also called drift trigger techniques. In contrast, category 2) methods attempt to learn drifts incrementally with each new piece of arriving data, unlike active detection that attempts to detect a drift at each time step, this strategy assists learning models to adapt to new environments gradually [18,44].
In this paper, we focus on addressing the weaknesses in distribution-based drift detection algorithms. Since these algorithms directly address the root causes of concept drift, and are capable of representing corresponding confidence intervals, they have been reported as the most accurate drift detection methods [14,34,45,53]. Although this type of drift detection algorithm has made remarkable achievements, they still face the following bottlenecks : 1) Regional drifts were not taken into consideration and drift sensitiveness was increased at a cost of increasing false alarms. Existing algorithms detect drifts in terms of the entire sample set, but do not consider any regional changes in sub-sample sets. As a result, the test statistics of regional drifts may eventually be diluted by stable regions, which decreases sensitivity [24]. Even if the algorithms can successfully capture a distribution drift caused by regional density inequality, they are not able to distinguish whether this drift is caused by a serious regional drift or a moderate global drift ; 2) Existing distribution-based drift detection methods lack tailored statistical significance test. For example, Dasu, et al. [14] used bootstrapping [17] for statistical analysis, and Shao, et al. [53] used the Wilcoxon test. From their experiments, we can see that different significance tests result in different performance outcomes. Statistical analysis is critical to drift detection accuracy, and adequate explanation is indispensable to justify the relationship between significance tests and test statistics [34]. Therefore, to improve the sensitivity of drift detection and to propose a tailored significance test, this paper proposes a novel concept drift detection algorithm, called the nearest neighborbased density variation identification (NN-DVI). NN-DVI requires no prior knowledge of the data distribution, and instead, estimates the dissimilarity between data sets in terms of instances' neighbors. Compared to other distributionbased drift detection algorithms, the proposed NN-DVI method demonstrates the following advantages : -It is robust to one-dimensional data, as well as high-dimensional data.
-It is sensitive to concept drift caused by regional density changes and is robust to noise. -The distribution of the proposed distance is proven theoretically which provides a statistical bound to guarantee the number of false alarms. -It can describe detected changes by highlighting suspicious regions, and has been tested in real-world applications.
The rest of this paper is organized as follows. In section 2, the problem of concept drift is formally defined and some of the-state-of-art drift detection and adaptation algorithms are introduced. Section 3 introduces the preliminaries.
Section 4 formally defines the proposed nearest neighbor-based density variation identification (NN-DVI) algorithm.
Section 5 discusses how to select a proper neighborhood to construct the data model. Section 6 details the settings for the experiments and outlines the performance of NN-DVI on both synthetic and real-world datasets. Section 7 concludes this study with a discussion of future work.

Problem Description : Concept Drift and Related Research Topics
Concept drift was first proposed by [52]. It is formally defined as follows : at a time step t, a batch of observations is given, denoted as S t = d 1 , . . . , d m , where m is the batch size, d i = (X i , y i ) is one observation (or a data instance), X i is the feature vector, y i is the label, and S t follows a certain distribution F t (X, y). Concept drift occurs whenever there is a statistically significant difference between two consecutive observation sets S t , S t+1 that have F t (X, y) = F t+1 (X, y), denoted as ∃t : p t (X, y) = p t+1 (X, y) [21,43,44,56]. If we further decompose p t (X, y) into two parts as p t (X, y) = p t (X) × p t (y|X), we could say there are two sources of concept drift : one is the drift of p t (X) along with the time variable t, which can also be written as p(X|t) ; and the other is p t (y|X), which is the drift of the conditional probability of the feature vector X [46]. Concept drift has been given different names, such as dataset shift [48,54] or concept shift [42,50,56]. Other related terminologies were introduced in [49]'s work. The authors proposing that concept drift or shift is only one subcategory of dataset shift and the dataset shift consists of covariate shift, prior probability shift and concept shift.
Moreover, they formally defined that : 1) dataset shift appears when training and testing joint distributions are different which is ∃t : p t (X, y) = p t+1 (X, y) ; 2) covariate shift appears only in X → y problems, and the research focus is the drift in p t (X) while p t (y|X) remains unchanged ; 3) prior probability shift appears only in y → X problems, and the research focus is the drift in p t (y) while p t (X|y) remains unchanged ; and 4) concept shift (drift) focuses on the drift of p t (y|X) while p t (X) remains unchanged in X → y problems and the drift of p t (X|y) while p t (y) remains unchanged in y → X problems. These definitions clearly state the research scope of each research topic. However, since concept shift (drift) is usually associated with covariate shift and prior probability shift, an increasing number of publications [21,44,45,40] refer to the term concept drift in relation to the problem that ∃t : p t (X, y) = p t+1 (X, y).
Therefore, we use the term concept drift, instead dataset shift, in this paper.
These algorithms can effectively detect and adapt to the changes of newly arrived data, once the changes are considered significant. However, in some cases, if a drift happens in a local region, and the overall changes are not considered significant, such changes will not be addressed. For example, in a spam filtering system, if one user only changes her interest in car sales-related emails and all other topics remain stable, considering car sales-related emails only comprise a very small percentage of all her emails, this small change may not be considered as significant to the overall user's interests.
To address local regional drifts, in some publications [7,5], the authors applied a decision tree model to detect changes in the online error-rate in each internal tree node to identify drifted nodes, and then updating the nodes respectively. The experimental results showed a good performance in detecting and adapting the drifts. Similar algorithms for regression problems can be found in [20,28,29]. However, these algorithms are based on decision tree models, which may not be easy to highlight drifted data instances. In addition, for some decision tree models, constrains may be applied to construct tree nodes. For example, CVFDT normally requires to observe 200 data instances before attempting to split the nodes [27]. If a regional drift occurs within the 200 data instances, the entire nodes will be updated, as a result, regional drifts in this area will not be identified.

Drift Detection via Non-Parametric Distribution Analysis
In probability theory, the distribution of data is commonly described by a probability density function (PDF).
Admittedly, if a data stream fits a certain distribution, the best way of tracking concept drift is to monitor the change in its PDF's parameters [33,43]. However, this presumption may not always hold. In most cases, it is impossible to abstract a PDF from given data. Compared to tracking the parameter changes of PDFs, mapping data sets to nonparametric models and detecting the difference between these models is a more flexible approach [33,43].
According to the literature, the first formal treatment of drift detection in data streams was proposed by [34]. In their study, they point out the most natural notion of distance between distributions is total variation as defined by : or equivalently, when the distributions have the density functions f 1 and f 2 This provides practical guidance on the design of a distance function for distribution discrepancy analysis. Accordingly, [34] proposed a family of distances, called relativized discrepancy, denoted as Ø A and Ξ A . In addition, [34] also presents the significance level of the distance according to the number of data instances. The bounds on the probabilities of missed detections and false alarms are theoretically proven, using Chernoff bounds and the Vapnik-Chervonenkis dimension. However, experiment results show that the Ø A , Ξ A statistics only outperform other statistics on uniform and normal distribution, while on exponential, binomial and Poisson distributions, Ø A , Ξ A performs similar to or even worse than the others. With regard to the data modelling stage, [34] does not propose any novel high-dimensional friendly data models. Instead, they stress that a suitable model choice is an open problem.
Later, [14,46] published another two distribution-based algorithms for concept drift detection. In [14], a kdqTreebased space partitioning algorithm was exploited to estimate data distributions empirically. Intuitively, each cell (a leaf node on the kdqTree) can be seen as a bin of the empirical distribution. By dividing the number of instances located in a cell by the total number of instances, an estimated empirical-PDF (epdf) can be acquired. They then adopt the Kullback-Leibler divergence as the test statistics. Their statistical significance test is a form of bootstrapping. In [46], the authors proposed an enhanced competence model and applied an empirical distance as the test statistics.
This statistical significance test is a non-parametric permutation test. According to their experiments, the competence model achieved much better results. Nevertheless, regional concept drift has still not been resolved.

Preliminary
An important element of this research is the multiset theory [11]. In contrast to classic set theory, multiset theory The multiplicity operations involved in this Chapter are listed below. Definition 2. [11](M ultiset Indicator F unction) Indicator function I A : X → N, is defined by where X is the set of unique element in A These multiplicity functions provide the basic set of operators for our algorithm.

Nearest Neighbor-based Density Variation Identification
Data distribution-based concept drift detection has been reported as the most sensitive and convincing detection method [46,45] because it directly addresses the root causes of concept drift and can represent the corresponding confidence interval. According to the literature, a typical distribution-based detection method consists of three components. The first component is a data representation model through which critical information is retrieved and irrelevant details are discarded. The second component is a specific dissimilarity function designed to measure the discrepancies between the data models. One of the most natural notions for the distance between distributions is the total variation or the L 1 norm [33]. The third component is a statistical significance test. Statistical significance, namely the p-value, is the probability of obtaining the least extreme result given that a null hypothesis is true. In drift detection, the null hypothesis is true when the detected discrepancies are not caused by concept drift. As a non-parametric test, a permutation test is a good option to empirically estimate statistical significance. An overall framework for distribution-oriented concept drift detection is summarized in Figure 1.
In this section, we formally present the proposed drift detection method (NN-DVI). NN-DVI consists of three parts. The first part is a data modeling approach which retrieves critical information and presents data sets as an abstracted model. This is explained in detail in Section 4.1. The second part is a distance function that accumulates regional density changes to quantify the overall discrepancy between data sets. Section 4.2 gives the definition and theorems related to this distance. The third part is a tailored statistical significant test for the distance. In Section 4.3, FIGURE 2: Conventional space partitioning schema maps data instances into larger bins, and uses these bins to measure the similarity between datasets. Different bin sizes will give different answers. For example, the bin size of Figure 2 (c) is too small so that each block only contains one instance. As a result, the intersection set of green dots and blue dots is empty. However, in Figure 2 (d), the intersection set has 2 elements.
we further discuss the property of the distance and theoretically prove its distribution. Finally, Section 4.4 provides the implementation details of NN-DVI.

Modelling Data as a Set of High-Resolution Partitions
Space partitioning is defined as the process of dividing a feature space into two or more disjoint subsets. Any data instance in the feature space can then be identified as residing in exactly one of the subsets. When dealing with discrete datasets, it is often infeasible to directly estimate the similarity between data instances [14]. One of the most popular solutions is to accumulate the differences of empirical probability density through the divided non-overlapping subsets [14]. For example, a histogram is one of the most popular ways to assess the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values (bins), where the bins are the disjoint subsets. In concept drift problems, if two sample sets are drawn from an identical distribution, every partition should have a similar number of data points, namely the empirical probability density ( num of data points in one partition total num of data points ) in each partition will be similar. Otherwise, a statistical significant difference between their empirical probability density will be found.
Currently, in related research, space partitioning methods directly contribute to the accuracy of drift detection [45].
As shown in Figure 2, if a partitioning schema cannot explicitly map similar items into the same partitions, like the 8×8 partitioning schema, the similarity between sample sets will always be zero. As a result, the detected density change will be invalid. By the same token, if a partitioning schema mistakenly maps non-similar items into the same partitions, the dissimilarity between testing sample sets will always be zero. A drift detection algorithm will lose its sensitivity to density drifts, especially when only a few data instances are available for evaluating the partitioning schema. Optimizing space partitioning methods for concept drift detection is still an open problem [34].
Motivated by this issue, we propose a novel nearest neighbor-based partitioning schema (NNPS). The fundamental idea behind the NNPS is to find the minimum shared particles between instances instead of the shared partitions to which instances belong. Instead of mapping a data instance into a lower granularity presented by partitions or bins, expanding the data points into a hypersphere can preserve much more of the instance's details. For example, as shown in Figure 3(b), in terms of a two-dimensional feature space, the expanded data points are indicated by blue and FIGURE 3: The proposed instance-oriented space partitioning schema aims to find the primary elements that consist of data instances, and then use these elements to measure the similarity between datasets. In a 2d domain, let us consider each data instance as a circle rather than a dot and the entire domain is a by n×n pixels square, as shown in Figure 3(b). Then the pixels are the primary elements which constitute the data instances, and data instances can be represented by a set of pixels located in their circles. Therefore, the similarity between instances can be simply estimated by counting the shared pixels located in the overlapping regions, such as the similarity between d 1 and d 2 can be estimated by This partitioning schema performs well with two-dimensional data. However, it becomes increasingly complex as the dimensionality increases. This is because the intersecting regions between hyperspheres are difficult to explicitly calculate in high dimensional space. Therefore, to further optimize our partitioning schema, we use a k-nearest neighbor model to replace the hyperspheres model. An intuitive explanation for how a nearest neighbor model describes these instance particles is that close located data instances have hidden connections, and such connections can be used as the most basic element to constitute data instances. As shown in Figure 4(d), unlike conventional partitioning schemas, which group instances into different partitions, NNPS slices an instance into several particles, and uses these particles to constitute instances. An overview of the difference between the NNPS and conventional partitioning schemas is shown in Figure 4(e). To overcome the bias caused by high granule mapping, we propose our NNPS to extend instances into a more detailed granule. This process is also called instance discretization or instance quantization. Instead of measuring similarities in a conventional space partitioning schema, applying shared instance particles can pass more instance details to the sample sets, thereby making similarity measures more sensitive to small changes. The concept of instance particles is formally defined as follows.
Then the instance particles are : As long as there is a connection between two data instances, we consider that they are neighbors. Then the datasets can be presented as a set of connections, as shown in (c). If we consider each connection as one component of a data instance, namely a slice of instance, then the data instance d 1 is a composite of one slice of itself, one slice from d 2 and one slice from d 4 , as shown in (d). Then, (e) illustrates the difference between conventional partitioning methods and the NNPS The proposed NNSP breaks one data instance into a set of instance particles, thereby transforming each discrete data instance into a set of shared instance particles. As a result, the differences between data instances can be preserved for measuring the distance between sample sets. Definition 9. (Instance P article Group) Given an instance d k ∈ D, the particle group of d k is defined as Example 2. Referring to example 1, the data instances belonging to D represented by the instance particles are : Definition 10. (Instances P article Group) Given a sample set S k ⊆ D, the particle group of S k is defined as Example 3. Referring to example 1, and given two sample sets Since the data instances can be represented as a set of instance particles, this offers a higher resolution to calculate the distance between the sample sets. The next step is to even out the weight of the instance particles and the weight of instances represented by instance particles. As shown in Example 2, the number of particles in a data instance may differ from instance to instance, |P(d 1 )| = 2, |P(d 2 )| = 3, |P(d 3 )| = 3, |P(d 4 )| = 2, resulting in inconsistencies between the instances' weight. To even out the weight of instance particles and instances simultaneously, we introduce the lowest common multiple (LCM), LCM ({|P(d j )| : ∀d j ∈ D}) as the multiplicity function. Then a uniform weighted data instance can be represented by a multiset of particles.
Definition 11. (Instance P article Set) Given an instance d k ∈ D, a multiset of instance particles of d k in terms of N N P S is defined as where Q is the lowest common multiple of {|P(d j )| : ∀d j ∈ D}.
Example 4. Referring to Example 1, we have multiset represented data instances : Remark. The multiplicity function can be generalized as LCM ({w · |P(d j )| : ∀d j ∈ D}), where w is the weight of instance d i and w ∈ N. If a data set is uniformly weighted, the w will equal to one.
Example 5. Referring to Example 1 and 3, we have multiset represented sample sets : In summary, NNPS aims to extend instances into a lower granularity to provide a more detailed shared subspace for measuring similarity and dissimilarity. In NNPS, each data instance is a partition, and the hidden relationships between instances are the primary elements that lie in the regions. Compared to conventional "bin"-based partitioning methods, which only consider the similarity between data instances as 1 (located in the same bin) or 0 (located in different bins), NNPS can preserve the similarity between data instances, therefore becoming more sensitive to small discrepancies.
In addition, the datasets presented by NNPS are the accumulation of instance particles from every partition. As a result, the density discrepancies in partitions will also be accumulated and reflected in the similarity or dissimilarity measurement.

The Distance Measurement of Sample Sets in terms of NNPS
Since data instances and datasets are now presented by multisets, the distance between them can be quantified by set-related metrics in terms of Definition 1, Definition 2, Definition 3, Definition 4, Definition 6, Definition 5, Definition 7. The distance function applied in this paper quantifies the number of different instance particles between two given multisets, denoted as d nnps .
Definition 13. (N N P S dissimilarity measurement) Given two sample sets A,B ⊆ D, the N N P S-based distance between them is calculated by accumulating the differences in the number of instance particles, and it can be normalized by dividing the number of unique instance particles, denoted as Example 6. Referring to Example 5, we have the

A Tailored Statistical Significance Test for d nnps
Measuring the difference between two sample sets is only one aspect of detecting concept drift. Another aspect is to provide adequate evidence to determine how likely it is that there will be such a difference if the given sample sets are drawn from an identical distribution. Null hypothesis testing, or the so-called p-value approach, is widely used to address similar problems. In our case, the null hypothesis H 0 is : "It is very likely to observe such a d nnps , if two given sample sets are independently drawn from the same distribution". The smaller the probability (the p-value), the stronger the evidence against H 0 .
The most intuitive way to achieve this goal is using the Monte Carlo permutation test [16]. The achieved significance level (ASL), also known as the p-value, can be attained by observing all possible values of the test statistics. A permutation test holds no assumptions for the test statistics, however, as the sample size increases, it become unfeasible to calculate the exact ASL, and it can only be roughly estimated which may increase the error rate.
Alternatively, if a given test statistics fit a known distribution, then the decision regions (to accept/reject the null hypothesis) can be obtained explicitly from its distribution function. In this research, according to Definition 13, we prove that the d nnps of two i.i.d. sample sets fits a normal distribution. As a result, we can use max likelihood to estimate the mean and variance of its distribution and determine the critical region with less Monte Carlo error.
if the following conditions can be satisfied : is only related to elements in A. Considering a random variable I k , expressed as follows, For each I k we can express its distribution as follows : Because Condition 2 can be satisfied and L 2 norm is grater than L 3 norm when |D| → ∞, meaning that Var(I k ) based on Lyapunov central limit theorems, when |D| → ∞, we have Remark. In this paper, we assume that instances in time windows are independent. So the constructed random variable I k is independent.
i ) if following conditions can be satisfied : Proof. Based on Lemma 1, ) is the sum of |D| independent half-normal distributions. Thus, we need to verfy whether the sum of half-normal distributions can satisfy the Lyapunov condition. We denote σ i as the parameter of the i th half-normal distribution and i dr i using integration by parts, we arrive at the following equations Because L 2 norm is greater than L 3 norma when |D| → ∞,

Stream Learning with Nearest Neighbor-based Density Variation Identification
In this section, we explain the implementation of NN-DVI from the computer logic perspective. The overall stream learning algorithm is given in Algorithm 1 and illustrates how NN-DVI is integrated with online learning models.
Then, the implementation of the NN-DVI and the calculation of d nnps are given in Algorithm 2 and Algorithm 3, respectively.
The proposed algorithm considers data stream learning as two scenarios. One is concept drift in which a drift adaptation process has to be initialized to correct the entire system. The other is static online learning which requires no intervention from an extra process. In the first scenario, the proposed algorithm applies a sliding windowing strategy to detect concept drift. The sliding window is initialized with training data or the first w min elements in the data stream, where w min is an input parameter used to decide the minimum drift detection window size. This process is implemented in Algorithm 1, lines 5-13, where the win f ix is the fixed time window while win slide is the sliding time window. If a concept drift is detected, the current time window, namely win slide will become the representation of current concept, and the training buffer (buf f train ) will be the rest, as shown in lines 14-17. In contrast, lines [18][19][20][21][22][23] are the implementation of the second scenario. If no concept drift is detected, the newly arrived data will be kept in the training buffer. Once the training buffer has reached the maximum allowed buffer size, which is an input parameter denoted as w max , the oldest training instance will be removed.
According to Theorem 4, the core idea of nndviDrif tDetection is to estimate the mean (µ) and variance (σ 2 ) of d nnps of the normal distribution, then use the cumulative distribution function to compute the drift threshold θ drif t with a certain confidence level, namely the significant level α. Algorithm 2 lines 1-4 are the construction of NNPS, where I |D|×|D| is a |D| by |D| Identity Matrix. Line 5 computes the actual distance between sample sets S 1 and S 2 .
From line 6 to line 9, we shuffle S 1 and S 2 to create two new sample sets. The shuffling process will ensure the new sample sets are i.i.d. Finally, in lines 10 to 12, we compare the actual NNPS distance with the drift threshold and determine if a significant drift is observed.
Algorithm 3 is the pseudo-code of Definition 13. Lines 2 and 3 transform the NNPS matrix to sample sets presented by normalized multiset. Then the discrepancies are accumulated as shown in Line 4-6.

Information Granularity Indicator for NNPS
The selection of the number of nearest neighbors (the k value of knn) is critical to controlling the resolution and is the only parameter of the N N P S. To select the best resolution to construct N N P S, we propose a novel discretization controlling method, called the instance particle independence coefficient, or simply independence, to quantify the granule information contained at different granule levels.
In the context of the N N P S, we define the granule information indicator as how likely it is that a group of instance particles will appear together, namely the joint probability of a group of instance particles occurring. Intuitively, in the proposed data model, the primary element of a data set is the instance particles. The N N P S is constructed based on these elements. If a group of such elements is always shown together, with a joint probability equal to 1, it would be impossible to detect drifts inside this group, as shown in Figure 5.
In other words, the granule information indicator of the proposed data model describes the average region size assumed to have no concept drift. From a practical point of view, investigating the joint probability of every combination is neither computationally-friendly nor storage efficient. Alternatively, such indicators can be acquired through mea-FIGURE 5: An example of grouped instance particles. Since p ( d 2 ) and p ( d 3 ) are always grouped together, the difference between d 2 and d 3 is equal to zero. If such a group is large enough, it will be impossible to identify whether there is a drift within them. Consequently, the sensitiveness of N N P S on concept drift will decrease.
suring the independence of every instance particle. The higher the probability that an instance particle will appear in a certain instance particle group, the less independence that instance particle has. To estimate the overall independence of a given instance particle p ( d i ), we define the indicator of its independence as the average value of the conditional probability of its connected instance particles. Definition 14. (Independence of Instance P article) The independence indicator of a given instance particle is defined as where p(p d k , p di ) is the probability that p d k ∈ P(d i ) and p di ∈ P(d i ) Definition 15. (Independence of N N P S) The independence of the N N P S is defined as the average value of instance particles' independence, denoted as : With information granularity indicator, we can recognize and exploit the meaningful pieces of knowledge present in data so that different features and regularities can be revealed. This provides theoretical guidance for selecting the k-value, which is max k {independence}.

Experiments and Evaluation
The evaluation of NN-DVI for concept drift detection consists of four sections, through twelve experiments. The experiments aim to : 1) establish how d nnps varies according to changes between two distributions ; 2) investigate the impact of the selection of k on d nnps ; and 3) evaluate the sensitiveness of d nnps on regional drifts. We ran three experiments with generated artificial data sets following 1D normal distributions and compared the results with the test statistics of the two-sample K-S test. All the results were calculated as the mean of 100 independent tests.  independence reached its max at k = 39, while k = 100 and k = 400 belongs to the independence decreasing stage, which means that the sensitivity of the distance is decreasing (Fig.7). The corresponding test statistics are outlined in Fig.8.
The d nnps has the same pattern as the two-sample K-S test. For larger k values, the test statistic values are smaller (shown in blue). This is because larger k values have a larger number of instance particles, therefore, the ratio between the intersection set and union set will decrease. Even though the test statistic means are different, this will not affect the discrepancy measurement, as long as the peak-valley difference between the data structures is correctly reflected.
By the same token, we reduced the batch size to 50 at each time step and selected the k with the highest indepen-  We show the two-sample K-S test statistics in Fig.10. In this figure, the valley values are similar, while the peak values gradually decrease. Since the valley values are determined by two identical distributions, they are steady and smooth. The peak values, by contrast, are calculated by two distributions with an increasing σ. As a result, the difference between these distributions will shrink as σ increases. Intuitively, this is because the distribution becomes less concentrated, thus the relative test statistic is smaller. The test statistics of d nnps are shown in Fig.11. It can be seen that d nnps has the same pattern as the two-sample K-S test, indicating that d nnps is capable of precisely representing differences in distribution.   : 1-D normal distribution with regional drift detection. At each timestep, we changed the mean (µ + 0.06) of a group of data instances and plot the d nnps between the current timestep and last timestep. The proposed d nnps reported a significant difference at after 19 groups drifted, whereas the K-S test cannot detect such a difference. Experiment 6.3. (Varying the regional mean µ j ). In this experiment, we compared the distances between two data samples that comprise 30 independent 1-D normal distributed data groups. In the first data group, the normal distribution was set as µ = 0.2, σ = 0.2. For the remaining 29 data groups, the normal distribution was set as We intentionally left a (20 × σ) gap between each group to ensure that a change occurring in one group would not affect the others. Each group contained 100 data instances, equating to 3000 instances for each time step. In the first-time step, we compared two data sample sets without any drift in any data group. Later, we manually changed the mean of each data group by µ j + 0.06 successively for each time step to create 30 minor regional drifts. Fig.12 shows the test statistics for the two-sample K-S test and the d nnps at different time steps. At t = 1, we compared 30 data groups with another 30 groups that were generated by the same distribution. At t = 2, we compared 30 data groups with another 29 + 1 groups where the mean µ 1 of group 1 was changed to µ 1 + 0.06, and so on till t = 31. It can be clearly seen that d nnps increased after each local drift occurred. In fact, the K-S test statistics also increased slightly. However, as the value of its test statistics is too small, the trend cannot be observed in this figure.
In addition, to compare the test statistics, we also applied the corresponding significance test to examine whether the drifts were statistically significant. The p-value of the two-sample K-S test was above 95% all the time while the p-value of d nnps dropped below 5% after timestep 19, which means that d nnps can detect simulated drifts after 19/30 local drifts. This experiment demonstrates that the proposed d nnps is sensitive to regional drifts.

Evaluating the NN-DVI Drift Detection Accuracy
In the above experiments, we have shown how the d nnps varies as the underlying distribution changes. To determine whether a given measurement is statistically significant as a drift, we compute the ASL as described in Section 4.3. Given a desired significance level α, we say there is a concept drift when the ASL < α. In the following experiments, we compare NN-DVI with Dasu [14] and Lu's [46] work in terms of drift detection accuracy on 10 synthetic drifting data streams. The reason we choose these two algorithms for comparison is that all three methods (including NN-DVI) detect concept drift based on the estimated data distribution, and do not hold assumptions on the distribution of data. In addition, we also compared one of the most popular two-sample tests for multivariate data, maximum mean discrepancy (MMD) [23], as the base line. NN-DVI(permutation) represents that using permutation test to estimate the ASL of NN-DVI, and NN-DVI(normal) represents that using the tailored significance test, which has been proven following normal distribution, to calculate the ALS of NN-DVI.
For a fair comparison, we used the same experimental configurations and evaluation criteria as introduced in Dasu and Lu's papers [14,46]. The evaluation criteria are : detected, late, f alse, and missed. Given a concept that exists for a period of time [t+1, t+z], where z is the length of the concept. In this experiment, the distribution parameters are updated at t+1 and t+z+1, that is, a new concept starts from t+1 and ends at t+z.  [14] and adopted by Lu et al. [46].
For each data stream, we generated 5,000,000 data instances, and for every 50,000 instances, the parameters of the corresponding distribution were varied randomly within a certain interval. These variations produced 99 controllable drifts for each data stream. For ease of comparison, we kept all the parameters the same for both the bootstrapping test and the permutation test, including the desired significance level of α = 1%, and the size of bootstrapping and permutation N = 500. KL and CM's parameters were set as per their papers. The distance function used in this section is Euclidean distance, as the same as the distance used by [14,46]. Because the permutation test does not hold any assumptions and can be applied on KL without modification, we also ran KL with a permutation test for comparison.
To avoid any bias in NNPS-DVI caused by parameter selection, we first conducted an experiment with a k-value that was chosen based on the average number of instances within the same partitioning size as KL and CM. Then we ran our algorithm with a dynamic k-value that we selected based on max k {independence}, and applied the normal distribution estimation described in Section 4.3 as the significance test.   drift margin . In the M ( ) streams, the features x 1 and x 2 followed an independent normal distribution with the  Tables 1 and 2.
From the results, we can see that our method outperformed the others in most cases. In terms of the false and missed rates, only bootstrapped KL on the C(0.2) stream is slightly better than ours. With regard to the other streams, our results are much better. If we only consider drift detection based on permutation tests, no other methods can surpass ours. According to the results, the excellent performance of KL with bootstrap is more likely attributed to the bootstrap test rather than to their drift detection model. Nevertheless, Dasu et al. [14] did not give any explanation   Table 3.
From  Table 3, NN-DVI clearly outperforms the others. Experiment 6.6. (Higher dimensional distributions). To test the scalability and performance of NN-DVI in high dimension space, we extended the C(0.2) stream to three d-dimensional multivariate normal distributed streams as per [14,46]. Only the first two dimensions had a correlation value equal to ρ, and the remaining (d-2) dimensions were configured with correlation values equal to zero. Additionally, the standard deviation of all the added dimensions was configured to be the same as the C( ) streams. These settings are managed to retain the same marginal distribution of all dimensions, so that the overall distance between the instances contributed by each dimension is equally important   [14,46]. As a result, it will be easier to calculate the distance between instances, and the drift detection results will not be affected by the selection of the distance functions used to compare data instances.
The experimental results for different dimensions are listed in Table 4. As expected, as more stationary dimensions are added, the missed rates of all detection algorithms increase, implying that such a change makes drift detection harder. However, NN-DVI preserves much of its power as the number of dimension increases. It is very likely that this can be attributed to the k-nearest neighbor-based data representation model defined in Section 4.1.
To summarize, we calculated the average detected, late, false and missed rates of all synthetic streams, as shown in Table 5. It shows that the proposed NN-DVI makes a notable improvement on drift detection in various situations.
Although the false rate of NN-DVI is slightly higher than the existing methods under the permutation test, the missed rate is almost halved. The performance of NN-DVI with normal distribution as significance test is even better. It shows higher accuracy with less false rate which indicates the tailored significance test introduced in Section 4.3 is more reliable. In addition, the complexity of data model construction can be reduced to O(knlog(n)) with a kd-Tree data structure while CM is O(n 2 log(n)) [46], KL is O(dnlog(1/δ)) where the d is the number of data dimensions and δ is the given parameter of KL [14] and MMD is O(n 2 ) [23].

Evaluating the NN-DVI on Real-world Datasets
To demonstrate how our drift detection algorithm improves the performance of learning models in real-world scenarios, we compared our detection method with the other two closely related algorithms, 1) KL [14] and 2) CM [46]), on five benchmark real-world concept drift data sets. These datasets are the top referenced benchmark and include both low and high dimensionality data. In addition, we also compared NN-DVI with four state-of-the-art drift adaptation algorithms that address concept drift using different strategies. These algorithms are 3) SAMkNN [44] which keeps past concepts in memory and considers all past concept to make final prediction ; 4) AUE2 [13] which uses ensemble methods to handle concept drift ; and 5) HDDM family tests (HDDM-A, HDDM-W) [19] which detects drift by monitoring learner outputs ; 6) ADWIN-Volatility (ADW-Vol) [26] which first introduced volatility shift into concept drift detection. At last, we use HAT [7] and ADWIN [6] as the base line. The algorithms were implemented with the authors' recommended settings. The selection of the base learner for the HDDM family tests, KL, CM and NN-DVI is independent of the drift detection algorithm, therefore, we implement these algorithms with Naïve Bayes, Hoeffding Tree and IBk classifiers. Because the SAMkNN is only compatible with IBk and AUE2 is only compatible with Hoeffding Tree, these two algorithms were only compared with their own base learners. All algorithms were implemented based on MOA platform [8] which is written in Java. The parameters of IBk were set as k = 5 and instances weighting method is unif ormly weighted. All the distance function used in this section is Euclidean distance. The parameters of Heoffding Tree were set as the same as those recommended by AUE2. Table   6 summarizes the classification accuracy and gives the overall rank of different algorithms. The average processing time in second and average memory usage (RAM-Hour in Gigabyte) are also listed, since the computational resources management is also an important issue for data stream learning [10].  [31]. According to Katakis's work [31], 500 attributes were retrieved using the chi-square feature selection approach.  and 69% negative (no-rain) classes. This data set is summarized in [18] and is available at http ://users.rowan.edu/ polikar/research/NSE Experiment 6.10. (Usenet1 and Usenet2) The Usenet datasets were first introduced by [32], and were used to evaluate HDDM family algorithms [19]. They are two subsets of the 20 newsgroups collection which is owned by the UCI Machine Learning Repository. Each data sets consists of 1500 instances and 99 features. In these data sets, each instance is a message from different newsgroups that is sequentially presented to a user, who then labels it as interesting or not interesting. These messages involve three topics : medicine, space and baseball. The user was simulated with a drifted interest for every 300 messages. A more detailed explanation of these datasets can be found in [32].
To better demonstrate the improvements of NN-DVI on real-world problems, we used a ranking method to summarize the overall performance, which is similar to [44]. The average rank shows that NN-DVI achieves the best performance with the IBk classifier and has the best final score, which is the average of all base classifiers' ranks.
The analysis results of Table 6 can be summarized as follows : -The performance of base learners is variable on different data sets. For example, under the same drift adaptation algorithm, the IBk classifier always outperforms the other classifiers on the Weather data set while it performs the worst on Usenet1 and Usenet2 data sets. In some situations, the selection of base learners may be more important than the selection of drift detection methods. This outcome inspired us to reconsider the learning strategy of using the same base classifier settings after a drift. Instead, changing a base learner or updating the base learner's parameters may further improve the overall prediction accuracy.
-The proposed NN-DVI method with IBk, NB and HTree have the best performance in terms of accuracy.
Considering the overall performance, it is evident that the highlighted results of NN-DVI do not depend on the base classifiers and NN-DVI contributes to the learning process in dynamic environment.
-Regarding to computational complexity, it can be seen that distribution-based drift detection algorithms (NN-DVI, CM, KL) have very similar computational costs, namely the Time and RAM-Hour, for all the tested three base learners. The reason is that these algorithms have to keep the representative data instances in memory to detect the distribution changes. It looks like a common issue of distribution-based drift detection methods.
Nevertheless, they have shown no manifest disadvantages comparing to other drift detection methods using IBk learner. Although error-rate based and ensemble-based drift adaptation methods can handle concept drift in an online manner, using IBk as base learner may slow down their learning speed.

Evaluating the Stream Learning with NN-DVI with Different Parameters
In general, the algorithm for stream learning with NN-DVI has six input parameters : 1) minimum drift detection window size ; 2) base learner ; 3) distance function for constructing NNPS ; 4) the number of nearest neighbours for NNPS ; 5) the sampling times for estimate the σ of d nnps ; and 6) drift significance level. Amount these parameters, the parameter 2) is evaluated in Section6.3, parameter 4) is discussed in Section 5, parameter 5) is related to Monte Carlo Error which has been discussed in details in the section 5.2 of [46] parameter 6) is a drift sensitivity setting chosen by users which is highly depend on the system requirements. Therefore, in this section, we focus on evaluating the impacts of the parameters : 1) minimum drift detection window size and 3) distance functions for constructing NNPS.
The rest of the parameters are fixed as the default values. Experiment 6.11. (Changing window size) As a critical parameter of time window-based drift detection algorithms, the window size or chuck size [53] controls the length of the most recent concept, namely the win slide in our algorithm.
On the one hand, if the window size is too small, the most recent concept may be incomplete, as a result, false alarms will be raised. On the other hand, if the window size is over large, multiple concepts may be included in one window, and such mixed data will lead to bad performance. To evaluate how window size affects NN-DVI, we set win min =50, 75, 100, 150, 200, and plotted the classification accuracy of the five real-world datasets in Figure 13. Experiment 6.12. (Changing distance functions) To evaluate how distance function affects NN-DVI on real-world datasets, we choose another four most commonly used distance functions to replace Euclidean distance, which are Manhattan distance, Chebyshev distance, Jaccard distance, and Radial basis function kernel distance. The overall classification accuracy of these distances with different window size is plotted in Figure 13.
It can be seen that along with the increasing of window size, the average accuracy is decreasing, especially for dataset Usenet1. It could be caused by there are mixed concepts in time windows. In contrast, the accuracy of Spam and Weather dataset is slightly increasing along with the window size. We consider that NN-DVI might overly trigger false alarms with small window size in these datasets. Increasing the window size will reduce the overall false alarms so that the accuracy can be increased. Regarding to the distance functions, Euclidean and Manhattan has very similar performance and are the top two best distance for the evaluated datasets, while Jaccard distance performs the worst all the time. Therefore, we recommend to use Euclidean or Manhattan distance for most situation.

Conclusions and Further Work
In this paper, we analyzed distribution-based drift detection algorithms, and summarized the fundamental components of this type of drift detection methods. Under the guidance of the summarized framework ( Fig. 1), we proposed a novel space partitioning schema, called NNPS, to improve the sensitiveness of regional drift detection. Accordingly, a novel distance d nnps and a nearest neighbor-based density variation identification (NN-DVI) algorithm is proposed.
This research also defines a k-nearest neighbor-based data discretization controlling method to represent the granular information in data samples. Regarding the drift significance test, this study theoretically proved the proposed measurement, namely the d nnps , fits a normal distribution. According to this finding, an efficient and reliable critical interval identification method has been integrated. The experiment results show that NN-DVI can detect concept drift accurately on the synthetic datasets and is beneficial to solving real-world concept drift problems.
The innovation and contributions of this paper lie in its aims to discover the relationships between distribution drifts and the changes in data instances' neighbors. A regional density-oriented similarity measurement is given. This measurement can effectively detect concept drift caused by either regional or global distribution changes.
Future research endeavors will aim to improve the current sliding windowing method by proposing an online version of d nnps , so that the performance of NN-DVI will not be limited by window size. Another improvement may be achieved by parallelizing the calculation of d nnps , through which overall complexity can be further reduced. Finally, this paper is part of our work on adaptability learning for data stream mining. Successive instance selection methods and adaptive learning methods that can take advantage of drift detection are desired.