A Hybrid Evolutionary Preprocessing Method for Imbalanced Datasets

Imbalanced datasets are commonly encountered in real-world classiﬁcation problems. As many machine learning algorithms are originally designed for well-balanced datasets, re-sampling has become an important step to pre-process imbalanced data. It aims at balancing the datasets by increasing the samples of the smaller class or decreasing the samples of the larger class, which are known as over-sampling and under-sampling respectively. In this paper, a sampling strategy based on both over-sampling and under-sampling is proposed, in which the new samples of the smaller class are created based on fuzzy logic. Improvement of the datasets are done by the evolutionary computational method of Cross-generational elitist selection, Heterogeneous recombination and Cataclysmic mutation (CHC)

that under-samples both the minority and majority samples.As a result, a hybrid preprocessing method is proposed to re-sample imbalanced datasets.The evaluation is done by applying the Support Vector Machine (SVM), C4.5 decision tree and nearest neighbor rule to train a classification model from the re-sampled training sets.From the experimental results, it can be seen that our proposed method improves both the F-measure and AUC.The over-sampling rate and complexity of the classification model are also compared.Our proposed method is found to be superior to all other methods under comparison, and is more robust in different classifiers.

Introduction
The classification of imbalanced datasets is a popular topic in recent years [22] and [27].Most of the machine learning tools, such as neural networks and support vector machines, are originally designed for well-balanced datasets.If the dataset is imbalanced, the performance of the classifier can be poor.The reason for this is apparent.For example, considering a dataset with 99% of data from class A and only 1% of data from class B, the accuracy is 99% if the classifier ignores the data from class B and labels the whole dataset as class A. It is already very hard to achieve an accuracy above 99% by using most of the learning algorithms.
However, the minority class of datasets is usually more important and meaningful.
For example, there are much less samples of people with a particular disease than those of healthy people in a medical problem.If a classifier is needed to label whether some people are infected or not, it is obvious that the minority class (people with a particular disease) is the class of interest.
Problems with imbalanced datasets can be easily found in the real world, such as intrusion detection [9], speech recognition [26], identification of power distribution fault causes [41], and bioinformatics problems [16].There are two main approaches to solve the problems caused by imbalanced datasets.One is the data level approach and the other is the algorithm level approach.The data level approaches [3], [8], [18], and [28] include balancing the class distribution by oversampling the minority class or under-sampling the majority class.The algorithm level approaches improve the existing machine learning methods by adjusting the probabilistic estimate [38], modifying the cost per class [32], adding some penalty constants [25], or learning from one class instead of two classes [35] and [30].
Many experiments [12], [15], and [42] show that re-sampling is a good data level approach to handle imbalanced data.Moreover, it is more flexible as it does not depend on the chosen classifier.Therefore, we focus on re-sampling in this paper.There are three main types of strategies for re-sampling data.The first one is over-sampling, which can be done randomly or by the method of Synthetic Minority Over-sampling Technique (SMOTE) [8].The second one is under-sampling, which includes Tomek links [37] and Neighborhood Cleaning Rule (NCL) [24].
The last one is the hybrid method, which combines the two previous methods (over-sampling and under-sampling methods).
The importance of designing sampling strategies has been discussed in [31], which may affect the successful learning of different classes.Hybrid re-sampling methods, reportedly, have advantage on treating datasets with a high imbalanced ratio [3] and [6].Although some hybrid methods [3], [34], and [40] have been proposed to reduce the over-generalization problem from over-sampling methods, most of these methods are based on SMOTE and the results may be limited by the synthetic samples of SMOTE.Therefore, a hybrid re-sampling method is proposed in this paper.Fuzzy logic, which is a useful tool to treat imbalanced datasets [12], is used to over-sample the minority class samples instead of SMOTE.A fuzzy rule base is formed based on the samples of the minority class.Then, a rule is selected randomly with reference to the effectiveness of each rule.The selected rule is used as the criteria to generate a new sample of the minority class.The above steps will repeat until the sizes of the majority class and minority class are the same.
However, the large over-sampled training dataset will increase the complexity of the classification model and decrease the efficiency of the learning algorithm.
It will also cause over-generalization easily, especially for some noisy dataset.This is because the decision boundary could become narrow or the overlapping area between the majority class and minority class could become large after the over-sampling.Therefore, an evolutionary algorithm (EA) is applied to both the synthetic samples and majority samples to under-sample the dataset.The chosen EA is the CHC (Cross-generational elitist selection, Heterogeneous recombination and Cataclysmic mutation) algorithm [11] since it shows the ability of selecting the most representative instances among many algorithms studied in [5].
Experiments are carried out to compare our proposed method with three SMOTEextended over-sampling methods, four hybrid re-sampling methods and one undersampling method.They are SMOTE, Safe-Level-SMOTE [4], Adaptive Synthetic Sampling [21], SMOTE+Tomek Links [3], SMOTE+Rough Set [34], SMOTE+CHC (sCHC) [40], agglomerative hierarchical clustering [10], and EUSCHC [14].44 imbalanced datasets from UCI Repository [2] are used in the experiments.The Support Vector Machine (SVM) [7], C4.5 decision tree [33], and nearest neighbor rule (1NN) are used as the tools for reaching a classification model for each resampled dataset so as to evaluate each re-sampling method.The evaluation measures are based on F-measure and area under the receiver operating characteristic curve (AUC).Although there exists many hybrid pre-processing methods, only some of them are like our method that consider and focus on the data size.In this paper, CHC is used to reduce the data size and achieve a good performance.Additionally, the proposed method enhances the performance in the over-sampling stage by taking advantage of the fuzzy rule base.This paper is organized as follows: In Section 2, some preprocessing methods and CHC are reviewed.Section 3 presents the details of the proposed re-sampling strategy and the evaluation method.To show the effectiveness of our proposed approach, the comparisons with other methods and the results are discussed in Section 4. A conclusion is drawn in Section 5.

Previous Work
This section describes some previous works about re-sampling methods, which will be used to compare with our proposed method in the experiments later.The ideas about CHC will also be discussed.

Re-sampling Methods
As discussed in the previous section, there are three main strategies for resampling data.

Over-sampling Methods
Some instances are produced for the minority class to balance the class distribution.The simplest one is a non-heuristic method (random over-sampling) that replicates samples of the original minority class to generate the new instances.This method causes over-fitting easily since the new instances copy exactly from the original minority class.Synthetic Minority Over-sampling Technique (SMOTE) [8] is a well-known method which creates the new instances by interpolating several minority samples that join together.This method makes use of each minority class sample and inserts synthetic samples along the line segments joining any/all of the k minority class nearest neighbors to over-sample the minority class.An example is shown in Fig. 1.Five nearest neighbors are used in it, where x i is a selected sample of minority class, x i1 to x i5 are the 5 nearest neighbors of x i and s 1 to s 5 are the synthetic samples created by interpolation.
If the degree of over-sampling required is 300%, three synthetic examples are selected randomly from s 1 to s 5 .Since the synthetic samples provide a less specific and larger decision region, the over-fitting problem can be reduced.However, this method may introduce more minority synthetic samples in the area of majority class where the minority class is very sparse with respect to the majority class.This causes the problem of over-generalization, which means the decision boundary is very narrow or there is a large overlapping area between the majority class and minority class.Therefore, some methods are developed based on SMOTE to overcome this limitation, such as Borderline-SMOTE (sBorder) [19], Adaptive Synthetic Sampling (ADASYN) [21], Safe-Level-SMOTE (sSafe) [4], and SPIDERS [29].

Under-sampling Methods
Some instances of majority class are eliminated in order to balance the class distribution.The simplest method is random under-sampling (RUS), which aims to balance the datasets by randomly removing samples of the majority class.However, this method may easily remove some useful data.The other representative methods include (i) condensed nearest neighbor rule (CNN) [20], which eliminates the majority class samples that are distant from the decision border, (ii) Tomek links (TL) [37], which edits out noisy and borderline majority class samples, (iii) one-sided selection (OSS) [23], which is an integrated method of TL and CNN, and (iv) neighborhood cleaning rule (NCL) [24], which is based on the Wilson's Edited Nearest Neighbor Rule (ENN) [39] to remove the majority class samples that lead to misclassification.

Hybrid Methods
Although both over-sampling and under-sampling can balance the class distribution, different drawbacks like over-generalization and removal of useful data are also introduced.Therefore, some hybrid methods are developed to combine SMOTE and under-sampling as a data cleaning method to reduce the problem.
Example hybrid methods include SMOTE+Tomek links (sTL), which uses TL to remove samples of both classes to increase the area of decision border, and SMOTE+ENN (sENN) [3], which uses ENN to remove the samples that are misclassified by their nearest neighbors.Rough set theory (sRST) [34] and evolutionary algorithm (sCHC) [40] have also been applied on SMOTE to select the samples to increase the accuracy of classification.Most of the above hybrid methods make use of SMOTE to perform oversampling.Clustering techniques are also developed to perform under-sampling and over-sampling, such as agglomerative hierarchical clustering (AHC) [10].[11] CHC is a kind of EAs that combines a selection strategy with a highly disruptive recombination operator.To avoid premature convergence and maintain diversity, incest prevention and cataclysmic mutation are introduced.The process of CHC can be described as follows.Firstly, a population set of chromosomes P is created.Each chromosome p i = (p i1 , p i2 , . . ., p in ) is an n-dimensional vector, which is a set of genes, where p ij is the jth gene value (j = 1, 2, . . ., n) of the ith chromosome in the population (i = 1, 2, . . ., m); m is the population size and n is the number of genes.

CHC
Secondly, the chromosomes are evaluated by a defined fitness function.The form of fitness function depends on the application.Thirdly, an intermediate population set of chromosomes C, which is of the same size as P , is generated by copying all members of P in a random order.
Then, a uniform crossover (HUX) operator is applied on C to form C ′ .HUX exchanges half of the genes randomly between the chromosomes one by one to form C ′ .CHC also uses an additional method for incest prevention.Before applying HUX to the chromosomes, the Hamming distance between them is calculated.
If half of that distance is larger than a difference threshold d, HUX is applied; otherwise these two chromosomes are deleted from C. Therefore, the size of C ′ may be smaller than that of P or C. The initial threshold d is set at n/4.After C ′ has formed, it is evaluated by the fitness function and an elitist selection is taken.
Only the best chromosomes from both P and C ′ are selected to form the offspring population in the next generation.If the offspring population is the same as P , the difference threshold d is decreased by one.
CHC is different from the traditional genetic algorithm.Mutation is not performed at the recombination stage.CHC performs partial reinitialization (divergence) when the search becomes trapped (i.e., the difference threshold d becomes zero and no new offspring population is formed for several generations).The population is reinitialized, based on the best chromosome, by changing the elements' values randomly with a user-defined divergence rate D rate .For example, if D rate equals to 0.35, the values of 35% elements will be changed randomly.The search is then resumed with a new difference threshold d = D rate * (1 − D rate ) * n.This process is called cataclysmic mutation.
CHC has shown the ability of selecting the most representative instances among the other algorithms studied in [5].Therefore, it is chosen as the algorithm to improve the outcome of over-sampling in this paper.

Methodology
In this section, the proposed hybrid preprocessing method and the evaluation methods used in this paper are discussed.The proposed method involves two stages.The minority samples of the training sets are firstly over-sampled based on fuzzy logic to form a fuzzy rule base (FRB).To improve the performance, CHC is then implemented to reduce both the synthetic samples and majority samples.

Fuzzy Rule Base (FRB)
In this paper, let the positive class be the minority class and only λ training samples (X α ) of positive class are considered, where X α = (x α1 , . . ., x αγ ) is an γ-dimensional vector, α = 1, 2, . . ., λ and x αβ is the βth attribute value (β = 1, 2, . . ., γ) of the αth training sample.The θth fuzzy if-then rule is written as follows: where A θ β is a fuzzy term of the θth rule corresponding to the attribute z β , β = The rule weight w θ is used to reflect the degree of matching of each fuzzy rule over all the positive samples, so that the importance of each rule can be evaluated.
First, the fuzzy value of each sample is calculated.The fuzzy value of X α for the θth fuzzy rule is defined as follows: where the product T-norm is used.The rule weight (w θ ) is calculated by adding all the fuzzy values of samples.
After the rule base of the positive class is generated, the rules are randomly drawn based on the rule weight.The rule with a higher rule weight will have a higher probability to be chosen.Then, a new sample is generated within the area of the selected rule.These processes are repeated until the number of positive samples is the same as that of the negative samples.
To illustrate the idea more clearly, Fig. to generate fuzzy rules, totally ten rules can be formed in this example: THEN class = positive with 0.897 THEN class = positive with 1.147 THEN class = positive with 1.508 THEN class = positive with 1.230 THEN class = positive with 2.344 THEN class = positive with 1.607 THEN class = positive with 0.727 For generating the synthetic samples, a rule out of these ten rules is chosen with the probability of selection depending on the rule weight.Then, this rule sets the criteria of the highest and lowest value of each attribute.The new sample is generated randomly within these criteria.This process is repeated until the num-ber of the positive class is the same as that of the negative class.Fig. 3 shows the samples distribution after over-sampling.The triangle dots represent the synthetic samples.It is found that the spread of the synthetic samples is similar to that of the original positive samples (shown as the square dots).The synthetic samples in Fig. 3 are dense in the area of rule 5.

Setting of CHC
After the over-sampling, the number of minority samples is the same as that of majority samples and CHC is then applied.There are two important issues that need to be addressed before the algorithm is employed: the representation of each chromosome and the definition of fitness function.Fig. 4 shows the block diagram of the process of FRB+CHC.

Chromosome Representation
CHC is used to reduce the synthetic samples as well as the majority class samples.Therefore, the chromosomes are to represent subsets of these samples.It can be carried out by a binary representation.Each chromosome is an n-dimensional vector.In this section, n is the number of synthetic samples plus majority class samples.Each vector element shows whether the corresponding sample exists in the subset of the training set or not.Therefore, there are two possible values for each element: 0 and 1.If the value is 1, the corresponding sample is included in the subset of the training set.If the value is 0, the sample does not exist in the subset.

Fitness function
In this study, the k-NN classifier is used as the evaluation method of CHC to obtain the subset with the highest classification rate.Normally, accuracy (ratio of correctly classified samples to total number of samples) would be used as the measure of classification rate.However, it may cause difficulty for the imbalanced datasets when doing testing later since the correct classification rate of the majority samples may affect the accuracy more significantly than that of the minority samples.Therefore, some other measures are used in this paper.These measures are commonly employed to analyze problems with imbalanced datasets.
Firstly, precision and recall are introduced [17].Their definitions are given as follows: P recision = T P T P + F P (4) where T P is the number of true positives, F P is the number of false positives and F N is the number of false negatives.A high value of precision indicates that the predicted positive samples are most likely relevant.A high value of recall indicates that most of the positive samples can be predicted correctly.
A popular evaluation metric for imbalanced problems is F − measure [17], which is a function of precision and recall.In principle, F − measure represents a harmonic mean between precision and recall.A high value of F − measure means both the precision and recall values are high and do not differ very much.
It is an important measure for imbalanced datasets since a high value of it can imply that the method classifies the positive samples correctly at a high rate with little misclassified negative samples.It is defined as follows: The area under the receiver operating characteristic curve (AUC) is also commonly used to measure the performance of classification.The AUC measure [13] is the probability of correctly identifying a random sample, and it can be defined as follows: where Recall is defined in (5) chromosome X will be regarded as a better one; otherwise chromosome Y will be regarded as a better one.The above setting is also applied in sCHC for the comparison in Section 4.

F − measure and AUC measures
To show the performance of our proposed method, F − measure in (6) and AUC in (7) are used.The main drawback of over-sampling or hybrid sampling methods is that the number of training samples are increased greatly.This may cause the increase of complexity of the learning model.Therefore, the oversampling rates of different methods are also compared.Define: where N sampled is the number of samples in the re-sampled training set and N original is the number of samples in the original training set.The over-sampling rate in (8) shows the increase rate of the number of the training samples.When a support vector machine is used to form the classification mode, the increase rate of the support vectors can be used to evaluate the complexity of the learning model.
This rate is calculated based on the support vectors generated. where

Experimental Study
In this section, we present the experiments that are carried out to compare our proposed method with other hybrid sampling methods and the CHC undersampling method.The datasets used can be found in UCI Repository [2].
The experiments involve different kinds of hybrid methods, including SMOTE, ADASYN, sTL, sSafe, sRST, sCHC, AHC and our proposed method, which is named as Fuzzy Rule Base+CHC (FRB+CHC).CHC, which is used as an undersampling method in [14] (EUSCHC), is also compared in the experiment.To measure the performance of the preprocessing methods, the same learning tool should be applied among all the experiments.In this study, three different tools are used.They are Support Vector Machine (SVM), 1 Nearest Neighbor (1NN), and C4.5 decision tree.The programs of all testing methods and the learning tools are based on KEEL, which is an open source software available in the Web [1].
F − measure and AUC are used as measures to analyze the results.The average values of these measures for each method will be calculated.As the expansion of re-sampled training datasets may increase the computational time and complexity of the classification model, the over-sampling rate and the number of support vectors formed from SVM will also be compared.

Datasets
To study the methods on different datasets, 44 datasets with different imbalance ratio (IR) are chosen.IR is the ratio of the number of majority class to the number of minority class.Table 1 shows the details of the selected datasets, where the number of samples (N samp.), the number of attributes (N attr.), the distribution of the minority and majority classes, and IR for each dataset can be found.

Setup of Experiment
For over-sampling, the rules of the minority samples are associated with regular triangular membership functions with five fuzzy terms.For CHC, the values of the parameters are: • Population size: 50.
• k of k-NN classifier used as evaluation: 1.
In this paper, SVM, 1NN, and C4.5 are used to weigh the influence of each preprocessing method.For SVM, a radial basis function (RBF) is used as the kernel since a non-linear classification model is needed and RBF is a common kernel to handle this problem.The RBF is defined as follows: where σ > 0 is the parameter to determine the width of the radial basis function.The AUC values of SMOTE, sTL, sSafe, sRST and sCHC are similar since they all use SMOTE to perform over-sampling.ADASYN gets the lowest average values of F −measure, which means the precision is low and the difference between precision and recall is large.
In this experiment, the performance of FRB and FRB+CHC is very similar, which shows the advantages of FRB over the other hybrid or over-sampling methods.However, the data size will be very large if only FRB is used as the preprocessing method.FRB+CHC can reduce the data size without a large effect to the performance.Therefore, only FRB+CHC will be considered in the following section.
Table 4 shows the average rankings by means of F − measure and AUC using Friedman's method [36].The highest value of each dataset is ranked as 1.
If a certain method obtains the ranking 3, 6, 2, and 1 on four datasets, the average ranking is (3 + 6 + 2 + 1)/4 = 3.Therefore, a lower average ranking indicates that the corresponding method is better among the other methods.FRB+CHC obtains the best ranking by AUC and sCHC obtains the best ranking by F − measure.Note that the highest average values of AUC or F − measure do not imply the best ranking results since the ranking shows the comparison results among all the methods of each dataset.For example, EUSCHC has the lowest AUC average values but its ranking is better than ADASYN.Since EUSCHC is an under-sampling method, it easily ignores some useful samples of majority class.

Conclusion
A hybrid re-sampling method developed based on both over-sampling and under-sampling has been proposed.The new synthetic samples of the minority class are generated based on fuzzy logic.To minimize the size of datasets, CHC has been employed over the new samples and the majority samples as a cleaning method to the over-sampled training set.
The proposed sampling method (FRB+CHC) is compared to SMOTE, ADASYN, sTL, sSafe, sRST, sCHC, EUSCHC, and AHC on 44 datasets.To evaluate the performance of these nine sampling methods, the same SVM classifier has been used to obtain the experimental results.It is shown that FRB and FRB+CHC outperforms the other sampling methods on both F − measure and AUC.FRB shows its advantage to act as an over-sampling method.If data size is not a consideration, FRB is a better choice of pre-processing method.
FRB+CHC obtains the best ranking by means of AUC.FRB+CHC and sCHC have similar performance in F − measure, which indicates that CHC is a good choice of data cleaning method.The AUC results of SMOTE, sTL, sSafe, sRST, and sCHC are similar since all of them use SMOTE to perform over-sampling.To show the advantages of the proposed method, the over-sampling rate and the number of support vectors formed from SVM for different methods are also compared.
In addition, the C4.5 and 1NN classifiers are used and FRB+CHC shows a robust behavior among different classifiers.FRB+CHC achieves good results under the above criteria, which reflects that FRB+CHC achieves a good balance between accuracy and over-sampling rate.It also has a low impact to the complexity of the learning model.The major reason is that CHC only selects the samples to increase the performance of the datasets, but not considering the locations of the samples.
Therefore, the most representative samples are selected to form the training sets.

( 1 , 2 ,
. . ., γ) and z = (z 1 , z 2 , . . ., z γ ) is a γ-dimensional attribute vector, and w θ is the rule weight.The regular triangular membership functions are used for the fuzzy terms.In this paper, the fuzzy terms A θ β are derived based on the samples of positive class.The minimum and maximum values of each attribute are first found.The fuzzy terms are the triangular membership functions within the range of each attribute.The fuzzy terms also depend on the number of labels.Since regular triangular membership functions are used, the fuzzy terms are distributed evenly within the range of each attribute.The fuzzy rules are generated based on the samples of positive class.For each sample, the label with the highest membership value is selected to form the corresponding rule for each attribute.The maximum number of rules depends on the number of labels and attributes.
2 shows the distribution of two classes with two attributes as an example of the formulation of fuzzy rules.The x-axis and y-axis govern the values of the two different attributes and regular triangular membership functions with five labels are used.The circle dots correspond to the negative class and the square dots correspond to the positive class.The dashed lines show the minimum or maximum value of the corresponding attribute of the positive samples.As only the attribute vectors of the positive class are considered

4 .
THEN class = positive with 1.319 Rule 9: IF z 1 is A = L1 4 AND z 2 is A 9 2 = L2 5. THEN class = positive with 1.731 Rule 10: IF z 1 is A 10 = L1 5 AND z 2 is A 10 2 = L2 4. THEN class = positive with 1.399 where z 1 and z 2 represent Attribute 1 and Attribute 2 for the x-axis and y-axis respectively in Fig. 2, L1 i is the i-th label of z 1 attribute, L2 i is the i-th label of z 2 attribute.Rule has the highest rule weight and rule 7 has the lowest rule weight in this example.

Figure 2 :
Figure 2: Example of the distribution of imbalanced dataset.The y-axis represents the values of z 2 and x-axis represents the value of z 1 .

Figure 3 :
Figure 3: Distribution of the samples after over-sampling.The y-axis represents the values of z 2 and x-axis represents the value of z 1 .
and F P rate = F P F P +T N , T N is the number of true negatives.F P rate defines the percentage of true negatives cases misclassified as positives.A high value of AUC implies small values of F N and F P , meaning that the corresponding classifier is effective.Since both F − measure and AUC are important measures on imbalanced datasets, a multi-objective fitness function is used here.The chromosome with both higher values of F − measure and AUC obviously has a higher rank.If a chromosome X has a higher value of F − measure (F X > F Y ) and a lower value of AUC (A X < A Y ) than that of chromosome Y , the difference between the chromosomes' F − measure (|F X − F Y |) and the difference between the chromosomes sampled is the number of support vectors trained by the re-sampled training set and SV original is the number of support vectors trained by the original training set.It should be noted that the CHC fitness evaluation for data size reduction (by k-NN) and the training of the classification model based on the resampled data (by SVM) are two separated processes.K-NN is used in the fitness evaluation because it is simple with minimal computation effort.SVM is a commonly used method to obtain the classification model.
It controls the flexibility of the classifier.When σ decreases, the flexibility of the resulting classifier in fitting the training data increases, and this might lead to over-fitting easily.The value of σ is set as 0.01.The tradeoff between training error and margin of SVM is set as 100.The above values are chosen through experiments.For C4.5, the confidence level is set as 0.25, the minimum number of item-sets per leaf is set to 2 and pruning is used as well to obtain the final tree.For 1NN, the Euclidean distance metric is used.A 5-fold cross validation model is used to compare the classification results from different preprocessing methods.Each dataset are first divided into five parts randomly.Four of them are combined to form a training set and the remaining subset forms a testing set.The process is then repeated five times, so that each subset is used once as a testing set.All the methods involve some random parameters, so five experiments are carried out for each 5-fold cross validation model and the average value are calculated as the results, i.e. totally 25 experiments were done.4.3.Results4.3.1.F − measure and AUC measuresTables2 and 3show the SVM results on F − measure and AUC for each re-sampling method on the 44 datasets respectively.The results of the original datasets are shown in the second column and the best value for each dataset are highlighted in bold.The last row shows the average value of each sampling method for the datasets.The performance of the FRB over-sampling method are also included (in the rightmost column) for comparison with FRB+CHC.It can be seen that the average values of F − measure and AUC in both FRB and FRB+CHC are higher than other methods.The performance of sCHC and FRB+CHC are similar.This shows that CHC has good performance as a data cleaning method after over-sampling, especially for the results in F − measure.

Fig. 6 Figs. 8
Fig. 6 and 7 show an example of the distribution of the positive samples and

Figure 5 :
Figure 5: Average AUC results obtained from training and testing sets.

Figure 6 :
Figure 6: Distribution of the samples after the implementation of FRB+CHC.

Figure 7 :
Figure 7: Distribution of the samples after the implementation of sCHC.

Table 1 :
Details of the Selected Imbalanced Datasets.

Table 2 :
SVM: Average F-measure of Testing Datasets among Different Sampling Methods.

Table 3 :
SVM: Average AUC of Testing Datasets among Different Sampling Methods.

Table 4 :
Friedman Rankings of AUC and F-measure.isthatthe size of training set is expanded greatly.If IR of the dataset is large, the size of the re-sampled training set can be nearly double of the original one.This drawback may increase the computational time and complexity of the learning model.Table5shows the over-sampling rates of different methods on each dataset and the mean rate of each method.A negative value means the size of re-sampled training set is smaller than that of original one.A value greater than 100% means the size of re-sampled training set is more than 2 times of the original set.Both sCHC and FRB+CHC shrink most of the dataset while the oversampling rates of the other methods are similar.This shows that both sCHC and FRB+CHC can use less training samples to achieve high performance.

Table 5 :
Over-sampling Rate (%) of Training Sets among Different Sampling Methods.

Table 6 :
The Details of the Re-sampled Datasets After FRB+CHC.

Table 7 :
The Increase Rate of Number of Support Vectors of the Classification Model formed by SVM.