When faced with unbalanced data on which to build a classifier, the performance of the model to correctly predict each class is generally poor or even bad. In this notebook, we will consider the case of a binary classifier, trained to predict two classes that we will call the majority class and the minority class. Arbitrarily, we will consider that the minority class will be coded 1, while the majority class will be coded 0. In theory, when the repartition between the two classes is not equal (50% of 0 and 50% of 1), we speak of imbalance. In practice, problems occur when the imbalance is strong, for example 1% of 1 and 99% of 1. It is also possible to have extremely few observations of the minority class. The typical example encountered in machine learning is that of fraud, where more than 99% of observations are non-fraudulent while less than one percent of observations involve fraud.
An imbalanced dataset with 2% of observation of the minority class
Instead of saying that there are 98% of observations from class 0 and 1% of observations from class 1, it is possible to use the imbalance ratio that gives the ratio between the number of observations from the majority class and the number of observations from the smallest minority class:
The larger the imbalance ratio, the larger the imbalance. For a perfectly balanced dataset, the imbalance ratio is equal to 1.
Building a classifier on an imbalanced dataset
In a first step, let us try to fit a random forest on the raw data. We will train the model on a subsample of the data, and assess its quality of fit both on the same training data and on unseen data (test set). We will consider here two different random forests, and try to select the one that gives the best results.
Let us put 80% of the data in the training set and leave the remaining 20% for the test set.
Let us grow a random forest with a first set of hyperparameters (200 trees, 2 variables as candidates to draw from when performing the splits, 20 observation at minimum in terminal nodes)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1959 29
1 0 12
Accuracy : 0.9855
95% CI : (0.9792, 0.9903)
No Information Rate : 0.9795
P-Value [Acc > NIR] : 0.02988
Kappa : 0.4477
Mcnemar's Test P-Value : 1.999e-07
Sensitivity : 0.2927
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9854
Prevalence : 0.0205
Detection Rate : 0.0060
Detection Prevalence : 0.0060
Balanced Accuracy : 0.6463
'Positive' Class : 1
While the accuracy (percentage of correctly predicted individuals) is very high (0.99), we need to keep in mind that 98% of the observations are from class 0. The true negative rate (or specificity) (TN/(TN+FP)) is very high (1.00), but the true positive rate (or sensitivity) (TP/(TP+FN)), on the other hand, is low (0.29).
Let us look at those same metrics based on the predictions made on unseen data.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 491 8
1 0 1
Accuracy : 0.984
95% CI : (0.9687, 0.9931)
No Information Rate : 0.982
P-Value [Acc > NIR] : 0.45445
Kappa : 0.1971
Mcnemar's Test P-Value : 0.01333
Sensitivity : 0.1111
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9840
Prevalence : 0.0180
Detection Rate : 0.0020
Detection Prevalence : 0.0020
Balanced Accuracy : 0.5556
'Positive' Class : 1
Again, the accuracy is very high (0.98). The specificity, i.e., the true negative rate is high (1.00), but the sensitivity, i.e., the true positive rate is low (0.11).
A second model
Let us fit a second model, with different specifications:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 491 8
1 0 1
Accuracy : 0.984
95% CI : (0.9687, 0.9931)
No Information Rate : 0.982
P-Value [Acc > NIR] : 0.45445
Kappa : 0.1971
Mcnemar's Test P-Value : 0.01333
Sensitivity : 0.1111
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9840
Prevalence : 0.0180
Detection Rate : 0.0020
Detection Prevalence : 0.0020
Balanced Accuracy : 0.5556
'Positive' Class : 1
Warning
Which of the two models give the best results? To try to answer this question, let us consider the ROC curve.
ROC Curve
Let us have a look at the ROC curve. Recall that the x-axis shows the False Positive Rate (here, fractions of errors for the majority class) while the y-axis show the True Positive Rate (here, fraction of correct predictions for the minority class). The graph reports these two metrics when the threshold \(\tau\) varies. This threshold is the cut-off point above which class 1 is predicted for the observation: if \(\mathbb{P}(Y=1 \mid X) \geq \tau\), then the observation is classified as 1, 0 otherwise. Varying this threshold will favour the TPR at the expense of the FPR or conversely.
The ROC curve is plotted on a graph with the False positive rate on the x-axis, and the True positive rate on the y-axis:
If the threshold \(\tau=0\), then every observation is classified as 1. Let us assume that the fraction of class 1 in the sample is \(\gamma\). The fraction of class 0 is then \(1-\gamma\).
Confusion table (values expressed in proportions) for a classifier that systematically assigns class 1
With a classifier that randomly assigns the classes
Now, let us consider the case where the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\). In such a case, the expected proportions are as follows:
Confusion table (values expressed in proportions) for a classifier that randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\)
Pred. Negative (0)
Pred. Positive (1)
Obs. Negative (0)
True Negative (TN) = \((1-p) (1-\gamma)\)
False Positive (FP) =\(p (1-\gamma)\)
Obs. Positive (1)
False Negative (FN) = \((1-p) \gamma\)
True Positive (TP) = \(p \gamma\)
In that case the expected values for the TPR and the FPR are:
So, on average, with a classifier that randomly assigns a class to the observations, \(TPR=FPR=p\). The ROC curve, for such a classifier will thus be a line from (0,0) to (1,1), i.e., a diagonal. The classifier would have no discriminative power.
On the ROC curve, points above that diagonal will correspond to classifiers that provide results that are better than a random guess while points below will correspond to classifiers that provide results that are worse than a random guess.
The roc_curve() function from {yardstick} allows to get the TPR/sensitivity and the TNR/specificity (the FPR is equal to 1-TNR). Let us apply it to the train data firt.
# A tibble: 2 × 4
model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 First roc_auc binary 0.899
2 Second roc_auc binary 0.900
Optimal threshold
Let us now turn to finding the best probability threshold for a model. We have seen that a perfect classifier is located on the top-left corner of the plot of the ROC curve (FPR=0, TPR=1).
There are many ways to define the optimal threshold value (see Unal (2017)).
For simplicity, let us use the Euclidean distance to find the closest point to (0,1). The optimal threshold \(\tau^\star\) thus minimises the following distance:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 445 1
1 46 8
Accuracy : 0.906
95% CI : (0.877, 0.9301)
No Information Rate : 0.982
P-Value [Acc > NIR] : 1
Kappa : 0.2302
Mcnemar's Test P-Value : 1.38e-10
Sensitivity : 0.8889
Specificity : 0.9063
Pos Pred Value : 0.1481
Neg Pred Value : 0.9978
Prevalence : 0.0180
Detection Rate : 0.0160
Detection Prevalence : 0.1080
Balanced Accuracy : 0.8976
'Positive' Class : 1
Precision-recall curve
With both the train and test data, looking at the ROC-curve, we might think that our results are pretty good. Let us look at the precision-recall curve, to focus specifically on the positive predictions. Recall that:
Precision = \(\frac{TP}{TP+FP}\) ;
Recall = \(\frac{TP}{TP+FN}\).
These two metrics focus on the positive class, which is the minority class here. With a dataset where there is a fewer number of observation from class 1, we are interested in assessing ability of the classifier to predict class 1. The Precision and recall metrics are thus helpful to have an idea of the performance of the model on the minority class. The precision-recall curve plots the precision as a function of the recall. The different values are obtained, as in the case of the ROC curve, by varying the probability threshold \(\tau\).
With classification threshold \(\tau=0\)
If the threshold \(\tau=0\), the confusion matrix will be as follows:
Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Pred. Negative (0)
Pred. Positive (1)
Obs. Negative (0)
True Negative (TN) = \(1-\gamma\)
False Positive (FP) = 0
Obs. Positive (1)
False Negative (FN) = \(\gamma\)
True Positive (TP) = 0
Hence:
Precision = \(\frac{TP}{TP+FP}\) : undefined.
Recall = \(\frac{TP}{TP+FN} = 0\).
In practice, while the value of \(\tau\) increases, more and more cases are classified as 0. The proportion of false positive decreases and usually becomes negligible when compared to the proportion of true positive. The precision thus increases and may be equal to 1.
With a perfect classifier
Now, let us turn to a classifier that perfectly predicts the two classes for each case.
Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
With a classifier that randomly assigns the classes
If the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\), regardless of \(\tau\), we the expected proportions are:
Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
Hence, for \(0<\tau<1\), the point corresponding to a classifier that randomly assigns class 1 with probability \(p\) will be on the dashed green line. The x-coordinate of that point will be equal to \(p\) (assuming that there is a proportion \(\gamma\) of class 1 individuals in the data).
Recall from the confusion matrix (showing proportions) for the first model, on the test sample:
Metric
Training data
Test data
Precision (TP / (TP+FP))
0.98
0.98
Recall/Sensitivity (TP / (TP + FN))
0.11
0.11
And for the second:
Metric
Training data
Test data
Precision (TP / (TP+FP))
0.98
0.98
Recall/Sensitivity (TP / (TP + FN))
0.11
0.11
When faced with an unbalanced dataset and when the model fails to correctly predict the positive class, the ROC curve will not necessarily reflect this if the imbalance is large. The ROC curve will in fact be close to the ROC curve of a perfectly discriminating model. In contrast, the Precision-Recall curve will not be as close as the one corresponding to a perfect classifier. Let us illustrate this.
The pr_curve() from {yardstick} allows us to get the precision and recall for different values of \(\tau\).
Clearly, the performances of the model are much less impressive viwed that way…
F1 Score
With an imbalanced dataset, if we focus on the accuracy, we might select a model that gives very good results on average, but poor results to predict one of the classes. Some metrics are built to take into consideration the predictive capabilities for both classes.
This is the case of the F1 score. It computes the harmonic mean of the precision an sensitivity: \[\text{F1 score} = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} = \frac{2 \times TP}{2 \times TP + FP + FN}.\] The F1 score values range from 0 (if either precision or sensitivity is equal to 0) to 1 (both precision and sensitivity equal to 1).
With an imbalanced dataset, it may be useful to rely on this metric to assess the quality of fit of a given classifier when performing model selection.
With our two models, we can use the F1-Score to select the optimal threshold:
When faced with an imbalanced dataset, what should we do?
Unfortunately (or maybe, fortunately?), there is no “good” answer to this question. There are multiple ways to address this issue. And as is often the case with machine learning, we have to try different options and we have to make choices.
As of today, there are three ways to deal with imbalanced datasets:
data resampling
algorithm modifications (not in the scope of this notebook)
cost-sensitive learning (not in the scope of this notebook).
In the remainder of this notebook, we will have a look at data resampling methods.
Rebalancing the dataset
We have seen in the first part of this notebook that in the presence of a data sample containing much more observations of the negative class than of the positive class, it is difficult to obtain an algorithm with good predictive capabilities for the minority (positive) class. Yet, obtaining good performance in predicting the minority class may be the precise purpose of the classification (detection of cancer, fraud, recession, …). In this case, it is possible to re-equillibrate the data set to decrease the imbalance ratio.
This part of the notebook will present two broad techniques:
under-sampling: the number of observations from the majority class is decreased
over-sampling: the number of observations from the minority class is increased.
Under-sampling
A way to obtain a balanced dataset consists in sampling from the majority class to decrease the number of observations at hand from this class.There a many techniques to do so.
Random under-sampling
One of the most simple technique is random sampling. All we need to do is to decide the desired imbalance ratio we would like to obtain. For example, if we want that ratio to be equal to 1 (1 minority case for each majority case), we just need to sample as many examples of the majority as there are examples of the minority class.
In the traiing set, the imbalance ratio is:
sum(df_train$y==0) /sum(df_train$y==1)
[1] 47.78049
The number of minority cases is:
(n_minority <-sum(df_train$y==1))
[1] 41
We will thus randomly sample 41 cases from the majority case, and keep all cases from the minority class.
Undersampling to obtain an imbalance ratio of 1 resulted in a final training set with 2 times the number of minority cases, i.e., 82. This number of observations may be too small, depending on the variability in the data. To have a few more observations on which to train the model, we can accept to have a slightly unbalanced data set.
For example, if we accept to have an imbalance ratio of 2 (2 observations of the majority class for each observation of the minority class):
nrow(df_train)
[1] 2000
df_train_under_random_2 <- df_train %>%# Shuffle observationssample_frac(1L) %>%group_by(y) %>%# Under-sampling in the majority classslice_head(n =3*n_minority) %>%ungroup()
When observations are randomly drawn from the majority class, the resulting sample may not be representative of the original data. Removing some individuals may result in removing important information from the training dataset.
Therefore, instead of drawing randomly among the observations of the majority class, some techniques focus on orienting the drawing (by removing redundant individuals, for example), or on creating synthetic individuals among the individuals of the majority class.
Undersampling using KNN
Instead of randomly select observations to keep or remove from the majority class, Mani and Zhang (2003) suggest using the nearest neighbors. They propose 3 different versions.
Note
NearMiss-1:
for each example from the majority class, compute the average distance to k closest examples among the minority class
select majority examples for which the average distance computed in the previous step is the smallest
NearMiss-2:
for each example from the majority class, compute the average distance to k farthest examples among the minority class
select majority examples for which the average distance computed in the previous step is the smallest
NearMiss-3:
for each example from the minority class, identify and select a given number of neighbors among the majority class.
for each of the neighbors obtained in the previous step, compute the distance to each example from the the minority class, then, compute the average distance from their 3 closest neighbors
select majority examples for which the average distance computed in the previous step is the farthest.
As the NearMiss-1 technique seem to provide poor results, some suggest to select majority examples for which the average distance computed in the previous step is the highest. [ref needed]
Let us illustrate with a few R codes how these variants work. First, let us split the sample according to the value of the target variable: the majority class and the minority class.
Let us say that we want to use \(k=3\) for the KNN:
n_neighbors <-3
NearMiss-1
In a first step, for each observation from the minority sample, let us compute the distance (using the Euclidean distance here) to each observation from the minority sample:
Each row of the obtained matrix gives the distance to each observation from the minority sample (in column):
dim(dists)
[1] 1959 41
For a single example from the majority, we identify the closests observations from the minority sample. The distances of the first example to each observation from the minority sample are:
Then, we can order the results by descending values of the computed average distance and select the first observations from that list. Let us assume that we to keep only as many observations from the majority class as there are in the minority class:
n_to_keep <- n_minority
Then the index of the values we keep from the majority class can be obtained as follows: