Machine Learning and Statistical Learning

Dealing with unbalanced dataset

Author

Ewen Gallic

When faced with unbalanced data on which to build a classifier, the performance of the model to correctly predict each class is generally poor or even bad. In this notebook, we will consider the case of a binary classifier, trained to predict two classes that we will call the majority class and the minority class. Arbitrarily, we will consider that the minority class will be coded 1, while the majority class will be coded 0. In theory, when the repartition between the two classes is not equal (50% of 0 and 50% of 1), we speak of imbalance. In practice, problems occur when the imbalance is strong, for example 1% of 1 and 99% of 1. It is also possible to have extremely few observations of the minority class. The typical example encountered in machine learning is that of fraud, where more than 99% of observations are non-fraudulent while less than one percent of observations involve fraud.

To illustrate the different concepts, we will use data that we generate. Let us create a dataset with two spirals. To do so, we rely on some code published by Stanislas Morbieu on R-bloggers.

Show the R codes
# Code from
# https://www.r-bloggers.com/2018/11/generate-datasets-to-understand-some-clustering-algorithms-behavior/
library(dplyr) ; library(ggplot2)
n <- 5000
set.seed(123)
library(mvtnorm)
generateSpiralData <- function(n) {
  maxRadius = 7
  xShift = 2.5
  yShift = 2.5
  angleStart = 2.5 * pi
  noiseVariance = 0.4
  
  # first spiral
  firstSpiral <- function() {
    d1 = data.frame(0:(n-1))
    colnames(d1) <- c("i")
    d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
                  radius = maxRadius * (n + n/5 - i)/ (n + n/5),
                  x = radius * sin(angle),
                  y = radius * cos(angle),
                  class="0")
  }
  d1 = firstSpiral()
  
  # second spiral
  d2 = d1 %>% mutate(x = -x, y = -y, class="1")
  
  # combine, add noise, and shift
  generateNoise <- function(n) {
    sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
    noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
    df = data.frame(noise)
    colnames(df) <- c("xNoise", "yNoise")
    df
  }
  d1 %>%
    bind_rows(d2) %>%
    bind_cols(generateNoise(2*n)) %>%
    transmute(x_1 = x + xShift + xNoise,
              x_2 = y + yShift + yNoise,
              y = factor(class, levels = c("0", "1"))) %>% 
    as_tibble()
}

df <- generateSpiralData(n)
wongBlue <- "#0072B2"
wongOrange <- "#D55E00"

ggplot(data = df, aes(x = x_1, y = x_2)) +
  geom_point(mapping = aes(colour = y)) +
  scale_colour_manual(NULL, values = c("1" = wongOrange, "0" = wongBlue))

Two spirals

Now, let us only keep a few observation from class 1. For example, let us assume that there are 98% of observation of class 0 and only 2% of class 1.

imbalance <- .98

df <- 
  df %>% filter(y == "0") %>% 
  sample_n(size = round(imbalance*n/2)) %>% 
  bind_rows(
    df %>% filter(y == "1") %>% 
      sample_n(size = round((1-imbalance)*n/2))
  )

df %>% group_by(y) %>% count() %>%
  ungroup() %>% mutate(prop = n / sum(n))
# A tibble: 2 × 3
  y         n  prop
  <fct> <int> <dbl>
1 0      2450  0.98
2 1        50  0.02
ggplot(data = df, aes(x = x_1, y = x_2)) +
  geom_point(mapping = aes(colour = y)) +
  scale_colour_manual(NULL, values = c("1" = wongOrange, "0" = wongBlue))

An imbalanced dataset with 2% of observation of the minority class

Instead of saying that there are 98% of observations from class 0 and 1% of observations from class 1, it is possible to use the imbalance ratio that gives the ratio between the number of observations from the majority class and the number of observations from the smallest minority class:

imbalance_ratio <- sum(df$y==0) / sum(df$y==1)
imbalance_ratio
[1] 49

The larger the imbalance ratio, the larger the imbalance. For a perfectly balanced dataset, the imbalance ratio is equal to 1.

Building a classifier on an imbalanced dataset

In a first step, let us try to fit a random forest on the raw data. We will train the model on a subsample of the data, and assess its quality of fit both on the same training data and on unseen data (test set). We will consider here two different random forests, and try to select the one that gives the best results.

Let us put 80% of the data in the training set and leave the remaining 20% for the test set.

ind_train <- sample(1:nrow(df), size = .8*nrow(df))
df_train <- df[ind_train, ]
df_test <- df[-ind_train, ]

A first model

Let us grow a random forest with a first set of hyperparameters (200 trees, 2 variables as candidates to draw from when performing the splits, 20 observation at minimum in terminal nodes)

library(randomForest)
mod <- 
  randomForest(
  formula = y ~ x_1+x_2,
  data = df_train,
  ntree = 200,
  mtry =2 ,
  nodesize = 20,
  maxnodes = NULL
)

Thanks to the predict function, we can obtain the estimated probabilities to be classified as 1:

pred_prob_train <- predict(mod, newdata = df_train, "prob")
head(pred_prob_train)
     0    1
1 1.00 0.00
2 0.67 0.33
3 1.00 0.00
4 0.99 0.01
5 1.00 0.00
6 1.00 0.00

Based on these probabilities, we can assign a class (0 or 1) to each observation.

pred_class_train <- ifelse(pred_prob_train[,"1"] > .5, "1", "0") %>% 
  factor(levels = c("0", "1"))

Let us have a look at the confusion table on the training set.

Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) False Positive (FP)
Obs. Positive (1) False Negative (FN) True Positive (TP)
conf_mat_train <- 
  caret::confusionMatrix(data = pred_class_train,
                         reference = df_train$y,
                         positive="1")
conf_mat_train
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1959   29
         1    0   12
                                          
               Accuracy : 0.9855          
                 95% CI : (0.9792, 0.9903)
    No Information Rate : 0.9795          
    P-Value [Acc > NIR] : 0.02988         
                                          
                  Kappa : 0.4477          
                                          
 Mcnemar's Test P-Value : 1.999e-07       
                                          
            Sensitivity : 0.2927          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9854          
             Prevalence : 0.0205          
         Detection Rate : 0.0060          
   Detection Prevalence : 0.0060          
      Balanced Accuracy : 0.6463          
                                          
       'Positive' Class : 1               
                                          

While the accuracy (percentage of correctly predicted individuals) is very high (0.99), we need to keep in mind that 98% of the observations are from class 0. The true negative rate (or specificity) (TN/(TN+FP)) is very high (1.00), but the true positive rate (or sensitivity) (TP/(TP+FN)), on the other hand, is low (0.29).

Let us look at those same metrics based on the predictions made on unseen data.

pred_prob_test <- predict(mod, newdata = df_test, type = "prob")
pred_class_test <- ifelse(pred_prob_test[,"1"] > .5, "1", "0") %>% 
  factor(levels = c("0", "1"))
conf_mat_test <- 
  caret::confusionMatrix(data = pred_class_test, reference = df_test$y,
                         positive = "1")
conf_mat_test
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 491   8
         1   0   1
                                          
               Accuracy : 0.984           
                 95% CI : (0.9687, 0.9931)
    No Information Rate : 0.982           
    P-Value [Acc > NIR] : 0.45445         
                                          
                  Kappa : 0.1971          
                                          
 Mcnemar's Test P-Value : 0.01333         
                                          
            Sensitivity : 0.1111          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9840          
             Prevalence : 0.0180          
         Detection Rate : 0.0020          
   Detection Prevalence : 0.0020          
      Balanced Accuracy : 0.5556          
                                          
       'Positive' Class : 1               
                                          

Again, the accuracy is very high (0.98). The specificity, i.e., the true negative rate is high (1.00), but the sensitivity, i.e., the true positive rate is low (0.11).

A second model

Let us fit a second model, with different specifications:

mod_2 <- 
  randomForest(
  formula = y ~ x_1+x_2,
  data = df_train,
  ntree = 200,
  mtry =2 ,
  nodesize = 50,
  maxnodes = NULL
)

The predicted probabilities with this second model:

pred_prob_train_2 <- predict(mod_2, newdata = df_train, type = "prob")

Based on these probabilities, we can assign a class (0 or 1) to each observation.

pred_class_train_2 <- ifelse(pred_prob_train_2[,"1"] > .5, "1", "0") %>% 
  factor(levels = c("0", "1"))

Let us have a look at the confusion table on the training set.

conf_mat_train_2 <- 
  caret::confusionMatrix(data = pred_class_train_2,
                         reference = df_train$y,
                         positive="1")
conf_mat_train_2
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1959   32
         1    0    9
                                         
               Accuracy : 0.984          
                 95% CI : (0.9775, 0.989)
    No Information Rate : 0.9795         
    P-Value [Acc > NIR] : 0.08616        
                                         
                  Kappa : 0.3552         
                                         
 Mcnemar's Test P-Value : 4.251e-08      
                                         
            Sensitivity : 0.2195         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.9839         
             Prevalence : 0.0205         
         Detection Rate : 0.0045         
   Detection Prevalence : 0.0045         
      Balanced Accuracy : 0.6098         
                                         
       'Positive' Class : 1              
                                         

Let us look at those same metrics based on the predictions made on unseen data.

pred_prob_test_2 <- predict(mod_2, newdata = df_test, type = "prob")
pred_class_test_2 <- ifelse(pred_prob_test_2[,"1"] > .5, "1", "0") %>% 
  factor(levels = c("0", "1"))
conf_mat_test_2 <- 
  caret::confusionMatrix(data = pred_class_test_2, reference = df_test$y,
                         positive = "1")
conf_mat_test_2
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 491   8
         1   0   1
                                          
               Accuracy : 0.984           
                 95% CI : (0.9687, 0.9931)
    No Information Rate : 0.982           
    P-Value [Acc > NIR] : 0.45445         
                                          
                  Kappa : 0.1971          
                                          
 Mcnemar's Test P-Value : 0.01333         
                                          
            Sensitivity : 0.1111          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9840          
             Prevalence : 0.0180          
         Detection Rate : 0.0020          
   Detection Prevalence : 0.0020          
      Balanced Accuracy : 0.5556          
                                          
       'Positive' Class : 1               
                                          
Warning

Which of the two models give the best results? To try to answer this question, let us consider the ROC curve.

ROC Curve

Let us have a look at the ROC curve. Recall that the x-axis shows the False Positive Rate (here, fractions of errors for the majority class) while the y-axis show the True Positive Rate (here, fraction of correct predictions for the minority class). The graph reports these two metrics when the threshold \(\tau\) varies. This threshold is the cut-off point above which class 1 is predicted for the observation: if \(\mathbb{P}(Y=1 \mid X) \geq \tau\), then the observation is classified as 1, 0 otherwise. Varying this threshold will favour the TPR at the expense of the FPR or conversely.

The ROC curve is plotted on a graph with the False positive rate on the x-axis, and the True positive rate on the y-axis:

Show the R codes
ggplot() +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  geom_blank() +
  coord_equal() +
  scale_x_continuous(limits = c(0,1)) +
  scale_y_continuous(limits = c(0,1))

With classification threshold \(\tau=0\)

If the threshold \(\tau=0\), then every observation is classified as 1. Let us assume that the fraction of class 1 in the sample is \(\gamma\). The fraction of class 0 is then \(1-\gamma\).

Confusion table (values expressed in proportions) for a classifier that systematically assigns class 1
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = 0 False Positive (FP) = \(1-\gamma\)
Obs. Positive (1) False Negative (FN) = 0 True Positive (TP) = \(\gamma\)

In such a case:

  • \(TPR = \frac{TP}{TP+FN} = 1\) ;
  • \(FPR = \frac{FP}{FP+TN} = 1\).
Show the R codes
ggplot(data = tibble(tpr = 1, fpr = 1),
       mapping = aes(x = fpr, y = tpr)) +
  geom_point() +
  geom_text(data = tibble(tpr = c(.7), fpr = c(.75), lab = c("tau == 0")),
            mapping = aes(label = lab), parse = T) +
  geom_curve(data = tibble(
    x1 = (.75),
    x2 = c(.99),
    y1 = c(.75),
    y2 = c(.99)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  coord_equal(xlim = c(0,1), ylim = c(0,1))

With classification threshold \(\tau=1\)

If the threshold \(\tau=1\), then every observation is classified as 0.

Confusion table (values expressed in proportions) for a classifier that systematically assigns class 0
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \(1-\gamma\) False Positive (FP) = 0
Obs. Positive (1) False Negative (FN) = \(\gamma\) True Positive (TP) = 0

In such a case:

  • \(TPR = \frac{TP}{TP+FN} = 0\) ;
  • \(FPR = \frac{FP}{FP+TN} = 0\).
Show the R codes
ggplot(
  data = tibble(
    fpr = c(1, 0),
    tpr = c(1, 0)
  ),
  mapping = aes(x = fpr, y = tpr)) +
  geom_point() +
  geom_text(data = tibble(
    fpr = c(.75, .25),
    tpr = c(.7, .3),
    lab = c("tau == 0", "tau == 1")),
    mapping = aes(label = lab), parse = T) +
  geom_curve(data = tibble(
    x1 = c(.75, .25),
    x2 = c(.99, 0.01),
    y1 = c(.75, .25),
    y2 = c(.99, 0.01)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  coord_equal(xlim = c(0,1), ylim = c(0,1))

With a perfect classifier

If the classifier perfectly predicts the class of the observations:

  • If \(\tau=0\), then the classifier is no longer perfect, and we have:

    • \(TPR = 1\),
    • \(FPR = 1\) ;
  • If \(\tau=1\), the classifier is not perfect either, and we have:

    • \(TPR = 0\),
    • \(FPR = 1\) ;
  • For any other value of \(0<\tau<1\):

  • \(TPR = \frac{\gamma}{\gamma+0}=1\),

  • \(FPR = 0\).

Confusion table (values expressed in proportions) for a classifier that perfectly predicts the classes, for \(0<\tau<1\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \((1-\gamma)\) False Positive (FP) =\(0\)
Obs. Positive (1) False Negative (FN) = \(0\) True Positive (TP) = \(\gamma\)

Then, the ROC curve for a perfect classifier will be made of three points: (0,0), (0,1), and (1,1).

Show the R codes
ggplot(
  data = tibble(
    fpr = c(1, 0, 0),
    tpr = c(1, 0, 1)
  ),
  mapping = aes(x = fpr, y = tpr)) +
  geom_line(colour = "#E69F00") +
  geom_point(colour = "#E69F00") +
  geom_text(data = tibble(
    fpr = c(.75, .25, .25),
    tpr = c(.7, .3, .7),
    lab = latex2exp::TeX(c("$\\tau = 0$", "$\\tau = 1$", "$0 < \\tau < 1$"))),
    mapping = aes(label = lab),
    parse=T,
    colour = c("black", "black", "#E69F00")) +
  geom_curve(data = tibble(
    x1 = c(.75, .25, .25),
    x2 = c(.99, 0.01, 0.01),
    y1 = c(.75, .25, .75),
    y2 = c(.99, 0.01, 0.99)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  geom_label(x = 0.5, y = .95, label = "Perfect classifier", fill = "#E69F00") +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  coord_equal(xlim = c(0,1), ylim = c(0,1))

With a classifier that randomly assigns the classes

Now, let us consider the case where the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\). In such a case, the expected proportions are as follows:

Confusion table (values expressed in proportions) for a classifier that randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \((1-p) (1-\gamma)\) False Positive (FP) =\(p (1-\gamma)\)
Obs. Positive (1) False Negative (FN) = \((1-p) \gamma\) True Positive (TP) = \(p \gamma\)

In that case the expected values for the TPR and the FPR are:

  • \(TPR = \frac{TP}{TP+FN} = \frac{p\gamma}{p\gamma + (1-p)\gamma} = p\) ;
  • \(FPR = \frac{FP}{FP+TN} = \frac{p(1-\gamma)}{p(1-\gamma) + (1-p) (1-\gamma)}=p\).

So, on average, with a classifier that randomly assigns a class to the observations, \(TPR=FPR=p\). The ROC curve, for such a classifier will thus be a line from (0,0) to (1,1), i.e., a diagonal. The classifier would have no discriminative power.

On the ROC curve, points above that diagonal will correspond to classifiers that provide results that are better than a random guess while points below will correspond to classifiers that provide results that are worse than a random guess.

Show the R codes
ggplot(
  data = tibble(
    fpr = c(1, 0, 0),
    tpr = c(1, 0, 1)
  ),
  mapping = aes(x = fpr, y = tpr)) +
  geom_line(colour = "#E69F00") +
  geom_point(colour = "#E69F00") +
  geom_abline(slope = 1, intercept = 0, colour = "#009E73", linetype = "dashed") +
  geom_text(data = tibble(
    fpr = c(.6, .25, .25),
    tpr = c(.7, .15, .7),
    lab = latex2exp::TeX(c("$\\tau = 0$", "$\\tau = 1$", "$0 < \\tau < 1$"))),
    mapping = aes(label = lab),
    parse=TRUE,
    colour = c("black", "black", "#E69F00")) +
  geom_curve(data = tibble(
    x1 = c(.6, .25, .25),
    x2 = c(.99, 0.01, 0.01),
    y1 = c(.75, .1, .75),
    y2 = c(.99, 0.01, 0.99)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  geom_label(x = 0.5, y = .95, label = "Perfect classifier", fill = "#E69F00") +
  geom_label(x = 0.6, y = .35, label = "Random guess", fill = "#009E73") +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  coord_equal(xlim = c(0,1), ylim = c(0,1))

The roc_curve() function from {yardstick} allows to get the TPR/sensitivity and the TNR/specificity (the FPR is equal to 1-TNR). Let us apply it to the train data firt.

library(yardstick)
tibble(observed = df_train$y, prob_class_1 = pred_prob_train[,"1"]) %>% 
  roc_curve(truth = observed, prob_class_1, event_level="second")
# A tibble: 67 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1   -Inf           0               1
 2      0           0               1
 3      0.005       0.873           1
 4      0.01        0.910           1
 5      0.015       0.923           1
 6      0.02        0.930           1
 7      0.025       0.938           1
 8      0.03        0.947           1
 9      0.035       0.960           1
10      0.04        0.967           1
# … with 57 more rows

Let us put in a table the predicted probability for each observation, for both models.

pred_prob_train_both <- 
  tibble(observed = df_train$y, prob_class_1 = pred_prob_train[,"1"],
       model = "First") %>% 
  bind_rows(
    tibble(observed = df_train$y, prob_class_1 = pred_prob_train_2[,"1"],
           model = "Second") 
  )

Then, the autoplot() from {yardstick} can be applied to the resulting table to plot the ROC curve.

 pred_prob_train_both %>% 
  group_by(model) %>% 
  roc_curve(truth = observed, prob_class_1, event_level="second") %>% 
  autoplot() +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity")

ROC Curve for the first and the second models, training data

The Area Under the Curve (AUC) can be computed using the roc_auc() function from {yardstick}:

pred_prob_train_both %>% group_by(model) %>% 
  roc_auc(truth = observed, prob_class_1, event_level="second")
# A tibble: 2 × 4
  model  .metric .estimator .estimate
  <chr>  <chr>   <chr>          <dbl>
1 First  roc_auc binary         0.999
2 Second roc_auc binary         0.994

The AUC is really close to 1. The first model seems to provide better results than the second, with regard to the Area Under Curve.

And with the test data:

pred_prob_test_both <- 
  tibble(observed = df_test$y, prob_class_1 = pred_prob_test[,"1"],
       model = "First") %>% 
  bind_rows(
    tibble(observed = df_test$y, prob_class_1 = pred_prob_test_2[,"1"],
           model = "Second") 
  )
pred_prob_test_both %>%
  group_by(model) %>% 
  roc_curve(truth = observed, prob_class_1, event_level="second") %>% 
  autoplot() +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity")

ROC Curve for both models, test data

On the test data, the picture is less clear. Let us compute the values:

pred_prob_test_both %>% group_by(model) %>% 
  roc_auc(truth = observed, prob_class_1, event_level="second")
# A tibble: 2 × 4
  model  .metric .estimator .estimate
  <chr>  <chr>   <chr>          <dbl>
1 First  roc_auc binary         0.899
2 Second roc_auc binary         0.900

Optimal threshold

Let us now turn to finding the best probability threshold for a model. We have seen that a perfect classifier is located on the top-left corner of the plot of the ROC curve (FPR=0, TPR=1).

There are many ways to define the optimal threshold value (see Unal (2017)).

For simplicity, let us use the Euclidean distance to find the closest point to (0,1). The optimal threshold \(\tau^\star\) thus minimises the following distance:

\[d(\tau) = \sqrt{\text{Sensitivity}(\tau)^2 + \left((1-\text{Specificity}(\tau)\right)^2}\]

where \(\text{Sensitivity} = \frac{TP}{TP+FN}\) and \(\text{Specificity} = 1 - \frac{FP}{FP+TN}\).

Let us define a function that computes the Euclidean distance from two points a and b, where a and b are provided in a table:

eucl_dist_tibble <- function(a,b) sqrt(rowSums(a-b)^2)

The distance from every point (\(\text{Sensitivity}(\tau)\), \(1-\text{Specificity}(\tau)\)) to the point (0,1) can the be computed as follows:

distances <- 
  tibble(observed = df_test$y, prob_class_1 = pred_prob_test[,"1"]) %>% 
  roc_curve(truth = observed, prob_class_1, event_level="second") %>% 
  filter(`.threshold` >0, `.threshold`<1) %>% 
  mutate(dist = eucl_dist_tibble(cbind(sensitivity, 1-specificity),
                                 cbind(rep(0, n()), rep(1, n())))) %>% 
  arrange(dist)
distances
# A tibble: 34 × 4
   .threshold specificity sensitivity    dist
        <dbl>       <dbl>       <dbl>   <dbl>
 1      0.015       0.886       0.889 0.00294
 2      0.02        0.906       0.889 0.0174 
 3      0.01        0.864       0.889 0.0253 
 4      0.005       0.817       0.889 0.0722 
 5      0.025       0.916       0.778 0.139  
 6      0.03        0.923       0.778 0.145  
 7      0.035       0.927       0.667 0.260  
 8      0.04        0.935       0.667 0.268  
 9      0.045       0.939       0.556 0.383  
10      0.055       0.943       0.556 0.387  
# … with 24 more rows

The optimal threshold obtained with this method is thus:

optimal_tau <- distances %>% slice(1) %>% .$.threshold
optimal_tau
  432 
0.015 
tibble(observed = df_test$y, prob_class_1 = pred_prob_test[,"1"]) %>% 
  roc_curve(truth = observed, prob_class_1, event_level="second") %>% 
  autoplot() +
  labs(x = "False Positive Rate = FP / (FP+TN) = 1-specificity",
       y = "True Positive Rate = TP/(TP+FN) = Sensitivity") +
  geom_point(data = distances %>% slice(1),
             mapping = aes(x = 1-specificity, y = specificity),
             colour = "red") +
  geom_text(data = distances %>% slice(1),
             mapping = aes(x = 1-specificity + .25, y = specificity),
            colour = "red", label = "Optimal threshold")

ROC Curve for the logistic model, test data

The predictions made using that probability threshold for the classifier can then computed as follows:

pred_class_train_opt <- ifelse(pred_prob_train[,"1"] > optimal_tau, "1", "0") %>% 
  factor(levels = c("0", "1"))

And the confusion matrix obtained is:

caret::confusionMatrix(data = pred_class_train_opt,
                         reference = df_train$y,
                         positive="1")
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1821    0
         1  138   41
                                         
               Accuracy : 0.931          
                 95% CI : (0.919, 0.9417)
    No Information Rate : 0.9795         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.3511         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 1.0000         
            Specificity : 0.9296         
         Pos Pred Value : 0.2291         
         Neg Pred Value : 1.0000         
             Prevalence : 0.0205         
         Detection Rate : 0.0205         
   Detection Prevalence : 0.0895         
      Balanced Accuracy : 0.9648         
                                         
       'Positive' Class : 1              
                                         

And for the test set:

pred_class_test_opt <- ifelse(pred_prob_test[,"1"] > optimal_tau, "1", "0") %>% 
  factor(levels = c("0", "1"))

And the confusion matrix obtained is:

caret::confusionMatrix(data = pred_class_test_opt,
                         reference = df_test$y,
                         positive="1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 445   1
         1  46   8
                                         
               Accuracy : 0.906          
                 95% CI : (0.877, 0.9301)
    No Information Rate : 0.982          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.2302         
                                         
 Mcnemar's Test P-Value : 1.38e-10       
                                         
            Sensitivity : 0.8889         
            Specificity : 0.9063         
         Pos Pred Value : 0.1481         
         Neg Pred Value : 0.9978         
             Prevalence : 0.0180         
         Detection Rate : 0.0160         
   Detection Prevalence : 0.1080         
      Balanced Accuracy : 0.8976         
                                         
       'Positive' Class : 1              
                                         

Precision-recall curve

With both the train and test data, looking at the ROC-curve, we might think that our results are pretty good. Let us look at the precision-recall curve, to focus specifically on the positive predictions. Recall that:

  • Precision = \(\frac{TP}{TP+FP}\) ;
  • Recall = \(\frac{TP}{TP+FN}\).

These two metrics focus on the positive class, which is the minority class here. With a dataset where there is a fewer number of observation from class 1, we are interested in assessing ability of the classifier to predict class 1. The Precision and recall metrics are thus helpful to have an idea of the performance of the model on the minority class. The precision-recall curve plots the precision as a function of the recall. The different values are obtained, as in the case of the ROC curve, by varying the probability threshold \(\tau\).

With classification threshold \(\tau=0\)

If the threshold \(\tau=0\), the confusion matrix will be as follows:

Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = 0 False Positive (FP) =\((1-\gamma)\)
Obs. Positive (1) False Negative (FN) = 0 True Positive (TP) = \(\gamma\)
  • Precision = \(\frac{TP}{TP+FP} = \frac{\gamma}{\gamma + (1-\gamma)} = \gamma\) ;
  • Recall = \(\frac{TP}{TP+FN} = \frac{\gamma}{\gamma + 0} = 1\).

For example, if \(\gamma=20\%\) :

Show the R codes
ggplot(
  data = tibble(
    precision = c(0.2),
    recall = c(1)
  ),
  mapping = aes(x = recall, y = precision)) +
  geom_point() +
  geom_text(data = tibble(
    precision = c(0.05),
    recall = c(.75),
    lab = c("tau == 0")),
    mapping = aes(label = lab), parse = T) +
  geom_curve(data = tibble(
    x1 = c(.75),
    x2 = c(.98),
    y1 = c(.1),
    y2 = c(.2)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value") +
  coord_equal(xlim = c(0,1), ylim = c(0,1)) +
  scale_y_continuous(limits=c(0, 1), breaks = c(.2, seq(0, 1, by = .25)),
                     labels = c(latex2exp::TeX("$\\gamma$"), seq(0, 1, by = .25)))

With classification threshold \(\tau=1\)

If the threshold \(\tau=1\), then:

Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \(1-\gamma\) False Positive (FP) = 0
Obs. Positive (1) False Negative (FN) = \(\gamma\) True Positive (TP) = 0

Hence:

  • Precision = \(\frac{TP}{TP+FP}\) : undefined.
  • Recall = \(\frac{TP}{TP+FN} = 0\).

In practice, while the value of \(\tau\) increases, more and more cases are classified as 0. The proportion of false positive decreases and usually becomes negligible when compared to the proportion of true positive. The precision thus increases and may be equal to 1.

With a perfect classifier

Now, let us turn to a classifier that perfectly predicts the two classes for each case.

Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \(1-\gamma\) False Positive (FP) = 0
Obs. Positive (1) False Negative (FN) = 0 True Positive (TP) = \(\gamma\)

In such a case:

  • Precision = \(\frac{TP}{TP+FP}=1\) ;
  • Recall = \(\frac{TP}{TP+FN} = 1\).
Show the R codes
ggplot(
  data = tibble(
    precision = c(0.2, 1),
    recall = c(1, 1)
  ),
  mapping = aes(x = recall, y = precision)) +
  geom_point(colour = c("black", "#E69F00")) +
  geom_text(data = tibble(
    precision = c(0.05, .8, .75),
    recall = c(.75, .8, .8),
    lab = latex2exp::TeX(c("$\\tau = 0$","$0 < \\tau < 1$", "perfect ~~ classifier"))),
    mapping = aes(label = lab), parse = T,
    colour = c("black", "#E69F00", "#E69F00")) +
  geom_curve(data = tibble(
    x1 = c(.75, .8),
    x2 = c(.98, .98),
    y1 = c(.1, .85),
    y2 = c(.2, .98)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value") +
  coord_equal(xlim = c(0,1), ylim = c(0,1)) +
  scale_y_continuous(limits=c(0, 1), breaks = c(.2, seq(0, 1, by = .25)),
                     labels = c(latex2exp::TeX("$\\gamma$"), seq(0, 1, by = .25)))

With a classifier that randomly assigns the classes

If the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\), regardless of \(\tau\), we the expected proportions are:

Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
Pred. Negative (0) Pred. Positive (1)
Obs. Negative (0) True Negative (TN) = \((1-p) (1-\gamma)\) False Positive (FP) = \(p(1-\gamma)\)
Obs. Positive (1) False Negative (FN) = \((1-p)\gamma\) True Positive (TP) = \(p\gamma\)

Which leads to:

  • Precision = \(\frac{TP}{TP+FP} = \frac{p\gamma}{p\gamma + p(1-\gamma)} = \gamma\) ;
  • Recall = \(\frac{TP}{TP+FN} = \frac{p\gamma}{p\gamma + (1-p)\gamma}= p\).

Hence, for \(0<\tau<1\), the point corresponding to a classifier that randomly assigns class 1 with probability \(p\) will be on the dashed green line. The x-coordinate of that point will be equal to \(p\) (assuming that there is a proportion \(\gamma\) of class 1 individuals in the data).

Show the R codes
ggplot(
  data = tibble(
    precision = c(0.2, 1),
    recall = c(1, 1)
  ),
  mapping = aes(x = recall, y = precision)) +
  geom_point(colour = c("black", "#E69F00")) +
  geom_text(data = tibble(
    precision = c(0.05, .8, .75),
    recall = c(.75, .8, .8),
    lab = latex2exp::TeX(c("$\\tau = 0$","$0 < \\tau < 1$", "perfect ~~ classifier"))),
    mapping = aes(label = lab), parse = T,
    colour = c("black", "#E69F00", "#E69F00")) +
  geom_curve(data = tibble(
    x1 = c(.75, .8),
    x2 = c(.98, .98),
    y1 = c(.1, .85),
    y2 = c(.2, .98)
  ),
  mapping = aes(x = x1, y = y1, xend = x2, yend = y2),
  arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
  color = "gray20", curvature = -0.3
  ) +
  geom_hline(yintercept = .2, colour = "#009E73", linetype = "dashed") +
  geom_label(label = "Random guess", fill = "#009E73", x = .5, y = .25) +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value") +
  coord_equal(xlim = c(0,1), ylim = c(0,1)) +
  scale_y_continuous(limits=c(0, 1), breaks = c(.2, seq(0, 1, by = .25)),
                     labels = c(latex2exp::TeX("$\\gamma$"), seq(0, 1, by = .25)))

With our models

Recall from the confusion matrix (showing proportions) for the first model, on the test sample:

Metric Training data Test data
Precision (TP / (TP+FP)) 0.98 0.98
Recall/Sensitivity (TP / (TP + FN)) 0.11 0.11

And for the second:

Metric Training data Test data
Precision (TP / (TP+FP)) 0.98 0.98
Recall/Sensitivity (TP / (TP + FN)) 0.11 0.11

When faced with an unbalanced dataset and when the model fails to correctly predict the positive class, the ROC curve will not necessarily reflect this if the imbalance is large. The ROC curve will in fact be close to the ROC curve of a perfectly discriminating model. In contrast, the Precision-Recall curve will not be as close as the one corresponding to a perfect classifier. Let us illustrate this.

The pr_curve() from {yardstick} allows us to get the precision and recall for different values of \(\tau\).

tibble(observed = df_train$y, prob_class_1 = pred_prob_train[,"1"]) %>% 
  pr_curve(truth = observed, prob_class_1, event_level="second")
# A tibble: 66 × 3
   .threshold recall precision
        <dbl>  <dbl>     <dbl>
 1    Inf     0              1
 2      0.995 0.0732         1
 3      0.97  0.0976         1
 4      0.965 0.146          1
 5      0.9   0.171          1
 6      0.785 0.195          1
 7      0.73  0.220          1
 8      0.65  0.244          1
 9      0.635 0.268          1
10      0.58  0.293          1
# … with 56 more rows

And for the second model:

tibble(observed = df_train$y, prob_class_1 = pred_prob_train_2[,"1"]) %>% 
  pr_curve(truth = observed, prob_class_1, event_level="second")
# A tibble: 52 × 3
   .threshold recall precision
        <dbl>  <dbl>     <dbl>
 1    Inf     0          1    
 2      0.995 0.0732     1    
 3      0.975 0.146      1    
 4      0.915 0.171      1    
 5      0.755 0.195      1    
 6      0.705 0.220      1    
 7      0.47  0.244      1    
 8      0.43  0.268      1    
 9      0.415 0.293      1    
10      0.4   0.293      0.857
# … with 42 more rows

Putting those in a single table:

pr_curve_both_train <- 
  tibble(observed = df_train$y, prob_class_1 = pred_prob_train[,"1"],
         model = "First") %>% 
  bind_rows(
    tibble(observed = df_train$y, prob_class_1 = pred_prob_train_2[,"1"], 
           model = "Second")
  ) %>% 
  group_by(model) %>% 
  pr_curve(truth = observed, prob_class_1, event_level="second")

Again, the autoplot() function can be used to plot the curve from the obtained table.

pr_curve_both_train %>% 
  autoplot() +
  geom_line(tibble(recall = c(0,0.9999,1), precision = c(1,1,0)),
            mapping = aes(x = recall, y = precision),
            colour = "#E69F00") +
  geom_hline(yintercept = sum(df_train$y==1) / sum(df_train$y==0), linetype = "dashed", colour = "#009E73") +
  geom_label(data = tibble(x = c(.25, 0.25), y = c(sum(df_train$y==1) / sum(df_train$y==0), 0.95),
                           lab = c("Random classifier", "Perfect classifier")),
             mapping = aes(x=x, y=y, label=lab, fill = lab)) +
  scale_fill_manual(NULL, guide = "none",
                    values = c("Random classifier" = "#009E73",
                               "Perfect classifier" = "#E69F00")) +
  coord_equal() +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value")

Precision-recall for both models, training data

With the test data:

pr_curve_both_test <- 
  tibble(observed = df_test$y, prob_class_1 = pred_prob_test[,"1"],
         model = "First") %>% 
  bind_rows(
    tibble(observed = df_test$y, prob_class_1 = pred_prob_test_2[,"1"], 
           model = "Second")
  ) %>% 
  group_by(model) %>% 
  pr_curve(truth = observed, prob_class_1, event_level="second")
pr_curve_both_test %>% 
  autoplot() +
  geom_line(tibble(recall = c(0,0.9999,1), precision = c(1,1,0)),
            mapping = aes(x = recall, y = precision),
            colour = "#E69F00") +
  geom_hline(yintercept = sum(df_train$y==1) / sum(df_train$y==0), linetype = "dashed", colour = "#009E73") +
  geom_label(data = tibble(x = c(.25, 0.25), y = c(sum(df_train$y==1) / sum(df_train$y==0), 0.95),
                           lab = c("Random classifier", "Perfect classifier")),
             mapping = aes(x=x, y=y, label=lab, fill = lab)) +
  scale_fill_manual(NULL, guide = "none",
                    values = c("Random classifier" = "#009E73",
                               "Perfect classifier" = "#E69F00")) +
  coord_equal() +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value")

Precision-recall for the both models, test data

Clearly, the performances of the model are much less impressive viwed that way…

F1 Score

With an imbalanced dataset, if we focus on the accuracy, we might select a model that gives very good results on average, but poor results to predict one of the classes. Some metrics are built to take into consideration the predictive capabilities for both classes.

This is the case of the F1 score. It computes the harmonic mean of the precision an sensitivity: \[\text{F1 score} = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} = \frac{2 \times TP}{2 \times TP + FP + FN}.\] The F1 score values range from 0 (if either precision or sensitivity is equal to 0) to 1 (both precision and sensitivity equal to 1).

With an imbalanced dataset, it may be useful to rely on this metric to assess the quality of fit of a given classifier when performing model selection.

With our two models, we can use the F1-Score to select the optimal threshold:

thresholds_F1 <- 
  pr_curve_both_test %>% 
  mutate(F1_score = 2*precision*recall/ (precision+recall)) %>% 
  group_by(model) %>% 
  arrange(desc(F1_score)) %>% 
  slice(1)
thresholds_F1
# A tibble: 2 × 5
# Groups:   model [2]
  model  .threshold recall precision F1_score
  <chr>       <dbl>  <dbl>     <dbl>    <dbl>
1 First       0.295  0.333     0.375    0.353
2 Second      0.21   0.333     0.375    0.353
pr_curve_both_test %>% 
  autoplot() +
  geom_line(tibble(recall = c(0,0.9999,1), precision = c(1,1,0)),
            mapping = aes(x = recall, y = precision),
            colour = "#E69F00") +
  geom_point(data = thresholds_F1, mapping = aes(x = recall, y = precision, colour = model)) +
  geom_hline(yintercept = sum(df_train$y==1) / sum(df_train$y==0), linetype = "dashed", colour = "#009E73") +
  geom_label(data = tibble(x = c(.25, 0.25), y = c(sum(df_train$y==1) / sum(df_train$y==0), 0.95),
                           lab = c("Random classifier", "Perfect classifier")),
             mapping = aes(x=x, y=y, label=lab, fill = lab)) +
  scale_fill_manual(NULL, guide = "none",
                    values = c("Random classifier" = "#009E73",
                               "Perfect classifier" = "#E69F00")) +
  coord_equal() +
  labs(x = "Recall, Sensitivity, True Positive Rate",
       y = "Precision, Positive Predictive Value")

Precision-recall for the both models, test data

Warning

When faced with an imbalanced dataset, what should we do?

Unfortunately (or maybe, fortunately?), there is no “good” answer to this question. There are multiple ways to address this issue. And as is often the case with machine learning, we have to try different options and we have to make choices.

As of today, there are three ways to deal with imbalanced datasets:

  1. data resampling
  2. algorithm modifications (not in the scope of this notebook)
  3. cost-sensitive learning (not in the scope of this notebook).

In the remainder of this notebook, we will have a look at data resampling methods.

Rebalancing the dataset

We have seen in the first part of this notebook that in the presence of a data sample containing much more observations of the negative class than of the positive class, it is difficult to obtain an algorithm with good predictive capabilities for the minority (positive) class. Yet, obtaining good performance in predicting the minority class may be the precise purpose of the classification (detection of cancer, fraud, recession, …). In this case, it is possible to re-equillibrate the data set to decrease the imbalance ratio.

This part of the notebook will present two broad techniques:

  1. under-sampling: the number of observations from the majority class is decreased
  2. over-sampling: the number of observations from the minority class is increased.

Under-sampling

A way to obtain a balanced dataset consists in sampling from the majority class to decrease the number of observations at hand from this class.There a many techniques to do so.

Random under-sampling

One of the most simple technique is random sampling. All we need to do is to decide the desired imbalance ratio we would like to obtain. For example, if we want that ratio to be equal to 1 (1 minority case for each majority case), we just need to sample as many examples of the majority as there are examples of the minority class.

In the traiing set, the imbalance ratio is:

sum(df_train$y==0) / sum(df_train$y==1)
[1] 47.78049

The number of minority cases is:

(n_minority <- sum(df_train$y==1))
[1] 41

We will thus randomly sample 41 cases from the majority case, and keep all cases from the minority class.

df_train_under_random_1 <- 
  df_train %>% 
  group_by(y) %>% 
  slice_sample(n = n_minority, replace = FALSE) %>% 
  ungroup()
ggplot(data = df_train_under_random_1,
       aes(x = x_1, y = x_2)) +
  geom_point(mapping = aes(colour = y)) +
  scale_colour_manual(NULL, values = c("1" = wongOrange, "0" = wongBlue))

A rebalanced dataset with an imbalance ratio of 1

Undersampling to obtain an imbalance ratio of 1 resulted in a final training set with 2 times the number of minority cases, i.e., 82. This number of observations may be too small, depending on the variability in the data. To have a few more observations on which to train the model, we can accept to have a slightly unbalanced data set.

For example, if we accept to have an imbalance ratio of 2 (2 observations of the majority class for each observation of the minority class):

nrow(df_train)
[1] 2000
df_train_under_random_2 <- 
  df_train %>% 
  # Shuffle observations
  sample_frac(1L) %>% 
  group_by(y) %>% 
  # Under-sampling in the majority class
  slice_head(n = 3*n_minority) %>% 
  ungroup()
ggplot(data = df_train_under_random_2,
       aes(x = x_1, y = x_2)) +
  geom_point(mapping = aes(colour = y)) +
  scale_colour_manual(NULL, values = c("1" = wongOrange, "0" = wongBlue))

A rebalanced dataset with an imbalance ratio of 2

When observations are randomly drawn from the majority class, the resulting sample may not be representative of the original data. Removing some individuals may result in removing important information from the training dataset.

Therefore, instead of drawing randomly among the observations of the majority class, some techniques focus on orienting the drawing (by removing redundant individuals, for example), or on creating synthetic individuals among the individuals of the majority class.

Undersampling using KNN

Instead of randomly select observations to keep or remove from the majority class, Mani and Zhang (2003) suggest using the nearest neighbors. They propose 3 different versions.

Note
  • NearMiss-1:

    • for each example from the majority class, compute the average distance to k closest examples among the minority class
    • select majority examples for which the average distance computed in the previous step is the smallest
  • NearMiss-2:

    • for each example from the majority class, compute the average distance to k farthest examples among the minority class
    • select majority examples for which the average distance computed in the previous step is the smallest
  • NearMiss-3:

    • for each example from the minority class, identify and select a given number of neighbors among the majority class.
    • for each of the neighbors obtained in the previous step, compute the distance to each example from the the minority class, then, compute the average distance from their 3 closest neighbors
    • select majority examples for which the average distance computed in the previous step is the farthest.

As the NearMiss-1 technique seem to provide poor results, some suggest to select majority examples for which the average distance computed in the previous step is the highest. [ref needed]

Let us illustrate with a few R codes how these variants work. First, let us split the sample according to the value of the target variable: the majority class and the minority class.

majority_sample <- df_train %>% filter(y == 0) %>% select(x_1, x_2) %>% as.matrix()
nrow(majority_sample)
[1] 1959
minority_sample <- df_train %>% filter(y == 1) %>% select(x_1, x_2) %>% as.matrix()
nrow(minority_sample)
[1] 41

Let us say that we want to use \(k=3\) for the KNN:

n_neighbors <- 3

NearMiss-1

In a first step, for each observation from the minority sample, let us compute the distance (using the Euclidean distance here) to each observation from the minority sample:

dists <- fields::rdist(majority_sample, minority_sample)

Each row of the obtained matrix gives the distance to each observation from the minority sample (in column):

dim(dists)
[1] 1959   41

For a single example from the majority, we identify the closests observations from the minority sample. The distances of the first example to each observation from the minority sample are:

dists[1,]
 [1]  5.560756  8.386229  3.836185  5.257663  4.611472  3.375368  3.465841
 [8]  5.310080  7.891209  5.513237 10.274421  3.288169  3.620326  7.294134
[15]  7.229950  9.132841  3.784633  5.755599  4.298604  6.833199  5.378315
[22]  6.527019  6.739599  8.184762  3.146240  6.750714  3.027914  2.554185
[29]  9.280898  5.349391  4.274212  5.804782  2.894196  2.650842  6.136897
[36]  9.872539  5.302912  5.072795  8.952452  2.533850  4.851702

Using the order() function, we can get the index of the examples from the minority cases, ordered by descending values of the distance:

order(dists[1,])
 [1] 40 28 34 33 27 25 12  6  7 13 17  3 31 19  5 41 38  4 37  8 30 21 10  1 18
[26] 32 35 22 23 26 20 15 14  9 24  2 39 16 29 36 11

The distances to the three closest neighbors are:

head(sort(dists[1,]), n_neighbors)
[1] 2.533850 2.554185 2.650842

The average of these distances is:

mean(head(sort(dists[1,]), n_neighbors))
[1] 2.579626

Visually speaking:

Show the R codes
as_tibble(majority_sample) %>% 
  slice(-1) %>% 
  mutate(sample = "Majority") %>% 
bind_rows(
  as_tibble(minority_sample) %>% 
    slice(-head(order(dists[1,]), n_neighbors)) %>% 
    mutate(sample = "Minority")) %>% 
  bind_rows(
    as_tibble(minority_sample) %>% slice(head(order(dists[1,]), n_neighbors)) %>% 
      mutate(sample = "Nearest neighbors")
  ) %>% 
  bind_rows(
    as_tibble(majority_sample) %>% slice(1) %>% 
      mutate(sample = "Current obs.")
  ) %>% 
  ggplot(data = ., mapping = aes(x = x_1, y = x_2, colour = sample, alpha = sample, size = sample)) +
  geom_point() +
  scale_alpha_manual(NULL, values = c("Majority" = .2,
                                      "Minority" = .2,
                                      "Nearest neighbors" = 1,
                                      "Current obs." = 1)) +
  scale_colour_manual(NULL, values = c("Majority" = wongBlue,
                                       "Minority" = wongOrange,
                                       "Nearest neighbors" = wongOrange,
                                       "Current obs." = wongBlue
                                       )) +
  scale_size_manual(NULL, values = c("Majority" = .5,
                                       "Minority" = .5,
                                       "Nearest neighbors" = 2,
                                       "Current obs." = 2
  ))

Computing the average distance from the current observation to the 3-nearest neighbors

Let us compute the average distance to the three nearest neighbors in the minority class for each observation from the majority class:

avg_dist <- apply(dists, 1, function(x) mean(head(sort(x), n_neighbors)))

Then, we can order the results by descending values of the computed average distance and select the first observations from that list. Let us assume that we to keep only as many observations from the majority class as there are in the minority class:

n_to_keep <- n_minority

Then the index of the values we keep from the majority class can be obtained as follows:

ind_to_keep <- order(avg_dist)[1:n_to_keep]
head(ind_to_keep)
[1] 1874 1942  837  263 1083 1346

Let us have a visual representation:

Show the R codes
as_tibble(majority_sample[ind_to_keep, ]) %>% mutate(sample = "Majority (keep)") %>% 
  bind_rows(
    as_tibble(majority_sample[-ind_to_keep, ]) %>% mutate(sample = "Majority (discard)")
  ) %>% 
  bind_rows(as_tibble(minority_sample) %>% mutate(sample = "Minority")) %>% 
  ggplot(data = .,
         mapping = aes(x = x_1, y = x_2, colour = sample, alpha = sample)) +
  geom_point() +
  scale_alpha_manual(NULL, values = c("Majority (keep)" = 1, "Majority (discard)" = .1,
                                "Minority" = 1)) +
  scale_colour_manual(NULL, values = c("Majority (keep)" = wongBlue,
                                       "Majority (discard)" = wongBlue,
                                       "Minority" = wongOrange))

Observations in the training sample after NearMiss-1 is applied

NearMiss-2

This time, instead of computing the average distance to the closest neighbors, we compute the average distance to the farthest neighbors.

dists <- fields::rdist(majority_sample, minority_sample)
dim(dists)
[1] 1959   41

The distance to the farthest neighbors:

tail(sort(dists[1,]), n_neighbors)
[1]  9.280898  9.872539 10.274421

We compute the average:

mean(tail(sort(dists[1,]), n_neighbors))
[1] 9.809286
Show the R codes
as_tibble(majority_sample) %>% 
  slice(-1) %>% 
  mutate(sample = "Majority") %>% 
  bind_rows(
    as_tibble(minority_sample) %>% 
      slice(-tail(order(dists[1,]), n_neighbors)) %>% 
      mutate(sample = "Minority")) %>% 
  bind_rows(
    as_tibble(minority_sample) %>%
      slice(tail(order(dists[1,]), n_neighbors)) %>% 
      mutate(sample = "Farthest neighbors")
  ) %>% 
  bind_rows(
    as_tibble(majority_sample) %>% slice(1) %>% 
      mutate(sample = "Current obs.")
  ) %>% 
  ggplot(data = ., mapping = aes(x = x_1, y = x_2, colour = sample, alpha = sample, size = sample)) +
  geom_point() +
  scale_alpha_manual(NULL, values = c("Majority" = .2,
                                      "Minority" = .2,
                                      "Farthest neighbors" = 1,
                                      "Current obs." = 1)) +
  scale_colour_manual(NULL, values = c("Majority" = wongBlue,
                                       "Minority" = wongOrange,
                                       "Farthest neighbors" = wongOrange,
                                       "Current obs." = wongBlue
  )) +
  scale_size_manual(NULL, values = c("Majority" = .5,
                                     "Minority" = .5,
                                     "Farthest neighbors" = 2,
                                     "Current obs." = 2
  ))

Computing the average distance from the current observation to the 3-farthest neighbors

We do it for each observation from the majority class:

avg_dist_farthest <- apply(dists, 1, function(x) mean(tail(sort(x), n_neighbors)))

After ordering the results by increasing values, we can select a given number of observations with the smallest distances:

ind_to_keep_2 <- order(avg_dist_farthest)[1:n_to_keep]

Visually:

Show the R codes
as_tibble(majority_sample[ind_to_keep_2, ]) %>% mutate(sample = "Majority (keep)") %>% 
  bind_rows(
    as_tibble(majority_sample[-ind_to_keep_2, ]) %>% mutate(sample = "Majority (discard)")
  ) %>% 
  bind_rows(as_tibble(minority_sample) %>% mutate(sample = "Minority")) %>% 
  ggplot(data = .,
         mapping = aes(x = x_1, y = x_2, colour = sample, alpha = sample)) +
  geom_point() +
  scale_alpha_manual(NULL, values = c("Majority (keep)" = 1, "Majority (discard)" = .1,
                                      "Minority" = 1)) +
  scale_colour_manual(NULL, values = c("Majority (keep)" = wongBlue,
                                       "Majority (discard)" = wongBlue,
                                       "Minority" = wongOrange))

Observations in the training sample after NearMiss-2 is applied