Machine Learning and Statistical Learning

Dealing with unbalanced dataset

Author

Ewen Gallic

When faced with unbalanced data on which to build a classifier, the performance of the model to correctly predict each class is generally poor or even bad. In this notebook, we will consider the case of a binary classifier, trained to predict two classes that we will call the majority class and the minority class. Arbitrarily, we will consider that the minority class will be coded 1, while the majority class will be coded 0. In theory, when the repartition between the two classes is not equal (50% of 0 and 50% of 1), we speak of imbalance. In practice, problems occur when the imbalance is strong, for example 1% of 1 and 99% of 1. It is also possible to have extremely few observations of the minority class. The typical example encountered in machine learning is that of fraud, where more than 99% of observations are non-fraudulent while less than one percent of observations involve fraud.

To illustrate the different concepts, we will use data that we generate. Let us create a dataset with two spirals. To do so, we rely on some code published by Stanislas Morbieu on R-bloggers.

Show the R codes
# Code from
# https://www.r-bloggers.com/2018/11/generate-datasets-to-understand-some-clustering-algorithms-behavior/
library(dplyr) ; library(ggplot2)
n <- 5000
set.seed(123)
library(mvtnorm)
generateSpiralData <- function(n) {
  maxRadius = 7
  xShift = 2.5
  yShift = 2.5
  angleStart = 2.5 * pi
  noiseVariance = 0.4
  
  # first spiral
  firstSpiral <- function() {
    d1 = data.frame(0:(n-1))
    colnames(d1) <- c("i")
    d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
                  radius = maxRadius * (n + n/5 - i)/ (n + n/5),
                  x = radius * sin(angle),
                  y = radius * cos(angle),
                  class="0")
  }
  d1 = firstSpiral()
  
  # second spiral
  d2 = d1 %>% mutate(x = -x, y = -y, class="1")
  
  # combine, add noise, and shift
  generateNoise <- function(n) {
    sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
    noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
    df = data.frame(noise)
    colnames(df) <- c("xNoise", "yNoise")
    df
  }
  d1 %>%
    bind_rows(d2) %>%
    bind_cols(generateNoise(2*n)) %>%
    transmute(x_1 = x + xShift + xNoise,
              x_2 = y + yShift + yNoise,
              y = factor(class, levels = c("0", "1"))) %>% 
    as_tibble()
}

df <- generateSpiralData(n)
wongBlue <- "#0072B2"
wongOrange <- "#D55E00"

ggplot(data = df, aes(x = x_1, y = x_2)) +
  geom_point(mapping = aes(colour = y)) +
  scale_colour_manual(NULL, values = c("1" = wongOrange, "0" = wongBlue))

Two spirals