When faced with unbalanced data on which to build a classifier, the performance of the model to correctly predict each class is generally poor or even bad. In this notebook, we will consider the case of a binary classifier, trained to predict two classes that we will call the majority class and the minority class. Arbitrarily, we will consider that the minority class will be coded 1, while the majority class will be coded 0. In theory, when the repartition between the two classes is not equal (50% of 0 and 50% of 1), we speak of imbalance. In practice, problems occur when the imbalance is strong, for example 1% of 1 and 99% of 1. It is also possible to have extremely few observations of the minority class. The typical example encountered in machine learning is that of fraud, where more than 99% of observations are non-fraudulent while less than one percent of observations involve fraud.
Instead of saying that there are 98% of observations from class 0 and 1% of observations from class 1, it is possible to use the imbalance ratio that gives the ratio between the number of observations from the majority class and the number of observations from the smallest minority class:
The larger the imbalance ratio, the larger the imbalance. For a perfectly balanced dataset, the imbalance ratio is equal to 1.
Building a classifier on an imbalanced dataset
In a first step, let us try to fit a random forest on the raw data. We will train the model on a subsample of the data, and assess its quality of fit both on the same training data and on unseen data (test set). We will consider here two different random forests, and try to select the one that gives the best results.
Let us put 80% of the data in the training set and leave the remaining 20% for the test set.
Let us grow a random forest with a first set of hyperparameters (200 trees, 2 variables as candidates to draw from when performing the splits, 20 observation at minimum in terminal nodes)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1959 29
1 0 12
Accuracy : 0.9855
95% CI : (0.9792, 0.9903)
No Information Rate : 0.9795
P-Value [Acc > NIR] : 0.02988
Kappa : 0.4477
Mcnemar's Test P-Value : 1.999e-07
Sensitivity : 0.2927
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9854
Prevalence : 0.0205
Detection Rate : 0.0060
Detection Prevalence : 0.0060
Balanced Accuracy : 0.6463
'Positive' Class : 1
While the accuracy (percentage of correctly predicted individuals) is very high (0.99), we need to keep in mind that 98% of the observations are from class 0. The true negative rate (or specificity) (TN/(TN+FP)) is very high (1.00), but the true positive rate (or sensitivity) (TP/(TP+FN)), on the other hand, is low (0.29).
Let us look at those same metrics based on the predictions made on unseen data.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 491 8
1 0 1
Accuracy : 0.984
95% CI : (0.9687, 0.9931)
No Information Rate : 0.982
P-Value [Acc > NIR] : 0.45445
Kappa : 0.1971
Mcnemar's Test P-Value : 0.01333
Sensitivity : 0.1111
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9840
Prevalence : 0.0180
Detection Rate : 0.0020
Detection Prevalence : 0.0020
Balanced Accuracy : 0.5556
'Positive' Class : 1
Again, the accuracy is very high (0.98). The specificity, i.e., the true negative rate is high (1.00), but the sensitivity, i.e., the true positive rate is low (0.11).
A second model
Let us fit a second model, with different specifications:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 491 8
1 0 1
Accuracy : 0.984
95% CI : (0.9687, 0.9931)
No Information Rate : 0.982
P-Value [Acc > NIR] : 0.45445
Kappa : 0.1971
Mcnemar's Test P-Value : 0.01333
Sensitivity : 0.1111
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9840
Prevalence : 0.0180
Detection Rate : 0.0020
Detection Prevalence : 0.0020
Balanced Accuracy : 0.5556
'Positive' Class : 1
Warning
Which of the two models give the best results? To try to answer this question, let us consider the ROC curve.
ROC Curve
Let us have a look at the ROC curve. Recall that the x-axis shows the False Positive Rate (here, fractions of errors for the majority class) while the y-axis show the True Positive Rate (here, fraction of correct predictions for the minority class). The graph reports these two metrics when the threshold \(\tau\) varies. This threshold is the cut-off point above which class 1 is predicted for the observation: if \(\mathbb{P}(Y=1 \mid X) \geq \tau\), then the observation is classified as 1, 0 otherwise. Varying this threshold will favour the TPR at the expense of the FPR or conversely.
The ROC curve is plotted on a graph with the False positive rate on the x-axis, and the True positive rate on the y-axis:
If the threshold \(\tau=0\), then every observation is classified as 1. Let us assume that the fraction of class 1 in the sample is \(\gamma\). The fraction of class 0 is then \(1-\gamma\).
Confusion table (values expressed in proportions) for a classifier that systematically assigns class 1
With a classifier that randomly assigns the classes
Now, let us consider the case where the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\). In such a case, the expected proportions are as follows:
Confusion table (values expressed in proportions) for a classifier that randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\)
Pred. Negative (0)
Pred. Positive (1)
Obs. Negative (0)
True Negative (TN) = \((1-p) (1-\gamma)\)
False Positive (FP) =\(p (1-\gamma)\)
Obs. Positive (1)
False Negative (FN) = \((1-p) \gamma\)
True Positive (TP) = \(p \gamma\)
In that case the expected values for the TPR and the FPR are:
So, on average, with a classifier that randomly assigns a class to the observations, \(TPR=FPR=p\). The ROC curve, for such a classifier will thus be a line from (0,0) to (1,1), i.e., a diagonal. The classifier would have no discriminative power.
On the ROC curve, points above that diagonal will correspond to classifiers that provide results that are better than a random guess while points below will correspond to classifiers that provide results that are worse than a random guess.
The roc_curve() function from {yardstick} allows to get the TPR/sensitivity and the TNR/specificity (the FPR is equal to 1-TNR). Let us apply it to the train data firt.
# A tibble: 2 × 4
model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 First roc_auc binary 0.899
2 Second roc_auc binary 0.900
Optimal threshold
Let us now turn to finding the best probability threshold for a model. We have seen that a perfect classifier is located on the top-left corner of the plot of the ROC curve (FPR=0, TPR=1).
There are many ways to define the optimal threshold value (see Unal (2017)).
For simplicity, let us use the Euclidean distance to find the closest point to (0,1). The optimal threshold \(\tau^\star\) thus minimises the following distance:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 445 1
1 46 8
Accuracy : 0.906
95% CI : (0.877, 0.9301)
No Information Rate : 0.982
P-Value [Acc > NIR] : 1
Kappa : 0.2302
Mcnemar's Test P-Value : 1.38e-10
Sensitivity : 0.8889
Specificity : 0.9063
Pos Pred Value : 0.1481
Neg Pred Value : 0.9978
Prevalence : 0.0180
Detection Rate : 0.0160
Detection Prevalence : 0.1080
Balanced Accuracy : 0.8976
'Positive' Class : 1
Precision-recall curve
With both the train and test data, looking at the ROC-curve, we might think that our results are pretty good. Let us look at the precision-recall curve, to focus specifically on the positive predictions. Recall that:
Precision = \(\frac{TP}{TP+FP}\) ;
Recall = \(\frac{TP}{TP+FN}\).
These two metrics focus on the positive class, which is the minority class here. With a dataset where there is a fewer number of observation from class 1, we are interested in assessing ability of the classifier to predict class 1. The Precision and recall metrics are thus helpful to have an idea of the performance of the model on the minority class. The precision-recall curve plots the precision as a function of the recall. The different values are obtained, as in the case of the ROC curve, by varying the probability threshold \(\tau\).
With classification threshold \(\tau=0\)
If the threshold \(\tau=0\), the confusion matrix will be as follows:
Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Confusion table (values expressed in proportions) for a classifier, with a threshold \(\tau=0\)
Pred. Negative (0)
Pred. Positive (1)
Obs. Negative (0)
True Negative (TN) = \(1-\gamma\)
False Positive (FP) = 0
Obs. Positive (1)
False Negative (FN) = \(\gamma\)
True Positive (TP) = 0
Hence:
Precision = \(\frac{TP}{TP+FP}\) : undefined.
Recall = \(\frac{TP}{TP+FN} = 0\).
In practice, while the value of \(\tau\) increases, more and more cases are classified as 0. The proportion of false positive decreases and usually becomes negligible when compared to the proportion of true positive. The precision thus increases and may be equal to 1.
With a perfect classifier
Now, let us turn to a classifier that perfectly predicts the two classes for each case.
Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
With a classifier that randomly assigns the classes
If the classifier randomly assigns class 1 with a probability \(p\) and class 0 with probability \(1-p\), regardless of \(\tau\), we the expected proportions are:
Confusion table (values expressed in proportions) for a perfect classifier, with a threshold \(0<\tau<1\)
Hence, for \(0<\tau<1\), the point corresponding to a classifier that randomly assigns class 1 with probability \(p\) will be on the dashed green line. The x-coordinate of that point will be equal to \(p\) (assuming that there is a proportion \(\gamma\) of class 1 individuals in the data).
Recall from the confusion matrix (showing proportions) for the first model, on the test sample:
Metric
Training data
Test data
Precision (TP / (TP+FP))
0.98
0.98
Recall/Sensitivity (TP / (TP + FN))
0.11
0.11
And for the second:
Metric
Training data
Test data
Precision (TP / (TP+FP))
0.98
0.98
Recall/Sensitivity (TP / (TP + FN))
0.11
0.11
When faced with an unbalanced dataset and when the model fails to correctly predict the positive class, the ROC curve will not necessarily reflect this if the imbalance is large. The ROC curve will in fact be close to the ROC curve of a perfectly discriminating model. In contrast, the Precision-Recall curve will not be as close as the one corresponding to a perfect classifier. Let us illustrate this.
The pr_curve() from {yardstick} allows us to get the precision and recall for different values of \(\tau\).
Clearly, the performances of the model are much less impressive viwed that way…
F1 Score
With an imbalanced dataset, if we focus on the accuracy, we might select a model that gives very good results on average, but poor results to predict one of the classes. Some metrics are built to take into consideration the predictive capabilities for both classes.
This is the case of the F1 score. It computes the harmonic mean of the precision an sensitivity: \[\text{F1 score} = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}} = \frac{2 \times TP}{2 \times TP + FP + FN}.\] The F1 score values range from 0 (if either precision or sensitivity is equal to 0) to 1 (both precision and sensitivity equal to 1).
With an imbalanced dataset, it may be useful to rely on this metric to assess the quality of fit of a given classifier when performing model selection.
With our two models, we can use the F1-Score to select the optimal threshold:
When faced with an imbalanced dataset, what should we do?
Unfortunately (or maybe, fortunately?), there is no “good” answer to this question. There are multiple ways to address this issue. And as is often the case with machine learning, we have to try different options and we have to make choices.
As of today, there are three ways to deal with imbalanced datasets:
data resampling
algorithm modifications (not in the scope of this notebook)
cost-sensitive learning (not in the scope of this notebook).
In the remainder of this notebook, we will have a look at data resampling methods.
Rebalancing the dataset
We have seen in the first part of this notebook that in the presence of a data sample containing much more observations of the negative class than of the positive class, it is difficult to obtain an algorithm with good predictive capabilities for the minority (positive) class. Yet, obtaining good performance in predicting the minority class may be the precise purpose of the classification (detection of cancer, fraud, recession, …). In this case, it is possible to re-equillibrate the data set to decrease the imbalance ratio.
This part of the notebook will present two broad techniques:
under-sampling: the number of observations from the majority class is decreased
over-sampling: the number of observations from the minority class is increased.
Under-sampling
A way to obtain a balanced dataset consists in sampling from the majority class to decrease the number of observations at hand from this class.There a many techniques to do so.
Random under-sampling
One of the most simple technique is random sampling. All we need to do is to decide the desired imbalance ratio we would like to obtain. For example, if we want that ratio to be equal to 1 (1 minority case for each majority case), we just need to sample as many examples of the majority as there are examples of the minority class.
In the traiing set, the imbalance ratio is:
sum(df_train$y==0) /sum(df_train$y==1)
[1] 47.78049
The number of minority cases is:
(n_minority <-sum(df_train$y==1))
[1] 41
We will thus randomly sample 41 cases from the majority case, and keep all cases from the minority class.
Undersampling to obtain an imbalance ratio of 1 resulted in a final training set with 2 times the number of minority cases, i.e., 82. This number of observations may be too small, depending on the variability in the data. To have a few more observations on which to train the model, we can accept to have a slightly unbalanced data set.
For example, if we accept to have an imbalance ratio of 2 (2 observations of the majority class for each observation of the minority class):
nrow(df_train)
[1] 2000
df_train_under_random_2 <- df_train %>%# Shuffle observationssample_frac(1L) %>%group_by(y) %>%# Under-sampling in the majority classslice_head(n =3*n_minority) %>%ungroup()
When observations are randomly drawn from the majority class, the resulting sample may not be representative of the original data. Removing some individuals may result in removing important information from the training dataset.
Therefore, instead of drawing randomly among the observations of the majority class, some techniques focus on orienting the drawing (by removing redundant individuals, for example), or on creating synthetic individuals among the individuals of the majority class.
Undersampling using KNN
Instead of randomly select observations to keep or remove from the majority class, Mani and Zhang (2003) suggest using the nearest neighbors. They propose 3 different versions.
Note
NearMiss-1:
for each example from the majority class, compute the average distance to k closest examples among the minority class
select majority examples for which the average distance computed in the previous step is the smallest
NearMiss-2:
for each example from the majority class, compute the average distance to k farthest examples among the minority class
select majority examples for which the average distance computed in the previous step is the smallest
NearMiss-3:
for each example from the minority class, identify and select a given number of neighbors among the majority class.
for each of the neighbors obtained in the previous step, compute the distance to each example from the the minority class, then, compute the average distance from their 3 closest neighbors
select majority examples for which the average distance computed in the previous step is the farthest.
As the NearMiss-1 technique seem to provide poor results, some suggest to select majority examples for which the average distance computed in the previous step is the highest. [ref needed]
Let us illustrate with a few R codes how these variants work. First, let us split the sample according to the value of the target variable: the majority class and the minority class.
Let us say that we want to use \(k=3\) for the KNN:
n_neighbors <-3
NearMiss-1
In a first step, for each observation from the minority sample, let us compute the distance (using the Euclidean distance here) to each observation from the minority sample:
Each row of the obtained matrix gives the distance to each observation from the minority sample (in column):
dim(dists)
[1] 1959 41
For a single example from the majority, we identify the closests observations from the minority sample. The distances of the first example to each observation from the minority sample are:
Then, we can order the results by descending values of the computed average distance and select the first observations from that list. Let us assume that we to keep only as many observations from the majority class as there are in the minority class:
n_to_keep <- n_minority
Then the index of the values we keep from the majority class can be obtained as follows:
Then, for each of the neighbors we identified, we need to compute the distance to the examples from the minority class, to identify the three closest points from the majority class.
For the first neighbor from the previous step, the 3 closest points from the minority class are the following:
head(order(dists_2[1,]), n_closest)
[1] 8 21 1
And the Euclidean distances are:
head(sort(dists_2[1,]), n_closest)
[1] 0.1259316 0.2014562 0.3481431
The average of these 3 distances needs to be computed:
mean(head(sort(dists_2[1,]), n_closest))
[1] 0.225177
So, for the first identified neighbor (big blue dot), we compute the average of the distances to the 3 closest observations from the minority sample (big orange dot):
as_tibble(majority_sample_neighbors[ind_to_keep_3_neighbors, ]) %>%mutate(sample ="Majority (keep)") %>%bind_rows(# Not exactly correct but OK for the graphas_tibble(majority_sample) %>%mutate(sample ="Majority (discard)") ) %>%bind_rows(as_tibble(minority_sample) %>%mutate(sample ="Minority")) %>%ggplot(data = .,mapping =aes(x = x_1, y = x_2, colour = sample, alpha = sample)) +geom_point() +scale_alpha_manual(NULL, values =c("Majority (keep)"=1, "Majority (discard)"= .1,"Minority"=1)) +scale_colour_manual(NULL, values =c("Majority (keep)"= wongBlue,"Majority (discard)"= wongBlue,"Minority"= wongOrange))
Other techniques
A broader overview about the different undersampling techniques that rely on the KNN algorithm can be found in Beckmann et al. (2015)
Over-sampling
Instead of sampling from the majority class to reduce the number of examples from the majority class, it is possible to consider over-sampling from the minority examples.
SMOTE
A way of over-sampling the data is to create synthetic observations from the minority class. SMOTE (Chawla et al. 2002) (Synthetic Minority Oversampling Technique) is a popular algorithm to do so. Let us first present how this technique works. Then, we can explore some of its limits.
In a nutshell:
we decide a proportion \(\alpha\) of minority class to reach after the oversampling method is applied
for each observation from the minority class, synthetic data are created, based on the characteristics from the nearest neighbors.
There are thus two parameters that need to be set prior the technique:
the proportion of minority class to reach
the number of nearest neighbors to consider.
To create a synthetic individual for one example \(x\) from the minority class, the algorithm works as follows:
Randomly select one of the k-nearest neighbors: \(x^\prime\)
For each of the \(j=1,\ldots,p\) characteristics:
compute de distance between the caracteristic \(x_j^\prime\) of the neighbor and that of the indivual of interest \(x_j^\prime\), i.e., \(x_j^\prime - x_j^\prime\)
draw a number \(\gamma_j\) from a uniform distribution \(\mathcal{U}[0,1]\)
multiply \(\gamma_j\) and \(x_j^\prime - x_j^\prime\) to obtain the j-th coordinate of the synthetic individual
Repeat steps 1 and 2 as many time as necessary to populate for the individual of interest.
Tip
An illustration of the creation of synthetic data is provided a bit further.
Let us do it with R, from scratch:
# Number of minority class sampleT <-sum(df_train$y==1)# Amount of SMOTE N%perc_over =300if(perc_over <100){# randomize the minority class samples as only a random percent of them will be SMOTEd. T <- (perc_over/100) * T perc_over <-100}perc_over <- perc_over/100perc_over
[1] 3
If we want 300% more individuals from the minority class, then, for each example we need to create 3 synthetic observations.
Let us select a number of nearest neighbors:
k <-5
With our spiral data from earlier, we have \(p=2\) characteristics:
num_attrs <-2
Let us identify the k-nearest neighbors thanks to the knn.index() function from {FNN}:
# Compute the k-nearest neighbors for each minority class sample onlyminority_sample <- df_train %>%filter(y ==1) %>%select(x_1, x_2) %>%as.matrix()# Index of the k-nearest neighborsk_nearest_neighbors <- FNN::knn.index(minority_sample, k = k)head(k_nearest_neighbors)
# A tibble: 123 × 4
x_1 x_2 y colour
<dbl> <dbl> <fct> <chr>
1 0.304 3.29 1 red
2 0.0171 3.34 1 red
3 0.335 3.32 1 red
4 -2.90 7.21 1 red
5 -2.42 6.28 1 red
6 -2.17 6.80 1 red
7 1.86 3.52 1 red
8 1.87 3.66 1 red
9 1.85 3.50 1 red
10 3.17 -0.408 1 red
# … with 113 more rows
Let us put the training data and the synthetic data in a single tibble:
The green point represents the current individual from the minority class. We will generate synthetic data using its k nearest neighbors. Among them, one (circled in black) has been randomly picked. A random fraction of the distance from the \(j\)-th characteristic of that neighbor to that of the current point has been computed, for each of the \(j=1,2\) characteriscits (x_1 and x_2). The resulting synthetic observation is represented by a purple triangle with a black contour.
Show the R codes
plots_1 <-plot_indiv(1)plots_1[[1]]
Recall that we set perc_over to 300. For each observation, we thus generate 3 new points.
Show the R codes
plots_1[[2]]
Show the R codes
plots_1[[3]]
Then, we can turn to another point from the minority class.
In R, we can use the SMOTE() function from {smotefamily}. It generates synthetic data from the minority class so as to get a perfectly balanced data at the end of the process.
type
class observed synthetic
0 0.50411734 0.00000000
1 0.01055069 0.48533196
Varying the number of k-nearest neighbors
Let us vary the number of k-nearest neighbors. We can create a small function that creates a balanced dataset using SMOTE given some number k of nearest numbers, then estimate the random forest, and returns the predicted probabilities in the test sample.
#' SMOTE with `k` nearest neighbors#' Then train a random forest on the SMOTEd data#' And make the prediction on the test data#' @param k number of nearest numbersget_pred_k <-function(k){ gen_data_train <- smotefamily::SMOTE(df_train %>%select(-y),target = df_train$y,K = k) df_train_with_synthetic_2 <- df_train %>%mutate(type ="observed") %>%bind_rows(gen_data_train$syn_data %>%rename(y = class) %>%mutate(type ="synthetic")) %>%mutate(y =factor(y, levels =c("0", "1"))) mod_tmp_synthetic <-randomForest(formula = y ~ x_1+x_2,data = df_train_with_synthetic_2,ntree =200,mtry =2 ,nodesize =5,maxnodes =NULL ) pred_prob_with_synth <-predict(mod_tmp_synthetic, newdata = df_test, type ="prob")[,"1"]tibble(observed = df_test$y, pred = pred_prob_with_synth, k = k)}
Let us consider the following values for k: 5, 10, 15, and 20. (Here, we only have 41 observations from the minority class, so it might not be a good idea to consider too many neighbors.)
For comparison, let us estimate a random forest on the unbalanced dataset:
mod_without_synthetic <-randomForest(formula = y ~ x_1+x_2,data = df_train,ntree =200,mtry =2 ,nodesize =5,maxnodes =NULL )pred_prob_without_synth <-predict(mod_without_synthetic, newdata = df_test, type ="prob")[,"1"]
For each of the different models estimated (depending on the number of neighbors, and on the imbalanced dataset), let us compute the precision and sensitivity for varying values of the probability threshold of the classifier.
precision_recall <-tibble(observed = df_test$y, pred = pred_prob_without_synth, k =-1) %>%bind_rows(predictions_with_smote) %>%group_by(k) %>%pr_curve(truth = observed, pred, event_level="second")
[[1]]
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 490 5
1 1 4
Accuracy : 0.988
95% CI : (0.9741, 0.9956)
No Information Rate : 0.982
P-Value [Acc > NIR] : 0.2043
Kappa : 0.5658
Mcnemar's Test P-Value : 0.2207
Sensitivity : 0.4444
Specificity : 0.9980
Pos Pred Value : 0.8000
Neg Pred Value : 0.9899
Prevalence : 0.0180
Detection Rate : 0.0080
Detection Prevalence : 0.0100
Balanced Accuracy : 0.7212
'Positive' Class : 1
We seem to be able to better predict the minority class. However, keep in mind that this is a single example. We would need to replicate the whole process a great number of time to have a clue whether or not, on average, with the data at hand, SMOTE is actually helpful.
Note
Notes:
If you want to go further, you should also try to look at what happens when you use SMOTE on the test set as well. In practice, you should not do so, because you will make predictions for individuals that live in a space where there might actually never be any observations from the minority class.
When identifying the k-nearest neighbors, the results can be greatly affected if the scale of the different features are large. It may be best to think about normalize the data prior applying the SMOTE technique.
Further comments and readings
Distances with categorical variables
If your dataset includes categorical variables, the distance between two observations is a bit changed. While it would be possible to transform your categorical variable into a set of binary variables (one-hot encoding or dummy encoding), it may not be the best idea.
Consider the following situation. Assume we have one feature with numerical values and a categorical variable that takes three values: A, B, and C. Let us further assume that we have created two dummy variables: one for class A, and another for class B (class C thus being the reference class). Consider the two examples:
\(x\), for which the numerical feature is .1 and who belongs to class A: \(x = \begin{bmatrix}.1 & 1 & 0\end{bmatrix}\)
\(x^\prime\), for which the numerical feature is .2 and who belongs to class B: \(x^\prime = \begin{bmatrix}.2 & 1 & 0\end{bmatrix}\)
The Eucledean distance between \(x\) and \(x^\prime\) this writes: \[\sqrt{\left(.1-.2\right)^2 + \left(1-0\right)^2 + \left(0-1\right)^2}.\]
The contribution of the categorical variable to the distance is thus \(\left(1-0\right)^2 + \left(0-1\right)^2 = 2\). If the numerical variable has been scaled to get values between 0 and 1, the maximum contribution for that variable is 1. Here, with only 3 classes for the categorical variable, the contribution of that variable to the distance is twice the maximum contribution of the numerical variable. If the number of classes for the categorical variable is higher, the contribution of that variable to the distance will be even higher if we create dummy variables to encode the categorical variable.
To overcome this issue, it is possible to use a different distance measure. Gower’s distance (Gower 1971) can be a good solution when the dataset contains categorical variables. Here is how it works, to compute the distance between two observations \(x\) and \(x^\prime\):
for each feature \(j=1,\ldots,p\), compute a score \(s_j\in [0,1]\)
the closer \(x\) and \(x^\prime\), the closer \(s_j\) to 1
the farther \(x\) and \(x^\prime\), the closer \(s_j\) to 0
and set a weight \(\delta_j\):
if \(x\) and \(x^\prime\) can be compared for their \(j\)-th feature, set a weigh \(\delta_j=1\)
otherwise, in case of missing value, set \(\delta_j=0\)
The Gower’s distance between \(x\) and \(x^\prime\) is then computed as follows: \[S = \frac{\sum_{j=1}^p s_j \times \delta_j}{\sum_{j=1}^p \delta_j}\]
The score \(s_j\) depends on the type of the \(j\)-th feature:
for a quantitative variable: \[s_j = 1 - \frac{\mid x_j - x^\prime_j \mid}{\max (\boldsymbol{\mathbb{x}}_j) - \min(\boldsymbol{\mathbb{x}}_j)},\] where \(\max (\boldsymbol{\mathbb{x}}_j)\) and \(\min(\boldsymbol{\mathbb{x}}_j)\) are the maximum and the minimum values of feature \(j\) in the training sample.
for a qualitative variable: \[s_j = \mathds{1}_{\{x_j = x_j^\prime\}}.\]
Note
For over-sampling, the SMOTE-NC algorithm allows us to have categorical variables within the data. When creating synthetic data, nothing changes compared to SMOTE for numerical variables. For categorical variables, the value of the synthetic observation is set to the most common among the nearest neighbors.
SMOTE with dicrete numerical variables
If your dataset contains discrete numerical variables, using SMOTE will create synthetic individual in areas where no observation can never be found in the original data.
To overcome this issue, one could think of simply casting the variable to a categorical variable. But doing so would make us lose the idea of an order in the values of the variable. A better solution consists in discretizing the variable right after SMOTE.
For example, if the \(j\)-th feature is a discrete numerical variable that takes the following values: {1,2,3}, if the synthetic observation has a value of, let us say, 1.2, then we can simply round it to 1.
A nice survey on resampling techniques
A nice survey of the methods available as of 2016 is available in More (2016).
References
Beckmann, Marcelo, Nelson FF Ebecken, Beatriz SL Pires de Lima, et al. 2015. “A KNN Undersampling Approach for Data Balancing.”Journal of Intelligent Learning Systems and Applications 7 (04): 104.
Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.”Journal of Artificial Intelligence Research 16: 321–57.
Gower, John C. 1971. “A General Coefficient of Similarity and Some of Its Properties.”Biometrics, 857–71.
Mani, Inderjeet, and I Zhang. 2003. “kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction.” In Proceedings of Workshop on Learning from Imbalanced Datasets, 126:1–7. ICML.
More, Ajinkya. 2016. “Survey of Resampling Techniques for Improving Classification Performance in Unbalanced Datasets.”arXiv Preprint arXiv:1608.06048.
Unal, Ilker. 2017. “Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach.”Computational and Mathematical Methods in Medicine 2017.