Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

This hands-on session focuses on pictures. The objective is to manipulate images to perform classification tasks.

1 Parkinson’s Disease

The first application consists in detecting Parkinson’s disease from hand-drawn images of spirals and waves. This application is inspired by Adrian Rosebrock’s tutorial.

The idea is to compare spiral or wave images drawn by patients with Parkinson’s disease or by patients without it. Patients with Parkinson’s disease, as shown in Zham et al. (2017), draw less quickly and with less pressure than other test subjects. In the tutorial, Adrian Rosebrock mentions that João Paulo Folador, a Brazilian Ph.D. student explored the following idea: maybe it is possible to detect Parkinson’s disease from the sole images, without measuring speed and pressure of the pen on paper.

In this first part of the hands-on session, we will train a classifier to automatically detect whether a patient is suffering from Parkinson’s disease.

To that end, we have a set of image data organized as follows:

Spiral data:
- training set: 36 healthy and 36 Parkinson (n=72)
- testing set: 15 healthy and 15 Parkinson (n=30)
Wave data:
- training set: 36 healthy and 36 Parkinson (n=72)
- testing set: 15 healthy and 15 Parkinson (n=30).

We will adopt the following strategy:

load the images into R
convert them in greyscale
crop them
create a matrix where each column represents an example (i.e., an individual) and each line the value of a pixel
perform a principal component analysis
use a subset of the principal components to train a classifier
test the performance of the model on a test sample

1.1 Data Preparation

1.1.1 Loading images

The hand-drawings of the subjects are provided as PNG files, according to the following tree structure (we will focus on the spiral test only):

pictures
├── donnees
│   └── dataset-parkinson
│       └── spiral
│           └── testing
│           │   └── healthy
│           │   │   └── V01HE01.png
│           │   │   └── ...
│           │   │   └── V55HE15.png
│           │   └── parkinson
│           │   │   └── V01PE01.png
│           │   │   └── ...
│           │   │   └── V15PE015.png
│           └── training
│           │   └── healthy
│           │   │   └── V01HE02.png
│           │   │   └── ...
│           │   │   └── V55HE11.png
│           │   └── parkinson
│           │   │   └── V01PE02.png
│           │   │   └── ...
│           │   │   └── V15PE03.png

The file names can be used to determine whether or not the individual has Parkinson’s disease (P for Parkinson’s or H for Healthy). The data are already split into a training and a testing sample.

To load images into R, the {OpenImageR} package can be used. To easily manipulate data, we will also need {tidyverse}.

library(OpenImageR)
library(tidyverse)

To load an image into R, the imager::load.image() function just needs the path to the file. It returns an object of class 'cimg'.

Let us load an image from a healthy subject:

path <- "../donnees/dataset-parkinson/spiral/training/healthy/V01HE02.png"
image <- imager::load.image(path)
class(image)

## [1] "cimg"         "imager_array" "numeric"

image

## Image. Width: 256 pix Height: 256 pix Depth: 1 Colour channels: 3

A quick peek at the structure of the image shows that the object is made up of a four-dimensional array:

width
height
depth
colour (RGB components)

str(image)

##  'cimg' num [1:256, 1:256, 1, 1:3] 0.953 0.949 0.953 0.941 0.941 ...

The drawing of the healthy subject can be plotted using the plot() function:

plot(image, main = "Healthy subject")

Let us compare with a drawing from a subject with Parkinson’s disease.

path <- "../donnees/dataset-parkinson/spiral/training/parkinson/V03PE05.png"
image <- imager::load.image(path)
plot(image, main = "Subject with Parkinson's disease")

We will then perform a principal component analysis (PCA) on the pixel values of each image. The main components will then be used as explanatory variables (features) to train a binary classifier to detect whether the drawings come from a healthy person or a person with Parkison’s disease. But we are facing a problem at this stage. PCA is an orthogonal linear transformation, and the data consists of triplet for each pixel, indicating the red, green and blue components (RGB). A simple way to get around this problem is to convert the image to grey scale, i.e., by calculating the average of the three RGB components. This is what the imager::grayscale() function does.

image_grey <- imager::grayscale(image)

Some PNG files may include a transparency component in the fourth dimension of the array (colours). In such cases, the dimension of the vector of colours is 4 instead of 3 (Red, Green, Blue, Transparency), which is an invalid dimension for the imager::grayscale() function. Hence, in such cases, we can drop the transparency information using the imager::channel() function prior to to the greyscale transformation:

if(dim(image)[4] > 3) image <- imager::channel(image, 1:3)
dim(image_grey)

## [1] 256 256   1   1

The images in this dataset are not all exactly the same size. Most of them are squares of 256 x 256 pixels. For the few exceptions, the difference is minimal.

The imager::resize() function allows us to resize the images. The desired size for the output image is set using the two arguments size_x and size_y. The argument interpolation_type controls the type of interpolation that is used (a nearest-neighbor interpolation is used by default).

Let us look at an example, with two parrots. The size of the image is 768 x 512 pixels.

image_test <- imager::load.example("parrots")
image_test

## Image. Width: 768 pix Height: 512 pix Depth: 1 Colour channels: 3

plot(image_test)

Let us create a larger image with two different methods of interpolation (no interpolation and nearest-neighbor interpolation, respectively).

larger_image_test <- 
  imager::resize(im = image_test,
                 size_x = 1200, size_y = 512, interpolation_type = 1)
larger_image_test_no <- 
  imager::resize(im = image_test,
                 size_x = 1200, size_y = 512, interpolation_type = 0)


op <- par(mfrow = c(1,2))
plot(larger_image_test, main = "Nearest-Neighbor interpolation")
plot(larger_image_test_no, main = "Without interpolation")

Back to our drawings examples, we want to ensure that the images are 256x256 pixels in size.

image_resized <- imager::resize(image_grey, size_x = 256, size_y = 256)

As all these steps will be repeated for each image, a function can be defined to perform them. Let us name this function load_image_gs().

#' load_image_gs
#' For a given image, reads it, then return the grey-scale values
#' for each pixel
#' @param path path to the image
load_image_gs <- function(path, size_x, size_y){
  image <- imager::load.image(path)
  if(dim(image)[4] > 3) image <- imager::channel(image, 1:3)
  
  image_grey <- imager::grayscale(image)
  
  # Resize the image
  image_resized <- imager::resize(image_grey, size_x = size_x, size_y = size_y)
  
  image_resized
}

path <- "../donnees/dataset-parkinson/spiral/training/parkinson/V03PE05.png"
test <- load_image_gs(path, size_x = 256, size_y = 256)
plot(test)

The healthy training examples are located in the following path: "../donnees/dataset-parkinson/spiral/training/healthy/". The list of the files in that directory can be obtained with the function list.files():

folder_path <- "../donnees/dataset-parkinson/spiral/training/healthy/"
files <- list.files(folder_path, full.names = TRUE, pattern = "\\.png")
head(files)

## [1] "../donnees/dataset-parkinson/spiral/training/healthy//V01HE02.png"
## [2] "../donnees/dataset-parkinson/spiral/training/healthy//V01HE03.png"
## [3] "../donnees/dataset-parkinson/spiral/training/healthy//V02HE02.png"
## [4] "../donnees/dataset-parkinson/spiral/training/healthy//V02HE03.png"
## [5] "../donnees/dataset-parkinson/spiral/training/healthy//V03HE2.png" 
## [6] "../donnees/dataset-parkinson/spiral/training/healthy//V03HE3.png"

The function load_image_gs() can then be applied to each of the files:

images <- lapply(files, load_image_gs, size_x = 256, size_y = 256)

Each element of the resulting list named images is a cimg object. The greyscale value of each pixel of these objects can be extracted easily and stored as a vector. For example, for the first image, the first six pixels of the image are:

as.vector(images[[1]]) %>% head()

## [1] 0.9529412 0.9490196 0.9529412 0.9411765 0.9411765 0.9450980

We can then loop over all images to extract the greyscale values, store them in a list of vectors (each element of the list corresponding to the vector of pixels of the image). Then, these same sized vectors can be concatenated to obtain a matrix where each column corresponds to an image. Applying the transpose function (t()) allows us to obtain a matrix where each row represents an image whose pixel values are prodided in columns:

values <- 
    lapply(images, function(x) as.vector(x)) %>% 
    do.call("cbind", .) %>% 
    t()
dim(values)

## [1]    36 65536

As the different training and testing examples are contained in several directories, a function can help to have a cleaner code.

#' load_images_sample
#' Computes the histogram of oriented gradient for the images
#' of the spiral or waves, for either healthy or patients with Parkinson's disease
#' in either the training or the testing set
#' The result is a matrix whose lines correspond to pictures and columns correspond to extracted values
#' @param drawing_type type of the drawing (`spiral` or `wave`)
#' @param sample_type sample type (`training` or `testing`)
#' @param patient_condition patient condition (`healthy` or `parkinson`)
#' drawing_type <- "wave" ; sample_type <- "training" ; patient_condition <- "healthy"
load_images_sample <- function(drawing_type, sample_type, patient_condition){
  folder_path <- str_c("../donnees/dataset-parkinson/",
                       drawing_type, "/", sample_type, "/", patient_condition)
  files <- list.files(folder_path, full.names = TRUE, pattern = "\\.png")
  
  size_x <- ifelse(drawing_type == "wave", yes = 512, no = 256)
  images <- lapply(files, load_image_gs, size_x = size_x, size_y = 256)
  
  values <- 
    lapply(images, function(x) as.vector(x)) %>% 
    do.call("cbind", .) %>% 
    t()
  
  values
}

The training examples for the spiral drawings can be loaded using that function.

spiral_training_sample <- 
  load_images_sample(drawing_type = "spiral", sample_type = "training",
                     patient_condition = "healthy") %>% 
  rbind(
    load_images_sample(drawing_type = "spiral", sample_type = "training",
                       patient_condition = "parkinson")
  )
dim(spiral_training_sample)

## [1]    72 65536

The true corresponding values can be stored in a vector:

spiral_training_sample_y <- 
  c(rep("healthy", 36), rep("parkinson", 36))

And the same procedure can be applied for the testing sample:

# Test sample for the spiral images
spiral_testing_sample <- 
  load_images_sample(drawing_type = "spiral", sample_type = "testing",
                     patient_condition = "healthy") %>% 
  rbind(
    load_images_sample(drawing_type = "spiral", sample_type = "testing",
                       patient_condition = "parkinson")
  )
dim(spiral_testing_sample)

## [1]    30 65536

spiral_testing_sample_y <- c(rep("healthy", 15), rep("parkinson", 15))

1.1.2 PCA

As previously mentioned, we can perform a principal component analysis on the pixel values to reduce extract the first components that can then be used to train the classifier. To do so, we give the matrix containing our training observations to the stats::prcomp() function.

pca_spiral <- prcomp(spiral_training_sample)

The eigenvalues can be obtained as follows:

pca_spiral$sdev^2

##  [1] 2.635821e+01 1.992733e+01 1.789631e+01 1.721614e+01 1.600063e+01
##  [6] 1.500296e+01 1.337245e+01 1.303905e+01 1.278017e+01 1.181507e+01
## [11] 1.135295e+01 1.037793e+01 9.895068e+00 9.742401e+00 9.528333e+00
## [16] 9.182100e+00 9.036485e+00 8.629998e+00 8.228589e+00 8.185230e+00
## [21] 8.083826e+00 7.930054e+00 7.839348e+00 7.660239e+00 7.468011e+00
## [26] 7.197078e+00 7.147125e+00 6.918793e+00 6.814197e+00 6.651275e+00
## [31] 6.407575e+00 6.272961e+00 6.118306e+00 6.048554e+00 5.883941e+00
## [36] 5.800532e+00 5.649302e+00 5.557457e+00 5.538205e+00 5.388674e+00
## [41] 5.149271e+00 5.040583e+00 4.948619e+00 4.846885e+00 4.824131e+00
## [46] 4.739241e+00 4.599152e+00 4.479539e+00 4.394042e+00 4.336441e+00
## [51] 4.265755e+00 4.209796e+00 4.012567e+00 3.871452e+00 3.741862e+00
## [56] 3.640015e+00 3.539892e+00 3.488006e+00 3.463213e+00 3.311305e+00
## [61] 3.205570e+00 3.151047e+00 3.065447e+00 2.890960e+00 2.804314e+00
## [66] 2.770862e+00 2.470196e+00 2.373654e+00 2.276878e+00 2.190752e+00
## [71] 1.850264e+00 8.181983e-29

The eigenvectors can be obtained as:

# pca_spiral$rotation

The eigenvectors can also be obtained using the predict() function:

library(factoextra)
coordinates_pca_spiral <- predict(pca_spiral)

The function predict() can be applied to the testing sample to predict the partial components for the testing set:

coordinates_pca_spiral_test <- 
  predict(pca_spiral, newdata = spiral_testing_sample)

Let us create a tibble with the true value y for each example, as well as the first two components:

pca_spiral_results_df <- 
  tibble(
    factor_1 = coordinates_pca_spiral[,1],
    factor_2 = coordinates_pca_spiral[,2],
    y = as.factor(spiral_training_sample_y),
    type = "spiral")
pca_spiral_results_df

## # A tibble: 72 x 4
##    factor_1 factor_2 y       type  
##       <dbl>    <dbl> <fct>   <chr> 
##  1   -4.31   13.4    healthy spiral
##  2   -5.76   -3.27   healthy spiral
##  3    2.98    1.45   healthy spiral
##  4    2.34    0.0235 healthy spiral
##  5   -0.283   1.93   healthy spiral
##  6   -1.09   -0.181  healthy spiral
##  7   -1.52   -2.25   healthy spiral
##  8    0.934  -1.30   healthy spiral
##  9    2.69   -0.404  healthy spiral
## 10    1.11   -0.261  healthy spiral
## # … with 62 more rows

A scatter plot with these first two components can be created.

p_1 <- 
  ggplot(data = pca_spiral_results_df, aes(x = factor_1, y = factor_2, colour = y)) +
  geom_point() +
  scale_colour_discrete("True value") +
  labs(x = "Factor 1", y = "Factor 2") +
  stat_ellipse(level = .95, type = "t", segments = 250)

p_1

Looking at this graph suggests that using only these first two components will not result in high performance of the classifier.

1.2 Model Training

In this section we will train a classifier to recognize from the drawings of the subjects whether or not they have Parkinson’s disease. To start slowly, we will put in place an example in which we will only retain the first three main components from the PCA that was conducted previously. The strategy we will use is as follows:

create the data table that will be used to train the classifier
train the classifier (we will use a multinomial logit model, which corresponds to a single-layer hidden neural network with a logit activation function)
measure the performance of the model.

So, we first create the tibble that contains the condition to predict (y) and the features (the PCs), both for the training and test samples:

df_spiral <- 
  data.frame(
    y = as.factor(spiral_training_sample_y),
    x = coordinates_pca_spiral[,1:3]) %>% 
  tbl_df()
df_spiral

## # A tibble: 72 x 4
##    y        x.PC1   x.PC2  x.PC3
##    <fct>    <dbl>   <dbl>  <dbl>
##  1 healthy -4.31  13.4    -8.77 
##  2 healthy -5.76  -3.27   -7.60 
##  3 healthy  2.98   1.45    5.48 
##  4 healthy  2.34   0.0235  0.301
##  5 healthy -0.283  1.93   -0.324
##  6 healthy -1.09  -0.181  -2.31 
##  7 healthy -1.52  -2.25   -0.742
##  8 healthy  0.934 -1.30    0.615
##  9 healthy  2.69  -0.404   1.19 
## 10 healthy  1.11  -0.261   0.841
## # … with 62 more rows

df_test_spiral <- 
  data.frame(
    y = as.factor(spiral_testing_sample_y),
    x = coordinates_pca_spiral_test[,1:3]) %>% 
  tbl_df()
df_test_spiral

## # A tibble: 30 x 4
##    y        x.PC1   x.PC2   x.PC3
##    <fct>    <dbl>   <dbl>   <dbl>
##  1 healthy -0.204 -0.0835 -0.669 
##  2 healthy  3.91   0.180   2.38  
##  3 healthy  2.37  -0.139   0.0146
##  4 healthy -1.21  -0.481   0.252 
##  5 healthy -1.14   0.781  -2.59  
##  6 healthy  0.561 -0.493   1.12  
##  7 healthy  1.64   0.218   1.61  
##  8 healthy  2.42   0.0311  0.903 
##  9 healthy  1.52   0.137   1.48  
## 10 healthy  0.482  0.212  -0.208 
## # … with 20 more rows

Then, we can fit the model, using the nnet::multinom() function from the {nnet} package. A different model could be used at this stage.

library(nnet)
reg <- multinom(y ~ ., data = df_spiral, trace = FALSE)

The predicted values can be computed using the predict() function. The parameter type can be used to state whether we want probabilities of belonging to the reference class to be returned (type = "probs") or if we want the class associated with the highest probability to be returned (type = "class")

df_spiral$pred <- predict(reg, type = "probs")
df_spiral$pred_class <- predict(reg, type = "class")
df_spiral

## # A tibble: 72 x 6
##    y        x.PC1   x.PC2  x.PC3  pred pred_class
##    <fct>    <dbl>   <dbl>  <dbl> <dbl> <fct>     
##  1 healthy -4.31  13.4    -8.77  0.201 healthy   
##  2 healthy -5.76  -3.27   -7.60  0.550 parkinson 
##  3 healthy  2.98   1.45    5.48  0.510 parkinson 
##  4 healthy  2.34   0.0235  0.301 0.471 healthy   
##  5 healthy -0.283  1.93   -0.324 0.463 healthy   
##  6 healthy -1.09  -0.181  -2.31  0.490 healthy   
##  7 healthy -1.52  -2.25   -0.742 0.565 parkinson 
##  8 healthy  0.934 -1.30    0.615 0.527 parkinson 
##  9 healthy  2.69  -0.404   1.19  0.488 healthy   
## 10 healthy  1.11  -0.261   0.841 0.505 parkinson 
## # … with 62 more rows

With these predictions, we can create a contingency matrix, using the table() function:

confusion_matrix <- table(df_spiral$y, df_spiral$pred_class)
confusion_matrix

##            
##             healthy parkinson
##   healthy        18        18
##   parkinson      16        20

The rows give the true conditions (healthy or parkinson) while the columns give the predicted conditions.

The prediction for the testing sample can be obtained following the same steps:

df_test_spiral$pred <- predict(reg, newdata = df_test_spiral, type="probs")
df_test_spiral$pred_class <- predict(reg, newdata = df_test_spiral, type="class")

And the corresponding confusion matrix can also be computed:

confusion_matrix_test <- table(df_test_spiral$y, df_test_spiral$pred_class)
confusion_matrix_test

##            
##             healthy parkinson
##   healthy        12         3
##   parkinson       4        11

Multiple measures can be computed from the confusion matrix. The wikipedia page offers a really good summary on that topic.

Let us create a function that takes a confusion matrix as its first argument and the index of the column (and row) to consider as positive (in this example, we will consider that parkinson should be considered as a positive condition). This function will return a tibble providing the values of four measures:

accuracy: percentage of correctly classified examples
missclass_rate: missclassification rate, i.e., percentage of missclassified examples
sensitivity: (also called true positive rate) proportion of actual positives that are correctly identified as such
specificity: (also called true negative rate) proportion of actual negatives that are correctly identified as such.

#' conf_mat_metrics
#' Extract metrics from a confusion matrix
#' @param conf_mat confusion matrix (2x2 matrix)
#' @param ind_positive index of the column and row to consider as "positive"
conf_mat_metrics <- function(conf_mat, ind_positive = 2){
  ind_negative <- ifelse(ind_positive == 1, yes = 2, no = 1)
  TP <- conf_mat[ind_positive, ind_positive]
  TN <- conf_mat[ind_negative, ind_negative]
  FP <- conf_mat[ind_negative, ind_positive]
  FN <- conf_mat[ind_positive, ind_negative]
  Pos <- sum(conf_mat[ind_positive, ])
  Neg <- sum(conf_mat[ind_negative, ])
  # Accuracy
  acc <- (TP+TN)/(Pos+Neg)
  # Missclassification rate
  missclass_rate <- 1-acc
  # Sensitivity (True positive rate)
  # proportion of actual positives that are correctly identified as such
  TPR <- TP/Pos
  # Specificity (True negative rate)
  # proportion of actual negatives that are correctly identified as such
  TNR <- TN/Neg
  tibble(accuracy = acc,
         missclass_rate = missclass_rate,
         sensitivity = TPR,
         specificity = TNR)
}

This function can be applied to the confusion matrix computed with the predictions using the training and testing sample, respectively:

conf_mat_metrics(conf_mat = confusion_matrix, ind_positive = 2)

## # A tibble: 1 x 4
##   accuracy missclass_rate sensitivity specificity
##      <dbl>          <dbl>       <dbl>       <dbl>
## 1    0.528          0.472       0.556         0.5

conf_mat_metrics(conf_mat = confusion_matrix_test, ind_positive = 2)

## # A tibble: 1 x 4
##   accuracy missclass_rate sensitivity specificity
##      <dbl>          <dbl>       <dbl>       <dbl>
## 1    0.767          0.233       0.733         0.8

As can be seen in the result from the previous code snippet, the classifier only predict \(52\%\) of observation correctly in the training set and \(76\%\) in the testing set.

Now, instead of setting manually the number of principal components to use, we can create a function where it varies.

#' classif_performance_pcs
#' Train a log-linear model on the patient condition
#' using a subet of the first principal components only
#' @param nb_pcs number of principal components to keep
classif_performance_pcs <- function(nb_pcs){
  # Data frame with the true value and the first k components of the PCA
  df <- 
    data.frame(
      y = as.factor(spiral_training_sample_y),
      x = coordinates_pca_spiral[,1:nb_pcs]) %>% 
    tbl_df()
  df_test <- 
    data.frame(
      y = as.factor(spiral_testing_sample_y),
      x = coordinates_pca_spiral_test[,1:nb_pcs]) %>% 
    tbl_df()
  
  # Regression of the values of the images on the first k components of the PCA
  reg <- multinom(y ~ ., data = df, trace = FALSE)
  
  # Training sample
  # Adding the predicted value  and the predicted class to the data frame
  df$pred <- predict(reg, type="probs")
  df$pred_class <- predict(reg, type="class")
  # And then compute the contingency matrix
  confusion_matrix <- table(df$y, df$pred_class)
  
  # Testing sample
  df_test$pred <-  predict(reg, newdata = df_test, type="probs")
  df_test$pred_class <-  predict(reg, newdata = df_test, type="class")
  # Contingency table
  confusion_matrix_test <- table(df_test$y, df_test$pred_class)
  
  ind_positive <- which(colnames(confusion_matrix) == "parkinson")

  conf_mat_metrics(conf_mat = confusion_matrix, ind_positive = ind_positive) %>% 
    mutate(sample_type = "training") %>% 
    bind_rows(
      conf_mat_metrics(conf_mat = confusion_matrix_test, ind_positive = ind_positive) %>% 
        mutate(sample_type = "testing")
    ) %>% 
    mutate(nb_pcs = !!nb_pcs)
  
}# End of classif_performance_pcs()

For example, to get the performance metrics when we use the first three PCs:

classif_performance_pcs(3)

## # A tibble: 2 x 6
##   accuracy missclass_rate sensitivity specificity sample_type nb_pcs
##      <dbl>          <dbl>       <dbl>       <dbl> <chr>        <dbl>
## 1    0.528          0.472       0.556         0.5 training         3
## 2    0.767          0.233       0.733         0.8 testing          3

We can loop over a large number of PCs to use, and save the performances in a list of tibble. The tibbles can in turn be binded in a single one:

performances <- pbapply::pblapply(1:50, classif_performance_pcs) %>%
  bind_rows()
performances

## # A tibble: 100 x 6
##    accuracy missclass_rate sensitivity specificity sample_type nb_pcs
##       <dbl>          <dbl>       <dbl>       <dbl> <chr>        <int>
##  1    0.486          0.514       0.333       0.639 training         1
##  2    0.567          0.433       0.533       0.6   testing          1
##  3    0.514          0.486       0.417       0.611 training         2
##  4    0.633          0.367       0.6         0.667 testing          2
##  5    0.528          0.472       0.556       0.5   training         3
##  6    0.767          0.233       0.733       0.8   testing          3
##  7    0.569          0.431       0.528       0.611 training         4
##  8    0.767          0.233       0.733       0.8   testing          4
##  9    0.583          0.417       0.528       0.639 training         5
## 10    0.567          0.433       0.733       0.4   testing          5
## # … with 90 more rows

The missclassification rate in the training and the testing sample can be plotted according to the number of PCs used:

p_2 <- 
  performances %>% 
  ggplot(data = ., aes(x = nb_pcs, y = missclass_rate, colour = sample_type)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Factors", y = "Misclassification Rate") +
  scale_color_manual("Sample", values = c("training" = "orange", "testing" = "blue"))


cowplot::plot_grid(p_1, p_2)

The different values of each metrics computed from the confusion table from the training and testing samples can be plotted using a facet plot:

performances %>% 
  gather(metrics, value, -sample_type, -nb_pcs) %>% 
  ggplot(data = ., aes(x = nb_pcs, y = value, colour = sample_type)) +
  geom_line() +
  geom_point() +
  facet_wrap(~metrics) +
  labs(x = "Number of Factors", y = NULL) +
  scale_color_manual("Sample", values = c("training" = "orange", "testing" = "blue"))

What value should be used for the number of PCs to be kept in the model? Cross-validation could provide an answer to that question.

2 Satellite Data

The purpose of the second application is to manipulate GIS data, stored in a raster format. From satellite images, the objective is to detect and qualify burned areas. This application is strongly inspired by Leah Wasser’s excellent course (Wasser 2018).

The first part of this application provides explanations on the data source. The second provides an illustrative case, that of a major fire in California in 2018, to show what can be done with R to process satellite images.

2.1 Landsat data

2.1.1 What are Landsat data?

Since 1972, satellite images have been acquired within a program currently known as Landsat. In 2013, the 8th satellite of this program, Landsat 8, was launched. This satellite has a 16-day repeat cycle. It carries two multispectral sensors which capture the response of the Earth’s surface at discrete wavelengths:

the Operational Land Imager (OLI): operates in the visible and short wave infrared. This sensor uses 9 channels, ranging from wavelengths of 443 nm to 2,200 nm (bands 1 to 9 in the Table below)
the Thermal Infrared Sensor (TIRS): measures land surface temperature in two thermal bands, ranging from wavelengths of 10.30 µm to 11.30 µm (bands 10 to 11 in the Table below).

Band	Wavelength (nm)	Spatial Resolution (m)
Band 1 - Coastal/aerosol	430 - 453	30
Band 2 - Blue	450 - 515	30
Band 3 - Green	525 - 600	30
Band 4 - Red	630 - 680	30
Band 5 - Near Infrared (NIR)	845 - 885	30
Band 6 - Short Wavelength Infrared (SWIR)	1560 - 1660	30
Band 7 - Short Wavelength Infrared (SWIR)	2110 - 2300	30
Band 8 - Panchromatic	500 - 680	15
Band 9 - Cirrus	1360 - 1390	30
Band 10 - Long Wavelength Infrared	10,300 - 11,300	100
Band 11 - Long Wavelength Infrared	11,500 - 12,500	100

Combining bands 4, 3 and 2, i.e., red, green and blue wavelengths and then apply red, green and blue filters yields a natural looking RGB image (more details on combining Landat 8 bands can be found in Joe Peters’ post).

The data is cut into scenes of about 170 x 185 km, available in a Geographic Tagged Image File Format (GeoTIFF) format, a computer file format for storing raster graphics images and embedding georeferencing information. Each band has a corresponding raster. The value associated with each pixel of a raster corresponds to the measurement recorded by the sensor. The resolution of the raster depends, as shown in the previous table, on the band. For example, a pixel for the first band represents an area of 30m by 30m.

2.1.2 How to obtain the files?

An easy way to obtain Landsat scenes is through Earth Explorer. After registering (for free), the data can be ordered. An e-mail will be sent when it is available for download. Please note that this may take a while (a few hours).

Let us suppose that we wish to retrieve scenes near Mendocino (see next section), in California. We thus look for Mendocino in the Address/Place tab.

Then, after clicking on the link in the table that displays the results of the query, a tick mark appears on the map. Let us zoom in the corresponding area.

The red mark can be moved to define another point of interest. New maks can be added by double-clicking on the map, or by entering the coordinates manually, to define a polygon.

Then, we need to specify the date range for which we would like to recover landsat scenes. Let us use 2018-06-01 to 2018-08-31. Then, we can click on the button Data Sets to switch to a tab where we can specify which data we want to recover.

We want Landsat 8 data:

Then, we can move to the next tab, i.e., Additional Criteria. We can filter the results that will be looked for by land cloud cover, scene cloud cover, data type, and so on. Once we are ready, we can click on the button Results to display the available scenes.

By clicking on the image icon of a displayed data set, a preview of the scene is added to the map.

Then, clicking on the trolley/shopping cart icon adds the scene to the basket. To see what is in the basket, we can click on item basket, and then proceed to checkout. When the files are available for download, an e-mail with a link to the data will be sent to our account.

2.2 The Mendocino Complex Fire

In this part, we will work on Landsat scenes prior and posterior to the Mendocino Complex Fire, a huge fire which hit California in 2018.

To avoid waisting time requesting and downloading Landsat scenes from earthexplorer, four scenes have already been downloaded. Two before the fire, and two after it. The folder ../donnees/landsat contains these scenes, in four sub-folders, each corresponding to a scene. The different rasters providing the values recorded by the sensors of the satellite are in these folders.

The first part of the names of these folders indicate which scene we are facing. The Table below explains each element of the file name LC080450332018071001T1-SC20190704113649.

Desc.	Landsat	Sensor	Satellite	WRS path	WRS row	Acquisition date	Processing year	Collection number	Collection category
Code	L	C	08	045	033	20180710	20180131	01	T1
Value	Landsat	OLI	Landsat 8	path=045	row=033	2018-07-10	2018-01-31	01	Tier 1

The second part gives information regarding the processing time of the image.

2.2.1 Prior to the Fire

library(tidyverse)
library(raster)
library(rgeos)
library(rgdal)
library(RStoolbox)

The first landsat scene we have downloaded is from 2018-07-26. There are 13 files provided by the USGS:

list.files("../donnees/landsat/LC080450332018072601T1-SC20190704113634/")

##  [1] "LC08_L1TP_045033_20180726_20180731_01_T1_ANG.txt"       
##  [2] "LC08_L1TP_045033_20180726_20180731_01_T1_MTL.txt"       
##  [3] "LC08_L1TP_045033_20180726_20180731_01_T1_pixel_qa.tif"  
##  [4] "LC08_L1TP_045033_20180726_20180731_01_T1_radsat_qa.tif" 
##  [5] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_aerosol.tif"
##  [6] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band1.tif"  
##  [7] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band2.tif"  
##  [8] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band3.tif"  
##  [9] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band4.tif"  
## [10] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band5.tif"  
## [11] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band6.tif"  
## [12] "LC08_L1TP_045033_20180726_20180731_01_T1_sr_band7.tif"  
## [13] "LC08_L1TP_045033_20180726_20180731_01_T1.xml"

The 7 bands can be imported in R, using the raster::stack() function. First, a vector with the path to each of the 7 files need to be created.

files_bands_20180726 <- list.files("../donnees/landsat/LC080450332018072601T1-SC20190704113634/",
                                   pattern = "band[[:digit:]]{1}\\.tif$",
                                   full.names = TRUE)

files_bands_20180726

## [1] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band1.tif"
## [2] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band2.tif"
## [3] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band3.tif"
## [4] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band4.tif"
## [5] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band5.tif"
## [6] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band6.tif"
## [7] "../donnees/landsat/LC080450332018072601T1-SC20190704113634//LC08_L1TP_045033_20180726_20180731_01_T1_sr_band7.tif"

In the previous code, we have provided a regular expression (or regex) to the argument pattern to help identify which files to keep among those in the directory. Let us take a few moments to explain this regex. Here, we want to list the files in the directory that contain the word band, followed by a number ([[digit]]) exactly present once ({1}), followed by a dot (\\.) and the substring tif. This substring must be placed at the end of the string providing the paths to the files.

The character vector that gives the path to each of the 7 bands can then be given as the x argument of the raster::stack() function, to create a RasterStack object in R.

bands_20180726_stack <- raster::stack(files_bands_20180726)
bands_20180726_stack

## class      : RasterStack 
## dimensions : 7871, 7741, 60929411, 7  (nrow, ncol, ncell, nlayers)
## resolution : 30, 30  (x, y)
## extent     : 367185, 599415, 4187685, 4423815  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names      : LC08_L1TP//1_sr_band1, LC08_L1TP//1_sr_band2, LC08_L1TP//1_sr_band3, LC08_L1TP//1_sr_band4, LC08_L1TP//1_sr_band5, LC08_L1TP//1_sr_band6, LC08_L1TP//1_sr_band7 
## min values :                -32768,                -32768,                -32768,                -32768,                -32768,                -32768,                -32768 
## max values :                 32767,                 32767,                 32767,                 32767,                 32767,                 32767,                 32767

The previous output shows that the RasterStack is a collection of 7 RasterLayer objects of dimension 7871 x 7741. The resolution of each cell of the grid is 30m x 30m.

The area covered by the image is larger than that of interest. Let us create a spatial object that gives the coordinates of the area of interest which corresponds roughly to the Mendocino Complex Fire area.

First, let us define the coordinates of the bouding box (in UTM coordinate system, zone 10).

x_min <- 485000
x_max <- 550000
y_min <- 4310000
y_max <- 4370000

Then, we can create a SpatialPolygons() object with the following code:

p <- 
  cbind(c(x_min, x_min, x_max, x_max),
        c(y_min, y_max, y_max, y_min)) %>% 
  Polygon() %>% 
  list() %>% 
  Polygons(ID = "1") %>% 
  list() %>% 
  SpatialPolygons()

p_df <- data.frame( ID=1:length(p) )

Then, we can create a SpatialPolygonsDataFrame object and set the CRS accordingly with that of of landsat scene.

p_sp <- SpatialPolygonsDataFrame(p, p_df) 
crs(p_sp) <- crs(bands_20180726_stack)
plot(p_sp)

The landsat scene can be plotted using the ggRGB() function from the package {RStoolbox}. We can add some customized theme to the plot to get a fancier output.

#' theme_map
#' Theme for maps created using ggplot2
theme_map <- function(...) {
  theme_minimal() +
    theme(
      text = element_text(color = "#22211d"),
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      # panel.grid.minor = element_line(color = "#ebebe5", size = 0.2),
      panel.grid.major = element_line(color = "#ebebe5", size = 0.2),
      panel.grid.minor = element_blank(),
      plot.background = element_rect(fill = "#f5f5f2", color = NA), 
      panel.background = element_rect(fill = "#f5f5f2", color = NA), 
      legend.key = element_rect(colour = "black"),
      legend.background = element_rect(fill = "#f5f5f2", color = NA),
      panel.border = element_blank(),
      ...
    )
}

# The landsat scene and the area of interest
RStoolbox::ggRGB(bands_20180726_stack,
      r = 4, g = 3, b = 2, stretch = "lin") +
  # Highlight the area of interest
  geom_polygon(data = fortify(p_sp),
               aes(x = long, y = lat), fill = "red", alpha = .3, col = "red") +
  theme_map() +
  labs(title = "Landsat scene (2018-07-26)")

To focus only on the area of interest, the landsat image can be cropped using the raster::crop() function:

bands_20180726_brick_crop <- raster::crop(bands_20180726_stack, p_sp)
bands_20180726_brick_crop

## class      : RasterBrick 
## dimensions : 2000, 2167, 4334000, 7  (nrow, ncol, ncell, nlayers)
## resolution : 30, 30  (x, y)
## extent     : 484995, 550005, 4309995, 4369995  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : /private/var/folders/rn/2hxdvrhs4ps367xyntyn1cgr0000gp/T/RtmpalS7sh/raster/r_tmp_2019-07-05_185222_5364_42255.grd 
## names      : LC08_L1TP//1_sr_band1, LC08_L1TP//1_sr_band2, LC08_L1TP//1_sr_band3, LC08_L1TP//1_sr_band4, LC08_L1TP//1_sr_band5, LC08_L1TP//1_sr_band6, LC08_L1TP//1_sr_band7 
## min values :                 -2000,                 -2000,                   -92,                  -132,                  -318,                  -124,                   -59 
## max values :                  6122,                  6582,                  6728,                  7383,                  9411,                  8643,                  8786

Which gives the following collection of rasters:

ggRGB(bands_20180726_brick_crop,
      r = 4, g = 3, b = 2, stretch = "lin") +
  theme_map() +
  labs(title = "Landsat scene (2018-07-26)",
       subtitle = "Perimeter of the Mendocino Complex fire")

2.2.1.1 Dealing with Clouds and Water

There may be some clouds or water in the landsat scene. Fortunately, the USGS provides a layer pointing where the clouds and water area may be. The name of this layer ends with pixel_qa. The product guide gives the information about the different values associated with each pixel.

# Import the layer
cloud_mask_20180726 <- 
  raster("../donnees/landsat/LC080450332018072601T1-SC20190704113634/LC08_L1TP_045033_20180726_20180731_01_T1_pixel_qa.tif")
# Extract the area of interest
cloud_mask_20180726_crop <- crop(cloud_mask_20180726, p_sp)
cloud_mask_20180726_crop

## class      : RasterLayer 
## dimensions : 2000, 2167, 4334000  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 484995, 550005, 4309995, 4369995  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : /private/var/folders/rn/2hxdvrhs4ps367xyntyn1cgr0000gp/T/RtmpalS7sh/raster/r_tmp_2019-07-05_185244_5364_25523.grd 
## names      : LC08_L1TP_045033_20180726_20180731_01_T1_pixel_qa 
## values     : 322, 834  (min, max)

Let us plot that raster, using ggplot(). To that end, a data frame with the coordinates of the tiles and the associated values must be provided to the ggplot() function.

cloud_mask_20180726_df <- 
  as.data.frame(cloud_mask_20180726_crop, xy=T) %>%
  as_tibble() %>% 
  # Rename the columns of the table
  magrittr::set_colnames(c("long", "lat", "pixel_qa")) %>% 
  # Set the column `pixel_qa` as a factor
  mutate(pixel_qa = factor(pixel_qa))
cloud_mask_20180726_df

## # A tibble: 4,334,000 x 3
##      long     lat pixel_qa
##     <dbl>   <dbl> <fct>   
##  1 485010 4369980 322     
##  2 485040 4369980 322     
##  3 485070 4369980 322     
##  4 485100 4369980 322     
##  5 485130 4369980 322     
##  6 485160 4369980 322     
##  7 485190 4369980 322     
##  8 485220 4369980 322     
##  9 485250 4369980 322     
## 10 485280 4369980 322     
## # … with 4,333,990 more rows

The different values of the pixels are the following:

table(cloud_mask_20180726_df$pixel_qa)

## 
##     322     324     834 
## 4126322  207638      40

Table 6-3. Landsat 8 Pixel Quality Assessment (pixel_qa) Values from the product guide gives the following descriptions for these codes:

322, 834: clear
324: water

There is therefore no identified clouds in this raster. There is, however, some water.

Then the plot can be created:

ggplot(data = cloud_mask_20180726_df,
       aes(x = long, y = lat, fill = pixel_qa)) +
  geom_raster() +
  theme_map() +
  labs(title = "Landsat 2018-07-26 - Cloud mask/Water layer")

Here, we want to locate areas in which the fire burnt. We therefore will assign NA values to pixels whose value is 322:

cloud_mask_20180726_crop[ cloud_mask_20180726_crop == 324 ] <- NA
as.data.frame(cloud_mask_20180726_crop, xy=TRUE) %>% 
  as_tibble() %>% 
  magrittr::set_colnames(c("long", "lat", "pixel_qa")) %>% 
  mutate(pixel_qa = factor(pixel_qa)) %>% 
  ggplot(data = .,
         aes(x = long, y = lat, fill = pixel_qa)) +
  geom_raster() +
  theme_map() +
  labs(title = "Landsat 2018-07-26 - Cloud/Water mask layer")

Then, using the raster::mask() function, the areas from the landsat image identified as water can be removed. Every cell of the raster where the value in the mask is NA will become NA as well.

bands_20180726_brick_crop_noclouds <- raster::mask(bands_20180726_brick_crop, mask = cloud_mask_20180726_crop)

Again, the image can be plotted:

ggRGB(bands_20180726_brick_crop_noclouds,
      r = 4, g = 3, b = 2, stretch = "lin") +
  labs(title = "Landsat 2018-07-26 - Without water") +
  theme_map()

2.2.2 Posterior to the Fire

A landsat view after the fire is available on 2018-08-27. Again, we create a vector with the paths to the 7 bands:

files_bands_20180827 <- 
  list.files("../donnees/landsat/LC080450332018082701T1-SC20190704113725/",
             pattern = "band[[:digit:]]{+}\\.tif$",
             full.names = TRUE)

Then we import the layers in R using the raster::stack() function:

bands_20180827_stack <- raster::stack(files_bands_20180827)

Again, the landsat view is far too large for the area being studied and therefore cropped:

bands_20180827_brick_crop <- crop(bands_20180827_stack, p_sp)

Let us plot that cropped image, using the red, green and blue layers:

RStoolbox::ggRGB(bands_20180827_brick_crop,
      r = 4, g = 3, b = 2, stretch = "lin") +
  theme_map() +
  labs(title = "Landsat scene (2018-08-27)",
       subtitle = "Perimeter of the Mendocino Complex fire")

There seems to be some clouds in that view. The layer providing the pixel quality assessment confirms it:

cloud_mask_20180827 <- 
  raster("../donnees/landsat/LC080450332018082701T1-SC20190704113725/LC08_L1TP_045033_20180827_20180911_01_T1_pixel_qa.tif")
cloud_mask_20180827_crop <- crop(cloud_mask_20180827, p_sp)
cloud_mask_20180827_df <- 
  as.data.frame(cloud_mask_20180827_crop, xy=T) %>%
  as_tibble() %>% 
  # Rename the columns of the table
  magrittr::set_colnames(c("long", "lat", "pixel_qa"))
cloud_mask_20180827_df

## # A tibble: 4,334,000 x 3
##      long     lat pixel_qa
##     <dbl>   <dbl>    <int>
##  1 485010 4369980      322
##  2 485040 4369980      322
##  3 485070 4369980      322
##  4 485100 4369980      322
##  5 485130 4369980      322
##  6 485160 4369980      322
##  7 485190 4369980      322
##  8 485220 4369980      322
##  9 485250 4369980      322
## 10 485280 4369980      322
## # … with 4,333,990 more rows

cloud_mask_20180827_df$pixel_qa %>% table()

## .
##     322     324     328     352     386     388     392     416     480 
## 4034040  218436    2093     899   40949       1     983   20166   16428 
##     834 
##       5

The product guide gives the following descriptions for the encountered pixel values:

322, 386, 834: clear
324, 388: water
328, 392: cloud shadow
352, 416, 480: cloud

We can add a column to the data frame of the mask providing the description. To do this, a tibble containing the values of the pixels and the corresponding table can be created and joined to the mask data table.

pixel_descriptions <- 
  tibble(pixel_qa = c(
    322, 386, 834,
    324, 388,
    328, 392,
    352, 416, 480),
    description = c(
      rep("clear", 3),
      rep("water", 2),
      rep("cloud shadow", 2),
      rep("cloud", 3)
    ))

cloud_mask_20180827_df <- 
  cloud_mask_20180827_df %>% 
  left_join(
    pixel_descriptions
  )

cloud_mask_20180827_df

## # A tibble: 4,334,000 x 4
##      long     lat pixel_qa description
##     <dbl>   <dbl>    <dbl> <chr>      
##  1 485010 4369980      322 clear      
##  2 485040 4369980      322 clear      
##  3 485070 4369980      322 clear      
##  4 485100 4369980      322 clear      
##  5 485130 4369980      322 clear      
##  6 485160 4369980      322 clear      
##  7 485190 4369980      322 clear      
##  8 485220 4369980      322 clear      
##  9 485250 4369980      322 clear      
## 10 485280 4369980      322 clear      
## # … with 4,333,990 more rows

The resulting mask can be plotted:

ggplot(data = cloud_mask_20180827_df,
       aes(x = long, y = lat, fill = description)) +
  geom_raster() +
  theme_map() +
  labs(title = "Landsat 2018-08-27 - Cloud mask/Water layer")

The areas where clouds or cloud shadows are spotted can be replaced with NA values in the mask:

mask_20180827_crop_clouds <- cloud_mask_20180827_crop
mask_20180827_crop_clouds[ mask_20180827_crop_clouds %in% c(328, 392, 352, 416, 480) ] <- NA

And from the landsat view:

bands_20180827_brick_crop_noclouds <- 
  raster::mask(bands_20180827_brick_crop, mask = mask_20180827_crop_clouds)

Which gives the following view:

ggRGB(bands_20180827_brick_crop_noclouds,
      r = 4, g = 3, b = 2, stretch = "lin") +
  labs(title = "Landsat 2018-08-27 - Without water") +
  theme_map()

2.2.2.1 Replace Clouds

As for now, the pixels identified as clouds in the landsat view are filled with NA values. We can use another landsat view from another date to replace those with ones that may not be clouds. We can use the view from 2018-09-12 for example.

files_bands_20180912 <- 
  list.files("../donnees/landsat/LC080450332018091201T1-SC20190704113621/",
             pattern = "band[[:digit:]]{+}\\.tif$",
             full.names = TRUE)
bands_20180912_stack <- raster::stack(files_bands_20180912)
# Cropping the raster
bands_20180912_brick_crop <- crop(bands_20180912_stack, p_sp)

# Plotting the image
RStoolbox::ggRGB(bands_20180912_brick_crop,
                 r = 4, g = 3, b = 2, stretch = "lin") +
  theme_map() +
  labs(title = "Landsat scene (2018-09-12)",
       subtitle = "Perimeter of the Mendocino Complex fire")

The pixel quality assessment corresponding to that view may also be loaded:

cloud_mask_20180912 <-
  raster("../donnees/landsat/LC080450332018091201T1-SC20190704113621/LC08_L1TP_045033_20180912_20180927_01_T1_pixel_qa.tif")
cloud_mask_20180912_crop <- crop(cloud_mask_20180912, p_sp)

And again, transformed to a table:

cloud_mask_20180912_df <- 
  as.data.frame(cloud_mask_20180912_crop, xy=T) %>%
  as_tibble() %>% 
  # Rename the columns of the table
  magrittr::set_colnames(c("long", "lat", "pixel_qa"))
cloud_mask_20180912_df

## # A tibble: 4,334,000 x 3
##      long     lat pixel_qa
##     <dbl>   <dbl>    <int>
##  1 485010 4369980      322
##  2 485040 4369980      322
##  3 485070 4369980      322
##  4 485100 4369980      322
##  5 485130 4369980      322
##  6 485160 4369980      322
##  7 485190 4369980      322
##  8 485220 4369980      322
##  9 485250 4369980      322
## 10 485280 4369980      322
## # … with 4,333,990 more rows

cloud_mask_20180912_df$pixel_qa %>% table()

## .
##     322     324     328     352     386     392     416     480     834 
## 4037193  226383   13870    2276    1325     499    1935    1220   12409 
##     836     840     864     898     904     928     992 
##      14    1486    2787    1571     886    6072   24074

The description can be added to the data frame:

# Description table
pixel_descriptions <- 
  tibble(pixel_qa = c(
    322, 386, 834, 898,
    324, 388, 836,
    328, 392, 836, 904,
    352, 416, 480, 864, 928, 992),
    description = c(
      rep("clear", 4),
      rep("water", 3),
      rep("cloud shadow", 4),
      rep("cloud", 6)
    ))

# Then joining the description to `cloud_mask_20180827_df`
cloud_mask_20180912_df <- 
  cloud_mask_20180912_df %>% 
  left_join(
    pixel_descriptions
  )

cloud_mask_20180912_df

## # A tibble: 4,334,014 x 4
##      long     lat pixel_qa description
##     <dbl>   <dbl>    <dbl> <chr>      
##  1 485010 4369980      322 clear      
##  2 485040 4369980      322 clear      
##  3 485070 4369980      322 clear      
##  4 485100 4369980      322 clear      
##  5 485130 4369980      322 clear      
##  6 485160 4369980      322 clear      
##  7 485190 4369980      322 clear      
##  8 485220 4369980      322 clear      
##  9 485250 4369980      322 clear      
## 10 485280 4369980      322 clear      
## # … with 4,334,004 more rows

And a plot rendered:

ggplot(data = cloud_mask_20180912_df,
       aes(x = long, y = lat, fill = description)) +
  geom_raster() +
  theme_map() +
  labs(title = "Landsat 2018-09-12 - Cloud mask/Water layer")

The cells of the collection of layers from the view on 2018-09-12 not identified as clear are assigned with a NA value:

bands_20180912_brick_crop_clear <- mask(bands_20180912_brick_crop, mask = cloud_mask_20180912_crop)

ggRGB(bands_20180912_brick_crop_clear,
      r = 4, g = 3, b = 2, stretch = "lin") +
  labs(title = "Landsat 2018-09-27 - Without water and clouds") +
  theme_map()

We will use this collection of layers to replace NA values from our first view. We will use the following two collections of layers:

bands_20180827_brick_crop_noclouds: landsat view on 2018-08-27 where cells identified as clouds and shadow clouds were replaced with NA values
bands_20180912_brick_crop_clear: landsat view on 2018-09-12 where cells identified as clouds, shadow clouds and water were replaced with NA values.

The function raster::cover() will replace the NA values from the first scene (bands_20180827_brick_crop_noclouds) with non NA values from the second (bands_20180912_brick_crop_clear):

post_fire_brick <- raster::cover(bands_20180827_brick_crop_noclouds, bands_20180912_brick_crop_clear)

The resulting image can be plotted:

ggRGB(post_fire_brick,
      r = 4, g = 3, b = 2, stretch = "lin") +
  labs(title = "Landsat 2018-09-27 - Without clouds") +
  theme_map()

2.2.2.2 Remove Water

We want to identify the areas in which the fire occurred. To do this, we will build a metrics which may give false positives to water areas. It may thus be a good idea to remove pixels identified as water.

mask_20180827_crop_water <- cloud_mask_20180827_crop
mask_20180827_crop_water[ mask_20180827_crop_water %in% c(324, 388, 836) ] <- NA

And from the landsat view:

post_fire_brick <- raster::mask(post_fire_brick, mask = mask_20180827_crop_water)

Which gives the following view:

ggRGB(post_fire_brick,
      r = 4, g = 3, b = 2, stretch = "lin") +
  labs(title = "Landsat 2018-08-27 - Without water") +
  theme_map()

2.2.3 Burnt Areas

To identify burnt areas, we can use a metrics attributed to (Key et al. 2002) (unable to find the document), that is widely used in the literature: the Normalized Burn Ratio (NBR), defined as follows:

\[NBR = \frac{NIR - SWIR}{NIR + SWIR},\] where \(NIR\) is the near infrared value recorded by the thermal sensor (band 5) and \(SWIR\) is the short-wavelength infrared value recorded by the SWIR sensor (band 7).

Depending on the value of the ratio, the following situation can be assessed:

We can compute that ratio on both landsat views. To do so, a function that takes two layers as arguments can be created:

#' compute_nbr
#' @param band_nir NIR band
#' @param band_swir SWIR band
compute_nbr <- function(band_nir, band_swir){
  # this function calculates the difference between two rasters of the same CRS and extent
  # input: 2 raster layers of the same extent, crs that can be subtracted
  # output: a single different raster of the same extent, crs of the input rasters
  (band_nir - band_swir)/(band_nir + band_swir)
}

This function can then be applied to the 5th and 7th layers of our two scenes:

landsat_prefire_nbr <- overlay(bands_20180726_brick_crop_noclouds[[5]],
                               bands_20180726_brick_crop_noclouds[[7]],
                               fun = compute_nbr)

landsat_prefire_nbr

## class      : RasterLayer 
## dimensions : 2000, 2167, 4334000  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 484995, 550005, 4309995, 4369995  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : -0.5500239, 0.9310345  (min, max)

landsat_postfire_nbr <- overlay(post_fire_brick[[5]],
                                post_fire_brick[[7]],
                                fun = compute_nbr)

landsat_postfire_nbr

## class      : RasterLayer 
## dimensions : 2000, 2167, 4334000  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 484995, 550005, 4309995, 4369995  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : -0.8548847, 0.9088264  (min, max)

We can create tibbles from these two RasterLayer objects and bind them to plot the index values prior and posterior to the fire.

as.data.frame(landsat_prefire_nbr, xy=TRUE) %>% as_tibble() %>% 
  rename("nbr" = "layer") %>% 
  mutate(period = "pre") %>% 
  bind_rows(
    as.data.frame(landsat_postfire_nbr, xy=TRUE) %>% as_tibble() %>% 
      rename("nbr" = "layer") %>% 
      mutate(period = "post")
  ) %>% 
  mutate(period = factor(period,
                         levels = c("pre", "post"),
                         labels = c("Pre-fire NBR", "Post-fire NBR"))) %>% 
  ggplot(data = .,
         aes(x = x, y = y, fill = nbr)) +
  geom_raster() +
  scale_fill_gradient2("NBR", low = "red", mid = "white", high = "darkgreen",  limits = c(-1, 1)) +
  facet_wrap(~period) +
  theme_map() +
  coord_quickmap()

Enventually, the evolution of the NBR can be monitored, by computing the difference between the NBR value prior to the fire and the NBR value posterior to the fire. The resulting value (\(\Delta NBR\)) measures burn severity, as explained in this introduction note to remote sensing by Humboldt State University. The note warns that these values should be confronted to the field and constitute a first assessment of burn severity.

\(\Delta NBR\)	Burn Severity
< -0.25	High post-fire regrowth
-0.25 to -0.1	Low post-fire regrowth
-0.1 to +0.1	Unburned
0.1 to 0.27	Low-severity burn
0.27 to 0.44	Moderate-low severity burn
0.44 to 0.66	Moderate-high severity burn
> 0.66	High-severity burn

In R, the burn severity measure is computed as follows:

landsat_diff_nbr <- landsat_prefire_nbr - landsat_postfire_nbr

The resulting RasterLayer can be transformed in a tibble for convenient use by the ggplot() function.

landsat_postfire_dnbr_df <- 
  as.data.frame(landsat_diff_nbr, xy=TRUE) %>% 
  as_tibble() %>% 
  rename("dNBR" = "layer") %>% 
  mutate(
    dNBR_class = cut(dNBR, c(-Inf, -.1, .1, .27, .66, Inf),
                     right=TRUE,
                     labels = c("Enhanced Regrowth",
                                "Unburned",
                                "Low Severity",
                                "Moderate Severity",
                                "High Severity"))
  )
landsat_postfire_dnbr_df

## # A tibble: 4,334,000 x 4
##         x       y     dNBR dNBR_class
##     <dbl>   <dbl>    <dbl> <fct>     
##  1 485010 4369980 -0.0191  Unburned  
##  2 485040 4369980  0.00893 Unburned  
##  3 485070 4369980  0.00188 Unburned  
##  4 485100 4369980  0.0124  Unburned  
##  5 485130 4369980 -0.0435  Unburned  
##  6 485160 4369980 -0.0424  Unburned  
##  7 485190 4369980 -0.0272  Unburned  
##  8 485220 4369980 -0.0130  Unburned  
##  9 485250 4369980 -0.0167  Unburned  
## 10 485280 4369980 -0.00498 Unburned  
## # … with 4,333,990 more rows

Let us define some colours associated with the classes of \(\Delta NBR\):

colours <- 
  tibble(name = c("dark green", "green", "yellow", "orange", "red"),
         colour = c("#49934b", "#b1d778", "#fffdc6", "#f1b16e", "#c4352a"),
         dNBR_class = c("Enhanced Regrowth",
                        "Unburned",
                        "Low Severity",
                        "Moderate Severity",
                        "High Severity"))

We can plot the evolution of the NBR as follows:

ggplot(data = landsat_postfire_dnbr_df,
       aes(x = x, y = y, fill = dNBR_class)) +
  geom_raster() +
  coord_quickmap() +
  scale_fill_manual(NULL, values = colours$colour) +
  labs(x = NULL, y = NULL, title = "Landsat dNBR - Cold Spring fire site (July 2018)") +
  theme_map()

For the area of interest, we can obtain the number of cells corresponding to each of the classes:

landsat_postfire_dnbr_df$dNBR_class %>% 
  table()

## .
## Enhanced Regrowth          Unburned      Low Severity Moderate Severity 
##             45080           2314607            406613            837404 
##     High Severity 
##            501137

A histogram can of course be drawn:

landsat_postfire_dnbr_df %>% 
  ggplot(data = ., aes(x = dNBR_class, fill = dNBR_class)) +
  geom_histogram(stat = "count") +
  scale_fill_manual(NULL, values = colours$colour) +
  labs(x = NULL) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Using the fact that the resolution of a cell is 30m by 30m, we can also compute the area where the burn severity is high:

nb_cell_high_severity <- 
  landsat_postfire_dnbr_df %>% 
  filter(dNBR_class %in% c("High Severity")) %>% 
  nrow()

# Squares kilometres
nb_cell_high_severity * 30 * 30 / 1000000

## [1] 451.0233

References

Key, CH, N Benson, D Ohlen, SM Howard, and Z Zhu. 2002. “The Normalized Burn Ratio and Relationships to Burn Severity: Ecology, Remote Sensing and Implementation.” In Ninth Biennial Remote Sensing Applications Conference, Apr 8–12, San Diego, ca.

Wasser, Leah. 2018. “earthlab/earth-analytics-r-course: Earth Analytics Course in the R Programming Language.” https://doi.org/10.5281/zenodo.1326873.

Zham, Poonam, Dinesh K. Kumar, Peter Dabnichki, Sridhar Poosapadi Arjunan, and Sanjay Raghav. 2017. “Distinguishing Different Stages of Parkinson’s Disease Using Composite Index of Speed and Pen-Pressure of Sketching a Spiral.” Frontiers in Neurology 8 (September). https://doi.org/10.3389/fneur.2017.00435.

Pictures

Insurance Data Science: Use and Value of Unusual Data

August 2019