Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

1 Text Processing

1.1 Regular Expressions

It is common to manipulate strings. This is the case when files need to be loaded in a loop where only a part of the files are targeted. This is also the case, of course, when the data that is manipulated in the models is textual. There is a very convenient tool that allows one to search for more or less complex patterns in strings: regular expressions. Regular expression (or regex) are sequences of characters forming a search pattern. The pattern is used to match one or several characters in a string. The help page in R (?regex) provides condensed information on regex.

The {base} package contains multiple functions related to regular expressions, but we will use instead some functions from the package {stringr}, built on top of {stringi}. The package {stringr} allows us to easily manipulate strings in R.

For most of the examples given to illustrate how regex work, we will use real Twitter data from CrisisNLP. Let us use the tweets from the Nepal Earthquake crisis, annotated by volunteers.

library(tidyverse)
tweets_earthquake <- 
  str_c("donnees/CrisisNLP_volunteers_labeled_data/2015_Nepal_Earthquake_en/",
        "2015_Nepal_Earthquake_en.csv") %>% 
  read_csv(locale = locale(encoding = "UTF-8"))
tweets_earthquake
## # A tibble: 9,471 x 10
##    tweet_id tweet_time tweet_author tweet_author_id tweet_language
##    <chr>    <chr>      <chr>                  <dbl> <chr>         
##  1 '591903… Sat Apr 2… Faali19           2387302745 en            
##  2 '591903… Sat Apr 2… STERLINGMHO…       153876973 en            
##  3 '591903… Sat Apr 2… HeenaliVP          421188281 en            
##  4 '591903… Sat Apr 2… Xennia79           176207969 en            
##  5 '591903… Sat Apr 2… Madhurita_        1058658786 en            
##  6 '591903… Sat Apr 2… MONIMISHI         1461160603 en            
##  7 '591904… Sat Apr 2… AnilBalkrus…      3034294729 en            
##  8 '591904… Sat Apr 2… haquem19          2782719416 en            
##  9 '591904… Sat Apr 2… Akshay7_           111600045 en            
## 10 '591904… Sat Apr 2… hnrwbell          1613115068 en            
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## #   tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>

To check whether a pattern is found in a string, we can use the function stringr::str_detect() (note that the package {stringr} have been attached when we atatched {tidyverse}).

(two_tweets <- tweets_earthquake$tweet_text[1:2])
## [1] "Dua's for all those affected by the earthquakes in India,Nepal &amp; Bhutan. Stay safe &amp; help others in any form. #Equake http://t.co/M6YG0k4FKh"
## [2] "itvnews: Witness to Nepal #earthquake tells itvnews: 'It was terrifying' http://t.co/UWMynVyzQC"
# Can we find the word "Earthquake" in the tweets?
str_detect(string = two_tweets, pattern = "earthquake")
## [1] TRUE TRUE
# Can we find the word "India" in the tweets?
str_detect(string = two_tweets, pattern = "India")
## [1]  TRUE FALSE

1.1.1 Litterals and metacharacters

In the above examples, the pattern is composed of litterals, i.e., characters that recieve a literal interpretation in the regular expression. Some other characters, on the other hand, receive a different interpretation when they are part of the regex. This is the case of the following reserved characters, called metacharacters: . \ | ( ) [ { $ * + ?. If we want these characters to be literaly interpretted, we need to escape them. In R, this is done using two backslash.

# The character `.` is a metacharacter that matches any character (except a new line)
str_detect(string = c("Earthquake.", "Earthquake"), pattern = ".")
## [1] TRUE TRUE
# To look for a dot in a string:
str_detect(string = c("Earthquake.", "Earthquake"), pattern = "\\.")
## [1]  TRUE FALSE

1.1.2 Line anchors

To match the beginning and the end of a string, respectively, we can use the line anchors ^ and $.

For example, to look for the tweets which begin with a hashtag:

str_detect(string = tweets_earthquake$tweet_text,
           pattern = "^#") %>% 
  which() %>% 
  head()
## [1] 10 25 28 31 38 43

This can be useful when combined with the function dplyr::filter() to filter rows of a two dimension table:

# Tweets beginning with a sharp
tweets_earthquake %>% 
  filter(str_detect(string = tweet_text, pattern = "^#")) %>% 
  select(tweet_text)
## # A tibble: 816 x 1
##    tweet_text                                                              
##    <chr>                                                                   
##  1 #earthquake @BBCNews my uncle is travelling in Nepal but has notified u…
##  2 #Kathmandu's Tribhuvan Airport is currently closed due to #lEarthquake.…
##  3 #Nepal earthquake claims five lives in East #India http://t.co/aU4EFiKU…
##  4 #google person finder for #earthquake http://t.co/uZyXguoio2            
##  5 #This Is the #helpline for #Nepal earthquake click on this post  https:…
##  6 "#\xbe\xdc\xdd\x8c_\xc9 #\x8c\xe0\xbc\x8a__\x8b\x81\xe3\x8d_\xc8 #\x8b\…
##  7 #NepalQuake | Deep Kumar Upadhyay, Nepal's ambassador to India, says Ai…
##  8 #BeingIndian mourns the loss of the lives in the #earthquake that hit N…
##  9 #NepalEarthquake Tribhuvan Int. Arprt #Kathmandu closed 4 operations fl…
## 10 #earthquakeindia Judging by the nature of tremors in Lakhimpur,one can …
## # … with 806 more rows

1.1.3 Alternation

The pipe character | allows to match one or more expression.

tweets <- 
  c("PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries",
  "Earthquake severe damage to Kathmandu. Tragic loss of life.",
  "7.9-magnitude earthquake strikes Nepal, damage reported",
  "Thoughts are with the families in #Nepal")

str_detect(string = tweets, pattern = "magnitude|damage")
## [1]  TRUE  TRUE  TRUE FALSE

This may be useful for alternative spellings.

str_detect(string = c("labor", "labour", "workforce"),
           pattern = "labor|labour")
## [1]  TRUE  TRUE FALSE

1.1.4 Character Classes

Character classes are lists of characters that belong to a group, such as alphabetic, numeric, alphanumeric characters, etc. It is possible to build them or to use predefined classes. They are written by placing them in square brackets []. For example, if the aim is to match strings where characters

Let us assume that we face file names with a date, and that we want to match only those whose month in a given year is “January” or “February”:

str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
            pattern = c("file_2019-0[12]-01"))
## [1] "file_2019-01-01" NA                "file_2019-02-01"

In the previous code, we therefore searched each string for the occurrence of the substring file_2019-01-01 or file_2019-02-01.

Using a dash -, it is possible to define a sequence of characters. Thus, the character class [A-Z] is used to match the letters of the following set: ABCDEFGHIJKLMNOPQRSTUVWXYZ. The character class [0-9] mathces the character set 0123456789.

str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
            pattern = c("file_2019-0[1-3]-01"))
## [1] "file_2019-01-01" "file_2019-03-01" "file_2019-02-01"

Unions of groups can be made:

str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
                       "file_2019-03-01.txt", "file_2019-04-01.txt",
                       "file_2019-05-01.txt", "file_2019-06-01.txt",
                       "file_2019-07-01.txt", "file_2019-08-01.txt"
                       ),
            pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA               
## [5] NA                "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"

To exclude a group of characters, a circumflex accent ^ can be used:

str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
                       "file_2019-03-01.txt", "file_2019-04-01.txt",
                       "file_2019-05-01.txt", "file_2019-06-01.txt",
                       "file_2019-07-01.txt", "file_2019-08-01.txt"
                       ),
            pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA               
## [5] NA                "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"

If, on the other hand, the circumflex must be part of the character class, it should not be placed right after the opening bracket:

str_extract(string = c("So happy ^_^", "So happy (#^.^#)"),
            pattern = "happy [a-z^]")
## [1] "happy ^" NA

It is also possible to escape the character as follows: [\^].

Somes classes are pre-built and can be referred to by their name. They are based on the POSIX family of standards. The most used (in my own experience) are listed in the Table below.

Character class Descriptions
[:digit:] digits
[:lower:] lowercase alphabetic characters
[:upper:] uppercase alphabetic characters
[:alpha:] alphabetic characters (both lower and upper)
[:alnum:] alphabetic characters and numbers
[:blank:] space and tab
[:punct:] punctuation
[:xdigit:] hexadecimal digits

To refer to these classes, they need to be put between the brackets defining the character classes:

tweets
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"
str_detect(tweets, "magnitude [[:digit:]]")
## [1] FALSE FALSE FALSE FALSE

Some classes also benefit from an abbreviation:

Character class Descriptions
\d digits
\D non decimal digit
\s whitespace
\w word
\W non word
str_extract_all("Magnitude 6.1", "\\d")
## [[1]]
## [1] "6" "1"
str_extract_all("Magnitude 6.1", "\\D")
## [[1]]
##  [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" " " "."
str_extract_all("Magnitude 6.1", "\\w")
## [[1]]
##  [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" "6" "1"
str_extract_all("Magnitude 6.1", "\\W")
## [[1]]
## [1] " " "."

1.1.5 Grouping

Parentheses can be used to group some part of a regular expression together. This is particularly helpful when combined with quantifiers and character classes, to manipulate file names for example.

Here is an example with the function str_extract(), which extracts matching patterns from a string:

str_extract(string = c("to analyse", "to analyze", "other"),
           pattern = "analy(s|z)e")
## [1] "analyse" "analyze" NA

1.1.6 Quantifiers

Quantifiers are used to repeat the regular expression a given number of times. The Table below lists the available quantifiers. They are placed after the regex that need to be matched a given number of times.

Quantifier | Description |
? | the regex appears zero or one time |
* | the regex appears zero or more time(s) |
+ | the regex appears one or more time(s) |
{n} | the regex appears n times exactly |
{n,} | the regex appears n times or more |
{n,m} | the regex appears at least n times but no more than m times |
str_extract(string = c("The labour force", "The labor force"),
           pattern = "labou?r")
## [1] "labour" "labor"

Combining quantifiers with character classes or groups allows to match more complex patterns:

tweets
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"
str_extract(tweets, "magnitude [[:digit:]\\.]{1,}")
## [1] NA NA NA NA
str_extract_all(tweets, "\\w+")
## [[1]]
##  [1] "PANIC"    "IN"       "NEPAL"    "Strong"   "quake"    "hits"    
##  [7] "capital"  "causing"  "major"    "damage"   "injuries"
## 
## [[2]]
## [1] "Earthquake" "severe"     "damage"     "to"         "Kathmandu" 
## [6] "Tragic"     "loss"       "of"         "life"      
## 
## [[3]]
## [1] "7"          "9"          "magnitude"  "earthquake" "strikes"   
## [6] "Nepal"      "damage"     "reported"  
## 
## [[4]]
## [1] "Thoughts" "are"      "with"     "the"      "families" "in"      
## [7] "Nepal"

(.*) is a useful combination of grouping and quantifiers. It allows to match any sequence of characters:

  • .: any character
  • *: present zero or more times
x <- c("type_1_20190101_20190131.txt", "type_2_20190101_20190131.txt",
       "type_1_20190201_20190228.txt", "type_2_20190201_20190228.txt",
       "type_1_20190101_20190131.csv", "type_2_20190101_20190131.csv",
       "type_1_20190201_20190228.csv", "type_2_20190201_20190228.csv")
str_extract(x, "^type_1(.*)\\.txt$")
## [1] "type_1_20190101_20190131.txt" NA                            
## [3] "type_1_20190201_20190228.txt" NA                            
## [5] NA                             NA                            
## [7] NA                             NA

1.1.7 Some functions

To illustrate the examples of regular expressions, we used some functions of the package {stringr} which all begin with the prefix str_.

Function Descriptions Type of result
str_detect() Detects the presence or absence of a pattern in a string Booleans
str_extract() Extracts the first matched pattern Strings
str_extract_all() Extracts matched pattern and provides the result in a list of vectors. List of vectors of characters. Each element of the list corresponds to an element provided to the argument string
str_match() Extracts the first group found in a string Matrix
str_match_all() Extracts all the groups found in a string. List of matrices whose elements correspond to the elements of the vector given to the argument string
str_locate() Locates the first occurrence of a pattern in a string Matrix
str_locate_all() Locates all the occurrences of a pattern in a string List of matrices
str_replace() Replaces the first occurrence of a pattern in a string String
str_replace_all() Replaces all the occurrences of a pattern in a string String
str_split() Splits a string into several pieces, according to a given pattern List of vector of characters
# French phone numbers
phone_numbers <-
  c("02 23 23 35 45", "02-23-23-35-45", 
    "Madrid", "02.23.23.35.45", "0223233545",
    "Milan", "02 23 23 35 45  ",
    " 02 23 23 35 45", "Home: 02 23 23 35 45")

pattern_phone_number <- str_c(str_dup("([0-9]{2})[- \\.]", 4),  "([0-9]{2})")
pattern_phone_number
## [1] "([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})"
# Extract phone numbers
str_extract(phone_numbers, pattern_phone_number)
## [1] "02 23 23 35 45" "02-23-23-35-45" NA               "02.23.23.35.45"
## [5] NA               NA               "02 23 23 35 45" "02 23 23 35 45"
## [9] "02 23 23 35 45"
# Extract phone numbers, then remove punctuation and white characters
str_extract(phone_numbers, pattern_phone_number) %>% 
  str_replace_all("[[:punct:]\\s]", "")
## [1] "0223233545" "0223233545" NA           "0223233545" NA          
## [6] NA           "0223233545" "0223233545" "0223233545"
# Extract matched groups from the phone numbers
str_match(phone_numbers, pattern_phone_number)
##       [,1]             [,2] [,3] [,4] [,5] [,6]
##  [1,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [2,] "02-23-23-35-45" "02" "23" "23" "35" "45"
##  [3,] NA               NA   NA   NA   NA   NA  
##  [4,] "02.23.23.35.45" "02" "23" "23" "35" "45"
##  [5,] NA               NA   NA   NA   NA   NA  
##  [6,] NA               NA   NA   NA   NA   NA  
##  [7,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [8,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [9,] "02 23 23 35 45" "02" "23" "23" "35" "45"

The str_locate() and str_locate_all() functions return the start and end indices of the matched subchains.

tweets
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"
str_locate(string = tweets, pattern = "magnitude")
##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]     5  13
## [4,]    NA  NA
str_locate_all(string = tweets, pattern = "magnitude")
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     5  13
## 
## [[4]]
##      start end

To look for a pattern in a string ignoring case sensitivity, the pattern can be previously provided to the function stringr::regex():

s <- c("earthquake", "Earthquake")
str_detect(string = s, pattern = "earthquake")
## [1]  TRUE FALSE
str_detect(string = s,
           pattern = stringr::regex("earthquake", ignore_case = TRUE))
## [1] TRUE TRUE

1.2 Cleaning

It is often necessary to “clean” the strings before they can be used in statistical models. A few basic operations can quickly remove spaces, punctuation, etc.

To set all alphabetical characters to lowercase or uppercase, the functions str_to_lower() or str_to_upper() can be used, respectively.

str_to_lower(tweets_earthquake$tweet_text[1])
## [1] "dua's for all those affected by the earthquakes in india,nepal &amp; bhutan. stay safe &amp; help others in any form. #equake http://t.co/m6yg0k4fkh"
str_to_upper(tweets_earthquake$tweet_text[1])
## [1] "DUA'S FOR ALL THOSE AFFECTED BY THE EARTHQUAKES IN INDIA,NEPAL &AMP; BHUTAN. STAY SAFE &AMP; HELP OTHERS IN ANY FORM. #EQUAKE HTTP://T.CO/M6YG0K4FKH"

To remove some undesired characters, such as punctuation, the function str_replace_all() can be used:

str_replace_all(tweets_earthquake$tweet_text[1], "[[:punct:]]", "")
## [1] "Duas for all those affected by the earthquakes in IndiaNepal amp Bhutan Stay safe amp help others in any form Equake httptcoM6YG0k4FKh"

Another useful function is str_trim(). It allows to trim whitespace from a string. This typically occurs after removing some words of a string. The side parameter allows to specify whether the spaces to be removed should only be those on the left of the string, on the right, or both.

x <- c("   String with spaces at the beginning and end   ")
str_trim(x, side = "both")
## [1] "String with spaces at the beginning and end"
str_trim(x, side = "left")
## [1] "String with spaces at the beginning and end   "
str_trim(x, side = "right")
## [1] "   String with spaces at the beginning and end"

2 Case Study: Tweet Classification

In this case study, we will analyze text data from the Twitter platform. Messages written during the Gorkha earthquake, an earthquake that occurred in Nepal in April and May 2015 were retrieved.

This case study was inspired by this course (Supervised classification with text data) made by Benjamin Soltoff.

2.1 Loading Data

Volunteers have labeled some of the tweets. These are available on the website CrisisNLP. We will use them to train a classifier. The objective of the latter is to assign one of the following different classes based on new messages broadcast on Twitter during a similar disaster:

  • infrastructure
  • response efforts
  • urgent needs
  • sympathy and emotional support
  • other

We previously loaded the tweets into a tibble named tweets_earthquake.

tweets_earthquake
## # A tibble: 9,471 x 10
##    tweet_id tweet_time tweet_author tweet_author_id tweet_language
##    <chr>    <chr>      <chr>                  <dbl> <chr>         
##  1 '591903… Sat Apr 2… Faali19           2387302745 en            
##  2 '591903… Sat Apr 2… STERLINGMHO…       153876973 en            
##  3 '591903… Sat Apr 2… HeenaliVP          421188281 en            
##  4 '591903… Sat Apr 2… Xennia79           176207969 en            
##  5 '591903… Sat Apr 2… Madhurita_        1058658786 en            
##  6 '591903… Sat Apr 2… MONIMISHI         1461160603 en            
##  7 '591904… Sat Apr 2… AnilBalkrus…      3034294729 en            
##  8 '591904… Sat Apr 2… haquem19          2782719416 en            
##  9 '591904… Sat Apr 2… Akshay7_           111600045 en            
## 10 '591904… Sat Apr 2… hnrwbell          1613115068 en            
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## #   tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>

The pre-existing classification is given in the label column. The table() function provides an overview of each class and its associated size:

tweets_earthquake$label %>% table()
## .
##                  Animal management                 Caution and advice 
##                                  1                                  6 
##                   Displaced people       Infrastructure and utilities 
##                                  4                                 50 
##              Infrastructure damage              Infrastructure Damage 
##                                  3                                166 
##             Injured or dead people  Missing, trapped, or found people 
##                                 76                                 56 
##                              Money          Not related or irrelevant 
##                                 53                                239 
##                       Not relevant                       Not Relevant 
##                                  1                               6279 
##                     Other relevant         Other relevant information 
##                                  2                                239 
##         Other Relevant Information                   Personal updates 
##                                627                                 29 
##                   Response efforts                   Response Efforts 
##                                  2                                994 
##               Shelter and supplies     Sympathy and emotional support 
##                                 18                                458 
##                       Urgent Needs Volunteer or professional services 
##                                108                                 60

Some classes refer to the same concept, but have a different spelling. Let us fix that.

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(label = str_to_lower(label)) %>% 
  mutate(label = ifelse(label %in% c("not related or irrelevant",
                                     "not relevant"),
                        yes = "not relevant",
                        no = label)) %>% 
  mutate(label = ifelse(label %in% c("other relevant information",
                                     "other relevant"),
                        yes = "other relevant",
                        no = label))

Then let us define the 4 classes we are interested in:

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(
    class = "Other",
    class = ifelse(label == "response efforts",
                   yes = "response efforts", no = class),
    class = ifelse(label %in% c("infrastructure damage", "infrastructure and utilitie"),
                   yes = "infrastructure", no = class),
    class = ifelse(label %in% c("urgent needs", "injured or dead people",
                                "missing, trapped, or found people"),
                   yes = "urgent needs", no = class),
    class = ifelse(label == "sympathy and emotional support",
                   yes = "sympathy and emotional support", no = class)
    )
tweets_earthquake$class %>% table() %>% sort()
## .
##                 infrastructure                   urgent needs 
##                            169                            240 
## sympathy and emotional support               response efforts 
##                            458                            996 
##                          Other 
##                           7608

To be properly imported into R, the identifiers contained in the tweet_id column have been previously surrounded by single quotation marks ('). If this had not been done, the data import function would have tried to convert this column to numeric, thus losing any 0 at the beginning of the chain (001 would have been transformed into 1).

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(tweet_id = str_sub(tweet_id, 2, -2))

Let’s keep only three columns of this data table:

  • tweet_id: the identifier of each tweet
  • class: the class to be predicted
  • tweet_text: the tweet full text
tweets_earthquake <- 
  tweets_earthquake %>% 
  select(tweet_id, class, tweet_text)

CrisisNLP provides a set of tuples (tweet-id, user-id) for each disaster studied. Using the Twitter API and a few lines of R code, the listed tweets were retrieved (at least those that, at the time of extraction, were still available). To avoid adding to this tutorial, we will work with a database that is already prepared.

The recovered tweets are stored as tibbles in 5 files: tweets_nepal_00.rda to tweets_nepal_04.rda. Let us load them into a list, then concatenate this list to form a single tibble.

# Load tweets (extracted using Twitter API)
N <- list.files("donnees/Tweets/Nepal_2015/", pattern = "^tweets_nepal", full.names = TRUE)
tweets_df <- 
  lapply(N, function(x){
    tweets_tmp <- load(x)
    get(tweets_tmp)
  })

tweets_df <- 
  tweets_df %>% 
  bind_rows()

Let us remove the retweeted statutes:

tweets_df <- tweets_df %>% 
  filter(!is_retweeted)

The number of lines is 981552.

nrow(tweets_df)
## [1] 981552

There are some statuses in the labeled tweets set that are no longer available on the social platform. Let us take them out of our analysis.

tweets_earthquake <- 
  tweets_earthquake %>% 
  filter(tweet_id %in% tweets_df$id_str)
nrow(tweets_earthquake)
## [1] 7031

Let us add to the labelled dataset the information obtained via the Twitter API:

tweets_earthquake <- 
  tweets_earthquake %>% 
  left_join(tweets_df, by = c("tweet_id" = "id_str"))

The frequency for each class shows a strong imbalance.

tweets_earthquake$class %>% table() %>% sort()
## .
##                 infrastructure                   urgent needs 
##                            135                            183 
## sympathy and emotional support               response efforts 
##                            328                            814 
##                          Other 
##                           5792

2.2 Pre-processing

The information we will use in this exercise will be extracted from the texts of the tweets. This involves extracting variables from textual data. What we will do is separate the text into tokens, after cleaning it (lowercase, punctuation removal, word root extraction, etc.).

We will use two packages to pre-process the data: {tidytext} and [{SnowballC}]https://cran.r-project.org/web/packages/SnowballC/index.html. Multiple functions useful for text mining are available in {tidytext}, including sentiment analysis functions. The package {SnowballC} allows to use Porter’s word stemming algorithm which collapses words to a common root (note that not all languages are available).

library(tidytext)
library(SnowballC)

2.2.1 Some functions to help clean texts

To clean the tweets, we will create some functions. Let us define the function remove_url() to remove URLs from a string, using a regex :

#' remove_url
#' Removes URLs from a string
#' @param x string
remove_url <- function(x){
  pattern_url <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
  str_replace_all(string = x, pattern = pattern_url, replacement = "")
}

Let us also create the function remove_special_chars() allowing to remove special characters found in tweets (unrecognized apostrophes) :

#' remove_special_chars
#' Removes the special characters from a string
#' @param x string
remove_special_chars <- function(x){
  str_replace_all(string = x,
                  pattern = "[^\x20-\x7e]", replacement = "")
}

Let us define the function remove_mentions() which identifies a mention in a Twitter status (starting with the arobase symbol (@) and followed by the user’s screen name) :

#' remove_mentions
#' Removes the mentions from tweets (`@`)
#' @param x string
remove_mentions <- function(x){
  str_replace_all(x, "@[[:alnum:]]+\\s?", "")
}

We can also define the functions remove_punctuation() and remove_numbers() which remove punctuation and numbers, respectively, from a tweet.

#' remove_punctuation
#' Removes punctuation from tweets
#' @param x string
remove_punctuation <- function(x){
  str_replace_all(x, "[[:punct:]]", "")
}

#' remove_numbers
#' Removes numbers from tweets
#' @param x string
remove_numbers <- function(x){
  str_replace_all(x, "[[:digit:]]", "")
}

Eventuallt, we can define the function remove_char_ref() which removes character references (e.g., &amp; )

#' remove_char_ref
#' Removes character reference
#' @param x string
remove_char_ref <- function(x){
  str_replace_all(x, "&[[:alpha:]]{1,6};", "")
}

2.2.2 Building the Corpus

We will create a corpus of texts from the tweets. The idea is to obtain an object containing as many documents as statuses. For each document, we have to count the occurrence of each of the words encountered throughout the corpus. At the end of the day, we obtain a matrix whose rows correspond to the tweets and whose columns indicate the occurrence of each word. The columns of this matrix will be the explanatory variables that can be uses to train a classifier.

First, let us create a tibble with an identifier and a tweet.

tweets_earthquake_tt <- 
  tibble(id = 1:nrow(tweets_earthquake),
         text = tweets_earthquake$full_text)
tweets_earthquake_tt
## # A tibble: 7,252 x 2
##       id text                                                              
##    <int> <chr>                                                             
##  1     1 "Dua's for all those affected by the\nearthquakes in India,Nepal …
##  2     2 Absolutely devastated by the destruction to my old home #Nepal    
##  3     3 Thoughts are with the families in #Nepal                          
##  4     4 Frightful images! Our prayers echo for everyone affected. #earthq…
##  5     5 Who was Gajendra Singh ? Today no news boz , of earthquake in Nep…
##  6     6 Live: Nepal cabinet meets to seek foreign help, 114 feared dead a…
##  7     7 When you go out for Momos this evening, ask and reassure the sell…
##  8     8 A crucial tool in a situation like #NepalQuake #NepalEarthquake..…
##  9     9 our affection from Madrid Spain, there we were this summer from N…
## 10    10 Devastating pictures of #NepalEarthQuake http://t.co/VaEOUkUTsG   
## # … with 7,242 more rows

Then, let us apply to each tweet the functions to clean the text:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  mutate(cleaned_text = str_to_lower(text),
         cleaned_text = remove_url(cleaned_text),
         cleaned_text = remove_mentions(cleaned_text),
         cleaned_text = remove_char_ref(cleaned_text),
         cleaned_text = remove_special_chars(cleaned_text),
         cleaned_text = remove_punctuation(cleaned_text),
         cleaned_text = remove_numbers(cleaned_text)
  )
tweets_earthquake_tt
## # A tibble: 7,252 x 3
##       id text                            cleaned_text                      
##    <int> <chr>                           <chr>                             
##  1     1 "Dua's for all those affected … "duas for all those affected by t…
##  2     2 Absolutely devastated by the d… absolutely devastated by the dest…
##  3     3 Thoughts are with the families… thoughts are with the families in…
##  4     4 Frightful images! Our prayers … "frightful images our prayers ech…
##  5     5 Who was Gajendra Singh ? Today… who was gajendra singh  today no …
##  6     6 Live: Nepal cabinet meets to s… "live nepal cabinet meets to seek…
##  7     7 When you go out for Momos this… when you go out for momos this ev…
##  8     8 A crucial tool in a situation … "a crucial tool in a situation li…
##  9     9 our affection from Madrid Spai… our affection from madrid spain t…
## 10    10 Devastating pictures of #Nepal… "devastating pictures of nepalear…
## # … with 7,242 more rows

Using the unnest_tokens() function of {tidytext}, let us separate each word from the tweets. Each line of the tibble obtained is a tuple indicating the identifier and the word.

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  select(id, cleaned_text) %>% 
  unnest_tokens(word, cleaned_text)

tweets_earthquake_tt
## # A tibble: 82,692 x 2
##       id word          
##    <int> <chr>         
##  1     1 duas          
##  2     1 for           
##  3     1 all           
##  4     1 those         
##  5     1 affected      
##  6     1 by            
##  7     1 theearthquakes
##  8     1 in            
##  9     1 indianepal    
## 10     1 bhutan        
## # … with 82,682 more rows

Some very frequent and potentially noisy words can be removed. A list of such words is available in the tibble stop_words. The function get_stopwords() can also be used

# English stop words
stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows
# French stop words
stop_words_fr <- get_stopwords(language = "fr")
stop_words_fr
## # A tibble: 164 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 au    snowball
##  2 aux   snowball
##  3 avec  snowball
##  4 ce    snowball
##  5 ces   snowball
##  6 dans  snowball
##  7 de    snowball
##  8 des   snowball
##  9 du    snowball
## 10 elle  snowball
## # … with 154 more rows

To remove these stopwords, we can use the anti_joint() function from {dplyr}:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  anti_join(stop_words, by = c("word"))
tweets_earthquake_tt
## # A tibble: 44,011 x 2
##       id word          
##    <int> <chr>         
##  1     1 duas          
##  2     1 affected      
##  3     1 theearthquakes
##  4     1 indianepal    
##  5     1 bhutan        
##  6     1 staysafe      
##  7     1 form          
##  8     1 equake        
##  9     2 absolutely    
## 10     2 devastated    
## # … with 44,001 more rows

To extract the root of each word the wordStem() function from {SnowballC} can be applied:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  mutate(word_stem = wordStem(word))
tweets_earthquake_tt
## # A tibble: 44,011 x 3
##       id word           word_stem   
##    <int> <chr>          <chr>       
##  1     1 duas           dua         
##  2     1 affected       affect      
##  3     1 theearthquakes theearthquak
##  4     1 indianepal     indianep    
##  5     1 bhutan         bhutan      
##  6     1 staysafe       staysaf     
##  7     1 form           form        
##  8     1 equake         equak       
##  9     2 absolutely     absolut     
## 10     2 devastated     devast      
## # … with 44,001 more rows

Using the count() function of {dplyr}, the occurrence of each word is easily calculated:

freq_words <- 
  tweets_earthquake_tt %>% 
  dplyr::count(word_stem, sort = TRUE)
freq_words
## # A tibble: 8,375 x 2
##    word_stem          n
##    <chr>          <int>
##  1 nepal           3820
##  2 nepalearthquak  1164
##  3 earthquak        670
##  4 prayer           486
##  5 peopl            457
##  6 god              455
##  7 nepalquak        370
##  8 donat            285
##  9 new              274
## 10 india            251
## # … with 8,365 more rows

We can use a barplot to graph the occurrences of the top n words (here, we use n=10):

freq_words %>% slice(1:10) %>% 
  ggplot(data = .,
         aes(x = reorder(word_stem, n), y = n)) +
  geom_bar(stat ="identity") +
  labs(x = "Word", y = "Frequency") +
  coord_flip()

A word cloud can be drawn using the wordcloud() function from {wordcloud}. The size of words is positively related to the frequency of appearance.

library(wordcloud)
wordcloud(words = freq_words %>% slice(1:100) %>% 
            magrittr::extract2("word_stem"),
          freq = freq_words %>% slice(1:100) %>%
            magrittr::extract2("n"),
          random.order = F)