Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

1 Text Processing

1.1 Regular Expressions

It is common to manipulate strings. This is the case when files need to be loaded in a loop where only a part of the files are targeted. This is also the case, of course, when the data that is manipulated in the models is textual. There is a very convenient tool that allows one to search for more or less complex patterns in strings: regular expressions. Regular expression (or regex) are sequences of characters forming a search pattern. The pattern is used to match one or several characters in a string. The help page in R (?regex) provides condensed information on regex.

The {base} package contains multiple functions related to regular expressions, but we will use instead some functions from the package {stringr}, built on top of {stringi}. The package {stringr} allows us to easily manipulate strings in R.

For most of the examples given to illustrate how regex work, we will use real Twitter data from CrisisNLP. Let us use the tweets from the Nepal Earthquake crisis, annotated by volunteers.

library(tidyverse)
tweets_earthquake <- 
  str_c("donnees/CrisisNLP_volunteers_labeled_data/2015_Nepal_Earthquake_en/",
        "2015_Nepal_Earthquake_en.csv") %>% 
  read_csv(locale = locale(encoding = "UTF-8"))
tweets_earthquake

## # A tibble: 9,471 x 10
##    tweet_id tweet_time tweet_author tweet_author_id tweet_language
##    <chr>    <chr>      <chr>                  <dbl> <chr>         
##  1 '591903… Sat Apr 2… Faali19           2387302745 en            
##  2 '591903… Sat Apr 2… STERLINGMHO…       153876973 en            
##  3 '591903… Sat Apr 2… HeenaliVP          421188281 en            
##  4 '591903… Sat Apr 2… Xennia79           176207969 en            
##  5 '591903… Sat Apr 2… Madhurita_        1058658786 en            
##  6 '591903… Sat Apr 2… MONIMISHI         1461160603 en            
##  7 '591904… Sat Apr 2… AnilBalkrus…      3034294729 en            
##  8 '591904… Sat Apr 2… haquem19          2782719416 en            
##  9 '591904… Sat Apr 2… Akshay7_           111600045 en            
## 10 '591904… Sat Apr 2… hnrwbell          1613115068 en            
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## #   tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>

To check whether a pattern is found in a string, we can use the function stringr::str_detect() (note that the package {stringr} have been attached when we atatched {tidyverse}).

(two_tweets <- tweets_earthquake$tweet_text[1:2])

## [1] "Dua's for all those affected by the earthquakes in India,Nepal &amp; Bhutan. Stay safe &amp; help others in any form. #Equake http://t.co/M6YG0k4FKh"
## [2] "itvnews: Witness to Nepal #earthquake tells itvnews: 'It was terrifying' http://t.co/UWMynVyzQC"

# Can we find the word "Earthquake" in the tweets?
str_detect(string = two_tweets, pattern = "earthquake")

## [1] TRUE TRUE

# Can we find the word "India" in the tweets?
str_detect(string = two_tweets, pattern = "India")

## [1]  TRUE FALSE

1.1.1 Litterals and metacharacters

In the above examples, the pattern is composed of litterals, i.e., characters that recieve a literal interpretation in the regular expression. Some other characters, on the other hand, receive a different interpretation when they are part of the regex. This is the case of the following reserved characters, called metacharacters: . \ | ( ) [ { $ * + ?. If we want these characters to be literaly interpretted, we need to escape them. In R, this is done using two backslash.

# The character `.` is a metacharacter that matches any character (except a new line)
str_detect(string = c("Earthquake.", "Earthquake"), pattern = ".")

## [1] TRUE TRUE

# To look for a dot in a string:
str_detect(string = c("Earthquake.", "Earthquake"), pattern = "\\.")

## [1]  TRUE FALSE

1.1.2 Line anchors

To match the beginning and the end of a string, respectively, we can use the line anchors ^ and $.

For example, to look for the tweets which begin with a hashtag:

str_detect(string = tweets_earthquake$tweet_text,
           pattern = "^#") %>% 
  which() %>% 
  head()

## [1] 10 25 28 31 38 43

This can be useful when combined with the function dplyr::filter() to filter rows of a two dimension table:

# Tweets beginning with a sharp
tweets_earthquake %>% 
  filter(str_detect(string = tweet_text, pattern = "^#")) %>% 
  select(tweet_text)

## # A tibble: 816 x 1
##    tweet_text                                                              
##    <chr>                                                                   
##  1 #earthquake @BBCNews my uncle is travelling in Nepal but has notified u…
##  2 #Kathmandu's Tribhuvan Airport is currently closed due to #lEarthquake.…
##  3 #Nepal earthquake claims five lives in East #India http://t.co/aU4EFiKU…
##  4 #google person finder for #earthquake http://t.co/uZyXguoio2            
##  5 #This Is the #helpline for #Nepal earthquake click on this post  https:…
##  6 "#\xbe\xdc\xdd\x8c_\xc9 #\x8c\xe0\xbc\x8a__\x8b\x81\xe3\x8d_\xc8 #\x8b\…
##  7 #NepalQuake | Deep Kumar Upadhyay, Nepal's ambassador to India, says Ai…
##  8 #BeingIndian mourns the loss of the lives in the #earthquake that hit N…
##  9 #NepalEarthquake Tribhuvan Int. Arprt #Kathmandu closed 4 operations fl…
## 10 #earthquakeindia Judging by the nature of tremors in Lakhimpur,one can …
## # … with 806 more rows

1.1.3 Alternation

The pipe character | allows to match one or more expression.

tweets <- 
  c("PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries",
  "Earthquake severe damage to Kathmandu. Tragic loss of life.",
  "7.9-magnitude earthquake strikes Nepal, damage reported",
  "Thoughts are with the families in #Nepal")

str_detect(string = tweets, pattern = "magnitude|damage")

## [1]  TRUE  TRUE  TRUE FALSE

This may be useful for alternative spellings.

str_detect(string = c("labor", "labour", "workforce"),
           pattern = "labor|labour")

## [1]  TRUE  TRUE FALSE

1.1.4 Character Classes

Character classes are lists of characters that belong to a group, such as alphabetic, numeric, alphanumeric characters, etc. It is possible to build them or to use predefined classes. They are written by placing them in square brackets []. For example, if the aim is to match strings where characters

Let us assume that we face file names with a date, and that we want to match only those whose month in a given year is “January” or “February”:

str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
            pattern = c("file_2019-0[12]-01"))

## [1] "file_2019-01-01" NA                "file_2019-02-01"

In the previous code, we therefore searched each string for the occurrence of the substring file_2019-01-01 or file_2019-02-01.

Using a dash -, it is possible to define a sequence of characters. Thus, the character class [A-Z] is used to match the letters of the following set: ABCDEFGHIJKLMNOPQRSTUVWXYZ. The character class [0-9] mathces the character set 0123456789.

str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
            pattern = c("file_2019-0[1-3]-01"))

## [1] "file_2019-01-01" "file_2019-03-01" "file_2019-02-01"

Unions of groups can be made:

str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
                       "file_2019-03-01.txt", "file_2019-04-01.txt",
                       "file_2019-05-01.txt", "file_2019-06-01.txt",
                       "file_2019-07-01.txt", "file_2019-08-01.txt"
                       ),
            pattern = c("file_2019-0[1-36-8]-01"))

## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA               
## [5] NA                "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"

To exclude a group of characters, a circumflex accent ^ can be used:

str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
                       "file_2019-03-01.txt", "file_2019-04-01.txt",
                       "file_2019-05-01.txt", "file_2019-06-01.txt",
                       "file_2019-07-01.txt", "file_2019-08-01.txt"
                       ),
            pattern = c("file_2019-0[1-36-8]-01"))

## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA               
## [5] NA                "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"

If, on the other hand, the circumflex must be part of the character class, it should not be placed right after the opening bracket:

str_extract(string = c("So happy ^_^", "So happy (#^.^#)"),
            pattern = "happy [a-z^]")

## [1] "happy ^" NA

It is also possible to escape the character as follows: [\^].

Somes classes are pre-built and can be referred to by their name. They are based on the POSIX family of standards. The most used (in my own experience) are listed in the Table below.

Character class	Descriptions
`[:digit:]`	digits
`[:lower:]`	lowercase alphabetic characters
`[:upper:]`	uppercase alphabetic characters
`[:alpha:]`	alphabetic characters (both lower and upper)
`[:alnum:]`	alphabetic characters and numbers
`[:blank:]`	space and tab
`[:punct:]`	punctuation
`[:xdigit:]`	hexadecimal digits

To refer to these classes, they need to be put between the brackets defining the character classes:

tweets

## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"

str_detect(tweets, "magnitude [[:digit:]]")

## [1] FALSE FALSE FALSE FALSE

Some classes also benefit from an abbreviation:

Character class	Descriptions
`\d`	digits
`\D`	non decimal digit
`\s`	whitespace
`\w`	word
`\W`	non word

str_extract_all("Magnitude 6.1", "\\d")

## [[1]]
## [1] "6" "1"

str_extract_all("Magnitude 6.1", "\\D")

## [[1]]
##  [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" " " "."

str_extract_all("Magnitude 6.1", "\\w")

## [[1]]
##  [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" "6" "1"

str_extract_all("Magnitude 6.1", "\\W")

## [[1]]
## [1] " " "."

1.1.5 Grouping

Parentheses can be used to group some part of a regular expression together. This is particularly helpful when combined with quantifiers and character classes, to manipulate file names for example.

Here is an example with the function str_extract(), which extracts matching patterns from a string:

str_extract(string = c("to analyse", "to analyze", "other"),
           pattern = "analy(s|z)e")

## [1] "analyse" "analyze" NA

1.1.6 Quantifiers

Quantifiers are used to repeat the regular expression a given number of times. The Table below lists the available quantifiers. They are placed after the regex that need to be matched a given number of times.

Quantifier | Description |
? | the regex appears zero or one time |
* | the regex appears zero or more time(s) |
+ | the regex appears one or more time(s) |
{n} | the regex appears n times exactly |
{n,} | the regex appears n times or more |
{n,m} | the regex appears at least n times but no more than m times |

str_extract(string = c("The labour force", "The labor force"),
           pattern = "labou?r")

## [1] "labour" "labor"

Combining quantifiers with character classes or groups allows to match more complex patterns:

tweets

## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"

str_extract(tweets, "magnitude [[:digit:]\\.]{1,}")

## [1] NA NA NA NA

str_extract_all(tweets, "\\w+")

## [[1]]
##  [1] "PANIC"    "IN"       "NEPAL"    "Strong"   "quake"    "hits"    
##  [7] "capital"  "causing"  "major"    "damage"   "injuries"
## 
## [[2]]
## [1] "Earthquake" "severe"     "damage"     "to"         "Kathmandu" 
## [6] "Tragic"     "loss"       "of"         "life"      
## 
## [[3]]
## [1] "7"          "9"          "magnitude"  "earthquake" "strikes"   
## [6] "Nepal"      "damage"     "reported"  
## 
## [[4]]
## [1] "Thoughts" "are"      "with"     "the"      "families" "in"      
## [7] "Nepal"

(.*) is a useful combination of grouping and quantifiers. It allows to match any sequence of characters:

.: any character
*: present zero or more times

x <- c("type_1_20190101_20190131.txt", "type_2_20190101_20190131.txt",
       "type_1_20190201_20190228.txt", "type_2_20190201_20190228.txt",
       "type_1_20190101_20190131.csv", "type_2_20190101_20190131.csv",
       "type_1_20190201_20190228.csv", "type_2_20190201_20190228.csv")
str_extract(x, "^type_1(.*)\\.txt$")

## [1] "type_1_20190101_20190131.txt" NA                            
## [3] "type_1_20190201_20190228.txt" NA                            
## [5] NA                             NA                            
## [7] NA                             NA

1.1.7 Some functions

To illustrate the examples of regular expressions, we used some functions of the package {stringr} which all begin with the prefix str_.

Function	Descriptions	Type of result
`str_detect()`	Detects the presence or absence of a pattern in a string	Booleans
`str_extract()`	Extracts the first matched pattern	Strings
`str_extract_all()`	Extracts matched pattern and provides the result in a list of vectors.	List of vectors of characters. Each element of the list corresponds to an element provided to the argument `string`
`str_match()`	Extracts the first group found in a string	Matrix
`str_match_all()`	Extracts all the groups found in a string.	List of matrices whose elements correspond to the elements of the vector given to the argument `string`
`str_locate()`	Locates the first occurrence of a pattern in a string	Matrix
`str_locate_all()`	Locates all the occurrences of a pattern in a string	List of matrices
`str_replace()`	Replaces the first occurrence of a pattern in a string	String
`str_replace_all()`	Replaces all the occurrences of a pattern in a string	String
`str_split()`	Splits a string into several pieces, according to a given pattern	List of vector of characters

# French phone numbers
phone_numbers <-
  c("02 23 23 35 45", "02-23-23-35-45", 
    "Madrid", "02.23.23.35.45", "0223233545",
    "Milan", "02 23 23 35 45  ",
    " 02 23 23 35 45", "Home: 02 23 23 35 45")

pattern_phone_number <- str_c(str_dup("([0-9]{2})[- \\.]", 4),  "([0-9]{2})")
pattern_phone_number

## [1] "([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})"

# Extract phone numbers
str_extract(phone_numbers, pattern_phone_number)

## [1] "02 23 23 35 45" "02-23-23-35-45" NA               "02.23.23.35.45"
## [5] NA               NA               "02 23 23 35 45" "02 23 23 35 45"
## [9] "02 23 23 35 45"

# Extract phone numbers, then remove punctuation and white characters
str_extract(phone_numbers, pattern_phone_number) %>% 
  str_replace_all("[[:punct:]\\s]", "")

## [1] "0223233545" "0223233545" NA           "0223233545" NA          
## [6] NA           "0223233545" "0223233545" "0223233545"

# Extract matched groups from the phone numbers
str_match(phone_numbers, pattern_phone_number)

##       [,1]             [,2] [,3] [,4] [,5] [,6]
##  [1,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [2,] "02-23-23-35-45" "02" "23" "23" "35" "45"
##  [3,] NA               NA   NA   NA   NA   NA  
##  [4,] "02.23.23.35.45" "02" "23" "23" "35" "45"
##  [5,] NA               NA   NA   NA   NA   NA  
##  [6,] NA               NA   NA   NA   NA   NA  
##  [7,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [8,] "02 23 23 35 45" "02" "23" "23" "35" "45"
##  [9,] "02 23 23 35 45" "02" "23" "23" "35" "45"

The str_locate() and str_locate_all() functions return the start and end indices of the matched subchains.

tweets

## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."              
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"                  
## [4] "Thoughts are with the families in #Nepal"

str_locate(string = tweets, pattern = "magnitude")

##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]     5  13
## [4,]    NA  NA

str_locate_all(string = tweets, pattern = "magnitude")

## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     5  13
## 
## [[4]]
##      start end

To look for a pattern in a string ignoring case sensitivity, the pattern can be previously provided to the function stringr::regex():

s <- c("earthquake", "Earthquake")
str_detect(string = s, pattern = "earthquake")

## [1]  TRUE FALSE

str_detect(string = s,
           pattern = stringr::regex("earthquake", ignore_case = TRUE))

## [1] TRUE TRUE

1.2 Cleaning

It is often necessary to “clean” the strings before they can be used in statistical models. A few basic operations can quickly remove spaces, punctuation, etc.

To set all alphabetical characters to lowercase or uppercase, the functions str_to_lower() or str_to_upper() can be used, respectively.

str_to_lower(tweets_earthquake$tweet_text[1])

## [1] "dua's for all those affected by the earthquakes in india,nepal &amp; bhutan. stay safe &amp; help others in any form. #equake http://t.co/m6yg0k4fkh"

str_to_upper(tweets_earthquake$tweet_text[1])

## [1] "DUA'S FOR ALL THOSE AFFECTED BY THE EARTHQUAKES IN INDIA,NEPAL &AMP; BHUTAN. STAY SAFE &AMP; HELP OTHERS IN ANY FORM. #EQUAKE HTTP://T.CO/M6YG0K4FKH"

To remove some undesired characters, such as punctuation, the function str_replace_all() can be used:

str_replace_all(tweets_earthquake$tweet_text[1], "[[:punct:]]", "")

## [1] "Duas for all those affected by the earthquakes in IndiaNepal amp Bhutan Stay safe amp help others in any form Equake httptcoM6YG0k4FKh"

Another useful function is str_trim(). It allows to trim whitespace from a string. This typically occurs after removing some words of a string. The side parameter allows to specify whether the spaces to be removed should only be those on the left of the string, on the right, or both.

x <- c("   String with spaces at the beginning and end   ")
str_trim(x, side = "both")

## [1] "String with spaces at the beginning and end"

str_trim(x, side = "left")

## [1] "String with spaces at the beginning and end   "

str_trim(x, side = "right")

## [1] "   String with spaces at the beginning and end"

2 Case Study: Tweet Classification

In this case study, we will analyze text data from the Twitter platform. Messages written during the Gorkha earthquake, an earthquake that occurred in Nepal in April and May 2015 were retrieved.

This case study was inspired by this course (Supervised classification with text data) made by Benjamin Soltoff.

2.1 Loading Data

Volunteers have labeled some of the tweets. These are available on the website CrisisNLP. We will use them to train a classifier. The objective of the latter is to assign one of the following different classes based on new messages broadcast on Twitter during a similar disaster:

infrastructure
response efforts
urgent needs
sympathy and emotional support
other

We previously loaded the tweets into a tibble named tweets_earthquake.

tweets_earthquake

## # A tibble: 9,471 x 10
##    tweet_id tweet_time tweet_author tweet_author_id tweet_language
##    <chr>    <chr>      <chr>                  <dbl> <chr>         
##  1 '591903… Sat Apr 2… Faali19           2387302745 en            
##  2 '591903… Sat Apr 2… STERLINGMHO…       153876973 en            
##  3 '591903… Sat Apr 2… HeenaliVP          421188281 en            
##  4 '591903… Sat Apr 2… Xennia79           176207969 en            
##  5 '591903… Sat Apr 2… Madhurita_        1058658786 en            
##  6 '591903… Sat Apr 2… MONIMISHI         1461160603 en            
##  7 '591904… Sat Apr 2… AnilBalkrus…      3034294729 en            
##  8 '591904… Sat Apr 2… haquem19          2782719416 en            
##  9 '591904… Sat Apr 2… Akshay7_           111600045 en            
## 10 '591904… Sat Apr 2… hnrwbell          1613115068 en            
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## #   tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>

The pre-existing classification is given in the label column. The table() function provides an overview of each class and its associated size:

tweets_earthquake$label %>% table()

## .
##                  Animal management                 Caution and advice 
##                                  1                                  6 
##                   Displaced people       Infrastructure and utilities 
##                                  4                                 50 
##              Infrastructure damage              Infrastructure Damage 
##                                  3                                166 
##             Injured or dead people  Missing, trapped, or found people 
##                                 76                                 56 
##                              Money          Not related or irrelevant 
##                                 53                                239 
##                       Not relevant                       Not Relevant 
##                                  1                               6279 
##                     Other relevant         Other relevant information 
##                                  2                                239 
##         Other Relevant Information                   Personal updates 
##                                627                                 29 
##                   Response efforts                   Response Efforts 
##                                  2                                994 
##               Shelter and supplies     Sympathy and emotional support 
##                                 18                                458 
##                       Urgent Needs Volunteer or professional services 
##                                108                                 60

Some classes refer to the same concept, but have a different spelling. Let us fix that.

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(label = str_to_lower(label)) %>% 
  mutate(label = ifelse(label %in% c("not related or irrelevant",
                                     "not relevant"),
                        yes = "not relevant",
                        no = label)) %>% 
  mutate(label = ifelse(label %in% c("other relevant information",
                                     "other relevant"),
                        yes = "other relevant",
                        no = label))

Then let us define the 4 classes we are interested in:

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(
    class = "Other",
    class = ifelse(label == "response efforts",
                   yes = "response efforts", no = class),
    class = ifelse(label %in% c("infrastructure damage", "infrastructure and utilitie"),
                   yes = "infrastructure", no = class),
    class = ifelse(label %in% c("urgent needs", "injured or dead people",
                                "missing, trapped, or found people"),
                   yes = "urgent needs", no = class),
    class = ifelse(label == "sympathy and emotional support",
                   yes = "sympathy and emotional support", no = class)
    )
tweets_earthquake$class %>% table() %>% sort()

## .
##                 infrastructure                   urgent needs 
##                            169                            240 
## sympathy and emotional support               response efforts 
##                            458                            996 
##                          Other 
##                           7608

To be properly imported into R, the identifiers contained in the tweet_id column have been previously surrounded by single quotation marks ('). If this had not been done, the data import function would have tried to convert this column to numeric, thus losing any 0 at the beginning of the chain (001 would have been transformed into 1).

tweets_earthquake <- 
  tweets_earthquake %>% 
  mutate(tweet_id = str_sub(tweet_id, 2, -2))

Let’s keep only three columns of this data table:

tweet_id: the identifier of each tweet
class: the class to be predicted
tweet_text: the tweet full text

tweets_earthquake <- 
  tweets_earthquake %>% 
  select(tweet_id, class, tweet_text)

CrisisNLP provides a set of tuples (tweet-id, user-id) for each disaster studied. Using the Twitter API and a few lines of R code, the listed tweets were retrieved (at least those that, at the time of extraction, were still available). To avoid adding to this tutorial, we will work with a database that is already prepared.

The recovered tweets are stored as tibbles in 5 files: tweets_nepal_00.rda to tweets_nepal_04.rda. Let us load them into a list, then concatenate this list to form a single tibble.

# Load tweets (extracted using Twitter API)
N <- list.files("donnees/Tweets/Nepal_2015/", pattern = "^tweets_nepal", full.names = TRUE)
tweets_df <- 
  lapply(N, function(x){
    tweets_tmp <- load(x)
    get(tweets_tmp)
  })

tweets_df <- 
  tweets_df %>% 
  bind_rows()

Let us remove the retweeted statutes:

tweets_df <- tweets_df %>% 
  filter(!is_retweeted)

The number of lines is 981552.

nrow(tweets_df)

## [1] 981552

There are some statuses in the labeled tweets set that are no longer available on the social platform. Let us take them out of our analysis.

tweets_earthquake <- 
  tweets_earthquake %>% 
  filter(tweet_id %in% tweets_df$id_str)
nrow(tweets_earthquake)

## [1] 7031

Let us add to the labelled dataset the information obtained via the Twitter API:

tweets_earthquake <- 
  tweets_earthquake %>% 
  left_join(tweets_df, by = c("tweet_id" = "id_str"))

The frequency for each class shows a strong imbalance.

tweets_earthquake$class %>% table() %>% sort()

## .
##                 infrastructure                   urgent needs 
##                            135                            183 
## sympathy and emotional support               response efforts 
##                            328                            814 
##                          Other 
##                           5792

2.2 Pre-processing

The information we will use in this exercise will be extracted from the texts of the tweets. This involves extracting variables from textual data. What we will do is separate the text into tokens, after cleaning it (lowercase, punctuation removal, word root extraction, etc.).

We will use two packages to pre-process the data: {tidytext} and [{SnowballC}]https://cran.r-project.org/web/packages/SnowballC/index.html. Multiple functions useful for text mining are available in {tidytext}, including sentiment analysis functions. The package {SnowballC} allows to use Porter’s word stemming algorithm which collapses words to a common root (note that not all languages are available).

library(tidytext)
library(SnowballC)

2.2.1 Some functions to help clean texts

To clean the tweets, we will create some functions. Let us define the function remove_url() to remove URLs from a string, using a regex :

#' remove_url
#' Removes URLs from a string
#' @param x string
remove_url <- function(x){
  pattern_url <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
  str_replace_all(string = x, pattern = pattern_url, replacement = "")
}

Let us also create the function remove_special_chars() allowing to remove special characters found in tweets (unrecognized apostrophes) :

#' remove_special_chars
#' Removes the special characters from a string
#' @param x string
remove_special_chars <- function(x){
  str_replace_all(string = x,
                  pattern = "[^\x20-\x7e]", replacement = "")
}

Let us define the function remove_mentions() which identifies a mention in a Twitter status (starting with the arobase symbol (@) and followed by the user’s screen name) :

#' remove_mentions
#' Removes the mentions from tweets (`@`)
#' @param x string
remove_mentions <- function(x){
  str_replace_all(x, "@[[:alnum:]]+\\s?", "")
}

We can also define the functions remove_punctuation() and remove_numbers() which remove punctuation and numbers, respectively, from a tweet.

#' remove_punctuation
#' Removes punctuation from tweets
#' @param x string
remove_punctuation <- function(x){
  str_replace_all(x, "[[:punct:]]", "")
}

#' remove_numbers
#' Removes numbers from tweets
#' @param x string
remove_numbers <- function(x){
  str_replace_all(x, "[[:digit:]]", "")
}

Eventuallt, we can define the function remove_char_ref() which removes character references (e.g., & )

#' remove_char_ref
#' Removes character reference
#' @param x string
remove_char_ref <- function(x){
  str_replace_all(x, "&[[:alpha:]]{1,6};", "")
}

2.2.2 Building the Corpus

We will create a corpus of texts from the tweets. The idea is to obtain an object containing as many documents as statuses. For each document, we have to count the occurrence of each of the words encountered throughout the corpus. At the end of the day, we obtain a matrix whose rows correspond to the tweets and whose columns indicate the occurrence of each word. The columns of this matrix will be the explanatory variables that can be uses to train a classifier.

First, let us create a tibble with an identifier and a tweet.

tweets_earthquake_tt <- 
  tibble(id = 1:nrow(tweets_earthquake),
         text = tweets_earthquake$full_text)
tweets_earthquake_tt

## # A tibble: 7,252 x 2
##       id text                                                              
##    <int> <chr>                                                             
##  1     1 "Dua's for all those affected by the\nearthquakes in India,Nepal …
##  2     2 Absolutely devastated by the destruction to my old home #Nepal    
##  3     3 Thoughts are with the families in #Nepal                          
##  4     4 Frightful images! Our prayers echo for everyone affected. #earthq…
##  5     5 Who was Gajendra Singh ? Today no news boz , of earthquake in Nep…
##  6     6 Live: Nepal cabinet meets to seek foreign help, 114 feared dead a…
##  7     7 When you go out for Momos this evening, ask and reassure the sell…
##  8     8 A crucial tool in a situation like #NepalQuake #NepalEarthquake..…
##  9     9 our affection from Madrid Spain, there we were this summer from N…
## 10    10 Devastating pictures of #NepalEarthQuake http://t.co/VaEOUkUTsG   
## # … with 7,242 more rows

Then, let us apply to each tweet the functions to clean the text:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  mutate(cleaned_text = str_to_lower(text),
         cleaned_text = remove_url(cleaned_text),
         cleaned_text = remove_mentions(cleaned_text),
         cleaned_text = remove_char_ref(cleaned_text),
         cleaned_text = remove_special_chars(cleaned_text),
         cleaned_text = remove_punctuation(cleaned_text),
         cleaned_text = remove_numbers(cleaned_text)
  )
tweets_earthquake_tt

## # A tibble: 7,252 x 3
##       id text                            cleaned_text                      
##    <int> <chr>                           <chr>                             
##  1     1 "Dua's for all those affected … "duas for all those affected by t…
##  2     2 Absolutely devastated by the d… absolutely devastated by the dest…
##  3     3 Thoughts are with the families… thoughts are with the families in…
##  4     4 Frightful images! Our prayers … "frightful images our prayers ech…
##  5     5 Who was Gajendra Singh ? Today… who was gajendra singh  today no …
##  6     6 Live: Nepal cabinet meets to s… "live nepal cabinet meets to seek…
##  7     7 When you go out for Momos this… when you go out for momos this ev…
##  8     8 A crucial tool in a situation … "a crucial tool in a situation li…
##  9     9 our affection from Madrid Spai… our affection from madrid spain t…
## 10    10 Devastating pictures of #Nepal… "devastating pictures of nepalear…
## # … with 7,242 more rows

Using the unnest_tokens() function of {tidytext}, let us separate each word from the tweets. Each line of the tibble obtained is a tuple indicating the identifier and the word.

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  select(id, cleaned_text) %>% 
  unnest_tokens(word, cleaned_text)

tweets_earthquake_tt

## # A tibble: 82,692 x 2
##       id word          
##    <int> <chr>         
##  1     1 duas          
##  2     1 for           
##  3     1 all           
##  4     1 those         
##  5     1 affected      
##  6     1 by            
##  7     1 theearthquakes
##  8     1 in            
##  9     1 indianepal    
## 10     1 bhutan        
## # … with 82,682 more rows

Some very frequent and potentially noisy words can be removed. A list of such words is available in the tibble stop_words. The function get_stopwords() can also be used

# English stop words
stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

# French stop words
stop_words_fr <- get_stopwords(language = "fr")
stop_words_fr

## # A tibble: 164 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 au    snowball
##  2 aux   snowball
##  3 avec  snowball
##  4 ce    snowball
##  5 ces   snowball
##  6 dans  snowball
##  7 de    snowball
##  8 des   snowball
##  9 du    snowball
## 10 elle  snowball
## # … with 154 more rows

To remove these stopwords, we can use the anti_joint() function from {dplyr}:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  anti_join(stop_words, by = c("word"))
tweets_earthquake_tt

## # A tibble: 44,011 x 2
##       id word          
##    <int> <chr>         
##  1     1 duas          
##  2     1 affected      
##  3     1 theearthquakes
##  4     1 indianepal    
##  5     1 bhutan        
##  6     1 staysafe      
##  7     1 form          
##  8     1 equake        
##  9     2 absolutely    
## 10     2 devastated    
## # … with 44,001 more rows

To extract the root of each word the wordStem() function from {SnowballC} can be applied:

tweets_earthquake_tt <- 
  tweets_earthquake_tt %>% 
  mutate(word_stem = wordStem(word))
tweets_earthquake_tt

## # A tibble: 44,011 x 3
##       id word           word_stem   
##    <int> <chr>          <chr>       
##  1     1 duas           dua         
##  2     1 affected       affect      
##  3     1 theearthquakes theearthquak
##  4     1 indianepal     indianep    
##  5     1 bhutan         bhutan      
##  6     1 staysafe       staysaf     
##  7     1 form           form        
##  8     1 equake         equak       
##  9     2 absolutely     absolut     
## 10     2 devastated     devast      
## # … with 44,001 more rows

Using the count() function of {dplyr}, the occurrence of each word is easily calculated:

freq_words <- 
  tweets_earthquake_tt %>% 
  dplyr::count(word_stem, sort = TRUE)
freq_words

## # A tibble: 8,375 x 2
##    word_stem          n
##    <chr>          <int>
##  1 nepal           3820
##  2 nepalearthquak  1164
##  3 earthquak        670
##  4 prayer           486
##  5 peopl            457
##  6 god              455
##  7 nepalquak        370
##  8 donat            285
##  9 new              274
## 10 india            251
## # … with 8,365 more rows

We can use a barplot to graph the occurrences of the top n words (here, we use n=10):

freq_words %>% slice(1:10) %>% 
  ggplot(data = .,
         aes(x = reorder(word_stem, n), y = n)) +
  geom_bar(stat ="identity") +
  labs(x = "Word", y = "Frequency") +
  coord_flip()

A word cloud can be drawn using the wordcloud() function from {wordcloud}. The size of words is positively related to the frequency of appearance.

library(wordcloud)
wordcloud(words = freq_words %>% slice(1:100) %>% 
            magrittr::extract2("word_stem"),
          freq = freq_words %>% slice(1:100) %>%
            magrittr::extract2("n"),
          random.order = F)

Finally, the cast_dtm() function from {tidytext} can be used to build a document-term matrix from the words of each tweet. Each line in this matrix represents a document and each column indicates the number of occurrences of a word.

tw_eq_dtm <- 
  tweets_earthquake_tt %>% 
  count(id, word_stem) %>%
  cast_dtm(document = id, term = word_stem, value = n)

tw_eq_dtm

## <<DocumentTermMatrix (documents: 7141, terms: 8375)>>
## Non-/sparse entries: 42855/59763020
## Sparsity           : 100%
## Maximal term length: 74
## Weighting          : term frequency (tf)

As can be seen, this matrix is very sparse.

2.3 Classification

Now we have the document-term matrix (dtm), we will use is as well as the pre-existing label provided by voluntary users to estimate a supervised model to classify tweets into the five categories mentioned above:

infrastructure
response efforts
urgent needs
sympathy and emotional support
other

For the sake of the example, we will estimate a random forest using the {ranger} package. We will proceed to some tuning of the parameters thanks to the {caret} package.

library(caret)

First, we can remove some sparse terms from the document-term matrix using the removeSparseTerms() function from {tm}. These will help us getting a small matrix (with less columns).

library(tm)
tw_eq_dtm_2 <- removeSparseTerms(tw_eq_dtm, sparse = .995)
tw_eq_dtm_2

## <<DocumentTermMatrix (documents: 7141, terms: 160)>>
## Non-/sparse entries: 19704/1122856
## Sparsity           : 98%
## Maximal term length: 16
## Weighting          : term frequency (tf)

Here, by setting sparse = .995, we have kepts in the dtm only the terms that appear in at least 0.5% of the tweets. This leaves 160 terms.

The attentive reader will have noticed that the number of documents has decreased compared to the initial number of tweets.

Let us only keep the tweets remaining in a tibble:

tweets_earthquake_sl <- 
  tweets_earthquake %>% 
  slice(as.numeric(tw_eq_dtm_2$dimnames$Docs))

The document-term matrix needs to be transformed in a two-dimension table to serve as input data for the random forest algorithm.

dtm_df <- tw_eq_dtm_2 %>% 
  as.matrix() %>% as_tibble()

The target can be added as follows:

dtm_df$y <- as.factor(str_replace_all(tweets_earthquake_sl$class, " |,", "_"))

Note that in the previous code snippet, we have replaced spaces and commas with underscores from the target variable. Otherwise, an error would occur when estimating the model.

We can create a training and a test sample as follows:

set.seed(13006)
ind_sample <- sample(1:nrow(dtm_df), size = round(.7*nrow(dtm_df)), replace = F)

train <- 
  dtm_df %>% 
  slice(ind_sample)

test <- 
  dtm_df %>% 
  slice(-ind_sample)

We will train the model using the train() from {caret}. We will rely on a random forest to classify the tweets. To do so, we set the method parameter to "ranger" (this will use the package {ranger}). We can further specify some other parameters:

num.trees: the number of threes
metric: which summary metric to rely on to use to select the optimal model
trControl: list of values that define how this function acts
tuneGrid: data frame with possible tuning values.

The list of values that define how the training will be accomplished is defined below. We state here that we wish to estimate the model by repeated 5-fold Cross Validation. We want 5 repetitions to be done.

fit_control <- trainControl(
  method = "repeatedcv", # fold CV
  number = 5,            # 5-fold CV
  repeats = 5,           # repeated ten times
  classProbs = TRUE,     # computing class probability
  verboseIter = FALSE     # printing a training log
)

The tuning parameters we want to define can be stored in a data frame using the expand.grid() function. With the {ranger} package, the following hyper-parameters can be set:

.mtry: number of variables to possibly split at in each node
.splitrule: split rule: "gini" or "extratrees" for classification
.min.node.size: the minimal node size

tuning_grid <- expand.grid(
  # Number of variables to possibly split at in each node.
  .mtry = c(round(sqrt(ncol(train)-1))), 
  # Split rule: "gini" or "extratrees" for classification
  .splitrule = "gini",
  # Minimal node size.
  .min.node.size = c(1, 5)
)

The model can then be estimated:

fit <- caret::train(y ~ .,
                    data = train,
                    method = 'ranger',
                    trControl = fit_control,
                    tuneGrid = tuning_grid,
                    num.trees = 200,
                    metric = "Kappa")

The function predict() allows to use the model to predict the class of tweets both on the training and testing sample:

pred_train <- predict(fit, newdata = train)
pred_test <- predict(fit, newdata = test)

The confusion matrices can easily be computed using the confusionMatrix() from {caret}:

confusion_matrix_train <- caret::confusionMatrix(pred_train, train$y)
confusion_matrix_test <- caret::confusionMatrix(pred_test, test$y)

Some metrics on the quality of fit are stored in the resulting objects:

confusion_matrix_train$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.601720e-01   4.473350e-01   8.502479e-01   8.696735e-01   7.973595e-01 
## AccuracyPValue  McnemarPValue 
##   4.782404e-31            NaN

confusion_matrix_test$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.8011204      0.1120414      0.7835763      0.8178366      0.7987862 
## AccuracyPValue  McnemarPValue 
##      0.4061509            NaN

A look at the confusion matrix show shat overall, while the accuracy is high, the model tends to overestimate the probability of being in the most frequently observed class (other):

confusion_matrix_train$table

##                                 Reference
## Prediction                       infrastructure Other response_efforts
##   infrastructure                             11     0                0
##   Other                                      82  3966              322
##   response_efforts                            0    17              241
##   sympathy_and_emotional_support              0     3                0
##   urgent_needs                                0     0                0
##                                 Reference
## Prediction                       sympathy_and_emotional_support
##   infrastructure                                              0
##   Other                                                     175
##   response_efforts                                            0
##   sympathy_and_emotional_support                             52
##   urgent_needs                                                0
##                                 Reference
## Prediction                       urgent_needs
##   infrastructure                            0
##   Other                                    97
##   response_efforts                          3
##   sympathy_and_emotional_support            0
##   urgent_needs                             30

More than 96% of the tweets from the testing sample are predicted as the other category:

tibble(prop_obs_test = 100 * table(test$y) %>% prop.table(),
       prop_predicted_test = table(pred_test) %>% prop.table() * 100) %>% 
  mutate(prop_obs_test = round(prop_obs_test, 2),
         prop_predicted_test = round(prop_predicted_test, 2))

## # A tibble: 5 x 2
##   prop_obs_test prop_predicted_test
##   <table>       <table>            
## 1  1.87          0.05              
## 2 79.88         96.73              
## 3 11.25          2.80              
## 4  4.62          0.37              
## 5  2.38          0.05

To account for the unbalanced data, we can follow Shirin Elsinghorst’s post. We can use oversampling to randomly duplicate sample from the class with fewer occurrences prior the estimation, in such a way as to obtain as much observation from those classes as in the most represented class.

This is already implemented in {caret}. All we need to do is add an item in the list of controls that states the sampling method to use (e.g., "up" for oversampling, "down" for under-sampling):

fit_control$sampling <- "up"

Then, the model can be estimated:

fit_2 <- caret::train(y ~ ., data = train,
                      method = 'ranger',
                      trControl = fit_control,
                      tuneGrid = tuning_grid,
                      num.trees = 200,
                      importance = "impurity")

The model can be used to predict the class of the tweets:

pred_train_2 <- predict(fit_2, newdata = train)
pred_test_2 <- predict(fit_2, newdata = test)

And once again, the confusion matrices can be obtained:

confusion_matrix_train_2 <- caret::confusionMatrix(pred_train_2, train$y)
confusion_matrix_test_2 <- caret::confusionMatrix(pred_test_2, test$y)

The accuracy in both the training and the testing samples are both relatively lower than their previous value.

confusion_matrix_train_2$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.611322e-01   4.074015e-01   6.478201e-01   6.742560e-01   7.973595e-01 
## AccuracyPValue  McnemarPValue 
##   1.000000e+00  4.739149e-308

confusion_matrix_test_2$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   4.859944e-01   1.719084e-01   4.646265e-01   5.074007e-01   7.987862e-01 
## AccuracyPValue  McnemarPValue 
##   1.000000e+00  5.116483e-135

But this second model performs relatively better than the first for predicting the other classes.

confusion_matrix_test$byClass # Without over-sampling

##                                       Sensitivity Specificity
## Class: infrastructure                  0.02500000  1.00000000
## Class: Other                           0.98129749  0.08816705
## Class: response_efforts                0.14522822  0.98684903
## Class: sympathy_and_emotional_support  0.01010101  0.99657367
## Class: urgent_needs                    0.00000000  0.99952176
##                                       Pos Pred Value Neg Pred Value
## Class: infrastructure                      1.0000000      0.9817842
## Class: Other                               0.8103282      0.5428571
## Class: response_efforts                    0.5833333      0.9010567
## Class: sympathy_and_emotional_support      0.1250000      0.9540769
## Class: urgent_needs                        0.0000000      0.9761794
##                                       Precision     Recall         F1
## Class: infrastructure                 1.0000000 0.02500000 0.04878049
## Class: Other                          0.8103282 0.98129749 0.88765530
## Class: response_efforts               0.5833333 0.14522822 0.23255814
## Class: sympathy_and_emotional_support 0.1250000 0.01010101 0.01869159
## Class: urgent_needs                   0.0000000 0.00000000        NaN
##                                       Prevalence Detection Rate
## Class: infrastructure                 0.01867414   0.0004668534
## Class: Other                          0.79878618   0.7838468721
## Class: response_efforts               0.11251167   0.0163398693
## Class: sympathy_and_emotional_support 0.04621849   0.0004668534
## Class: urgent_needs                   0.02380952   0.0000000000
##                                       Detection Prevalence
## Class: infrastructure                         0.0004668534
## Class: Other                                  0.9673202614
## Class: response_efforts                       0.0280112045
## Class: sympathy_and_emotional_support         0.0037348273
## Class: urgent_needs                           0.0004668534
##                                       Balanced Accuracy
## Class: infrastructure                         0.5125000
## Class: Other                                  0.5347323
## Class: response_efforts                       0.5660386
## Class: sympathy_and_emotional_support         0.5033373
## Class: urgent_needs                           0.4997609

confusion_matrix_test_2$byClass # With over-sampling

##                                       Sensitivity Specificity
## Class: infrastructure                   0.4500000   0.8810657
## Class: Other                            0.4810053   0.7563805
## Class: response_efforts                 0.5269710   0.8548133
## Class: sympathy_and_emotional_support   0.5050505   0.8521782
## Class: urgent_needs                     0.4509804   0.9196557
##                                       Pos Pred Value Neg Pred Value
## Class: infrastructure                     0.06716418      0.9882604
## Class: Other                              0.88685345      0.2685338
## Class: response_efforts                   0.31513648      0.9344451
## Class: sympathy_and_emotional_support     0.14204545      0.9726257
## Class: urgent_needs                       0.12041885      0.9856484
##                                        Precision    Recall        F1
## Class: infrastructure                 0.06716418 0.4500000 0.1168831
## Class: Other                          0.88685345 0.4810053 0.6237211
## Class: response_efforts               0.31513648 0.5269710 0.3944099
## Class: sympathy_and_emotional_support 0.14204545 0.5050505 0.2217295
## Class: urgent_needs                   0.12041885 0.4509804 0.1900826
##                                       Prevalence Detection Rate
## Class: infrastructure                 0.01867414    0.008403361
## Class: Other                          0.79878618    0.384220355
## Class: response_efforts               0.11251167    0.059290383
## Class: sympathy_and_emotional_support 0.04621849    0.023342670
## Class: urgent_needs                   0.02380952    0.010737628
##                                       Detection Prevalence
## Class: infrastructure                            0.1251167
## Class: Other                                     0.4332400
## Class: response_efforts                          0.1881419
## Class: sympathy_and_emotional_support            0.1643324
## Class: urgent_needs                              0.0891690
##                                       Balanced Accuracy
## Class: infrastructure                         0.6655328
## Class: Other                                  0.6186929
## Class: response_efforts                       0.6908921
## Class: sympathy_and_emotional_support         0.6786143
## Class: urgent_needs                           0.6853180

When evaluating the train() function, we have also specified the argument importance, setting it to "impurity". In doing so, it is then possible to calculate variable importance using the varImp() function from {caret}.

var_importance <- varImp(fit_2)
var_importance

## ranger variable importance
## 
##   only 20 most important variables shown (out of 160)
## 
##                Overall
## prai            100.00
## miss             69.61
## video            60.33
## nepal            54.35
## prayfornep       50.14
## donat            46.03
## nepalearthquak   42.37
## dead             39.34
## death            38.98
## sad              36.38
## kathmandu        35.11
## nepalquak        32.77
## kill             31.60
## devast           30.53
## earthquak        29.86
## everest          27.00
## peopl            24.99
## prayer           23.34
## god              22.52
## rip              21.99

The top ten variables can be plotted as follows:

data.frame(var_importance$importance) %>% 
  rownames_to_column(var = "variable") %>% 
  as_tibble() %>% 
  arrange(desc(Overall)) %>%
  slice(1:10) %>% 
  ggplot(data = .,
         aes(x = reorder(variable, Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(x = "Term")

2.3.1 Predictions on the rest of the tweets

Now that the model is estimated, it can be used to classify new tweets. However, the matrix of variables extracted from the texts must be created first, so that it can be provided as an argument to the predict() function.

Let us get original tweets from our dataset.

new_data <- 
  tweets_df %>% 
  filter(!is_retweeted) %>% 
  mutate(id = row_number())

And focus on texts only.

tweets_earthquake_tt_whole <- 
  new_data %>% 
  select(id, text = full_text)

The tweets need to be cleaned:

tweets_earthquake_tt_whole <- 
  tweets_earthquake_tt_whole %>% 
  mutate(cleaned_text = str_to_lower(text),
         cleaned_text = remove_url(cleaned_text),
         cleaned_text = remove_mentions(cleaned_text),
         cleaned_text = remove_char_ref(cleaned_text),
         cleaned_text = remove_special_chars(cleaned_text),
         cleaned_text = remove_punctuation(cleaned_text),
         cleaned_text = remove_numbers(cleaned_text)
  )

Let us create a tibble with only one word by row:

tweets_earthquake_tt_whole <- 
  tweets_earthquake_tt_whole %>% 
  select(id, cleaned_text) %>% 
  unnest_tokens(word, cleaned_text)

tweets_earthquake_tt_whole

## # A tibble: 12,127,423 x 2
##       id word           
##    <int> <chr>          
##  1     1 prayers        
##  2     1 for            
##  3     1 nepal          
##  4     1 hope           
##  5     1 the            
##  6     1 damage         
##  7     1 is             
##  8     1 minimal        
##  9     1 nepalearthquake
## 10     2 news           
## # … with 12,127,413 more rows

From these words, we can remove the stop words:

tweets_earthquake_tt_whole <- 
  tweets_earthquake_tt_whole %>% 
  anti_join(stop_words, by = c("word"))

nrow(tweets_earthquake_tt_whole)

## [1] 7040123

And then extract the root of the remaining words:

tweets_earthquake_tt_whole <- 
  tweets_earthquake_tt_whole %>% 
  mutate(word_stem = wordStem(word))

We will only count the occurrence of the words used as explanatory variables in the classification problem.

word_list <- colnames(dtm_df)[-which(colnames(dtm_df) == "y")]

We therefore only keep the words present in that list of words:

tweets_earthquake_tt_whole <- 
  tweets_earthquake_tt_whole %>% 
  semi_join(
    tibble(word_stem = word_list)
  )

Using the function count(), the number of occurrences of each word in each tweet can be calculated. Then, the document-term matrix can be created using the cast_dtm() function:

dtm_whole <- 
  tweets_earthquake_tt_whole %>% 
  count(id, word_stem) %>%
  cast_dtm(document = id, term = word_stem, value = n)

And converted to a matrix, then a tibble:

dtm_df_whole <- 
  dtm_whole %>% 
  as.matrix() %>% as_tibble()

Lastly, the prediction can be made:

dtm_df_whole$predicted_label <- predict(fit_2, newdata = dtm_df_whole)

The prediction can be added to the tibble new_data:

new_data <- 
  new_data %>% 
  left_join(
    dtm_df_whole %>% 
      mutate(id = as.numeric(dtm_whole$dimnames$Docs)) %>% 
      select(id, predicted_label)
  )

2.3.2 Some Vizualisation

Once all the tweets have been classified, we can visualize the result. For example, we can represent the density of tweets for each subject according to the date. To do this, let us use the parse_date_time() function of the {lubridate} package to allow R to interpret the created_at column as a date.

new_data <- 
  new_data %>% 
  mutate(date = parse_date_time(created_at, "%a %b %d %H:%M:%S %z %Y", locale = "en_US"))

new_data %>% 
  filter(!is.na(predicted_label)) %>% 
  ggplot(data = ., aes(x = date, y = ..count..)) +
  geom_density(aes(fill = predicted_label), position = "stack") +
  scale_fill_discrete("Topics")

We can also visualize the different categories of tweets on a map. Let us focus on tweets located in Nepal only.

First, a base map for Nepal can be obtained using the function ne_countries() from {rnaturalearth}.

library(rnaturalearth)
map_data <- ne_countries(
  country = c("nepal"),
  scale = 'medium', type = 'map_units',
                         returnclass = 'sf')

For convenience, the tweets can be put in an sf object:

library(sf)
geo_tweets <- 
  new_data %>% 
  filter(!is.na(predicted_label)) %>% 
  filter(!is.na(long), !is.na(lat)) %>% 
  st_as_sf(x = .,
           coords = c("long", "lat"),
           remove = FALSE,
           crs = "+proj=longlat +datum=WGS84")

To keep only the tweets located in Nepal, we can use the st_join() function from {sf}. Only the tweets that lie in the polygons defined in map_data will remain, the others will be discarded.

tweets_nepal <- st_join(geo_tweets, map_data, join = st_within) %>% 
  filter(sovereignt == "Nepal")

Lastly, a facetting of the tweets depending on their predicted class can be plotted:

ggplot() +
  geom_sf(data = map_data) +
  geom_sf(data = tweets_nepal,
             aes(colour = predicted_label), alpha = .2, size = 1) +
  facet_wrap(~predicted_label) +
  scale_colour_discrete(guide = "none") +
  theme_bw()

Text

Insurance Data Science: Use and Value of Unusual Data

August 2019