Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)
Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)
Ewen Gallic, Aix-Marseille Université (🐦 @3wen)
It is common to manipulate strings. This is the case when files need to be loaded in a loop where only a part of the files are targeted. This is also the case, of course, when the data that is manipulated in the models is textual. There is a very convenient tool that allows one to search for more or less complex patterns in strings: regular expressions. Regular expression (or regex) are sequences of characters forming a search pattern. The pattern is used to match one or several characters in a string. The help page in R (?regex
) provides condensed information on regex.
The {base
} package contains multiple functions related to regular expressions, but we will use instead some functions from the package {stringr
}, built on top of {stringi}
. The package {stringr
} allows us to easily manipulate strings in R.
For most of the examples given to illustrate how regex work, we will use real Twitter data from CrisisNLP. Let us use the tweets from the Nepal Earthquake crisis, annotated by volunteers.
library(tidyverse)
tweets_earthquake <-
str_c("donnees/CrisisNLP_volunteers_labeled_data/2015_Nepal_Earthquake_en/",
"2015_Nepal_Earthquake_en.csv") %>%
read_csv(locale = locale(encoding = "UTF-8"))
tweets_earthquake
## # A tibble: 9,471 x 10
## tweet_id tweet_time tweet_author tweet_author_id tweet_language
## <chr> <chr> <chr> <dbl> <chr>
## 1 '591903… Sat Apr 2… Faali19 2387302745 en
## 2 '591903… Sat Apr 2… STERLINGMHO… 153876973 en
## 3 '591903… Sat Apr 2… HeenaliVP 421188281 en
## 4 '591903… Sat Apr 2… Xennia79 176207969 en
## 5 '591903… Sat Apr 2… Madhurita_ 1058658786 en
## 6 '591903… Sat Apr 2… MONIMISHI 1461160603 en
## 7 '591904… Sat Apr 2… AnilBalkrus… 3034294729 en
## 8 '591904… Sat Apr 2… haquem19 2782719416 en
## 9 '591904… Sat Apr 2… Akshay7_ 111600045 en
## 10 '591904… Sat Apr 2… hnrwbell 1613115068 en
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## # tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>
To check whether a pattern is found in a string, we can use the function stringr::str_detect()
(note that the package {stringr
} have been attached when we atatched {tidyverse}
).
## [1] "Dua's for all those affected by the earthquakes in India,Nepal & Bhutan. Stay safe & help others in any form. #Equake http://t.co/M6YG0k4FKh"
## [2] "itvnews: Witness to Nepal #earthquake tells itvnews: 'It was terrifying' http://t.co/UWMynVyzQC"
# Can we find the word "Earthquake" in the tweets?
str_detect(string = two_tweets, pattern = "earthquake")
## [1] TRUE TRUE
## [1] TRUE FALSE
In the above examples, the pattern is composed of litterals, i.e., characters that recieve a literal interpretation in the regular expression. Some other characters, on the other hand, receive a different interpretation when they are part of the regex. This is the case of the following reserved characters, called metacharacters: . \ | ( ) [ { $ * + ?
. If we want these characters to be literaly interpretted, we need to escape them. In R, this is done using two backslash.
# The character `.` is a metacharacter that matches any character (except a new line)
str_detect(string = c("Earthquake.", "Earthquake"), pattern = ".")
## [1] TRUE TRUE
# To look for a dot in a string:
str_detect(string = c("Earthquake.", "Earthquake"), pattern = "\\.")
## [1] TRUE FALSE
To match the beginning and the end of a string, respectively, we can use the line anchors ^
and $
.
For example, to look for the tweets which begin with a hashtag:
## [1] 10 25 28 31 38 43
This can be useful when combined with the function dplyr::filter()
to filter rows of a two dimension table:
# Tweets beginning with a sharp
tweets_earthquake %>%
filter(str_detect(string = tweet_text, pattern = "^#")) %>%
select(tweet_text)
## # A tibble: 816 x 1
## tweet_text
## <chr>
## 1 #earthquake @BBCNews my uncle is travelling in Nepal but has notified u…
## 2 #Kathmandu's Tribhuvan Airport is currently closed due to #lEarthquake.…
## 3 #Nepal earthquake claims five lives in East #India http://t.co/aU4EFiKU…
## 4 #google person finder for #earthquake http://t.co/uZyXguoio2
## 5 #This Is the #helpline for #Nepal earthquake click on this post https:…
## 6 "#\xbe\xdc\xdd\x8c_\xc9 #\x8c\xe0\xbc\x8a__\x8b\x81\xe3\x8d_\xc8 #\x8b\…
## 7 #NepalQuake | Deep Kumar Upadhyay, Nepal's ambassador to India, says Ai…
## 8 #BeingIndian mourns the loss of the lives in the #earthquake that hit N…
## 9 #NepalEarthquake Tribhuvan Int. Arprt #Kathmandu closed 4 operations fl…
## 10 #earthquakeindia Judging by the nature of tremors in Lakhimpur,one can …
## # … with 806 more rows
The pipe character |
allows to match one or more expression.
tweets <-
c("PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries",
"Earthquake severe damage to Kathmandu. Tragic loss of life.",
"7.9-magnitude earthquake strikes Nepal, damage reported",
"Thoughts are with the families in #Nepal")
str_detect(string = tweets, pattern = "magnitude|damage")
## [1] TRUE TRUE TRUE FALSE
This may be useful for alternative spellings.
## [1] TRUE TRUE FALSE
Character classes are lists of characters that belong to a group, such as alphabetic, numeric, alphanumeric characters, etc. It is possible to build them or to use predefined classes. They are written by placing them in square brackets []
. For example, if the aim is to match strings where characters
Let us assume that we face file names with a date, and that we want to match only those whose month in a given year is “January” or “February”:
str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
pattern = c("file_2019-0[12]-01"))
## [1] "file_2019-01-01" NA "file_2019-02-01"
In the previous code, we therefore searched each string for the occurrence of the substring file_2019-01-01
or file_2019-02-01
.
Using a dash -
, it is possible to define a sequence of characters. Thus, the character class [A-Z]
is used to match the letters of the following set: ABCDEFGHIJKLMNOPQRSTUVWXYZ
. The character class [0-9]
mathces the character set 0123456789
.
str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
pattern = c("file_2019-0[1-3]-01"))
## [1] "file_2019-01-01" "file_2019-03-01" "file_2019-02-01"
Unions of groups can be made:
str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
"file_2019-03-01.txt", "file_2019-04-01.txt",
"file_2019-05-01.txt", "file_2019-06-01.txt",
"file_2019-07-01.txt", "file_2019-08-01.txt"
),
pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA
## [5] NA "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"
To exclude a group of characters, a circumflex accent ^
can be used:
str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
"file_2019-03-01.txt", "file_2019-04-01.txt",
"file_2019-05-01.txt", "file_2019-06-01.txt",
"file_2019-07-01.txt", "file_2019-08-01.txt"
),
pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA
## [5] NA "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"
If, on the other hand, the circumflex must be part of the character class, it should not be placed right after the opening bracket:
## [1] "happy ^" NA
It is also possible to escape the character as follows: [\^]
.
Somes classes are pre-built and can be referred to by their name. They are based on the POSIX family of standards. The most used (in my own experience) are listed in the Table below.
Character class | Descriptions |
---|---|
[:digit:] |
digits |
[:lower:] |
lowercase alphabetic characters |
[:upper:] |
uppercase alphabetic characters |
[:alpha:] |
alphabetic characters (both lower and upper) |
[:alnum:] |
alphabetic characters and numbers |
[:blank:] |
space and tab |
[:punct:] |
punctuation |
[:xdigit:] |
hexadecimal digits |
To refer to these classes, they need to be put between the brackets defining the character classes:
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## [1] FALSE FALSE FALSE FALSE
Some classes also benefit from an abbreviation:
Character class | Descriptions |
---|---|
\d |
digits |
\D |
non decimal digit |
\s |
whitespace |
\w |
word |
\W |
non word |
## [[1]]
## [1] "6" "1"
## [[1]]
## [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" " " "."
## [[1]]
## [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" "6" "1"
## [[1]]
## [1] " " "."
Parentheses can be used to group some part of a regular expression together. This is particularly helpful when combined with quantifiers and character classes, to manipulate file names for example.
Here is an example with the function str_extract()
, which extracts matching patterns from a string:
## [1] "analyse" "analyze" NA
Quantifiers are used to repeat the regular expression a given number of times. The Table below lists the available quantifiers. They are placed after the regex that need to be matched a given number of times.
?
| the regex appears zero or one time |*
| the regex appears zero or more time(s) |+
| the regex appears one or more time(s) |{n}
| the regex appears n
times exactly |{n,}
| the regex appears n
times or more |{n,m}
| the regex appears at least n
times but no more than m
times |## [1] "labour" "labor"
Combining quantifiers with character classes or groups allows to match more complex patterns:
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## [1] NA NA NA NA
## [[1]]
## [1] "PANIC" "IN" "NEPAL" "Strong" "quake" "hits"
## [7] "capital" "causing" "major" "damage" "injuries"
##
## [[2]]
## [1] "Earthquake" "severe" "damage" "to" "Kathmandu"
## [6] "Tragic" "loss" "of" "life"
##
## [[3]]
## [1] "7" "9" "magnitude" "earthquake" "strikes"
## [6] "Nepal" "damage" "reported"
##
## [[4]]
## [1] "Thoughts" "are" "with" "the" "families" "in"
## [7] "Nepal"
(.*)
is a useful combination of grouping and quantifiers. It allows to match any sequence of characters:
.
: any character*
: present zero or more timesx <- c("type_1_20190101_20190131.txt", "type_2_20190101_20190131.txt",
"type_1_20190201_20190228.txt", "type_2_20190201_20190228.txt",
"type_1_20190101_20190131.csv", "type_2_20190101_20190131.csv",
"type_1_20190201_20190228.csv", "type_2_20190201_20190228.csv")
str_extract(x, "^type_1(.*)\\.txt$")
## [1] "type_1_20190101_20190131.txt" NA
## [3] "type_1_20190201_20190228.txt" NA
## [5] NA NA
## [7] NA NA
To illustrate the examples of regular expressions, we used some functions of the package {stringr}
which all begin with the prefix str_
.
Function | Descriptions | Type of result |
---|---|---|
str_detect() |
Detects the presence or absence of a pattern in a string | Booleans |
str_extract() |
Extracts the first matched pattern | Strings |
str_extract_all() |
Extracts matched pattern and provides the result in a list of vectors. | List of vectors of characters. Each element of the list corresponds to an element provided to the argument string |
str_match() |
Extracts the first group found in a string | Matrix |
str_match_all() |
Extracts all the groups found in a string. | List of matrices whose elements correspond to the elements of the vector given to the argument string |
str_locate() |
Locates the first occurrence of a pattern in a string | Matrix |
str_locate_all() |
Locates all the occurrences of a pattern in a string | List of matrices |
str_replace() |
Replaces the first occurrence of a pattern in a string | String |
str_replace_all() |
Replaces all the occurrences of a pattern in a string | String |
str_split() |
Splits a string into several pieces, according to a given pattern | List of vector of characters |
# French phone numbers
phone_numbers <-
c("02 23 23 35 45", "02-23-23-35-45",
"Madrid", "02.23.23.35.45", "0223233545",
"Milan", "02 23 23 35 45 ",
" 02 23 23 35 45", "Home: 02 23 23 35 45")
pattern_phone_number <- str_c(str_dup("([0-9]{2})[- \\.]", 4), "([0-9]{2})")
pattern_phone_number
## [1] "([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})"
## [1] "02 23 23 35 45" "02-23-23-35-45" NA "02.23.23.35.45"
## [5] NA NA "02 23 23 35 45" "02 23 23 35 45"
## [9] "02 23 23 35 45"
# Extract phone numbers, then remove punctuation and white characters
str_extract(phone_numbers, pattern_phone_number) %>%
str_replace_all("[[:punct:]\\s]", "")
## [1] "0223233545" "0223233545" NA "0223233545" NA
## [6] NA "0223233545" "0223233545" "0223233545"
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [2,] "02-23-23-35-45" "02" "23" "23" "35" "45"
## [3,] NA NA NA NA NA NA
## [4,] "02.23.23.35.45" "02" "23" "23" "35" "45"
## [5,] NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA
## [7,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [8,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [9,] "02 23 23 35 45" "02" "23" "23" "35" "45"
The str_locate()
and str_locate_all()
functions return the start and end indices of the matched subchains.
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] 5 13
## [4,] NA NA
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 5 13
##
## [[4]]
## start end
To look for a pattern in a string ignoring case sensitivity, the pattern can be previously provided to the function stringr::regex()
:
## [1] TRUE FALSE
## [1] TRUE TRUE
It is often necessary to “clean” the strings before they can be used in statistical models. A few basic operations can quickly remove spaces, punctuation, etc.
To set all alphabetical characters to lowercase or uppercase, the functions str_to_lower()
or str_to_upper()
can be used, respectively.
## [1] "dua's for all those affected by the earthquakes in india,nepal & bhutan. stay safe & help others in any form. #equake http://t.co/m6yg0k4fkh"
## [1] "DUA'S FOR ALL THOSE AFFECTED BY THE EARTHQUAKES IN INDIA,NEPAL & BHUTAN. STAY SAFE & HELP OTHERS IN ANY FORM. #EQUAKE HTTP://T.CO/M6YG0K4FKH"
To remove some undesired characters, such as punctuation, the function str_replace_all()
can be used:
## [1] "Duas for all those affected by the earthquakes in IndiaNepal amp Bhutan Stay safe amp help others in any form Equake httptcoM6YG0k4FKh"
Another useful function is str_trim()
. It allows to trim whitespace from a string. This typically occurs after removing some words of a string. The side
parameter allows to specify whether the spaces to be removed should only be those on the left of the string, on the right, or both.
## [1] "String with spaces at the beginning and end"
## [1] "String with spaces at the beginning and end "
## [1] " String with spaces at the beginning and end"
In this case study, we will analyze text data from the Twitter platform. Messages written during the Gorkha earthquake, an earthquake that occurred in Nepal in April and May 2015 were retrieved.
This case study was inspired by this course (Supervised classification with text data) made by Benjamin Soltoff.
Volunteers have labeled some of the tweets. These are available on the website CrisisNLP. We will use them to train a classifier. The objective of the latter is to assign one of the following different classes based on new messages broadcast on Twitter during a similar disaster:
We previously loaded the tweets into a tibble named tweets_earthquake
.
## # A tibble: 9,471 x 10
## tweet_id tweet_time tweet_author tweet_author_id tweet_language
## <chr> <chr> <chr> <dbl> <chr>
## 1 '591903… Sat Apr 2… Faali19 2387302745 en
## 2 '591903… Sat Apr 2… STERLINGMHO… 153876973 en
## 3 '591903… Sat Apr 2… HeenaliVP 421188281 en
## 4 '591903… Sat Apr 2… Xennia79 176207969 en
## 5 '591903… Sat Apr 2… Madhurita_ 1058658786 en
## 6 '591903… Sat Apr 2… MONIMISHI 1461160603 en
## 7 '591904… Sat Apr 2… AnilBalkrus… 3034294729 en
## 8 '591904… Sat Apr 2… haquem19 2782719416 en
## 9 '591904… Sat Apr 2… Akshay7_ 111600045 en
## 10 '591904… Sat Apr 2… hnrwbell 1613115068 en
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## # tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>
The pre-existing classification is given in the label
column. The table()
function provides an overview of each class and its associated size:
## .
## Animal management Caution and advice
## 1 6
## Displaced people Infrastructure and utilities
## 4 50
## Infrastructure damage Infrastructure Damage
## 3 166
## Injured or dead people Missing, trapped, or found people
## 76 56
## Money Not related or irrelevant
## 53 239
## Not relevant Not Relevant
## 1 6279
## Other relevant Other relevant information
## 2 239
## Other Relevant Information Personal updates
## 627 29
## Response efforts Response Efforts
## 2 994
## Shelter and supplies Sympathy and emotional support
## 18 458
## Urgent Needs Volunteer or professional services
## 108 60
Some classes refer to the same concept, but have a different spelling. Let us fix that.
tweets_earthquake <-
tweets_earthquake %>%
mutate(label = str_to_lower(label)) %>%
mutate(label = ifelse(label %in% c("not related or irrelevant",
"not relevant"),
yes = "not relevant",
no = label)) %>%
mutate(label = ifelse(label %in% c("other relevant information",
"other relevant"),
yes = "other relevant",
no = label))
Then let us define the 4 classes we are interested in:
tweets_earthquake <-
tweets_earthquake %>%
mutate(
class = "Other",
class = ifelse(label == "response efforts",
yes = "response efforts", no = class),
class = ifelse(label %in% c("infrastructure damage", "infrastructure and utilitie"),
yes = "infrastructure", no = class),
class = ifelse(label %in% c("urgent needs", "injured or dead people",
"missing, trapped, or found people"),
yes = "urgent needs", no = class),
class = ifelse(label == "sympathy and emotional support",
yes = "sympathy and emotional support", no = class)
)
tweets_earthquake$class %>% table() %>% sort()
## .
## infrastructure urgent needs
## 169 240
## sympathy and emotional support response efforts
## 458 996
## Other
## 7608
To be properly imported into R, the identifiers contained in the tweet_id
column have been previously surrounded by single quotation marks ('
). If this had not been done, the data import function would have tried to convert this column to numeric, thus losing any 0
at the beginning of the chain (001
would have been transformed into 1
).
Let’s keep only three columns of this data table:
tweet_id
: the identifier of each tweetclass
: the class to be predictedtweet_text
: the tweet full textCrisisNLP provides a set of tuples (tweet-id, user-id) for each disaster studied. Using the Twitter API and a few lines of R code, the listed tweets were retrieved (at least those that, at the time of extraction, were still available). To avoid adding to this tutorial, we will work with a database that is already prepared.
The recovered tweets are stored as tibbles in 5 files: tweets_nepal_00.rda
to tweets_nepal_04.rda
. Let us load them into a list, then concatenate this list to form a single tibble.
# Load tweets (extracted using Twitter API)
N <- list.files("donnees/Tweets/Nepal_2015/", pattern = "^tweets_nepal", full.names = TRUE)
tweets_df <-
lapply(N, function(x){
tweets_tmp <- load(x)
get(tweets_tmp)
})
tweets_df <-
tweets_df %>%
bind_rows()
Let us remove the retweeted statutes:
The number of lines is 981552.
## [1] 981552
There are some statuses in the labeled tweets set that are no longer available on the social platform. Let us take them out of our analysis.
tweets_earthquake <-
tweets_earthquake %>%
filter(tweet_id %in% tweets_df$id_str)
nrow(tweets_earthquake)
## [1] 7031
Let us add to the labelled dataset the information obtained via the Twitter API:
The frequency for each class shows a strong imbalance.
## .
## infrastructure urgent needs
## 135 183
## sympathy and emotional support response efforts
## 328 814
## Other
## 5792
The information we will use in this exercise will be extracted from the texts of the tweets. This involves extracting variables from textual data. What we will do is separate the text into tokens, after cleaning it (lowercase, punctuation removal, word root extraction, etc.).
We will use two packages to pre-process the data: {tidytext
} and [{SnowballC
}]https://cran.r-project.org/web/packages/SnowballC/index.html. Multiple functions useful for text mining are available in {tidytext
}, including sentiment analysis functions. The package {SnowballC}
allows to use Porter’s word stemming algorithm which collapses words to a common root (note that not all languages are available).
To clean the tweets, we will create some functions. Let us define the function remove_url()
to remove URLs from a string, using a regex :
#' remove_url
#' Removes URLs from a string
#' @param x string
remove_url <- function(x){
pattern_url <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
str_replace_all(string = x, pattern = pattern_url, replacement = "")
}
Let us also create the function remove_special_chars()
allowing to remove special characters found in tweets (unrecognized apostrophes) :
#' remove_special_chars
#' Removes the special characters from a string
#' @param x string
remove_special_chars <- function(x){
str_replace_all(string = x,
pattern = "[^\x20-\x7e]", replacement = "")
}
Let us define the function remove_mentions()
which identifies a mention in a Twitter status (starting with the arobase symbol (@
) and followed by the user’s screen name) :
#' remove_mentions
#' Removes the mentions from tweets (`@`)
#' @param x string
remove_mentions <- function(x){
str_replace_all(x, "@[[:alnum:]]+\\s?", "")
}
We can also define the functions remove_punctuation()
and remove_numbers()
which remove punctuation and numbers, respectively, from a tweet.
#' remove_punctuation
#' Removes punctuation from tweets
#' @param x string
remove_punctuation <- function(x){
str_replace_all(x, "[[:punct:]]", "")
}
#' remove_numbers
#' Removes numbers from tweets
#' @param x string
remove_numbers <- function(x){
str_replace_all(x, "[[:digit:]]", "")
}
Eventuallt, we can define the function remove_char_ref()
which removes character references (e.g., &
)
We will create a corpus of texts from the tweets. The idea is to obtain an object containing as many documents as statuses. For each document, we have to count the occurrence of each of the words encountered throughout the corpus. At the end of the day, we obtain a matrix whose rows correspond to the tweets and whose columns indicate the occurrence of each word. The columns of this matrix will be the explanatory variables that can be uses to train a classifier.
First, let us create a tibble with an identifier and a tweet.
tweets_earthquake_tt <-
tibble(id = 1:nrow(tweets_earthquake),
text = tweets_earthquake$full_text)
tweets_earthquake_tt
## # A tibble: 7,252 x 2
## id text
## <int> <chr>
## 1 1 "Dua's for all those affected by the\nearthquakes in India,Nepal …
## 2 2 Absolutely devastated by the destruction to my old home #Nepal
## 3 3 Thoughts are with the families in #Nepal
## 4 4 Frightful images! Our prayers echo for everyone affected. #earthq…
## 5 5 Who was Gajendra Singh ? Today no news boz , of earthquake in Nep…
## 6 6 Live: Nepal cabinet meets to seek foreign help, 114 feared dead a…
## 7 7 When you go out for Momos this evening, ask and reassure the sell…
## 8 8 A crucial tool in a situation like #NepalQuake #NepalEarthquake..…
## 9 9 our affection from Madrid Spain, there we were this summer from N…
## 10 10 Devastating pictures of #NepalEarthQuake http://t.co/VaEOUkUTsG
## # … with 7,242 more rows
Then, let us apply to each tweet the functions to clean the text:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
mutate(cleaned_text = str_to_lower(text),
cleaned_text = remove_url(cleaned_text),
cleaned_text = remove_mentions(cleaned_text),
cleaned_text = remove_char_ref(cleaned_text),
cleaned_text = remove_special_chars(cleaned_text),
cleaned_text = remove_punctuation(cleaned_text),
cleaned_text = remove_numbers(cleaned_text)
)
tweets_earthquake_tt
## # A tibble: 7,252 x 3
## id text cleaned_text
## <int> <chr> <chr>
## 1 1 "Dua's for all those affected … "duas for all those affected by t…
## 2 2 Absolutely devastated by the d… absolutely devastated by the dest…
## 3 3 Thoughts are with the families… thoughts are with the families in…
## 4 4 Frightful images! Our prayers … "frightful images our prayers ech…
## 5 5 Who was Gajendra Singh ? Today… who was gajendra singh today no …
## 6 6 Live: Nepal cabinet meets to s… "live nepal cabinet meets to seek…
## 7 7 When you go out for Momos this… when you go out for momos this ev…
## 8 8 A crucial tool in a situation … "a crucial tool in a situation li…
## 9 9 our affection from Madrid Spai… our affection from madrid spain t…
## 10 10 Devastating pictures of #Nepal… "devastating pictures of nepalear…
## # … with 7,242 more rows
Using the unnest_tokens()
function of {tidytext
}, let us separate each word from the tweets. Each line of the tibble obtained is a tuple indicating the identifier and the word.
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
select(id, cleaned_text) %>%
unnest_tokens(word, cleaned_text)
tweets_earthquake_tt
## # A tibble: 82,692 x 2
## id word
## <int> <chr>
## 1 1 duas
## 2 1 for
## 3 1 all
## 4 1 those
## 5 1 affected
## 6 1 by
## 7 1 theearthquakes
## 8 1 in
## 9 1 indianepal
## 10 1 bhutan
## # … with 82,682 more rows
Some very frequent and potentially noisy words can be removed. A list of such words is available in the tibble stop_words
. The function get_stopwords()
can also be used
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
## # A tibble: 164 x 2
## word lexicon
## <chr> <chr>
## 1 au snowball
## 2 aux snowball
## 3 avec snowball
## 4 ce snowball
## 5 ces snowball
## 6 dans snowball
## 7 de snowball
## 8 des snowball
## 9 du snowball
## 10 elle snowball
## # … with 154 more rows
To remove these stopwords, we can use the anti_joint()
function from {dplyr
}:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
anti_join(stop_words, by = c("word"))
tweets_earthquake_tt
## # A tibble: 44,011 x 2
## id word
## <int> <chr>
## 1 1 duas
## 2 1 affected
## 3 1 theearthquakes
## 4 1 indianepal
## 5 1 bhutan
## 6 1 staysafe
## 7 1 form
## 8 1 equake
## 9 2 absolutely
## 10 2 devastated
## # … with 44,001 more rows
To extract the root of each word the wordStem()
function from {SnowballC
} can be applied:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
mutate(word_stem = wordStem(word))
tweets_earthquake_tt
## # A tibble: 44,011 x 3
## id word word_stem
## <int> <chr> <chr>
## 1 1 duas dua
## 2 1 affected affect
## 3 1 theearthquakes theearthquak
## 4 1 indianepal indianep
## 5 1 bhutan bhutan
## 6 1 staysafe staysaf
## 7 1 form form
## 8 1 equake equak
## 9 2 absolutely absolut
## 10 2 devastated devast
## # … with 44,001 more rows
Using the count()
function of {dplyr
}, the occurrence of each word is easily calculated:
## # A tibble: 8,375 x 2
## word_stem n
## <chr> <int>
## 1 nepal 3820
## 2 nepalearthquak 1164
## 3 earthquak 670
## 4 prayer 486
## 5 peopl 457
## 6 god 455
## 7 nepalquak 370
## 8 donat 285
## 9 new 274
## 10 india 251
## # … with 8,365 more rows
We can use a barplot to graph the occurrences of the top n
words (here, we use n=10
):
freq_words %>% slice(1:10) %>%
ggplot(data = .,
aes(x = reorder(word_stem, n), y = n)) +
geom_bar(stat ="identity") +
labs(x = "Word", y = "Frequency") +
coord_flip()
A word cloud can be drawn using the wordcloud()
function from {wordcloud
}. The size of words is positively related to the frequency of appearance.
library(wordcloud)
wordcloud(words = freq_words %>% slice(1:100) %>%
magrittr::extract2("word_stem"),
freq = freq_words %>% slice(1:100) %>%
magrittr::extract2("n"),
random.order = F)
Finally, the cast_dtm()
function from {tidytext
} can be used to build a document-term matrix from the words of each tweet. Each line in this matrix represents a document and each column indicates the number of occurrences of a word.
tw_eq_dtm <-
tweets_earthquake_tt %>%
count(id, word_stem) %>%
cast_dtm(document = id, term = word_stem, value = n)
tw_eq_dtm
## <<DocumentTermMatrix (documents: 7141, terms: 8375)>>
## Non-/sparse entries: 42855/59763020
## Sparsity : 100%
## Maximal term length: 74
## Weighting : term frequency (tf)
As can be seen, this matrix is very sparse.
Now we have the document-term matrix (dtm), we will use is as well as the pre-existing label provided by voluntary users to estimate a supervised model to classify tweets into the five categories mentioned above:
For the sake of the example, we will estimate a random forest using the {ranger}
package. We will proceed to some tuning of the parameters thanks to the {caret
} package.
First, we can remove some sparse terms from the document-term matrix using the removeSparseTerms()
function from {tm
}. These will help us getting a small matrix (with less columns).
## <<DocumentTermMatrix (documents: 7141, terms: 160)>>
## Non-/sparse entries: 19704/1122856
## Sparsity : 98%
## Maximal term length: 16
## Weighting : term frequency (tf)
Here, by setting sparse = .995
, we have kepts in the dtm only the terms that appear in at least 0.5% of the tweets. This leaves 160 terms.
The attentive reader will have noticed that the number of documents has decreased compared to the initial number of tweets.
Let us only keep the tweets remaining in a tibble:
The document-term matrix needs to be transformed in a two-dimension table to serve as input data for the random forest algorithm.
The target can be added as follows:
Note that in the previous code snippet, we have replaced spaces and commas with underscores from the target variable. Otherwise, an error would occur when estimating the model.
We can create a training and a test sample as follows:
set.seed(13006)
ind_sample <- sample(1:nrow(dtm_df), size = round(.7*nrow(dtm_df)), replace = F)
train <-
dtm_df %>%
slice(ind_sample)
test <-
dtm_df %>%
slice(-ind_sample)
We will train the model using the train()
from {caret
}. We will rely on a random forest to classify the tweets. To do so, we set the method
parameter to "ranger"
(this will use the package {ranger
}). We can further specify some other parameters:
num.trees
: the number of threesmetric
: which summary metric to rely on to use to select the optimal modeltrControl
: list of values that define how this function actstuneGrid
: data frame with possible tuning values.The list of values that define how the training will be accomplished is defined below. We state here that we wish to estimate the model by repeated 5-fold Cross Validation. We want 5 repetitions to be done.
fit_control <- trainControl(
method = "repeatedcv", # fold CV
number = 5, # 5-fold CV
repeats = 5, # repeated ten times
classProbs = TRUE, # computing class probability
verboseIter = FALSE # printing a training log
)
The tuning parameters we want to define can be stored in a data frame using the expand.grid()
function. With the {ranger}
package, the following hyper-parameters can be set:
.mtry
: number of variables to possibly split at in each node.splitrule
: split rule: "gini"
or "extratrees"
for classification.min.node.size
: the minimal node sizetuning_grid <- expand.grid(
# Number of variables to possibly split at in each node.
.mtry = c(round(sqrt(ncol(train)-1))),
# Split rule: "gini" or "extratrees" for classification
.splitrule = "gini",
# Minimal node size.
.min.node.size = c(1, 5)
)
The model can then be estimated:
fit <- caret::train(y ~ .,
data = train,
method = 'ranger',
trControl = fit_control,
tuneGrid = tuning_grid,
num.trees = 200,
metric = "Kappa")
The function predict()
allows to use the model to predict the class of tweets both on the training and testing sample:
The confusion matrices can easily be computed using the confusionMatrix()
from {caret
}:
confusion_matrix_train <- caret::confusionMatrix(pred_train, train$y)
confusion_matrix_test <- caret::confusionMatrix(pred_test, test$y)
Some metrics on the quality of fit are stored in the resulting objects:
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.601720e-01 4.473350e-01 8.502479e-01 8.696735e-01 7.973595e-01
## AccuracyPValue McnemarPValue
## 4.782404e-31 NaN
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.8011204 0.1120414 0.7835763 0.8178366 0.7987862
## AccuracyPValue McnemarPValue
## 0.4061509 NaN
A look at the confusion matrix show shat overall, while the accuracy is high, the model tends to overestimate the probability of being in the most frequently observed class (other
):
## Reference
## Prediction infrastructure Other response_efforts
## infrastructure 11 0 0
## Other 82 3966 322
## response_efforts 0 17 241
## sympathy_and_emotional_support 0 3 0
## urgent_needs 0 0 0
## Reference
## Prediction sympathy_and_emotional_support
## infrastructure 0
## Other 175
## response_efforts 0
## sympathy_and_emotional_support 52
## urgent_needs 0
## Reference
## Prediction urgent_needs
## infrastructure 0
## Other 97
## response_efforts 3
## sympathy_and_emotional_support 0
## urgent_needs 30
More than 96% of the tweets from the testing sample are predicted as the other
category:
tibble(prop_obs_test = 100 * table(test$y) %>% prop.table(),
prop_predicted_test = table(pred_test) %>% prop.table() * 100) %>%
mutate(prop_obs_test = round(prop_obs_test, 2),
prop_predicted_test = round(prop_predicted_test, 2))
## # A tibble: 5 x 2
## prop_obs_test prop_predicted_test
## <table> <table>
## 1 1.87 0.05
## 2 79.88 96.73
## 3 11.25 2.80
## 4 4.62 0.37
## 5 2.38 0.05
To account for the unbalanced data, we can follow Shirin Elsinghorst’s post. We can use oversampling to randomly duplicate sample from the class with fewer occurrences prior the estimation, in such a way as to obtain as much observation from those classes as in the most represented class.
This is already implemented in {caret
}. All we need to do is add an item in the list of controls that states the sampling method to use (e.g., "up"
for oversampling, "down"
for under-sampling):
Then, the model can be estimated:
fit_2 <- caret::train(y ~ ., data = train,
method = 'ranger',
trControl = fit_control,
tuneGrid = tuning_grid,
num.trees = 200,
importance = "impurity")
The model can be used to predict the class of the tweets:
And once again, the confusion matrices can be obtained:
confusion_matrix_train_2 <- caret::confusionMatrix(pred_train_2, train$y)
confusion_matrix_test_2 <- caret::confusionMatrix(pred_test_2, test$y)
The accuracy in both the training and the testing samples are both relatively lower than their previous value.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.611322e-01 4.074015e-01 6.478201e-01 6.742560e-01 7.973595e-01
## AccuracyPValue McnemarPValue
## 1.000000e+00 4.739149e-308
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 4.859944e-01 1.719084e-01 4.646265e-01 5.074007e-01 7.987862e-01
## AccuracyPValue McnemarPValue
## 1.000000e+00 5.116483e-135
But this second model performs relatively better than the first for predicting the other classes.
## Sensitivity Specificity
## Class: infrastructure 0.02500000 1.00000000
## Class: Other 0.98129749 0.08816705
## Class: response_efforts 0.14522822 0.98684903
## Class: sympathy_and_emotional_support 0.01010101 0.99657367
## Class: urgent_needs 0.00000000 0.99952176
## Pos Pred Value Neg Pred Value
## Class: infrastructure 1.0000000 0.9817842
## Class: Other 0.8103282 0.5428571
## Class: response_efforts 0.5833333 0.9010567
## Class: sympathy_and_emotional_support 0.1250000 0.9540769
## Class: urgent_needs 0.0000000 0.9761794
## Precision Recall F1
## Class: infrastructure 1.0000000 0.02500000 0.04878049
## Class: Other 0.8103282 0.98129749 0.88765530
## Class: response_efforts 0.5833333 0.14522822 0.23255814
## Class: sympathy_and_emotional_support 0.1250000 0.01010101 0.01869159
## Class: urgent_needs 0.0000000 0.00000000 NaN
## Prevalence Detection Rate
## Class: infrastructure 0.01867414 0.0004668534
## Class: Other 0.79878618 0.7838468721
## Class: response_efforts 0.11251167 0.0163398693
## Class: sympathy_and_emotional_support 0.04621849 0.0004668534
## Class: urgent_needs 0.02380952 0.0000000000
## Detection Prevalence
## Class: infrastructure 0.0004668534
## Class: Other 0.9673202614
## Class: response_efforts 0.0280112045
## Class: sympathy_and_emotional_support 0.0037348273
## Class: urgent_needs 0.0004668534
## Balanced Accuracy
## Class: infrastructure 0.5125000
## Class: Other 0.5347323
## Class: response_efforts 0.5660386
## Class: sympathy_and_emotional_support 0.5033373
## Class: urgent_needs 0.4997609
## Sensitivity Specificity
## Class: infrastructure 0.4500000 0.8810657
## Class: Other 0.4810053 0.7563805
## Class: response_efforts 0.5269710 0.8548133
## Class: sympathy_and_emotional_support 0.5050505 0.8521782
## Class: urgent_needs 0.4509804 0.9196557
## Pos Pred Value Neg Pred Value
## Class: infrastructure 0.06716418 0.9882604
## Class: Other 0.88685345 0.2685338
## Class: response_efforts 0.31513648 0.9344451
## Class: sympathy_and_emotional_support 0.14204545 0.9726257
## Class: urgent_needs 0.12041885 0.9856484
## Precision Recall F1
## Class: infrastructure 0.06716418 0.4500000 0.1168831
## Class: Other 0.88685345 0.4810053 0.6237211
## Class: response_efforts 0.31513648 0.5269710 0.3944099
## Class: sympathy_and_emotional_support 0.14204545 0.5050505 0.2217295
## Class: urgent_needs 0.12041885 0.4509804 0.1900826
## Prevalence Detection Rate
## Class: infrastructure 0.01867414 0.008403361
## Class: Other 0.79878618 0.384220355
## Class: response_efforts 0.11251167 0.059290383
## Class: sympathy_and_emotional_support 0.04621849 0.023342670
## Class: urgent_needs 0.02380952 0.010737628
## Detection Prevalence
## Class: infrastructure 0.1251167
## Class: Other 0.4332400
## Class: response_efforts 0.1881419
## Class: sympathy_and_emotional_support 0.1643324
## Class: urgent_needs 0.0891690
## Balanced Accuracy
## Class: infrastructure 0.6655328
## Class: Other 0.6186929
## Class: response_efforts 0.6908921
## Class: sympathy_and_emotional_support 0.6786143
## Class: urgent_needs 0.6853180
When evaluating the train()
function, we have also specified the argument importance
, setting it to "impurity"
. In doing so, it is then possible to calculate variable importance using the varImp()
function from {caret
}.
## ranger variable importance
##
## only 20 most important variables shown (out of 160)
##
## Overall
## prai 100.00
## miss 69.61
## video 60.33
## nepal 54.35
## prayfornep 50.14
## donat 46.03
## nepalearthquak 42.37
## dead 39.34
## death 38.98
## sad 36.38
## kathmandu 35.11
## nepalquak 32.77
## kill 31.60
## devast 30.53
## earthquak 29.86
## everest 27.00
## peopl 24.99
## prayer 23.34
## god 22.52
## rip 21.99
The top ten variables can be plotted as follows:
data.frame(var_importance$importance) %>%
rownames_to_column(var = "variable") %>%
as_tibble() %>%
arrange(desc(Overall)) %>%
slice(1:10) %>%
ggplot(data = .,
aes(x = reorder(variable, Overall), y = Overall)) +
geom_col() +
coord_flip() +
labs(x = "Term")
Now that the model is estimated, it can be used to classify new tweets. However, the matrix of variables extracted from the texts must be created first, so that it can be provided as an argument to the predict()
function.
Let us get original tweets from our dataset.
And focus on texts only.
The tweets need to be cleaned:
tweets_earthquake_tt_whole <-
tweets_earthquake_tt_whole %>%
mutate(cleaned_text = str_to_lower(text),
cleaned_text = remove_url(cleaned_text),
cleaned_text = remove_mentions(cleaned_text),
cleaned_text = remove_char_ref(cleaned_text),
cleaned_text = remove_special_chars(cleaned_text),
cleaned_text = remove_punctuation(cleaned_text),
cleaned_text = remove_numbers(cleaned_text)
)
Let us create a tibble with only one word by row:
tweets_earthquake_tt_whole <-
tweets_earthquake_tt_whole %>%
select(id, cleaned_text) %>%
unnest_tokens(word, cleaned_text)
tweets_earthquake_tt_whole
## # A tibble: 12,127,423 x 2
## id word
## <int> <chr>
## 1 1 prayers
## 2 1 for
## 3 1 nepal
## 4 1 hope
## 5 1 the
## 6 1 damage
## 7 1 is
## 8 1 minimal
## 9 1 nepalearthquake
## 10 2 news
## # … with 12,127,413 more rows
From these words, we can remove the stop words:
tweets_earthquake_tt_whole <-
tweets_earthquake_tt_whole %>%
anti_join(stop_words, by = c("word"))
nrow(tweets_earthquake_tt_whole)
## [1] 7040123
And then extract the root of the remaining words:
We will only count the occurrence of the words used as explanatory variables in the classification problem.
We therefore only keep the words present in that list of words:
tweets_earthquake_tt_whole <-
tweets_earthquake_tt_whole %>%
semi_join(
tibble(word_stem = word_list)
)
Using the function count()
, the number of occurrences of each word in each tweet can be calculated. Then, the document-term matrix can be created using the cast_dtm()
function:
dtm_whole <-
tweets_earthquake_tt_whole %>%
count(id, word_stem) %>%
cast_dtm(document = id, term = word_stem, value = n)
And converted to a matrix, then a tibble:
Lastly, the prediction can be made:
The prediction can be added to the tibble new_data
:
Once all the tweets have been classified, we can visualize the result. For example, we can represent the density of tweets for each subject according to the date. To do this, let us use the parse_date_time()
function of the {lubridate
} package to allow R to interpret the created_at
column as a date.
new_data <-
new_data %>%
mutate(date = parse_date_time(created_at, "%a %b %d %H:%M:%S %z %Y", locale = "en_US"))
new_data %>%
filter(!is.na(predicted_label)) %>%
ggplot(data = ., aes(x = date, y = ..count..)) +
geom_density(aes(fill = predicted_label), position = "stack") +
scale_fill_discrete("Topics")
We can also visualize the different categories of tweets on a map. Let us focus on tweets located in Nepal only.
First, a base map for Nepal can be obtained using the function ne_countries()
from {rnaturalearth
}.
library(rnaturalearth)
map_data <- ne_countries(
country = c("nepal"),
scale = 'medium', type = 'map_units',
returnclass = 'sf')
For convenience, the tweets can be put in an sf
object:
library(sf)
geo_tweets <-
new_data %>%
filter(!is.na(predicted_label)) %>%
filter(!is.na(long), !is.na(lat)) %>%
st_as_sf(x = .,
coords = c("long", "lat"),
remove = FALSE,
crs = "+proj=longlat +datum=WGS84")
To keep only the tweets located in Nepal, we can use the st_join()
function from {sf
}. Only the tweets that lie in the polygons defined in map_data
will remain, the others will be discarded.
Lastly, a facetting of the tweets depending on their predicted class can be plotted: