Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)
Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)
Ewen Gallic, Aix-Marseille Université (🐦 @3wen)
It is common to manipulate strings. This is the case when files need to be loaded in a loop where only a part of the files are targeted. This is also the case, of course, when the data that is manipulated in the models is textual. There is a very convenient tool that allows one to search for more or less complex patterns in strings: regular expressions. Regular expression (or regex) are sequences of characters forming a search pattern. The pattern is used to match one or several characters in a string. The help page in R (?regex
) provides condensed information on regex.
The {base
} package contains multiple functions related to regular expressions, but we will use instead some functions from the package {stringr
}, built on top of {stringi}
. The package {stringr
} allows us to easily manipulate strings in R.
For most of the examples given to illustrate how regex work, we will use real Twitter data from CrisisNLP. Let us use the tweets from the Nepal Earthquake crisis, annotated by volunteers.
library(tidyverse)
tweets_earthquake <-
str_c("donnees/CrisisNLP_volunteers_labeled_data/2015_Nepal_Earthquake_en/",
"2015_Nepal_Earthquake_en.csv") %>%
read_csv(locale = locale(encoding = "UTF-8"))
tweets_earthquake
## # A tibble: 9,471 x 10
## tweet_id tweet_time tweet_author tweet_author_id tweet_language
## <chr> <chr> <chr> <dbl> <chr>
## 1 '591903… Sat Apr 2… Faali19 2387302745 en
## 2 '591903… Sat Apr 2… STERLINGMHO… 153876973 en
## 3 '591903… Sat Apr 2… HeenaliVP 421188281 en
## 4 '591903… Sat Apr 2… Xennia79 176207969 en
## 5 '591903… Sat Apr 2… Madhurita_ 1058658786 en
## 6 '591903… Sat Apr 2… MONIMISHI 1461160603 en
## 7 '591904… Sat Apr 2… AnilBalkrus… 3034294729 en
## 8 '591904… Sat Apr 2… haquem19 2782719416 en
## 9 '591904… Sat Apr 2… Akshay7_ 111600045 en
## 10 '591904… Sat Apr 2… hnrwbell 1613115068 en
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## # tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>
To check whether a pattern is found in a string, we can use the function stringr::str_detect()
(note that the package {stringr
} have been attached when we atatched {tidyverse}
).
## [1] "Dua's for all those affected by the earthquakes in India,Nepal & Bhutan. Stay safe & help others in any form. #Equake http://t.co/M6YG0k4FKh"
## [2] "itvnews: Witness to Nepal #earthquake tells itvnews: 'It was terrifying' http://t.co/UWMynVyzQC"
# Can we find the word "Earthquake" in the tweets?
str_detect(string = two_tweets, pattern = "earthquake")
## [1] TRUE TRUE
## [1] TRUE FALSE
In the above examples, the pattern is composed of litterals, i.e., characters that recieve a literal interpretation in the regular expression. Some other characters, on the other hand, receive a different interpretation when they are part of the regex. This is the case of the following reserved characters, called metacharacters: . \ | ( ) [ { $ * + ?
. If we want these characters to be literaly interpretted, we need to escape them. In R, this is done using two backslash.
# The character `.` is a metacharacter that matches any character (except a new line)
str_detect(string = c("Earthquake.", "Earthquake"), pattern = ".")
## [1] TRUE TRUE
# To look for a dot in a string:
str_detect(string = c("Earthquake.", "Earthquake"), pattern = "\\.")
## [1] TRUE FALSE
To match the beginning and the end of a string, respectively, we can use the line anchors ^
and $
.
For example, to look for the tweets which begin with a hashtag:
## [1] 10 25 28 31 38 43
This can be useful when combined with the function dplyr::filter()
to filter rows of a two dimension table:
# Tweets beginning with a sharp
tweets_earthquake %>%
filter(str_detect(string = tweet_text, pattern = "^#")) %>%
select(tweet_text)
## # A tibble: 816 x 1
## tweet_text
## <chr>
## 1 #earthquake @BBCNews my uncle is travelling in Nepal but has notified u…
## 2 #Kathmandu's Tribhuvan Airport is currently closed due to #lEarthquake.…
## 3 #Nepal earthquake claims five lives in East #India http://t.co/aU4EFiKU…
## 4 #google person finder for #earthquake http://t.co/uZyXguoio2
## 5 #This Is the #helpline for #Nepal earthquake click on this post https:…
## 6 "#\xbe\xdc\xdd\x8c_\xc9 #\x8c\xe0\xbc\x8a__\x8b\x81\xe3\x8d_\xc8 #\x8b\…
## 7 #NepalQuake | Deep Kumar Upadhyay, Nepal's ambassador to India, says Ai…
## 8 #BeingIndian mourns the loss of the lives in the #earthquake that hit N…
## 9 #NepalEarthquake Tribhuvan Int. Arprt #Kathmandu closed 4 operations fl…
## 10 #earthquakeindia Judging by the nature of tremors in Lakhimpur,one can …
## # … with 806 more rows
The pipe character |
allows to match one or more expression.
tweets <-
c("PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries",
"Earthquake severe damage to Kathmandu. Tragic loss of life.",
"7.9-magnitude earthquake strikes Nepal, damage reported",
"Thoughts are with the families in #Nepal")
str_detect(string = tweets, pattern = "magnitude|damage")
## [1] TRUE TRUE TRUE FALSE
This may be useful for alternative spellings.
## [1] TRUE TRUE FALSE
Character classes are lists of characters that belong to a group, such as alphabetic, numeric, alphanumeric characters, etc. It is possible to build them or to use predefined classes. They are written by placing them in square brackets []
. For example, if the aim is to match strings where characters
Let us assume that we face file names with a date, and that we want to match only those whose month in a given year is “January” or “February”:
str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
pattern = c("file_2019-0[12]-01"))
## [1] "file_2019-01-01" NA "file_2019-02-01"
In the previous code, we therefore searched each string for the occurrence of the substring file_2019-01-01
or file_2019-02-01
.
Using a dash -
, it is possible to define a sequence of characters. Thus, the character class [A-Z]
is used to match the letters of the following set: ABCDEFGHIJKLMNOPQRSTUVWXYZ
. The character class [0-9]
mathces the character set 0123456789
.
str_extract(string = c("file_2019-01-01.txt", "file_2019-03-01.txt", "file_2019-02-01.txt"),
pattern = c("file_2019-0[1-3]-01"))
## [1] "file_2019-01-01" "file_2019-03-01" "file_2019-02-01"
Unions of groups can be made:
str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
"file_2019-03-01.txt", "file_2019-04-01.txt",
"file_2019-05-01.txt", "file_2019-06-01.txt",
"file_2019-07-01.txt", "file_2019-08-01.txt"
),
pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA
## [5] NA "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"
To exclude a group of characters, a circumflex accent ^
can be used:
str_extract(string = c("file_2019-01-01.txt", "file_2019-02-01.txt",
"file_2019-03-01.txt", "file_2019-04-01.txt",
"file_2019-05-01.txt", "file_2019-06-01.txt",
"file_2019-07-01.txt", "file_2019-08-01.txt"
),
pattern = c("file_2019-0[1-36-8]-01"))
## [1] "file_2019-01-01" "file_2019-02-01" "file_2019-03-01" NA
## [5] NA "file_2019-06-01" "file_2019-07-01" "file_2019-08-01"
If, on the other hand, the circumflex must be part of the character class, it should not be placed right after the opening bracket:
## [1] "happy ^" NA
It is also possible to escape the character as follows: [\^]
.
Somes classes are pre-built and can be referred to by their name. They are based on the POSIX family of standards. The most used (in my own experience) are listed in the Table below.
Character class | Descriptions |
---|---|
[:digit:] |
digits |
[:lower:] |
lowercase alphabetic characters |
[:upper:] |
uppercase alphabetic characters |
[:alpha:] |
alphabetic characters (both lower and upper) |
[:alnum:] |
alphabetic characters and numbers |
[:blank:] |
space and tab |
[:punct:] |
punctuation |
[:xdigit:] |
hexadecimal digits |
To refer to these classes, they need to be put between the brackets defining the character classes:
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## [1] FALSE FALSE FALSE FALSE
Some classes also benefit from an abbreviation:
Character class | Descriptions |
---|---|
\d |
digits |
\D |
non decimal digit |
\s |
whitespace |
\w |
word |
\W |
non word |
## [[1]]
## [1] "6" "1"
## [[1]]
## [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" " " "."
## [[1]]
## [1] "M" "a" "g" "n" "i" "t" "u" "d" "e" "6" "1"
## [[1]]
## [1] " " "."
Parentheses can be used to group some part of a regular expression together. This is particularly helpful when combined with quantifiers and character classes, to manipulate file names for example.
Here is an example with the function str_extract()
, which extracts matching patterns from a string:
## [1] "analyse" "analyze" NA
Quantifiers are used to repeat the regular expression a given number of times. The Table below lists the available quantifiers. They are placed after the regex that need to be matched a given number of times.
?
| the regex appears zero or one time |*
| the regex appears zero or more time(s) |+
| the regex appears one or more time(s) |{n}
| the regex appears n
times exactly |{n,}
| the regex appears n
times or more |{n,m}
| the regex appears at least n
times but no more than m
times |## [1] "labour" "labor"
Combining quantifiers with character classes or groups allows to match more complex patterns:
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## [1] NA NA NA NA
## [[1]]
## [1] "PANIC" "IN" "NEPAL" "Strong" "quake" "hits"
## [7] "capital" "causing" "major" "damage" "injuries"
##
## [[2]]
## [1] "Earthquake" "severe" "damage" "to" "Kathmandu"
## [6] "Tragic" "loss" "of" "life"
##
## [[3]]
## [1] "7" "9" "magnitude" "earthquake" "strikes"
## [6] "Nepal" "damage" "reported"
##
## [[4]]
## [1] "Thoughts" "are" "with" "the" "families" "in"
## [7] "Nepal"
(.*)
is a useful combination of grouping and quantifiers. It allows to match any sequence of characters:
.
: any character*
: present zero or more timesx <- c("type_1_20190101_20190131.txt", "type_2_20190101_20190131.txt",
"type_1_20190201_20190228.txt", "type_2_20190201_20190228.txt",
"type_1_20190101_20190131.csv", "type_2_20190101_20190131.csv",
"type_1_20190201_20190228.csv", "type_2_20190201_20190228.csv")
str_extract(x, "^type_1(.*)\\.txt$")
## [1] "type_1_20190101_20190131.txt" NA
## [3] "type_1_20190201_20190228.txt" NA
## [5] NA NA
## [7] NA NA
To illustrate the examples of regular expressions, we used some functions of the package {stringr}
which all begin with the prefix str_
.
Function | Descriptions | Type of result |
---|---|---|
str_detect() |
Detects the presence or absence of a pattern in a string | Booleans |
str_extract() |
Extracts the first matched pattern | Strings |
str_extract_all() |
Extracts matched pattern and provides the result in a list of vectors. | List of vectors of characters. Each element of the list corresponds to an element provided to the argument string |
str_match() |
Extracts the first group found in a string | Matrix |
str_match_all() |
Extracts all the groups found in a string. | List of matrices whose elements correspond to the elements of the vector given to the argument string |
str_locate() |
Locates the first occurrence of a pattern in a string | Matrix |
str_locate_all() |
Locates all the occurrences of a pattern in a string | List of matrices |
str_replace() |
Replaces the first occurrence of a pattern in a string | String |
str_replace_all() |
Replaces all the occurrences of a pattern in a string | String |
str_split() |
Splits a string into several pieces, according to a given pattern | List of vector of characters |
# French phone numbers
phone_numbers <-
c("02 23 23 35 45", "02-23-23-35-45",
"Madrid", "02.23.23.35.45", "0223233545",
"Milan", "02 23 23 35 45 ",
" 02 23 23 35 45", "Home: 02 23 23 35 45")
pattern_phone_number <- str_c(str_dup("([0-9]{2})[- \\.]", 4), "([0-9]{2})")
pattern_phone_number
## [1] "([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})[- \\.]([0-9]{2})"
## [1] "02 23 23 35 45" "02-23-23-35-45" NA "02.23.23.35.45"
## [5] NA NA "02 23 23 35 45" "02 23 23 35 45"
## [9] "02 23 23 35 45"
# Extract phone numbers, then remove punctuation and white characters
str_extract(phone_numbers, pattern_phone_number) %>%
str_replace_all("[[:punct:]\\s]", "")
## [1] "0223233545" "0223233545" NA "0223233545" NA
## [6] NA "0223233545" "0223233545" "0223233545"
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [2,] "02-23-23-35-45" "02" "23" "23" "35" "45"
## [3,] NA NA NA NA NA NA
## [4,] "02.23.23.35.45" "02" "23" "23" "35" "45"
## [5,] NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA
## [7,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [8,] "02 23 23 35 45" "02" "23" "23" "35" "45"
## [9,] "02 23 23 35 45" "02" "23" "23" "35" "45"
The str_locate()
and str_locate_all()
functions return the start and end indices of the matched subchains.
## [1] "PANIC IN NEPAL: Strong quake hits capital, causing major damage, injuries"
## [2] "Earthquake severe damage to Kathmandu. Tragic loss of life."
## [3] "7.9-magnitude earthquake strikes Nepal, damage reported"
## [4] "Thoughts are with the families in #Nepal"
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] 5 13
## [4,] NA NA
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 5 13
##
## [[4]]
## start end
To look for a pattern in a string ignoring case sensitivity, the pattern can be previously provided to the function stringr::regex()
:
## [1] TRUE FALSE
## [1] TRUE TRUE
It is often necessary to “clean” the strings before they can be used in statistical models. A few basic operations can quickly remove spaces, punctuation, etc.
To set all alphabetical characters to lowercase or uppercase, the functions str_to_lower()
or str_to_upper()
can be used, respectively.
## [1] "dua's for all those affected by the earthquakes in india,nepal & bhutan. stay safe & help others in any form. #equake http://t.co/m6yg0k4fkh"
## [1] "DUA'S FOR ALL THOSE AFFECTED BY THE EARTHQUAKES IN INDIA,NEPAL & BHUTAN. STAY SAFE & HELP OTHERS IN ANY FORM. #EQUAKE HTTP://T.CO/M6YG0K4FKH"
To remove some undesired characters, such as punctuation, the function str_replace_all()
can be used:
## [1] "Duas for all those affected by the earthquakes in IndiaNepal amp Bhutan Stay safe amp help others in any form Equake httptcoM6YG0k4FKh"
Another useful function is str_trim()
. It allows to trim whitespace from a string. This typically occurs after removing some words of a string. The side
parameter allows to specify whether the spaces to be removed should only be those on the left of the string, on the right, or both.
## [1] "String with spaces at the beginning and end"
## [1] "String with spaces at the beginning and end "
## [1] " String with spaces at the beginning and end"
In this case study, we will analyze text data from the Twitter platform. Messages written during the Gorkha earthquake, an earthquake that occurred in Nepal in April and May 2015 were retrieved.
This case study was inspired by this course (Supervised classification with text data) made by Benjamin Soltoff.
Volunteers have labeled some of the tweets. These are available on the website CrisisNLP. We will use them to train a classifier. The objective of the latter is to assign one of the following different classes based on new messages broadcast on Twitter during a similar disaster:
We previously loaded the tweets into a tibble named tweets_earthquake
.
## # A tibble: 9,471 x 10
## tweet_id tweet_time tweet_author tweet_author_id tweet_language
## <chr> <chr> <chr> <dbl> <chr>
## 1 '591903… Sat Apr 2… Faali19 2387302745 en
## 2 '591903… Sat Apr 2… STERLINGMHO… 153876973 en
## 3 '591903… Sat Apr 2… HeenaliVP 421188281 en
## 4 '591903… Sat Apr 2… Xennia79 176207969 en
## 5 '591903… Sat Apr 2… Madhurita_ 1058658786 en
## 6 '591903… Sat Apr 2… MONIMISHI 1461160603 en
## 7 '591904… Sat Apr 2… AnilBalkrus… 3034294729 en
## 8 '591904… Sat Apr 2… haquem19 2782719416 en
## 9 '591904… Sat Apr 2… Akshay7_ 111600045 en
## 10 '591904… Sat Apr 2… hnrwbell 1613115068 en
## # … with 9,461 more rows, and 5 more variables: tweet_lon <dbl>,
## # tweet_lat <dbl>, tweet_text <chr>, tweet_url <chr>, label <chr>
The pre-existing classification is given in the label
column. The table()
function provides an overview of each class and its associated size:
## .
## Animal management Caution and advice
## 1 6
## Displaced people Infrastructure and utilities
## 4 50
## Infrastructure damage Infrastructure Damage
## 3 166
## Injured or dead people Missing, trapped, or found people
## 76 56
## Money Not related or irrelevant
## 53 239
## Not relevant Not Relevant
## 1 6279
## Other relevant Other relevant information
## 2 239
## Other Relevant Information Personal updates
## 627 29
## Response efforts Response Efforts
## 2 994
## Shelter and supplies Sympathy and emotional support
## 18 458
## Urgent Needs Volunteer or professional services
## 108 60
Some classes refer to the same concept, but have a different spelling. Let us fix that.
tweets_earthquake <-
tweets_earthquake %>%
mutate(label = str_to_lower(label)) %>%
mutate(label = ifelse(label %in% c("not related or irrelevant",
"not relevant"),
yes = "not relevant",
no = label)) %>%
mutate(label = ifelse(label %in% c("other relevant information",
"other relevant"),
yes = "other relevant",
no = label))
Then let us define the 4 classes we are interested in:
tweets_earthquake <-
tweets_earthquake %>%
mutate(
class = "Other",
class = ifelse(label == "response efforts",
yes = "response efforts", no = class),
class = ifelse(label %in% c("infrastructure damage", "infrastructure and utilitie"),
yes = "infrastructure", no = class),
class = ifelse(label %in% c("urgent needs", "injured or dead people",
"missing, trapped, or found people"),
yes = "urgent needs", no = class),
class = ifelse(label == "sympathy and emotional support",
yes = "sympathy and emotional support", no = class)
)
tweets_earthquake$class %>% table() %>% sort()
## .
## infrastructure urgent needs
## 169 240
## sympathy and emotional support response efforts
## 458 996
## Other
## 7608
To be properly imported into R, the identifiers contained in the tweet_id
column have been previously surrounded by single quotation marks ('
). If this had not been done, the data import function would have tried to convert this column to numeric, thus losing any 0
at the beginning of the chain (001
would have been transformed into 1
).
Let’s keep only three columns of this data table:
tweet_id
: the identifier of each tweetclass
: the class to be predictedtweet_text
: the tweet full textCrisisNLP provides a set of tuples (tweet-id, user-id) for each disaster studied. Using the Twitter API and a few lines of R code, the listed tweets were retrieved (at least those that, at the time of extraction, were still available). To avoid adding to this tutorial, we will work with a database that is already prepared.
The recovered tweets are stored as tibbles in 5 files: tweets_nepal_00.rda
to tweets_nepal_04.rda
. Let us load them into a list, then concatenate this list to form a single tibble.
# Load tweets (extracted using Twitter API)
N <- list.files("donnees/Tweets/Nepal_2015/", pattern = "^tweets_nepal", full.names = TRUE)
tweets_df <-
lapply(N, function(x){
tweets_tmp <- load(x)
get(tweets_tmp)
})
tweets_df <-
tweets_df %>%
bind_rows()
Let us remove the retweeted statutes:
The number of lines is 981552.
## [1] 981552
There are some statuses in the labeled tweets set that are no longer available on the social platform. Let us take them out of our analysis.
tweets_earthquake <-
tweets_earthquake %>%
filter(tweet_id %in% tweets_df$id_str)
nrow(tweets_earthquake)
## [1] 7031
Let us add to the labelled dataset the information obtained via the Twitter API:
The frequency for each class shows a strong imbalance.
## .
## infrastructure urgent needs
## 135 183
## sympathy and emotional support response efforts
## 328 814
## Other
## 5792
The information we will use in this exercise will be extracted from the texts of the tweets. This involves extracting variables from textual data. What we will do is separate the text into tokens, after cleaning it (lowercase, punctuation removal, word root extraction, etc.).
We will use two packages to pre-process the data: {tidytext
} and [{SnowballC
}]https://cran.r-project.org/web/packages/SnowballC/index.html. Multiple functions useful for text mining are available in {tidytext
}, including sentiment analysis functions. The package {SnowballC}
allows to use Porter’s word stemming algorithm which collapses words to a common root (note that not all languages are available).
To clean the tweets, we will create some functions. Let us define the function remove_url()
to remove URLs from a string, using a regex :
#' remove_url
#' Removes URLs from a string
#' @param x string
remove_url <- function(x){
pattern_url <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
str_replace_all(string = x, pattern = pattern_url, replacement = "")
}
Let us also create the function remove_special_chars()
allowing to remove special characters found in tweets (unrecognized apostrophes) :
#' remove_special_chars
#' Removes the special characters from a string
#' @param x string
remove_special_chars <- function(x){
str_replace_all(string = x,
pattern = "[^\x20-\x7e]", replacement = "")
}
Let us define the function remove_mentions()
which identifies a mention in a Twitter status (starting with the arobase symbol (@
) and followed by the user’s screen name) :
#' remove_mentions
#' Removes the mentions from tweets (`@`)
#' @param x string
remove_mentions <- function(x){
str_replace_all(x, "@[[:alnum:]]+\\s?", "")
}
We can also define the functions remove_punctuation()
and remove_numbers()
which remove punctuation and numbers, respectively, from a tweet.
#' remove_punctuation
#' Removes punctuation from tweets
#' @param x string
remove_punctuation <- function(x){
str_replace_all(x, "[[:punct:]]", "")
}
#' remove_numbers
#' Removes numbers from tweets
#' @param x string
remove_numbers <- function(x){
str_replace_all(x, "[[:digit:]]", "")
}
Eventuallt, we can define the function remove_char_ref()
which removes character references (e.g., &
)
We will create a corpus of texts from the tweets. The idea is to obtain an object containing as many documents as statuses. For each document, we have to count the occurrence of each of the words encountered throughout the corpus. At the end of the day, we obtain a matrix whose rows correspond to the tweets and whose columns indicate the occurrence of each word. The columns of this matrix will be the explanatory variables that can be uses to train a classifier.
First, let us create a tibble with an identifier and a tweet.
tweets_earthquake_tt <-
tibble(id = 1:nrow(tweets_earthquake),
text = tweets_earthquake$full_text)
tweets_earthquake_tt
## # A tibble: 7,252 x 2
## id text
## <int> <chr>
## 1 1 "Dua's for all those affected by the\nearthquakes in India,Nepal …
## 2 2 Absolutely devastated by the destruction to my old home #Nepal
## 3 3 Thoughts are with the families in #Nepal
## 4 4 Frightful images! Our prayers echo for everyone affected. #earthq…
## 5 5 Who was Gajendra Singh ? Today no news boz , of earthquake in Nep…
## 6 6 Live: Nepal cabinet meets to seek foreign help, 114 feared dead a…
## 7 7 When you go out for Momos this evening, ask and reassure the sell…
## 8 8 A crucial tool in a situation like #NepalQuake #NepalEarthquake..…
## 9 9 our affection from Madrid Spain, there we were this summer from N…
## 10 10 Devastating pictures of #NepalEarthQuake http://t.co/VaEOUkUTsG
## # … with 7,242 more rows
Then, let us apply to each tweet the functions to clean the text:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
mutate(cleaned_text = str_to_lower(text),
cleaned_text = remove_url(cleaned_text),
cleaned_text = remove_mentions(cleaned_text),
cleaned_text = remove_char_ref(cleaned_text),
cleaned_text = remove_special_chars(cleaned_text),
cleaned_text = remove_punctuation(cleaned_text),
cleaned_text = remove_numbers(cleaned_text)
)
tweets_earthquake_tt
## # A tibble: 7,252 x 3
## id text cleaned_text
## <int> <chr> <chr>
## 1 1 "Dua's for all those affected … "duas for all those affected by t…
## 2 2 Absolutely devastated by the d… absolutely devastated by the dest…
## 3 3 Thoughts are with the families… thoughts are with the families in…
## 4 4 Frightful images! Our prayers … "frightful images our prayers ech…
## 5 5 Who was Gajendra Singh ? Today… who was gajendra singh today no …
## 6 6 Live: Nepal cabinet meets to s… "live nepal cabinet meets to seek…
## 7 7 When you go out for Momos this… when you go out for momos this ev…
## 8 8 A crucial tool in a situation … "a crucial tool in a situation li…
## 9 9 our affection from Madrid Spai… our affection from madrid spain t…
## 10 10 Devastating pictures of #Nepal… "devastating pictures of nepalear…
## # … with 7,242 more rows
Using the unnest_tokens()
function of {tidytext
}, let us separate each word from the tweets. Each line of the tibble obtained is a tuple indicating the identifier and the word.
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
select(id, cleaned_text) %>%
unnest_tokens(word, cleaned_text)
tweets_earthquake_tt
## # A tibble: 82,692 x 2
## id word
## <int> <chr>
## 1 1 duas
## 2 1 for
## 3 1 all
## 4 1 those
## 5 1 affected
## 6 1 by
## 7 1 theearthquakes
## 8 1 in
## 9 1 indianepal
## 10 1 bhutan
## # … with 82,682 more rows
Some very frequent and potentially noisy words can be removed. A list of such words is available in the tibble stop_words
. The function get_stopwords()
can also be used
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
## # A tibble: 164 x 2
## word lexicon
## <chr> <chr>
## 1 au snowball
## 2 aux snowball
## 3 avec snowball
## 4 ce snowball
## 5 ces snowball
## 6 dans snowball
## 7 de snowball
## 8 des snowball
## 9 du snowball
## 10 elle snowball
## # … with 154 more rows
To remove these stopwords, we can use the anti_joint()
function from {dplyr
}:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
anti_join(stop_words, by = c("word"))
tweets_earthquake_tt
## # A tibble: 44,011 x 2
## id word
## <int> <chr>
## 1 1 duas
## 2 1 affected
## 3 1 theearthquakes
## 4 1 indianepal
## 5 1 bhutan
## 6 1 staysafe
## 7 1 form
## 8 1 equake
## 9 2 absolutely
## 10 2 devastated
## # … with 44,001 more rows
To extract the root of each word the wordStem()
function from {SnowballC
} can be applied:
tweets_earthquake_tt <-
tweets_earthquake_tt %>%
mutate(word_stem = wordStem(word))
tweets_earthquake_tt
## # A tibble: 44,011 x 3
## id word word_stem
## <int> <chr> <chr>
## 1 1 duas dua
## 2 1 affected affect
## 3 1 theearthquakes theearthquak
## 4 1 indianepal indianep
## 5 1 bhutan bhutan
## 6 1 staysafe staysaf
## 7 1 form form
## 8 1 equake equak
## 9 2 absolutely absolut
## 10 2 devastated devast
## # … with 44,001 more rows
Using the count()
function of {dplyr
}, the occurrence of each word is easily calculated:
## # A tibble: 8,375 x 2
## word_stem n
## <chr> <int>
## 1 nepal 3820
## 2 nepalearthquak 1164
## 3 earthquak 670
## 4 prayer 486
## 5 peopl 457
## 6 god 455
## 7 nepalquak 370
## 8 donat 285
## 9 new 274
## 10 india 251
## # … with 8,365 more rows
We can use a barplot to graph the occurrences of the top n
words (here, we use n=10
):
freq_words %>% slice(1:10) %>%
ggplot(data = .,
aes(x = reorder(word_stem, n), y = n)) +
geom_bar(stat ="identity") +
labs(x = "Word", y = "Frequency") +
coord_flip()
A word cloud can be drawn using the wordcloud()
function from {wordcloud
}. The size of words is positively related to the frequency of appearance.
library(wordcloud)
wordcloud(words = freq_words %>% slice(1:100) %>%
magrittr::extract2("word_stem"),
freq = freq_words %>% slice(1:100) %>%
magrittr::extract2("n"),
random.order = F)