To begin this new year, I had the pleasure to have some fun with R, Gephi and Twitter. For those who do not know what Gephi is, it is « an open source graph visualization and manipulation software ». I already played with this software during my internship with Arthur (Freakonometrics) when I was in Montreal. At that time, we wanted to graph the relations between French « députés » on Twitter (in French, see also an interactive version). On that topic, Polit’bistro did a nice visualization using ggplot2 (in French). Still while I was in Montreal, we made other network graphs (also in French), but this time with much more nodes. We wanted to watch the evolution of tweets containing the hashtag #ggi (grève générale illimitée) before and after a radio had asked people to troll people on Twitter, in order to tear the strike down.

Here, we are still dealing with Twitter data, but we focus on a particular tweet and its retweets. On the first of January, @freakonometrics posted a tweet that kind of went viral. So Arthur offered me to help him visualizing how things happened.

Freakonometrics’s Tweet

In about 6 hours, there were more than 2000 retweets. If you pay attention to the picture of the tweet, you can notice that the RT count is smaller than 2000. Don’t worry, we did not cheat and modified data (what would be the point?). But instead of focusing on the tweet only, we also took care of tweets mentioning the picture, and of « fake » RT (i.e. when someone copies/pastes the tweet instead of using Twitter’s option).

In this post, I will try my best to explain how we collected data, and how we handled them.

Let’s start from the beginning. First, we need to provide our OAuth access tokens to our twitter session. This R code works on my laptop running Ubuntu 13.10. If you are using Windows, I think you have to add a few more steps before you can use Twitter API. For Mac OS X users, I think the following code works as well.

library("twitteR")

reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "http://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
consumerKey <- "" # Put your consumer key
consumerSecret <- "" # Put you consumer secret
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)
twitCred$handshake()
registerTwitterOAuth(twitCred)

If you are using RStudio, i reccomend you to run the lines above in a terminal window, save the object created, and then load it in RStudio :

# In a terminal R session
save(twitCred,file="cred.RData")
# In RStudio
load("cred.RData") # l'objet mis en memoire est nomme twitCred
registerTwitterOAuth(twitCred)

As I said previously, we look for tweets containing a link to the picture embeded in @freakonometrics’ tweet. Since some Twitter clients offer the possibility to shorten links automatically, we may have not referenced every links.

# La liste des chaines de caracteres a fournir a l'API Twitter
aParcourir <- c("pic.twitter.com/jAcorEFzdk",
                "http://t.co/jAcorEFzdk",
                "pic.twitter.com/f7mbJeuoeu",
                "pic.twitter.com/p6U5rhwSxn",
                "pic.twitter.com/2onCmDZGZD",
                "pic.twitter.com/yO3arFU0dW",
                "pic.twitter.com/6XzK4FPpaY",
                "pic.twitter.com/BYcWVqA73Y",
                "pic.twitter.com/z3h9YRd5Aj",
                "pic.twitter.com/WuSTOw80Qv",
                "pic.twitter.com/jHeYm13mZc",
                "pic.twitter.com/LnP3NJCxjf",
                "pic.twitter.com/rhpLFRxUdd",
                "pic.twitter.com/wlO7vgTmB4",
                "http://t.co/1hk3KXQcJM",
                "http://t.co/WwYXAKRNcm",
                "http://t.co/dWFX6IkXV8",
                "http://t.co/aDpjZCZhku",
                "https://t.co/nG91SfVEys",
                "http://t.co/17csv29Gm9",
                "http://t.co/O9Ds1icAQp",
                "http://t.co/pxJ83Ahu0F",
                "http://t.co/7zuoNqxs1V",
                "http://t.co/p6U5rhwSxn",
                "Europe vs. the United States. Sunlight in hours per year",
                "Sunlight in hours per year")

For each element of this list, a request is made to the Twitter API to retrieve some tweets. Thanks to « twitteR » R package, it is extremely simple to do!

search.res <- lapply(aParcourir, function(x) searchTwitter(x, n = 1500))
res <- do.call("c", search.res)

# For each element of the list, rearrange as a data frame
df_freak <- do.call("rbind", lapply(res, function(x) x$toDataFrame()))
df_freak <- df_freak[order(df_freak$created),]

The problem here is that there are some duplicated values. I don’t really know how to get rid of them (using the duplicated function does not work with the status class of the elements returned by the searchTwitter function), so I take care of that issue only after formatting data as a data frame.

df_freak <- unique(df_freak)

For convenience, let’s put the screen names to lowercase.

df_freak$screenName <- tolower(df_freak$screenName)

Now, what we are doing here is about one tweet :

leTweet <- df_freak[which(df_freak$screenName == "freakonometrics"),]

As we would like to know who was the source of every retweet, we should isolate the twittos name (I’ll use this term to refer to Twitter users henceforth) using some regular expressions.

extraire <- function(entree,motif){
  res <- regexec(motif,entree)
  if(length(res[[1]])==2){
    debut <- (res[[1]])[2]
    fin <- debut+(attr(res[[1]],"match.length"))[2]-1
    return(substr(entree,debut,fin))
  }else return(NA)
}
df_freak$RT <- do.call("c", lapply(df_freak$text, function(x) regmatches(x, gregexpr("@(.*?):", x))[[1]][1]))
df_freak$RT <- do.call("c", lapply(df_freak$RT, function(x) gsub(":", "", extraire(x, "@(.*?):"))))

As it was said previously, we have to deal with « fake » retweets.

tmp <- do.call("c", lapply(df_freak$RT, function(x) extraire(x, "^(.*?) ")))
df_freak[which(!is.na(tmp)),"RT"] <- tmp[!is.na(tmp)]

tmp <- do.call("c", lapply(df_freak[which(is.na(df_freak$RT)),"text"], function(x) regmatches(x, gregexpr("@(.*?) "", x))[[1]][1]))
df_freak[which(is.na(df_freak$RT)),]
df_freak$RT[which(is.na(df_freak$RT))] <- do.call("c", lapply(tmp, function(x) extraire(x, "@(.*?) "")))
rm(tmp)

# There are also some people saying "via @"
tmp <- do.call("c", lapply(df_freak[which(is.na(df_freak$RT)),"text"], function(x) regmatches(x, gregexpr("via @(.*?)$", x))[[1]][1]))
tmp
df_freak[which(is.na(df_freak$RT)),]

lesVias <- do.call("c", lapply(tmp, function(x) extraire(x, "@(.*?)$")))
lesVias <- gsub("[^[:alpha:]]", "", lesVias)
df_freak$RT[which(is.na(df_freak$RT))] <- lesVias

We have isolated some twittos, who did not RT anyone, but put a link to the photo :

df_freak[which(is.na(df_freak$RT)),]

Don’t forget to put screen names to lowercase :

df_freak$RT <- tolower(df_freak$RT)

So we now have a data frame in which sources (RT) and person who retweeted (screenName) can be paired.
There is one more step beforme worrying about data exportation to be able to import them in Gephi. Indeed, we want to create a view of the network every hour, until 24 hours after @freakonometrics first tweet. Conveniently, searchTwitter gives the creation time of every tweet recovered, so it is a breeze to find the number of hour elapsed between any tweet and a given date.

trouverIndiceProche <- function(date){
  laDiff <- difftime(date, leTweet$created, units = "hour")
  ceiling(as.numeric(laDiff))
}

df_freak$delai <- do.call("c", lapply(df_freak$created, trouverIndiceProche))

Finally, we can export our data to a graphml file, using the following lines.

output <- '<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<!-- Inspired by igraph -->
<key id="name" for="node" attr.name="name" attr.type="string"/>
<key id="delai" for="node" attr.name="delai" attr.type="string"/>
<key id="created" for="node" attr.name="created" attr.type="string"/>
<graph id="G" edgedefault="directed">n'

fin_output='t</graph>n</graphml>'

obtenir_noeud <- function(uneLigne){
  paste("<node id="",
        as.character(uneLigne["screenName"]),
        "">nt<data key="name">",
        as.character(uneLigne["screenName"]),
        "</data>nt<data key="delai">",
        as.numeric(uneLigne["delai"]),
        "</data>nt<data key="created">",
        as.character(uneLigne["created"]),
        "</data>n</node>",
        sep="")
}


obtenir_edge <- function(x){
  paste('<edge source="',x["RT"],'" target="',x["screenName"],'"/>',sep="")
}


lesNoeuds <- paste((apply(df_freak,1,obtenir_noeud)),collapse="n")
# lesNoeuds <- gsub('id=" ','id="',lesNoeuds)
lesLiens <- paste((apply(df_freak[-which(is.na(df_freak$RT)),],1,obtenir_edge)),collapse="n")
fichier <- paste(output,lesNoeuds,lesLiens,fin_output,collapse="n")

write.table(fichier,"tweet_freak.graphml",row.names=FALSE, quote=FALSE,col.names=FALSE)

Just before jumping to the network visualization, let us have a look at some descriptive statistics.
First, we can see that the main twittos being retweeted is @freakonometrics, followed by @dynarski. Other twittos are kind of far behind.

Number of RT for sources retweeted more than five times (up) and cumulative number of tweets over time (bottom).

It is interesting to look at the empirical distribution of RT over time. We may try to model it with a Poisson process.

Empirical distribution of RT over time

Finally, as we can notice on the graph of the number of tweets over time, growth rate seems to be exponential. We can add vertical lines at times where influent twittos retweeted the picture.

Number of tweets over time

To finish this post, let’s move to the visualization of the network. I don’t know much about graph theory, I only tried to have some fun here. I used Gephi the best I could, but I must admit I don’t understand everything that was done.

We have some nodes, for which we know a name, a date of creation and a time elapsed (in hour) since @freakonometrics’ tweet. Besides, as we know who retweeted who, we have a set of relations. We are then able to create an oriented graph. Here, we used the Fruchterman & Reingold algorithm to draw the graph.

The figure below shows the network 47 hours after @freakonometric’s tweet.

RT of @freakonometrics’ tweet 47 hours after the creation of the tweet

Arthur suggested to show the evolution hour by hour of the network, and I am quite pleased with the result, aesthetically. We only took the first 24 hours.

Evolution of RTs every hour

5 thoughts on “Tweet goes viral

  1. Chouette, chouette … Merci Ewen 🙂

    Par contre j’ai 2, 3 petites questions:
    Comment tu trouves les éléments de ton vecteur aParcourir (Je suis un peu novice de Twitter 😐 )

    Et puis la dernier fois que j’ai utilisé la fonction searchTwitter (sous Windows – en juin/juillet) si l’argument n était égal à un nombre supérieur à 100, j’avais toujours les 100 premiers tweets qui se repéraient
    -je sais pas si c’est très clair-
    Tu penses que ça à changer depuis (j’ai pas fait de nouveau test depuis juillet) ou que c’est parce que je suis sous Windows (j’vais peut être tester sur une machine virtuelle linux si j’ai le temps!)

    Bonne journée 🙂

    1. Salut Fanny,
      Alors pour les éléments de aParcourir, c’est un peu de la triche… On les a trouvé à la main…
      Pour ce qui est de la fonction searchTwitter, ça va dépendre de l’API en fait. Il y a, il me semble, une limite de 7 jours pour retrouver les tweets via l’API. Par contre, tu peux jeter un oeil du côté des arguments sinceID ou since (http://cran.r-project.org/web/packages/twitteR/twitteR.pdf).

      En juillet, si tu utilisais déjà la version 1.1 de l’API je pense (la 1 a cessé de fonctionner en mars 2013 je crois). Ca m’étonnerait que le fait que tu sois sous Windows y change grand chose pour l’utilisation de cette fonction. Juste peut-être au niveau de l’encodage des caractères, tu peux rencontrer des différences…

      À+

  2. Oui je sais pour le changement d’API (ça m’a posé quelques problèmes d’ailleurs, c’était en mai/juin il me semble)
    Mais du coup, j’avais lu (je ne sais plus où …) qu’il y avait une limite de 100 tweets par page et 10 pages.

    Mais du coup quand tu utilises: searchTwitter(x, n = 1500), tu n’as pas 150 fois les 100 derniers tweets postés?
    J’avais effectivement utilisé l’argument since – qui faisait planter R –
    Du coup j’avais laissé de côté cette solution pour accéder aux données twitter.

    Merci pour tes infos 🙂

    1. Oui, pour la requête GET search à l’API, tu es limitée à 100 réponses max par page affichée, et tu retrouveras au maximum 1500 tweets avec la même requête (https://dev.twitter.com/docs/api/1/get/search). Dans notre petite analyse ici, on n’a pas été confronté à ce souci : il n’y avait pas 1500 tweets de retournés pour chaque requête.
      Si tu veux récupérer plus de 1500 tweets avec un même mot clé, il faut passer par la version streaming… et donc adieu les anciens tweets…
      Sinon, il y a des sites qui proposent de récupérer des tweets plus anciens non ?

      1. Oui on passait par la version streaming (mais en RSS) du coup cette solution ne fonctionnait plus avec la nouvelle API. Ensuite on a fait des tests en JAVA, en faisant tourner un batch un script R qui récupérait toute les 2/3h les tweets et supprimait les doublons et puis avec un outil de ton copain SAS aussi …

        Bref on a finalement réussi a trouvé plusieurs solutions qui fonctionnaient.. C’était juste pour être sûr que je ne sois pas passé à côté d’une solution simple!

        🙂

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Time limit is exhausted. Please reload CAPTCHA.