Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

This hands-on session focuses on telematics data. The objective is to manipulate telematics data in order to extract information from it.

In April 2014, the mayors of Québec and Lévis two cities in the province of Québec (Canada) and the president and CEO of CAA-Quebec (the affiliated club of the Canadian Automobile Association in Québec) launched a mobile application called “Mon trajet” (literally, my journey). This application aimed at measuring traffic congestion in Québec and Lévis. The users were asked launch the app when they were using their car, to help collect data in real time (anonymously). In total, from April 28th 2014 to May 18th 2014, 18,988 car rides were recorded. The data are freely available online, on data.world.

A real nice use case of this dataset by Simon Coulombe is provided online : “Dessine moi un tramway” (literally, draw me a tramway). In this example, anyone can create a tram line layout and observed the estimated consequences on the on the relative use of trams and cars.

This hands-on tutorial is inspired by the work of Simon Coulombe. The first part consists of loading the data into R and producing a first succinct graphical representation. The second part focuses on producing statistics (distances travelled, travel time, average speed on the journey, etc.). Finally, the third part proposes to create dynamic and/or interactive graphical visualizations of the data.

1 Data handling

This section loads the data that are used in the tutorial: the telematics data and some geographic data.

1.1 Telematics data from “Mon trajet”

As previously, stated, the data containing the almost 19,000 car trips is freely available at data.world.

The file montrajet 09.50.22.zip contains three files in geojson format :

  • montrajet_24au8.geojson: trip from 2014-04-28 to 2014-05-04
  • montrajet_5au11.geojson: trip from 2014-05-05 to 2014-05-11
  • montrajet_12au18.geojson: trip from 2014-05-12 to 2014-05-18

Each file contains geographical points that correspond to the records provided by the users of the mobile app. Some information is given for each point, as described in the table below.

Variable Description
PRECISION_VER ?
TRIP_ID id of the trip
ALTITUDE altitude of the vehicle (in meters)
DATETRAJET date of the trip (YYYYMMDDHHMMSS)
GEOM Type of geometry and coordinates (JSON table)
COORD_ID id of the geometry
VITESSE speed of the car (km/hour)
PRECISION_HOR ?

After downloading the data and unzipping the file, the three files can be placed in the following directory: data/montrajet. Then, using the read_sf() function from the {sf} package, the points of each file can be loaded into the R session.

A quick glance at the content of the first batch of data can be made:

## Simple feature collection with 762741 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -72.19939 ymin: 46.05592 xmax: -70.48567 ymax: 47.60292
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 762,741 x 9
##    PRECISION_VER TRIP_ID ALTITUDE DATETRAJET GEOM  COORD_ID VITESSE
##            <dbl>   <int>    <int> <chr>      <chr>    <int>   <dbl>
##  1             6    4916      101 201404290… "{\"…  3666663   19.3 
##  2             4    4916      100 201404290… "{\"…  3667923    7.82
##  3             3    4916       98 201404290… "{\"…  3667168   10.9 
##  4             6    4916       96 201404290… "{\"…  3667664   13.9 
##  5             8    4916       96 201404290… "{\"…  3668149   14.0 
##  6             6    4916       96 201404290… "{\"…  3667703   10.1 
##  7             4    4916       96 201404290… "{\"…  3667793    4.48
##  8             4    4916       95 201404290… "{\"…  3667102    3.12
##  9             6    4916       98 201404290… "{\"…  3667552    4.32
## 10             4    4916       96 201404290… "{\"…  3667261    1.16
## # … with 762,731 more rows, and 2 more variables: PRECISION_HOR <dbl>,
## #   geometry <POINT [°]>

It can be seen that this first dataset contains 762,741 observations for which geographical coordinates in degrees and minutes are available. The identifier of the spatial projection used is EPSG: 4326. If necessary, it can be changed, using the function sf::st_transform() (an example of how to use this function is provided in the cartography section).

A car trip is identified by an ID (TRIP_ID). It consists of at least 2 observations (origin and destination). However, some IDs are repeated within the three files. A new ID can be created, by concatenating the old ID with a value that depends on the file the records are stored.

These three batch of data can be merged into a single object:

## Simple feature collection with 1927117 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -72.19968 ymin: 45.80168 xmax: -70.20025 ymax: 47.69149
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 1,927,117 x 9
##    PRECISION_VER TRIP_ID ALTITUDE DATETRAJET GEOM  COORD_ID VITESSE
##            <dbl> <chr>      <dbl> <chr>      <chr>    <int>   <dbl>
##  1             6 1_4916       101 201404290… "{\"…  3666663   19.3 
##  2             4 1_4916       100 201404290… "{\"…  3667923    7.82
##  3             3 1_4916        98 201404290… "{\"…  3667168   10.9 
##  4             6 1_4916        96 201404290… "{\"…  3667664   13.9 
##  5             8 1_4916        96 201404290… "{\"…  3668149   14.0 
##  6             6 1_4916        96 201404290… "{\"…  3667703   10.1 
##  7             4 1_4916        96 201404290… "{\"…  3667793    4.48
##  8             4 1_4916        95 201404290… "{\"…  3667102    3.12
##  9             6 1_4916        98 201404290… "{\"…  3667552    4.32
## 10             4 1_4916        96 201404290… "{\"…  3667261    1.16
## # … with 1,927,107 more rows, and 2 more variables: PRECISION_HOR <dbl>,
## #   geometry <POINT [°]>

In the telematics dataset, the attribute giving the date of the record, DATETRAJET, is a string with the following aspect: YYYYMMDDHHMMSS. Hence, the string is constructed by concatenating the year, the month, the day, and then the hour, the minute and the second. The function ymd_hms() from the {lubridate} package easily manages such strings to convert them into POSIXct objects. Once it is done, a few functions from the same package can be used to extract the elements of the date (year(), month(), day(), etc).

We need to make sure that the points are ordered chronologically within each trip. The records can be ordered by increasing dates:

In the raw dataset, there seem to be errors with the identifiers of each trip. The trips seem to consist of a succession of recordings spaced a few seconds apart. But for some of the trips, however, leaps of several days are observed. In such cases, we will assume that they are different trips, for which two different identifiers will be associated.

First, let us look at the time between each observation and the previous one, within the same trip identified by an ID. To do this, the dplyr::lag() function is used to put the date of the previous observation in the tibble. For the first observation of a path, the value NA is returned by the function dplyr::lag(). Then, using the lubridate::interval() function, a time interval can be calculated between the two dates. By dividing the returned value by a duration of 1 minute, which is obtained with the function lubridate::minutes(), the length of the interval is expressed in minutes.

Some descriptive statistics can then be obtained:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.167    0.200    4.113    0.267 8643.217    18988

As can be seen, the time between two observations is less than or equal to about 16 seconds for 75% of the observations. However, there seem to be significant differences for some observations within the same trip.

A threshold that defines the maximum time that should separate two observations before considering it as another trip need to be chosen. It is likely that a loss of signal prevents the phone from sending user position data. Hence, the chosen threshold should not be too low.

A plot can help us visualize how many trips would be created according to the value of the threshold:

Arbitrarily, let us choose a 10-minute threshold, which will lead to consider 2907 new trips:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.02   23.18 2275.87 2521.99 4805.32 8643.22
## [1] 2907

By manipulating the table it is easy to create a new identifier for the trips identified by a time-break compared to the previous observation. Within each data group identified by the column TRIP_ID, a numerical variable with a value of 1 is created when a break is observed. Then, using a cumulative sum, a separate value for new trips is obtained. This value is then concatenated to the TRIP_ID column.

Using the ID of each trip, it is possible to isolate a trip from the data set. To that end, a variable which contains all distinct values of trip IDs can be created:

## [1] 21895

For example, let us extract the car trip that corresponds to the first ID (1_3312_0):

## Simple feature collection with 32 features and 16 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -71.30231 ymin: 46.79629 xmax: -71.2673 ymax: 46.8701
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 32 x 17
##    PRECISION_VER TRIP_ID ALTITUDE DATETRAJET GEOM  COORD_ID VITESSE
##  *         <dbl> <fct>      <dbl> <chr>      <chr>    <int>   <dbl>
##  1             3 1_3312…       89 201404280… "{\"…  2303492    32.8
##  2             4 1_3312…       89 201404280… "{\"…  2303502    32.2
##  3             4 1_3312…       96 201404280… "{\"…  2303512    33  
##  4             5 1_3312…       93 201404280… "{\"…  2303522    31.2
##  5             4 1_3312…       76 201404280… "{\"…  2303532    31.8
##  6             3 1_3312…       48 201404280… "{\"…  2303544    32  
##  7             3 1_3312…       18 201404280… "{\"…  2303554    31.8
##  8             3 1_3312…       -4 201404280… "{\"…  2303564    31.2
##  9             4 1_3312…       -6 201404280… "{\"…  2303574    24.8
## 10             3 1_3312…      -10 201404280… "{\"…  2303584    17.8
## # … with 22 more rows, and 10 more variables: PRECISION_HOR <dbl>,
## #   geometry <POINT [°]>, date <dttm>, year <dbl>, month <dbl>, day <int>,
## #   hour <int>, minute <int>, second <dbl>, weekday <ord>

As can be seen from the output, the trip contains 32 points. The user drove from 04:19:08 to 04:27:15 on 2014-04-28:

## [1] "2014-04-28 04:19:08 UTC" "2014-04-28 04:27:15 UTC"

The information relative to the coordinates is contained in the column geometry of the tibble. For example, for the first point of the dataset:

## Geometry set for 1 feature 
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -71.29658 ymin: 46.8701 xmax: -71.29658 ymax: 46.8701
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

These coordinates can be used by the function ggplot2::geom_sf() from the {ggplot2} package to create a plot.

If one wants to display multiple trips, the procedures does not change much. For convenience, an object which contains a few trips can be created. Let us create an object that contains 3 car rides.

Note that the variable TRIP_ID has been turned to a factor. This step is helful as we will specify ggplot2 that we want to group the dots by TRIP_ID so that the colour of each trip can vary according to the ID of the car trip.

1.2 Cartography

The visualization of the car trip may be enhanced by adding some geographical context. To that end, the boundary files of the 2016 Canadian Census can be used.

1.2.1 Census subdivisions

The census subdivisions boundaries (in English, in ArcGIS format) can be downloaded and then loaded into R.

To download the data directly from R, one can use the download.file() function:

Then, using the unzip() function, the content of the zipped file can be extracted:

The folder data/maps/quebec now contains, in addition to a Reference Guide, four other files in the shapefile format:

  • lcsd000b16a_e.dbf: attributes of each shape
  • lcsd000b16a_e.prj: a description of the projection
  • lcsd000b16a_e.shp: containing the feature geometry
  • lcsd000b16a_e.shx: shape index format, containing an index of the geometry

Once again, using the sf::read_sf() it is straightforward to load the geometric data:

A quick glance at the content:

## Simple feature collection with 5162 features and 18 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 3689439 ymin: 659338.9 xmax: 9015737 ymax: 5242179
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=49 +lat_2=77 +lat_0=63.390675 +lon_0=-91.86666666666666 +x_0=6200000 +y_0=3000000 +datum=NAD83 +units=m +no_defs
## # A tibble: 5,162 x 19
##    CSDUID CSDNAME CSDTYPE PRUID PRNAME CDUID CDNAME CDTYPE CCSUID CCSNAME
##    <chr>  <chr>   <chr>   <chr> <chr>  <chr> <chr>  <chr>  <chr>  <chr>  
##  1 10011… Riverh… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  2 10011… Admira… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  3 10011… St. Jo… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  4 10011… Mount … T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  5 10012… Branch  T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  6 10012… Point … T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  7 10012… St. Br… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  8 10012… Divisi… SNO     10    Newfo… 1001  Divis… CDR    10012… Divisi…
##  9 10012… Placen… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
## 10 10012… Fox Ha… T       10    Newfo… 1001  Divis… CDR    10012… Divisi…
## # … with 5,152 more rows, and 9 more variables: ERUID <chr>, ERNAME <chr>,
## #   SACCODE <chr>, SACTYPE <chr>, CMAUID <chr>, CMAPUID <chr>,
## #   CMANAME <chr>, CMATYPE <chr>, geometry <MULTIPOLYGON [m]>

The attributes of each geomerty are described in the table below.

Variable Description
CSDUID Uniquely identifies a census subdivision (composed of the 2-digit province/territory unique identifier followed by the 2-digit census division code and the 3-digit census subdivision code)
CSDNAME Census subdivision name
CSDTYPE Census subdivisions are classified according to designations adopted by provincial/territorial or federal authorities
PRUID Uniquely identifies a province or territory
PRNAME Province or territory name
CDUID Uniquely identifies a census division (composed of the 2-digit province/territory unique identifier followed by the 2-digit census division code)
CDNAME Census division name
CDTYPE Census division type
CCSUID Uniquely identifies a census consolidated subdivision (composed of the 2-digit province/territory unique identifier followed by the 2-digit census division code and the 3-digit census consolidated subdivision code)
CCSNAME Census consolidated subdivision name
ERUID Uniquely identifies an economic region (composed of the 2-digit province/territory unique identifier followed by the 2-digit economic region code)
ERNAME Economic region name
SACCODE The 3-digit Statistical Area Classification code
SACTYPE The Statistical Area Classification groups census subdivisions according to whether they are a component of a census metropolitan area, a census agglomeration, a census metropolitan influenced zone or the territories
CMAUID Uniquely identifies a census metropolitan area/census agglomeration
CMAPUID Uniquely identifies the provincial/territorial part of a census metropolitan area/census agglomeration (composed of the 2-digit province/territory unique identifier followed by the 3-digit census metropolitan area/census agglomeration unique identifier)
CMANAME Census metropolitan area or census agglomeration name
CMATYPE A one-character field identifying whether the unit is a census metropolitan area, a tracted census agglomeration or a non-tracted census agglomeration

The column PRNAME can be used to filter the geometries according to the province they belong to. Let us focus on the Province of Québec.

A focus on the surrounding area of the city Québec can furtermore be done, using the attribute CMANAME.

## Simple feature collection with 29 features and 18 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 7708390 ymin: 1409112 xmax: 7791569 ymax: 1488392
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=49 +lat_2=77 +lat_0=63.390675 +lon_0=-91.86666666666666 +x_0=6200000 +y_0=3000000 +datum=NAD83 +units=m +no_defs
## # A tibble: 29 x 19
##    CSDUID CSDNAME CSDTYPE PRUID PRNAME CDUID CDNAME CDTYPE CCSUID CCSNAME
##  * <chr>  <chr>   <chr>   <chr> <chr>  <chr> <chr>  <chr>  <chr>  <chr>  
##  1 24190… Saint-… MÉ      24    Quebe… 2419  Belle… MRC    24190… Saint-…
##  2 24191… Beaumo… MÉ      24    Quebe… 2419  Belle… MRC    24191… Beaumo…
##  3 24200… Saint-… MÉ      24    Quebe… 2420  L'Île… MRC    24200… Saint-…
##  4 24210… Boisch… MÉ      24    Quebe… 2421  La Cô… MRC    24210… L'Ange…
##  5 24220… Shannon MÉ      24    Quebe… 2422  La Ja… MRC    24220… Sainte…
##  6 24220… Sainte… V       24    Quebe… 2422  La Ja… MRC    24220… Saint-…
##  7 24230… Notre-… PE      24    Quebe… 2423  Québec TÉ     24230… Québec 
##  8 24230… Québec  V       24    Quebe… 2423  Québec TÉ     24230… Québec 
##  9 24200… Saint-… MÉ      24    Quebe… 2420  L'Île… MRC    24200… Saint-…
## 10 24200… Saint-… MÉ      24    Quebe… 2420  L'Île… MRC    24200… Saint-…
## # … with 19 more rows, and 9 more variables: ERUID <chr>, ERNAME <chr>,
## #   SACCODE <chr>, SACTYPE <chr>, CMAUID <chr>, CMAPUID <chr>,
## #   CMANAME <chr>, CMATYPE <chr>, geometry <MULTIPOLYGON [m]>

This subset of the data containes 29 census subdivisions that can (easily) be plotted:

1.2.2 Rivers

In order to provide a different example of file type and projection, let us download some data on the coastal waters from Statistics Canada.

Instead of downloading the data in ArcGIS format, let us pick MapInfo format.

The data can be downloaded and unzipped as follows:

Again, the sf::read_sf() function can be used to load the data into R.

All the shapes from the coastal data are not useful in this example, as the focus is made on the Province of Québec. Unlike the census subdivision, the coastal data do not provide information on the Province they appear. So, instead of filtering the shapes according to the value of an attribute, we can filter according to a bounding box. We will use the bounding box of the telematics data.

##      xmin      ymin      xmax      ymax 
## -72.19939  46.05592 -70.48567  47.60292

However, one can note, using the function sf::st_crs(), that the coordinate refenrence systems (CRS) of the coastal and the car trips data do not match:

## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=lcc +lat_1=49 +lat_2=77 +lat_0=63.390675 +lon_0=-91.8666666667 +x_0=6200000 +y_0=3000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Using the sf::st_transform() function, the CRS of an object of class sf can be transformed:

Then, the function sf::st_crop() can be applied to crop the sf object to a rectangle:

Plotting the filtered shapes on the map of the census subdivisions gives the following map:

The projection of the map can be changed when creating the graph, usig ggplot2::coord_sf(). A list of different CRS can be consulted on spatialreference.org.

For example, if one wants the Mercator projection:

With some car trips displayed:

1.3 From points to lines

For now, the trips are defined by a series of points, which are not joined. While this makes sense in practice, we may be tempted to join the dots to display lines rather than points. To do so, the function sf::st_cast() can be used. It casts a geometry to another one. Here, it casts points to linestrings.

The points can then be casted to linestrings. The procedure works as follows. First, the data are grouped by their ID. Then, the points of each group are joined by means of the summarise() function (from {sf}). The parameter do_union is set to FALSE so that the orders of the points remain unchanged.

## Simple feature collection with 3 features and 1 field
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -71.42125 ymin: 46.67924 xmax: -71.19778 ymax: 46.90693
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 3 x 2
##   TRIP_ID                                                          geometry
## * <chr>                                                    <LINESTRING [°]>
## 1 1_2690_0 (-71.42125 46.85641, -71.42047 46.85686, -71.41961 46.85692, -7…
## 2 1_2692_0 (-71.20576 46.90693, -71.20565 46.90663, -71.20513 46.90566, -7…
## 3 1_3312_0 (-71.29658 46.8701, -71.29324 46.86668, -71.29 46.86284, -71.28…

Eventually, the plot of these three trips can be obtained with the following lines:

2 Some descriptive statistics

This section provides descriptive statistics for telematics data. In particular, information on travel time and distance is provided.

First, since information on the different geometries associated with each observation is not useful in this part, let us keep only the attributes of each observation.

However, the coordinates of the points are necessary when calculating the distances covered.

2.1 Trip duration

Let us compute the duration of each trip. To do so, the data are grouped together using the ID column, and then, for each group, the departure and end dates of the journey are kept.

## # A tibble: 21,895 x 6
##    TRIP_ID leave               arrive              leave_hour leave_weekday
##    <chr>   <dttm>              <dttm>                   <int> <ord>        
##  1 1_100_0 2014-05-01 16:28:39 2014-05-01 17:06:19         16 Thu          
##  2 1_1000… 2014-05-01 17:12:01 2014-05-01 17:41:20         17 Thu          
##  3 1_1000… 2014-05-01 07:42:10 2014-05-01 08:12:09          7 Thu          
##  4 1_1000… 2014-04-30 17:22:07 2014-04-30 17:41:03         17 Wed          
##  5 1_1000… 2014-05-01 17:13:39 2014-05-01 17:40:21         17 Thu          
##  6 1_1001… 2014-05-01 17:43:37 2014-05-01 17:46:30         17 Thu          
##  7 1_1001… 2014-05-01 17:19:17 2014-05-01 17:47:30         17 Thu          
##  8 1_1001… 2014-05-01 08:08:55 2014-05-01 08:34:38          8 Thu          
##  9 1_1002… 2014-05-01 17:43:01 2014-05-01 17:47:25         17 Thu          
## 10 1_1002… 2014-05-01 17:25:37 2014-05-01 17:48:10         17 Thu          
## # … with 21,885 more rows, and 1 more variable:
## #   average_recorded_speed <dbl>

In the previous code snippet, it can be noted that an average of the recorded speed is also computed In doing so, we can compare the values we obtain by dividing the distance over time (in the following subsection).

As in the previous section, the functions of the {lubridate} package are called to calculate a duration between two dates.

The average journey takes 23.6 minutes, with a standard deviation of 13.7 minutes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   14.55   21.63   23.60   30.46  307.43
## [1] 13.68929

A quick peak at the distributuon of trip duration can be obtained using a histogram.

There may be some suspicion of seasonality in the data based on the day of the week. A histogram of the durations for each day of the week allows us to explore this hypothesis.

Visually, there does not seem to be any major difference in the visual aspect of the curve. On the other hand, the number of trips made is much lower on weekends.

2.2 Distance

Each point in the data has geographic coordinates that can be used to calculate distances traveled by a driver. For a given trip, by calculating the distance between each point and the next and adding the distances obtained, the total distance of the trip can be estimated.

There are several functions available in R to calculate distances between different points. For example, it is possible to use the function spDistsN1() from {sp}. This function calculates, for points defined by their coordinates (longitude and latitude), the distance in kilometers (when the parameter longlat = TRUE) separating these points from another point for which the coordinates are also available.

For example, let us create a table that contains the coordinates of two Swiss town: Lausanne and Zurich. Assume that we wish to know the distance of each of these cities to Basel.

## # A tibble: 2 x 2
##   longitude latitude
##       <dbl>    <dbl>
## 1      6.63     46.5
## 2      8.54     47.3
##              [,1]
## longitude  7.5833
## latitude  47.5144

The sp::spDistsN1() function expects a matrix with two columns (longitude and latitude) for the parameter pts and a single 2D point for the parameter pt. (Note that the function also allows SpatialPoints or a SpatialPointsDataFrame objects).

## [1] 137.14619  75.33957

Basel is 137 km from Lausanne and 75 km from Zurich as the crow flies.

We can use this function to estimate the distances travelled by drivers. One possible strategy is to consider a table with the coordinates of each observation point defining the path travelled, and to calculate the distances from one point to another, using a loop. To do this, we can create a function, which will then be applied to each chunk of data corresponding to a trip. The function can return a tibble containing the trip ID in a column and the totoal calculated distance for that trip.

Par exemple, pour le trajet identifié par la valeur 1_3312_0, la fonction compute_distance() retourne la distance parcourue suivante :

## # A tibble: 1 x 2
##   TRIP_ID  distance_tot
##   <chr>           <dbl>
## 1 1_3312_0         11.9

As this function performs many loops, its application to all trips takes a long time. To speed up the calculation time, it is possible to parallelize the operations, using some functions from the package {parallel}.

First, the parallel::detectCores() function identifies the number of cores available. If we work on our own machine, it is possible, for example, to use all the available clusters, but leave one free.

## [1] 7

Then, the function parallel::makeCluster() creates the clusters:

## socket cluster with 7 nodes on host 'localhost'

If the code that will be evaluated in clusters depends on packages other than the basic one, they need to be loaded in each of the clusters, using the function parallel::clusterEvalQ():

The objects used within the function that will be evaluated that are from the global environment must be sent to each of the clusters, using the function parallel::clusterExport()

Now, all we have to do is use one of the apply functions, specifying the clusters to the cl parameter of the function from the {pbapply} package for example.

In the previous code snippet, the function compute_distance() is applied to subsets of the tibble trips_tibble. Each subset corresponds to the data of a single trip, whose ID is provided by the variable id, which varies at each iteration of the lapply function accordingly to the values contained in unique(trips_tibble$TRIP_ID).

At the end of the iterations, the clusters can be stopped, as we no longer need them.

The apply() function returns a list. Here are the first two elements of the result:

## [[1]]
## # A tibble: 1 x 2
##   TRIP_ID  distance_tot
##   <chr>           <dbl>
## 1 1_3312_0         11.9
## 
## [[2]]
## # A tibble: 1 x 2
##   TRIP_ID  distance_tot
##   <chr>           <dbl>
## 1 1_2690_0         27.9

The function bind_rows() can be applied to that list to bind the rows of each tibble in the list.

## # A tibble: 21,895 x 2
##    TRIP_ID  distance_tot
##    <chr>           <dbl>
##  1 1_3312_0        11.9 
##  2 1_2690_0        27.9 
##  3 1_2692_0        22.5 
##  4 1_2691_0         4.07
##  5 1_2933_0        14.0 
##  6 1_2693_0        26.7 
##  7 1_2699_0        45.7 
##  8 1_3865_0        30.3 
##  9 1_2709_0         1.09
## 10 1_2695_0        30.1 
## # … with 21,885 more rows

The average trip is 18 kilometers long in the dataset.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.0013    9.9586   15.2397   18.2961   22.3771 1458.3902       262

There are a few long trips in the dataset:

## # A tibble: 12 x 2
##    TRIP_ID  distance_tot
##    <chr>           <dbl>
##  1 2_2340_0        1458.
##  2 2_515_0         1195.
##  3 2_650_0         1057.
##  4 3_3888_0        1000.
##  5 2_473_0          995.
##  6 2_92_0           952.
##  7 2_476_0          929.
##  8 2_319_0          786.
##  9 1_5510_0         636.
## 10 2_404_0          628.
## 11 3_5112_0         560.
## 12 2_1265_0         516.

A histogram of the distances can be plotted:

Separating the data according to the distance can provide a different view. To do this, let us use the function dplyr::mutate() to create a column in the tibble, called type, indicating if the trip distance is more or less than 100km. Then, we can graph the distribution of the distances for each subset, using a facet plot.

2.3 Speed

Drivers’ speed can be estimated from distance and travel time estimates. A comparison with the recorded values ensures that the distance and time estimates appear to be correct.

Let us add to the table containing the durations the information from the table of the distance table, by making a join using the function dplyr::left_join(). The join is performed using the TRIP_ID column in both tables.

## # A tibble: 21,895 x 9
##    TRIP_ID leave               arrive              leave_hour leave_weekday
##    <chr>   <dttm>              <dttm>                   <int> <ord>        
##  1 1_100_0 2014-05-01 16:28:39 2014-05-01 17:06:19         16 Thu          
##  2 1_1000… 2014-05-01 17:12:01 2014-05-01 17:41:20         17 Thu          
##  3 1_1000… 2014-05-01 07:42:10 2014-05-01 08:12:09          7 Thu          
##  4 1_1000… 2014-04-30 17:22:07 2014-04-30 17:41:03         17 Wed          
##  5 1_1000… 2014-05-01 17:13:39 2014-05-01 17:40:21         17 Thu          
##  6 1_1001… 2014-05-01 17:43:37 2014-05-01 17:46:30         17 Thu          
##  7 1_1001… 2014-05-01 17:19:17 2014-05-01 17:47:30         17 Thu          
##  8 1_1001… 2014-05-01 08:08:55 2014-05-01 08:34:38          8 Thu          
##  9 1_1002… 2014-05-01 17:43:01 2014-05-01 17:47:25         17 Thu          
## 10 1_1002… 2014-05-01 17:25:37 2014-05-01 17:48:10         17 Thu          
## # … with 21,885 more rows, and 4 more variables:
## #   average_recorded_speed <dbl>, trip_interval <Interval>,
## #   trip_duration <dbl>, distance_tot <dbl>

The average speed can be estimated as follows:

## # A tibble: 21,895 x 10
##    TRIP_ID leave               arrive              leave_hour leave_weekday
##    <chr>   <dttm>              <dttm>                   <int> <ord>        
##  1 1_100_0 2014-05-01 16:28:39 2014-05-01 17:06:19         16 Thu          
##  2 1_1000… 2014-05-01 17:12:01 2014-05-01 17:41:20         17 Thu          
##  3 1_1000… 2014-05-01 07:42:10 2014-05-01 08:12:09          7 Thu          
##  4 1_1000… 2014-04-30 17:22:07 2014-04-30 17:41:03         17 Wed          
##  5 1_1000… 2014-05-01 17:13:39 2014-05-01 17:40:21         17 Thu          
##  6 1_1001… 2014-05-01 17:43:37 2014-05-01 17:46:30         17 Thu          
##  7 1_1001… 2014-05-01 17:19:17 2014-05-01 17:47:30         17 Thu          
##  8 1_1001… 2014-05-01 08:08:55 2014-05-01 08:34:38          8 Thu          
##  9 1_1002… 2014-05-01 17:43:01 2014-05-01 17:47:25         17 Thu          
## 10 1_1002… 2014-05-01 17:25:37 2014-05-01 17:48:10         17 Thu          
## # … with 21,885 more rows, and 5 more variables:
## #   average_recorded_speed <dbl>, trip_interval <Interval>,
## #   trip_duration <dbl>, distance_tot <dbl>, average_speed <dbl>

The average recorded speeds can be plotted against the computed ones.

A log scale can be used:

These two graphs suggest that the calculated average speed is underestimated relative to the recorded average, and that the greater the distance travelled, the greater the difference.

3 Visualization

3.1 Visualizing traffic on a single day, at a specific hour

We will display on a map the intensity of road traffic for a specific time on a specific day. We will start with a preliminary example, then generalize by creating a function that can display the graph when varying the desired times for displaying the traffic status.

First, let us extract the points that were observed between 9:00 and 9:59 on the 29th April 2014:

Then, let us cast the points to linestrings.

Finally, using sf::geom_sf(), the car trips can be plotted on the map:

Based on that example, a function that filters the points, casts them into linestrings and then plot the car trips according to a specific date can be created.

We can obtain the map for traffic intensity between 9:00 and 9:59 on the 29th April 2014 as follows:

And it takes only a few seconds to check what happens between 7:00 and 7:59 on that same day:

3.2 Create an animated graph to explore the data

It is possible to loop over the hours of the day to create an animated representation of the data for a specific day. At each iteration, the function plot_traffic() can be evaluated with the parameter hour set depending on the iteration. The loop is wraped in the function saveGIF() function from {animation}(https://cran.r-project.org/web/packages/animation/index.html) that saves the frames obtained at each iteration in a gif file.

Gif of car trip on 2014-04-28 between 5am and 10am

Gif of car trip on 2014-04-28 between 5am and 10am

The previous code contains some specific elements that should be explained. There might be no record of car trip for some hours. If we try to evaluate the function plot_traffic() for a time at which no record of car trip can be found in the data, R will throw an error and break the loop. To deal with these cases, the function try() is very useful. If the ggplot object created with the function plot_traffic() inherits the class "try-error", indications of the behaviour to be adopted in such situations can be given to R. Here, we tell R to directly jump to the next iteration, using the reserved word next, and not to evaluate the rest of the code inside the loop.

It is also possible to use {gganimate}, but it seems for now that the execution time is very long (there may be a better way to achieve this, however):

3.3 Interactive map

Interactive maps using OpenStreetMap data can be created using {leaflet}, as nicely explained in Sébastien Rochette tutorial (Introduction to mapping with {sf} & Co.).

In a first step, the coordinates of boundaries need to be converted to the World Geodetic System 1984, using st_transform().

Then we can create a leaflet object using the function leaflet() and add a layer containing the coordinates of the polygons that define the boundaries of the census subdivisions, with the function addPolygons():

The function addTiles() allows us to add a tile layer to the map, with data downloaded from OpenStreetMap:

Note that the order of the layers matters. In the previous example, the polygons are drawn over the OpenStreetMap data.

Eventually, the lines corresponding to car trips can be added as an additional layer, using the addPolylines() function :

To save the leaflet as an HTML object, the function `` from {htmlwidgets} can be used as follows:

It may be convenient to choose to display only certain trips, for example those corresponding to a specific time. One way to do this is to add layers to the map, each layer corresponding to the paths of a specific time. For the purposes of the example, let us create a newobject containing trip information for April 29th 2014, at 8am:

When adding the different layers, the parameter group is used to specify the group membership for the layer. The addLayersControl() function then selects the groups to be displayed on the map.