10 Data manipulation with pandas

pandas is an open-source library based on NumPy providing easy-to-use data structures and data analysis tools. The reader familiar with the basic functions of the R language will find many similar features with pandas.

To access the features of pandas, it is common practice to load the library by assigning it the alias pd:

import pandas as pd

We will also use numpy functions (c.f. Section 9). Let’s make sure to load this library, if it hasn’t already been done:

import numpy as np

10.1 Structures

We will look at two types of structures, series (series') and dataframes (DataFrame’).

10.1.1 Series

Series are one-dimensional tables of indexed data.

10.1.1.1 Creating Series from a List

To create one, we can define a list, then apply the function Series of pandas:

s = pd.Series([1, 4, -1, np.nan, .5, 1])
print(s)
## 0    1.0
## 1    4.0
## 2   -1.0
## 3    NaN
## 4    0.5
## 5    1.0
## dtype: float64

The previous example shows that the s series created contains both the data and an associated index. The values attribute is used to display the values that are stored in a numpy array:

print("values of s: ", s.values)
## values of s:  [ 1.   4.  -1.   nan  0.5  1. ]
print("type of values of s: ", type(s.values))
## type of values of s:  <class 'numpy.ndarray'>

The index is stored in a specific structure of pandas:

print("index of s: ", s.index)
## index of s:  RangeIndex(start=0, stop=6, step=1)
print("type of the index of s: ", type(s.index))
## type of the index of s:  <class 'pandas.core.indexes.range.RangeIndex'>

It is possible to assign a name to the series as well as to the index:

s.name = "my_serie"
s.name = "name_index"
print("name of the Serie: {} , name of the index: {}".format(s.name, s.index.name))
## name of the Serie: name_index , name of the index: None
print("Serie s: \n", s)
## Serie s: 
##  0    1.0
## 1    4.0
## 2   -1.0
## 3    NaN
## 4    0.5
## 5    1.0
## Name: name_index, dtype: float64

10.1.1.2 Definition of the Index

The index can be defined by the user, at the time of creating the series:

s = pd.Series([1, 4, -1, np.nan],
             index = ["o", "d", "i", "l"])
print(s)
## o    1.0
## d    4.0
## i   -1.0
## l    NaN
## dtype: float64

The index can also be defined with numerical values, without being forced to follow a specific order:

s = pd.Series([1, 4, -1, np.nan],
             index = [4, 40, 2, 3])
print(s)
## 4     1.0
## 40    4.0
## 2    -1.0
## 3     NaN
## dtype: float64

The index can be modified later, by overwriting the attribute index:

s.index = ["o", "d", "i", "l"]
print("Série s : \n", s)
## Série s : 
##  o    1.0
## d    4.0
## i   -1.0
## l    NaN
## dtype: float64

10.1.1.3 Creation of Special Series

A simple trick to creating series with a repeated value consists in providing a scalar to the Series function of NumPy and an index whose length will correspond to the number of times the scalar is repeated:

s = pd.Series(5, index = [np.arange(4)])
print(s)
## 0    5
## 1    5
## 2    5
## 3    5
## dtype: int64

A series can be created from a dictionary:

dictionary = {"King": "Arthur",
                "Knight_Round_Table": "Percival",
                "Druid": "Merlin"}
s = pd.Series(dictionary)
print(s)
## King                    Arthur
## Knight_Round_Table    Percival
## Druid                   Merlin
## dtype: object

As noted in the previous output, the dictionary keys were used for the index. When creating the series, specific values can be specified in the key argument: this will result in retrieving only the observations corresponding to these keys:

dictionary = {"King": "Arthur",
                "Knight_Round_Table": "Percival",
                "Druid": "Merlin"}
s = pd.Series(dictionary, index = ["King", "Druid"])
print(s)
## King     Arthur
## Druid    Merlin
## dtype: object

10.1.2 Dataframes

Dataframes correspond to the data format traditionally found in economics, two-dimensional tables, with column variables and observations in rows. The columns and rows of the dataframes are indexed.

10.1.2.1 Creating Dataframes from a Dictionary

To create a dataframe, the DataFrame() function of pandas can be provided with a dictionary that can be transformed into a series. This is the case of a dictionary where the values associated with the keys are all the same length:

dict = {"height" :
               [58, 59, 60, 61, 62,
                63, 64, 65, 66, 67,
                68, 69, 70, 71, 72],
        "weight":
               [115, 117, 120, 123, 126,
                129, 132, 135, 139, 142,
                146, 150, 154, 159, 164]
       }
df = pd.DataFrame(dict)
print(df)
##     height  weight
## 0       58     115
## 1       59     117
## 2       60     120
## 3       61     123
## 4       62     126
## 5       63     129
## 6       64     132
## 7       65     135
## 8       66     139
## 9       67     142
## 10      68     146
## 11      69     150
## 12      70     154
## 13      71     159
## 14      72     164

The position of the elements in the dataframe serves as an index. As for the series, the values are accessible in the values attribute and the index in the index attribute. The columns are also indexed:

print(df.columns)
## Index(['height', 'weight'], dtype='object')

The head() method displays the first few lines (the first 5, by default). You can modify its n arguments to indicate the number of lines to be returned:

df.head(2)
##    height  weight
## 0      58     115
## 1      59     117

When creating a dataframe from a dictionary, if the name of the columns to be imported is specified by a list of strings provided in the columns argument of the DataFrame function, it is possible to define not only the columns to be filled but also their order of appearance.

For example, to import only the weight column:

df = pd.DataFrame(dict, columns = ["weight"])
print(df.head(2))
##    weight
## 0     115
## 1     117

And to define the order in which the columns will appear:

df = pd.DataFrame(dict, columns = ["weight", "height"])
print(df.head(2))
##    weight  height
## 0     115      58
## 1     117      59

If a column name that is missing from the dictionary keys is specified, the resulting dataframe will contain a column with this name but filled with the values NaN:

df = pd.DataFrame(dict, columns = ["weight", "height", "age"])
print(df.head(2))
##    weight  height  age
## 0     115      58  NaN
## 1     117      59  NaN

10.1.2.2 Creating Dataframes from Series

A dataframe can be created from a series:

s = pd.Series([1, 4, -1, np.nan], index = ["o", "d", "i", "l"])
s.name = "name_variable"
df = pd.DataFrame(s, columns = ["name_variable"])
print(df)
##    name_variable
## o            1.0
## d            4.0
## i           -1.0
## l            NaN

If the series is not named, the columns argument of the DataFrame function must not be filled in. But in this case, the column will not have a name, just a numerical index.

s = pd.Series([1, 4, -1, np.nan], index = ["o", "d", "i", "l"])
df = pd.DataFrame(s)
print(df)
##      0
## o  1.0
## d  4.0
## i -1.0
## l  NaN
print(df.columns.name)
## None

10.1.2.3 Creating Dataframes from a Dictionary List

A dataframe can be created from a list of dictionaries:

dico_1 = {
    "Name": "Pendragon",
    "Surname": "Arthur",
    "Role": "King of Britain"
}
dico_2 = {
    "Name": "",
    "Surname": "Perceval",
    "Role": "Knight of the Round Table"
}

df = pd.DataFrame([dico_1, dico_2])
print(df)
##         Name                       Role   Surname
## 0  Pendragon            King of Britain    Arthur
## 1             Knight of the Round Table  Perceval

If some keys are missing in one or more of the dictionaries in the list, the corresponding values in the dataframe will be NaN:

dico_3 = {
    "Surname": "Guinevere",
    "Role": "Queen of Britain"
}
df = pd.DataFrame([dico_1, dico_2, dico_3])
print(df)
##         Name                       Role    Surname
## 0  Pendragon            King of Britain     Arthur
## 1             Knight of the Round Table   Perceval
## 2        NaN           Queen of Britain  Guinevere

10.1.2.4 Création de dataframes à partir d’un dictionnaire de séries

A dataframe can also be created from a series dictionary. To illustrate the method, let’s create two dictionaries:

# Annual 2017 GDP
# In millions of current dollars
dict_gdp_current = {
    "France": 2582501.31,
    "USA": 19390604.00,
    "UK": 2622433.96
}
# Annual consumer price index
dict_cpi = {
    "France": 0.2,
    "UK": 0.6,
    "USA": 1.3,
    "Germany": 0.5
}

From these two dictionaries, let’s create two corresponding series:

s_gdp_current = pd.Series(dict_gdp_current)
s_cpi = pd.Series(dict_cpi)

print("s_gdp_current : \n", s_gdp_current)
## s_gdp_current : 
##  France     2582501.31
## USA       19390604.00
## UK         2622433.96
## dtype: float64
print("\ns_cpi : \n", s_cpi)
## 
## s_cpi : 
##  France     0.2
## UK         0.6
## USA        1.3
## Germany    0.5
## dtype: float64

Then, let’s create a dictionary of series:

dict_from_series = {
    "gdp": s_gdp_current,
    "cpi": s_cpi
}
print(dict_from_series)
## {'gdp': France     2582501.31
## USA       19390604.00
## UK         2622433.96
## dtype: float64, 'cpi': France     0.2
## UK         0.6
## USA        1.3
## Germany    0.5
## dtype: float64}

Finally, let’s create our dataframe:

s = pd.DataFrame(dict_from_series)
print(s)
##                  gdp  cpi
## France    2582501.31  0.2
## Germany          NaN  0.5
## UK        2622433.96  0.6
## USA      19390604.00  1.3

The dict_gdp_current dictionary does not contain a Germany key, unlike the dict_cpi dictionary. When the dataframe was created, the GDP value for Germany was therefore assigned as NaN.

10.1.2.5 Creation of Dataframes from a Two-dimensional NumPy Array

A dataframe can also be created from a Numpy array. When creating, with the function DataFrame() of NumPy, it is possible to specify the name of the columns (otherwise, the indication of the columns will be numerical):

listing = [
    [1, 2, 3],
    [11, 22, 33],
    [111, 222, 333],
    [1111, 2222, 3333]
]
array_np = np.array(listing)
df = pd.DataFrame(array_np,
                  columns = ["a", "b", "c"])
print(df)
##       a     b     c
## 0     1     2     3
## 1    11    22    33
## 2   111   222   333
## 3  1111  2222  3333

10.1.2.6 Dimensions

The dimensions of a dataframe are accessed with the attribute shape.

print("shape : ", df.shape)
## shape :  (4, 3)

The number of lines can also be displayed as follows:

print("shape : ", len(df))
## shape :  4

And the number of columns:

print("shape : ", len(df.columns))
## shape :  3

10.1.2.7 Modification of the Index

As for the series, the index can be modified once the dataframe has been created, by overwriting the values of the attributes index and columns, for the index of rows and columns, respectively:

df.index = ["o", "d", "i", "l"]
df.columns = ["aa", "bb", "cc"]
print(df)
##      aa    bb    cc
## o     1     2     3
## d    11    22    33
## i   111   222   333
## l  1111  2222  3333

10.2 Selection

In this section, we look at different ways to select data in series and dataframes. There are two distinct ways to do this:

  • a first one based on the use of brackets directly on the object for which we want to select certain parts;
  • second based on indexers, accessible as attributes of NumPy objects (loc, at, iat, iat, etc.)

The second method avoids some confusion that may appear in the case of numerical indexes.

10.2.1 For Series

First, let’s look at ways to extract values from series.

10.2.1.1 With brackets

The index can be used to extract the data:

s = pd.Series([1, 4, -1, np.nan, .5, 1])
s[0] # 1st element of s
## 1.0
s[1:3] # From the 2nd (included) to the 4th (not included)
## 1    4.0
## 2   -1.0
## dtype: float64
s[[0,4]] # First to 5th element (not included)
## 0    1.0
## 4    0.5
## dtype: float64

Note that unlike `numpy’ tables or lists, negative values for the index cannot be used to retrieve data by counting their position relative to the end:

s[-2]
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: -2
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 766, in __getitem__
##     result = self.index.get_value(self, key)
##   File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3103, in get_value
##     tz=getattr(series.dtype, 'tz', None))
##   File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
##   File "pandas/_libs/index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
##   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
##   File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
##   File "pandas/_libs/hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item

In the case of an index composed of strings, it is then possible to refer either to the content of the index (to make it simple, its name) or to its position when extracting the data from the series:

s = pd.Series([1, 4, -1, np.nan],
             index = ["o", "d", "i", "l"])
print("Series s : \n", s)
## Series s : 
##  o    1.0
## d    4.0
## i   -1.0
## l    NaN
## dtype: float64
print('s["d"] : \n', s["d"])
## s["d"] : 
##  4.0
print('s[1] : \n', s[1])
## s[1] : 
##  4.0
print("elements o and i : \n", s[["o", "i"]])
## elements o and i : 
##  o    1.0
## i   -1.0
## dtype: float64

On the other hand, in the case where the index is defined with numerical values, to extract the values using the brackets, it will be by the value of the index and not by relying on the position:

s = pd.Series([1, 4, -1, np.nan],
             index = [4, 40, 2, 3])
print(s[40])
## 4.0

10.2.1.2 With Indexers

Pandas offers two types of multi-axis indication: loc, iloc. The first is mainly based on the use of axis labels, while the second is mainly based on positions using integers.

For the purposes of this section, let’s create two series; one with a textual index, the other with a numerical index:

s_num = pd.Series([1, 4, -1, np.nan],
             index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
             index = ["c", "a", "b", "d"])
10.2.1.2.1 Extraction of a Single Element

To extract an object with loc, we use the name of the index :

print(s_num.loc[5], s_text.loc["c"])
## 1.0 1.0

To extract a single element with iloc, simply indicate its position:

(s_num.iloc[1], s_text.iloc[1])
## (4.0, 4.0)
10.2.1.2.2 Extraction of Several Elements

To extract several elements with loc, we use the names (labels) of the indices, which we provide in a list:

print("elements with labels 5 and 4:\n", s_num.loc[[5,4]])
## elements with labels 5 and 4:
##  5    1.0
## 4   -1.0
## dtype: float64
print("elements with labels c and b: \n", s_text.loc[["c", "b"]])
## elements with labels c and b: 
##  c    1.0
## b   -1.0
## dtype: float64

To extract multiple elements with iloc:

print("elements at positions 0 and 2:\n", s_num.iloc[[0,2]])
## elements at positions 0 and 2:
##  5    1.0
## 4   -1.0
## dtype: float64
print("elements at positions 0 and 2: \n", s_text.iloc[[0,2]])
## elements at positions 0 and 2: 
##  c    1.0
## b   -1.0
## dtype: float64
10.2.1.2.3 Slicing

It is possible to perform series slicing, to recover consecutive elements:

print("elements with label 5 to 4:\n", s_num.loc[5:4])
## elements with label 5 to 4:
##  5    1.0
## 0    4.0
## 4   -1.0
## dtype: float64
print("elements with label c to b: \n", s_text.loc["c":"b"])
## elements with label c to b: 
##  c    1.0
## a    4.0
## b   -1.0
## dtype: float64

To extract multiple elements with iloc:

print("elements at positions 0 and 2:\n", s_num.iloc[0:2])
## elements at positions 0 and 2:
##  5    1.0
## 0    4.0
## dtype: float64
print("elements at positions 0 and 2: \n", s_text.iloc[0:2])
## elements at positions 0 and 2: 
##  c    1.0
## a    4.0
## dtype: float64

As we have seen so far, the upper limit value is not included in the breakdown.

10.2.1.2.4 Mask

A mask can also be used to extract elements, either using loc or iloc:

print("\n",s_num.loc[[True, False, False, True]])
## 
##  5    1.0
## 1    NaN
## dtype: float64
print("\n", s_text.loc[[True, False, False, True]])
## 
##  c    1.0
## d    NaN
## dtype: float64
print("\n", s_num.iloc[[True, False, False, True]])
## 
##  5    1.0
## 1    NaN
## dtype: float64
print("\n", s_text.iloc[[True, False, False, True]])
## 
##  c    1.0
## d    NaN
## dtype: float64
10.2.1.2.5 What’s the Point?

Why introduce such ways of extracting data and not just extract it using the brackets on the objects? Let’s look at a simple example. Let’s assume that we have the s_num series, with an index composed of integers that is not a sequence ranging from 0 to the number of elements. In this case, if we want to recover the 2nd element, because of the index composed of numerical values, we cannot obtain it by asking s[1]. To extract the 2nd of the series, we must know that its index is 0 and thus ask :

print("The element whose index is 0: ", s_num[0])
## The element whose index is 0:  4.0

To be able to perform the extraction according to the position, it is very useful to have this attribute iloc:

print("The element in 2nd position:", s_num.iloc[1])
## The element in 2nd position: 4.0

10.2.2 For dataframes

Now let’s look at different ways to extract data from a dataframe. Let’s create two dataframes as an example, one with a numerical index; another with a textual index :

dict = {"height" : [58, 59, 60, 61, 62],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, 31, 29],
        "taille": [162, 156, 172, 160, 158],
       }
df_num = pd.DataFrame(dict)
df_text = pd.DataFrame(dict, index=["a", "e", "c", "b", "d"])
print("df_num : \n", df_num)
## df_num : 
##     height  weight  age  taille
## 0      58     115   28     162
## 1      59     117   33     156
## 2      60     120   31     172
## 3      61     123   31     160
## 4      62     126   29     158
print("df_text : \n", df_text)
## df_text : 
##     height  weight  age  taille
## a      58     115   28     162
## e      59     117   33     156
## c      60     120   31     172
## b      61     123   31     160
## d      62     126   29     158

To make it simple, when we want to perform an extraction with the iloc attributes, the syntax is as follows:

df.iloc[line_selection, column_selection]

with line_selection:

  • a single value: 1 (second line) ;
  • a list of values: [2, 1, 3] (3rd line, 2nd line and 4th line) ;
  • a breakdown: [2:4] (from the 3rd line to the 4th line (not included)).

for column_selection:

  • a single value: 1 (second column) ;
  • a list of values: [2, 1, 3] (3rd column, 2nd column and 4th column) ;
  • a breakdown: [2:4] (from the 3rd column to the 4th column (not included)).

With loc, the syntax is as follows:

df.loc[line_selection, column_selection]

with line_selection :

  • a single value: `a'' (line nameda`);
  • a list of names: ["a", "c", "b"] (Rows named “a”, “c” and “b”) ;
  • a mask: df.['a']<10 (Rows for which the mask values are True).

with column_selection :

  • a single value: `a'' (column nameda`);
  • a list of values: ["a", "c", "b"] (columns named “a”, “c” and “b”) ;
  • a breakdown: `["a": "c"] (from the column named “a” to the column named “c”).

10.2.2.1 Extraction of a Rows

To extract a Rows from a dataframe, the name of the Rows can be used with loc:

print("Rows named 'e':\n", df_text.loc["e"])
## Rows named 'e':
##  height     59
## weight    117
## age        33
## taille    156
## Name: e, dtype: int64
print("\nRows named 'e':\n", df_num.loc[1])
## 
## Rows named 'e':
##  height     59
## weight    117
## age        33
## taille    156
## Name: 1, dtype: int64

Or, its position with iloc:

print("Rows in position 0:\n", df_text.iloc[0])
## Rows in position 0:
##  height     58
## weight    115
## age        28
## taille    162
## Name: a, dtype: int64
print("\nRows in position 0:\n", df_num.iloc[0])
## 
## Rows in position 0:
##  height     58
## weight    115
## age        28
## taille    162
## Name: 0, dtype: int64

10.2.2.2 Extraction of Several Rows

To extract multiple lines from a dataframe, their names can be used with loc (in an array):

print("Rows named a and c:\n", df_text.loc[["a", "c"]])
## Rows named a and c:
##     height  weight  age  taille
## a      58     115   28     162
## c      60     120   31     172
print("\nRows named 0 and 2:\n", df_num.loc[[0, 2]])
## 
## Rows named 0 and 2:
##     height  weight  age  taille
## 0      58     115   28     162
## 2      60     120   31     172

Or, their position with iloc:

print("Rows at positions 0 and 3:\n", df_text.iloc[[0, 3]])
## Rows at positions 0 and 3:
##     height  weight  age  taille
## a      58     115   28     162
## b      61     123   31     160
print("\nRows at positions 0 and 3:\n", df_num.iloc[[0, 3]])
## 
## Rows at positions 0 and 3:
##     height  weight  age  taille
## 0      58     115   28     162
## 3      61     123   31     160

10.2.2.3 Slicing of Several Rows

A line sequence can be retrieved by delimiting the first and last line to be retrieved according to their name and using loc:

print("Rows from label à to c:\n", df_text.loc["a":"c"])
## Rows from label à to c:
##     height  weight  age  taille
## a      58     115   28     162
## e      59     117   33     156
## c      60     120   31     172
print("\nRows from label 0 to 2:\n", df_num.loc[0:2])
## 
## Rows from label 0 to 2:
##     height  weight  age  taille
## 0      58     115   28     162
## 1      59     117   33     156
## 2      60     120   31     172

With the attribute iloc, this is also possible (again, the upper bound is not included):

print("Rows at position 0 to 3 (not included):\n", df_text.iloc[0:3])
## Rows at position 0 to 3 (not included):
##     height  weight  age  taille
## a      58     115   28     162
## e      59     117   33     156
## c      60     120   31     172
print("\nRows at position 0 to 3 (not included):\n", df_num.iloc[0:3])
## 
## Rows at position 0 to 3 (not included):
##     height  weight  age  taille
## 0      58     115   28     162
## 1      59     117   33     156
## 2      60     120   31     172

10.2.2.4 Mask

A mask can also be used to select certain rows For example, if we want to retrieve the rows for which the variable height has a value greater than 60, we use the following mask:

mask = df_text["height"]> 60
print(mask)
## a    False
## e    False
## c    False
## b     True
## d     True
## Name: height, dtype: bool

To filter:

print(df_text.loc[mask])
##    height  weight  age  taille
## b      61     123   31     160
## d      62     126   29     158

10.2.2.5 Extraction of a Single Column

To extract a column from a dataframe, we can use square brackets and refer to the name of the column (which is indexed by names):

print(df_text['weight'].head(2))
## a    115
## e    117
## Name: weight, dtype: int64

By selecting a single column, we obtain a series (the dataframe index is kept for the series):

print(type(df_text['weight']))
## <class 'pandas.core.series.Series'>

A column can also be extracted by referring to the attribute of the dataframe named after this column:

print(df_text.weight.head(2))
## a    115
## e    117
## Name: weight, dtype: int64

As for the series, we can rely on the attributes loc and iloc:

print("Column 2 (loc):\n", df_text.loc[:,"weight"])
## Column 2 (loc):
##  a    115
## e    117
## c    120
## b    123
## d    126
## Name: weight, dtype: int64
print("Column 2 (iloc):\n", df_text.iloc[:,1])
## Column 2 (iloc):
##  a    115
## e    117
## c    120
## b    123
## d    126
## Name: weight, dtype: int64

10.2.2.6 Extraction of Several Columns

To extract several columns, the names of the columns can be placed in a table:

print(df_text[["weight", "height"]])
##    weight  height
## a     115      58
## e     117      59
## c     120      60
## b     123      61
## d     126      62

The order in which these columns are called corresponds to the order in which they will be returned.

Again, we can use the loc attribute (we use the colon here to specify that we want all the lines):

print("Columns from weight to height:\n", df_text.loc[:,["weight", "height"]])
## Columns from weight to height:
##     weight  height
## a     115      58
## e     117      59
## c     120      60
## b     123      61
## d     126      62

And the iloc attribute:

print("Columns 2 and 1 :\n", df_num.iloc[:,[1,0]])
## Columns 2 and 1 :
##     weight  height
## 0     115      58
## 1     117      59
## 2     120      60
## 3     123      61
## 4     126      62

10.2.2.7 Slicing Several Columns

To perform a slice, the attributes loc and iloc can be used. We must be careful as the names of the columns used for the breakdown are not placed in a table here:

With loc:

print("Columns 2 and 2:\n", df_text.loc[:, "height":"age"])
## Columns 2 and 2:
##     height  weight  age
## a      58     115   28
## e      59     117   33
## c      60     120   31
## b      61     123   31
## d      62     126   29

And with the iloc attribute:

print("Columns from position 0 to 2 (not included) :\n",
      df_text.iloc[:, 0:2])
## Columns from position 0 to 2 (not included) :
##     height  weight
## a      58     115
## e      59     117
## c      60     120
## b      61     123
## d      62     126

10.2.2.8 Extraction of Rows and Columns

Now that we have reviewed multiple ways to select one or more rows or columns, we can also mention that it is possible to make selections of columns and rows in the same instruction.

For example, with iloc, let’s select the rows from position 0 to position 2 (not included) and the columns from position 1 to 3 (not included):

print(df_text.iloc[0:2, 1:3])
##    weight  age
## a     115   28
## e     117   33

With loc, let’s select the rows named a and c and the columns from the one named weight to age.

df_text.loc[["a", "c"], "weight":"age"]
##    weight  age
## a     115   28
## c     120   31

10.3 Renaming Columns in a Dataframe

To rename a column in a dataframe, pandas offers the method rename(). Let’s take an example with our df dataframe:

dict = {"height" : [58, 59, 60, 61, 62],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, 31, 29],
        "taille": [162, 156, 172, 160, 158],
       }
df = pd.DataFrame(dict)
print(df)
##    height  weight  age  taille
## 0      58     115   28     162
## 1      59     117   33     156
## 2      60     120   31     172
## 3      61     123   31     160
## 4      62     126   29     158

Let’s rename the column taille to height, using a dicionnaire specified in the argument columns, with as key the current name of the column, and value the new name :

df.rename(index=str, columns={"taille": "height"}, inplace=True)
print(df)
##    height  weight  age  height
## 0      58     115   28     162
## 1      59     117   33     156
## 2      60     120   31     172
## 3      61     123   31     160
## 4      62     126   29     158

For the change to be effective, the argument inplace is set to True, otherwise the change is not made to the dataframe.

To rename several columns at the same time, we can provide several pairs of value keys in the dictionary:

df.rename(index=str,
          columns={"weight": "weight_pounds", "age" : "years"},
          inplace=True)
print(df)
##    height  weight_pounds  years  height
## 0      58            115     28     162
## 1      59            117     33     156
## 2      60            120     31     172
## 3      61            123     31     160
## 4      62            126     29     158

10.4 Filtering

To filter the data in a table, depending on the values encountered for some variables, masks are used, as indicated in Section 10.2.2.4.

Let’s look at some examples here, by redefining our dataframe:

dict = {"height" : [58, 59, 60, 61, 62],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, 31, 29],
        "height_cm": [162, 156, 172, 160, 158],
       }
df = pd.DataFrame(dict)
print(df)
##    height  weight  age  height_cm
## 0      58     115   28        162
## 1      59     117   33        156
## 2      60     120   31        172
## 3      61     123   31        160
## 4      62     126   29        158

The idea is to create a mask returning a series containing Boolean values, one per line. When the value of the mask line is set to True, the dataframe line on which the mask will be applied will be retained, while it will not be retained when the value of the mask line is set to False.

Let’s look at a simple example, in which we want to keep observations only for which the value of the age variable is less than 30:

mask = df["age"] < 30
print(mask)
## 0     True
## 1    False
## 2    False
## 3    False
## 4     True
## Name: age, dtype: bool

Then, we simply have to apply this mask, with loc. We want all the columns, but only a few rows:

print(df.loc[mask])
##    height  weight  age  height_cm
## 0      58     115   28        162
## 4      62     126   29        158

Note: it also works without loc:

print(df[mask])
##    height  weight  age  height_cm
## 0      58     115   28        162
## 4      62     126   29        158

More simply, we can use the query() method of pandas. A Boolean expression to be evaluated is provided for this method to filter the data:

print(df.query("age<30"))
##    height  weight  age  height_cm
## 0      58     115   28        162
## 4      62     126   29        158

The query can be a little more complex, by combining comparison operators (see Section 4.2) and logical operators (see Section 4.3). For example, suppose that we want to filter the dataframe values to retain only those observations for which the size is less than or equal to 62 and the mass is strictly greater than 120. The request would then be:

print(df.query("weight > 120 and height < 62"))
##    height  weight  age  height_cm
## 3      61     123   31        160

It can be noted that the following instruction gives the same result:

print(df.query("weight > 120").query("height < 62"))
##    height  weight  age  height_cm
## 3      61     123   31        160

10.4.1 Checking whether a value belongs to dataframe

To create a mask indicating whether the values of a series or dataframe belong to a set, the isin() method can be used. For example, let’s return a mask indicating if the values in the height column of df are in the range \([59,60]\) :

df.height.isin(np.arange(59,61))
## 0    False
## 1     True
## 2     True
## 3    False
## 4    False
## Name: height, dtype: bool

10.5 Missing Values

In economics, it is quite common to face incomplete data. The way missing data is managed by pandas is the use of two special values: None and NaN.

The value None can be used in the tables NumPy only when the type of the latter is object.

table_none = np.array([1, 4, -1, None])
print(table_none)
## [1 4 -1 None]
print(type(table_none))
## <class 'numpy.ndarray'>

With an array of type object, operations performed on the data will be less efficient than with a numerical array. (VanderPlas 2016, p 121).

The value NaN is a floating point number value (see Section 9.1.10). NumPy manages it differently from NaN, and does not assign an object pass type from the start in the presence of NaN:

table_none = np.array([1, 4, -1, np.nan])
print(table_none)
## [ 1.  4. -1. nan]
print(type(table_none))
## <class 'numpy.ndarray'>

With pandas, these two values, None and NaN can be present:

s = pd.Series([1, None, -1, np.nan])
print(s)
## 0    1.0
## 1    NaN
## 2   -1.0
## 3    NaN
## dtype: float64
print(type(s))
## <class 'pandas.core.series.Series'>

This also applies to dataframes:

dict = {"height" : [58, 59, 60, 61, np.nan],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, np.nan, 29],
        "height_cm": [162, 156, 172, 160, 158],
       }
df = pd.DataFrame(dict)
print(df)
##    height  weight   age  height_cm
## 0    58.0     115  28.0        162
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172
## 3    61.0     123   NaN        160
## 4     NaN     126  29.0        158

However, it should be noted that only the type of variables for which missing values exist are passed to float64:

print(df.dtypes)
## height       float64
## weight         int64
## age          float64
## height_cm      int64
## dtype: object
Note that the data is recorded on a float64 type. When working on a table with no missing values, whose type is int or bool, if a missing value is entered, pandas will change the data type to float64 and object, respectively.

pandas offers different ways to handle missing values.

10.5.1 Identify Missing Values

With the isnull() method, a boolean mask is returned, indicating True for observations with the value NaN or None:

print(s.isnull())
## 0    False
## 1     True
## 2    False
## 3     True
## dtype: bool

To know if a value is not zero, the notnull() method can be used:

print(s.notnull())
## 0     True
## 1    False
## 2     True
## 3    False
## dtype: bool

10.5.2 Remove Observations with Missing Values

The dropna() method allows to remove observations with null values:

print(df.dropna())
##    height  weight   age  height_cm
## 0    58.0     115  28.0        162
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172

10.5.3 Removing Missing Values by Other Values

To replace missing values with other values, pandas proposes to use the method fillna():

print(df.fillna(-9999))
##    height  weight     age  height_cm
## 0    58.0     115    28.0        162
## 1    59.0     117    33.0        156
## 2    60.0     120    31.0        172
## 3    61.0     123 -9999.0        160
## 4 -9999.0     126    29.0        158

10.6 Deletions

To delete a value on one of the axes of a series or dataframe, NumPy offers the method drop().

10.6.1 Deleting Elements in a Series

To illustrate how the drop() method works, let’s create a series with a numerical index, another with a textual index :

s_num = pd.Series([1, 4, -1, np.nan],
             index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
             index = ["c", "a", "b", "d"])

An element can be deleted from a series by using its name:

print("for s_num: \n", s_num.drop(5))
## for s_num: 
##  0    4.0
## 4   -1.0
## 1    NaN
## dtype: float64
print("\nfor s_text: \n", s_text.drop("c"))
## 
## for s_text: 
##  a    4.0
## b   -1.0
## d    NaN
## dtype: float64

We can also retrieve the name according to the position, by going through a detour using the index() method:

print("s_num.index[0]: ", s_num.index[0])
## s_num.index[0]:  5
print("s_text.index[0]: ", s_text.index[0])
## s_text.index[0]:  c
print("for s_num: \n", s_num.drop(s_num.index[0]))
## for s_num: 
##  0    4.0
## 4   -1.0
## 1    NaN
## dtype: float64
print("\nfor s_text: \n", s_text.drop(s_text.index[0]))
## 
## for s_text: 
##  a    4.0
## b   -1.0
## d    NaN
## dtype: float64

To delete several elements, we can provide several index names in a list using the drop() method:

print("for s_num: \n", s_num.drop([5, 4]))
## for s_num: 
##  0    4.0
## 1    NaN
## dtype: float64
print("\nfor s_text: \n", s_text.drop(["c", "b"]))
## 
## for s_text: 
##  a    4.0
## d    NaN
## dtype: float64

Again, we can retrieve the name according to the position, by going through a detour using the index() method:

print("s_num.index[[0,2]]: ", s_num.index[[0,2]])
## s_num.index[[0,2]]:  Int64Index([5, 4], dtype='int64')
print("s_text.index[[0,2]]: ", s_text.index[[0,2]])
## s_text.index[[0,2]]:  Index(['c', 'b'], dtype='object')
print("for s_num: \n", s_num.drop(s_num.index[[0,2]]))
## for s_num: 
##  0    4.0
## 1    NaN
## dtype: float64
print("\nfor s_text: \n", s_text.drop(s_text.index[[0,2]]))
## 
## for s_text: 
##  a    4.0
## d    NaN
## dtype: float64

It is also possible to use a slicing to obtain the series without the element(s) (See Section 10.2.1.2.3)

10.6.2 Deleting Elements in a Dataframe

To illustrate how the drop() method works on a dataframe, let’s create one:

s_num = pd.Series([1, 4, -1, np.nan],
             index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
             index = ["c", "a", "b", "d"])
dict = {"height" : [58, 59, 60, 61, np.nan],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, np.nan, 29],
        "height_cm": [162, 156, 172, 160, 158],
       }
df = pd.DataFrame(dict)

10.6.2.1 Deleting Rows

To delete a row from a dataframe, we can refer to its name (here, the names are numbers, but they are labels):

print("Delete the first row:  \n", df.drop(0))
## Delete the first row:  
##     height  weight   age  height_cm
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172
## 3    61.0     123   NaN        160
## 4     NaN     126  29.0        158

If the rows have text labels, they can first be retrieved using the index() method:

label_pos_0 = df.index[0]
print("Delete the first row:  \n", df.drop(label_pos_0))
## Delete the first row:  
##     height  weight   age  height_cm
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172
## 3    61.0     123   NaN        160
## 4     NaN     126  29.0        158

To delete several rows, the name of these rows in a list is given to the drop() method:

print("Delete the 1st and 4th rows:  \n", df.drop([0,3]))
## Delete the 1st and 4th rows:  
##     height  weight   age  height_cm
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172
## 4     NaN     126  29.0        158

Or, by indicating the positions of the lines:

label_pos = df.index[[0, 3]]
print("Delete the 1st and 4th rows:  \n", df.drop(label_pos))
## Delete the 1st and 4th rows:  
##     height  weight   age  height_cm
## 1    59.0     117  33.0        156
## 2    60.0     120  31.0        172
## 4     NaN     126  29.0        158

It is also possible to use a slicing to obtain the series without the element(s) (See Sections 10.2.2.3 et 10.2.2.7)

10.6.2.2 Deleting Columns

To delete a column from a dataframe, we proceed in the same way as for rows, but by adding the parameter axis=1 to the method drop() to specify that we are interested in the columns :

print("Delete the first column:  \n", df.drop("height", axis=1))
## Delete the first column:  
##     weight   age  height_cm
## 0     115  28.0        162
## 1     117  33.0        156
## 2     120  31.0        172
## 3     123   NaN        160
## 4     126  29.0        158

We can first retrieve the labels of the columns according to their position using the method columns():

label_pos = df.columns[0]
print("label_pos : ", label_pos)
## label_pos :  height
print("Delete the first column:  \n", df.drop(label_pos, axis=1))
## Delete the first column:  
##     weight   age  height_cm
## 0     115  28.0        162
## 1     117  33.0        156
## 2     120  31.0        172
## 3     123   NaN        160
## 4     126  29.0        158

To delete several columns, the names of these columns are given in a list in the drop() method:

print("Delete the 1st and 4th columns:  \n",
 df.drop(["height", "height_cm"], axis = 1))
## Delete the 1st and 4th columns:  
##     weight   age
## 0     115  28.0
## 1     117  33.0
## 2     120  31.0
## 3     123   NaN
## 4     126  29.0

Or, by indicating the positions of the columns:

label_pos = df.columns[[0, 3]]
print("Delete the 1st and 4th columns:  \n", df.drop(label_pos, axis=1))
## Delete the 1st and 4th columns:  
##     weight   age
## 0     115  28.0
## 1     117  33.0
## 2     120  31.0
## 3     123   NaN
## 4     126  29.0

It is also possible to use a slicing to obtain the series without the element(s) (c.f. Sections 10.2.2.3 and 10.2.2.7)

10.7 Replacing Values

We will now look at how to modify one or more values, in the case of a series and then a dataframe.

10.7.1 For a Series

To modify a particular value in a series or dataframe, the equal symbol (=) can be used, having previously targeted the location of the value to be modified, using the extraction techniques explained in Section 10.2.

For example, let’s consider the following series:

s_num = pd.Series([1, 4, -1, np.nan],
             index = [5, 0, 4, 1])
print("s_num: ", s_num)
## s_num:  5    1.0
## 0    4.0
## 4   -1.0
## 1    NaN
## dtype: float64

Let’s modify the second element of s_num, to give it the value -3 :

s_num.iloc[1] = -3
print("s_num: ", s_num)
## s_num:  5    1.0
## 0   -3.0
## 4   -1.0
## 1    NaN
## dtype: float64

It is of course possible to modify several values at the same time.

Again, all we need to do is target the positions (there are many ways to do this) and provide an object of equivalent dimensions to replace the targeted values. For example, in our s_num series, we will replace the values in position 1 and 3 (2nd and 4th values) with -10 and -9 :

s_num.iloc[[1,3]] = [-10, -9]
print(s_num)
## 5     1.0
## 0   -10.0
## 4    -1.0
## 1    -9.0
## dtype: float64

10.7.2 For a Dataframe

Let’s consider the following dataframe:

dict = {"city" : ["Marseille", "Aix",
                   "Marseille", "Aix", "Paris", "Paris"],
        "year": [2019, 2019, 2018, 2018,2019, 2019],
        "x": [1, 2, 2, 2, 0, 0],
        "y": [3, 3, 2, 1, 4, 4],
       }
df = pd.DataFrame(dict)
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4
## 5      Paris  2019  0  4

10.7.2.1 Changes of a Particular Value

Let’s change the value of the first line of df for the column year, so that it is 2020. First, let’s retrieve the position of the year column in the dataframe, using the get_loc() method applied to the colnames attribute of the dataframe:

pos_year = df.columns.get_loc("year")
print("pos_year: ", pos_year)
## pos_year:  1

Then, let’s make the modification:

df.iloc[0,pos_year] = 2020
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4
## 5      Paris  2019  0  4

10.7.2.2 Modifications on One or More Columns

To modify all the values in a column to place a particular value, for example a 2 in the x column of df:

df.x = 2
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  2  4
## 5      Paris  2019  2  4

It is also possible to modify the values in the column by providing a list of values:

df.x = [2, 3, 4, 2, 1, 0]
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  3
## 1        Aix  2019  3  3
## 2  Marseille  2018  4  2
## 3        Aix  2018  2  1
## 4      Paris  2019  1  4
## 5      Paris  2019  0  4

We can therefore imagine modifying the values of a column according to the values that we read in another column. For example, let’s assume the following code: if the value of y is 2, then x is a', if the value ofyis 1, if the value ofxisb’, otherwise it is NaN. First, let’s build a list containing the values to insert (which we will name nv_val), using a loop. We will go through all the elements of the y' column, and at each iteration add tonv_val` the value obtained by making our comparisons:

new_val = []
for i in np.arange(len(df.index)):
        if df.y[i] == 2:
            new_val.append("a")
        elif df.y[i] == 1:
            new_val.append("b")
        else:
            new_val.append(np.nan)
print("new_val: ", new_val)
## new_val:  [nan, nan, 'a', 'b', nan, nan]

We are ready to modify the content of the x column of df to replace it with new_val:

df.x = new_val
print("df: \n", df)
## df: 
##          city  year    x  y
## 0  Marseille  2020  NaN  3
## 1        Aix  2019  NaN  3
## 2  Marseille  2018    a  2
## 3        Aix  2018    b  1
## 4      Paris  2019  NaN  4
## 5      Paris  2019  NaN  4

To replace several columns at the same time:

df[["x", "y"]] = [[2, 3, 4, 2, 1, 0], 1]
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  1
## 1        Aix  2019  3  1
## 2  Marseille  2018  4  1
## 3        Aix  2018  2  1
## 4      Paris  2019  1  1
## 5      Paris  2019  0  1

In the previous instruction, we replaced the contents of the x and y columns with a vector of handwritten values for x and with the value 1 for all observations for y.

10.7.2.3 Modifications on One or More Rows

To replace a row with a constant value (of little interest in the current example):

df.iloc[1,:] = 1
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  1
## 1          1     1  1  1
## 2  Marseille  2018  4  1
## 3        Aix  2018  2  1
## 4      Paris  2019  1  1
## 5      Paris  2019  0  1

It may be more interesting to replace an observation as follows:

df.iloc[1,:] = ["Aix", 2018, 1, 2]
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  1
## 1        Aix  2018  1  2
## 2  Marseille  2018  4  1
## 3        Aix  2018  2  1
## 4      Paris  2019  1  1
## 5      Paris  2019  0  1

To replace several rows, the method is identical:

df.iloc[[1,3],:] = [
    ["Aix", 2018, 1, 2],
    ["Aix", 2018, -1, -1]
]
print("df: \n", df)
## df: 
##          city  year  x  y
## 0  Marseille  2020  2  1
## 1        Aix  2018  1  2
## 2  Marseille  2018  4  1
## 3        Aix  2018 -1 -1
## 4      Paris  2019  1  1
## 5      Paris  2019  0  1

10.8 Adding Values

Now let’s look at how to add values, first in a series, then in a dataframe.

10.8.1 For a Series

Let’s consider the following series:

s_num = pd.Series([1, 4, -1, np.nan],
             index = [5, 0, 4, 1])
print("s_num: ", s_num)
## s_num:  5    1.0
## 0    4.0
## 4   -1.0
## 1    NaN
## dtype: float64

10.8.1.1 Adding a Single Value in a Series

To add a value, we use the append() method. Here, with s_num, as the index is manual, we are compelled to provide a series with a value for the index as well:

s_num_2 = pd.Series([1], index = [2])
print("s_num_2: \n", s_num_2)
## s_num_2: 
##  2    1
## dtype: int64
s_num = s_num.append(s_num_2)
print("s_num: \n", s_num)
## s_num: 
##  5    1.0
## 0    4.0
## 4   -1.0
## 1    NaN
## 2    1.0
## dtype: float64

Note that the append() method returns a view. To effectively add values, we must make a new assignment.

By having a series with an automatically generated numerical index, we can specify the value True for the ignore_index argument of the append() method to indicate that we do not take into account the value of the index of the object we add :

s = pd.Series([10, 2, 4])
s = s.append(pd.Series([2]), ignore_index=True)
print("s: \n", s)
## s: 
##  0    10
## 1     2
## 2     4
## 3     2
## dtype: int64

10.8.1.2 Adding Several Values in a Series

To add several values, we use the append() method. Here, with s_num, as the index is manual, we are required to provide a series with a value for the index as well:

s_num_2 = pd.Series([1], index = [2])
s_num.append(s_num_2)
## 5    1.0
## 0    4.0
## 4   -1.0
## 1    NaN
## 2    1.0
## 2    1.0
## dtype: float64
print("s_num: ", s_num)
## s_num:  5    1.0
## 0    4.0
## 4   -1.0
## 1    NaN
## 2    1.0
## dtype: float64

By having a series with an automatically generated numerical index:

s = pd.Series([10, 2, 4])
s.append(pd.Series([2]), ignore_index=True)
## 0    10
## 1     2
## 2     4
## 3     2
## dtype: int64

10.8.2 For a dataframe

Let’s go back to our dataframe:

dict = {"city" : ["Marseille", "Aix",
                   "Marseille", "Aix", "Paris", "Paris"],
        "year": [2019, 2019, 2018, 2018,2019, 2019],
        "x": [1, 2, 2, 2, 0, 0],
        "y": [3, 3, 2, 1, 4, 4],
       }
df = pd.DataFrame(dict)
print("df : \n", df)
## df : 
##          city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4
## 5      Paris  2019  0  4

10.8.2.1 Adding a Row in a Dataframe

As for a series, to add a row, we use the append() method. First, let’s create a new dataframe with the line to add:

new_row = pd.DataFrame([["Marseille", "2021", 2, 4]],
                       columns = df.columns)
print("new_row: \n", new_row)
## new_row: 
##          city  year  x  y
## 0  Marseille  2021  2  4

We made sure to have the same column name here, by indicating in the columns parameter of the pd.DataFrame method the name of the df columns, i.e., df.columns.

Let’s add the new row to df:

df = df.append(new_row, ignore_index=True)

Again, the append() method applied to a dataframe, returns a view and does not affect the object.

It can be noted that when adding a row, if the column names are not indicated in the same order as in the dataframe in which the addition is made, an indication must be added to the out parameter of the append() method:

  • if sort=True, the order of the columns of the added row will be applied to the destination dataframe
  • if sort=False, the column order of the destination dataframe will not be modified.
new_row = pd.DataFrame([["2021", "Marseille", 2, 4]],
                       columns = ["year", "coty", "x", "y"])
print("new_row: \n", new_row)
## new_row: 
##     year       coty  x  y
## 0  2021  Marseille  2  4
print("avec sort=True : \n",
  df.append(new_row, ignore_index=True, sort = True))
## avec sort=True : 
##          city       coty  x  y  year
## 0  Marseille        NaN  1  3  2019
## 1        Aix        NaN  2  3  2019
## 2  Marseille        NaN  2  2  2018
## 3        Aix        NaN  2  1  2018
## 4      Paris        NaN  0  4  2019
## 5      Paris        NaN  0  4  2019
## 6  Marseille        NaN  2  4  2021
## 7        NaN  Marseille  2  4  2021

10.8.2.2 Adding Multiple Rows in a Dataframe

To add several rows, it’s exactly the same principle as with a single one, just add a dataframe of several rows, with the same names again.

The rows to be inserted:

new_rows = pd.DataFrame([
    ["Marseille", "2022", 2, 4],
    ["Aix", "2022", 3, 3]],
    columns = df.columns)
print("new_rows: \n", new_rows)
## new_rows: 
##          city  year  x  y
## 0  Marseille  2022  2  4
## 1        Aix  2022  3  3

Then the insertion:

df = df.append(new_rows, ignore_index=True)

10.8.2.3 Adding a Column to a Dataframe

To add a column in a dataframe:

from numpy import random
df["z"] = random.rand(len(df.index))
print("df: \n", df)
## df: 
##          city  year  x  y         z
## 0  Marseille  2019  1  3  0.117443
## 1        Aix  2019  2  3  0.393782
## 2  Marseille  2018  2  2  0.452730
## 3        Aix  2018  2  1  0.538148
## 4      Paris  2019  0  4  0.790622
## 5      Paris  2019  0  4  0.465836
## 6  Marseille  2021  2  4  0.435332
## 7  Marseille  2022  2  4  0.569479
## 8        Aix  2022  3  3  0.969259

10.8.2.4 Adding Multiple Columns to a Dataframe

To add several columns:

df["a"] = random.rand(len(df.index))
df["b"] = random.rand(len(df.index))
print("df: \n", df)
## df: 
##          city  year  x  y         z         a         b
## 0  Marseille  2019  1  3  0.117443  0.040556  0.689236
## 1        Aix  2019  2  3  0.393782  0.548120  0.929546
## 2  Marseille  2018  2  2  0.452730  0.462577  0.918117
## 3        Aix  2018  2  1  0.538148  0.376472  0.975302
## 4      Paris  2019  0  4  0.790622  0.327912  0.397002
## 5      Paris  2019  0  4  0.465836  0.813529  0.262626
## 6  Marseille  2021  2  4  0.435332  0.646552  0.430151
## 7  Marseille  2022  2  4  0.569479  0.047426  0.764531
## 8        Aix  2022  3  3  0.969259  0.994958  0.599731

10.9 Removing Duplicate Values

To remove duplicate values from a dataframe, NumPy offers the drop_duplicates() method, which takes several optional arguments:

  • subset: by indicating one or more column names, the search for duplicates is only done on these columns;

  • keep: allows to indicate which observation to keep in case of identified duplicates:

    • if keep='first', all duplicates are removed except the first occurrence,
    • if keep='last', all duplicates are removed except the last occurrence,
    • if keep='False', all duplicates are removed;
  • inplace: boolean (default: False) to indicate if duplicates should be removed from the dataframe or if a copy should be returned (default).

Let’s give some examples using this dataframe which makes up two duplicates when we consider its totality. If we focus only on years or cities, or both, other duplicates can be identified.

dict = {"city" : ["Marseille", "Aix",
                   "Marseille", "Aix", "Paris", "Paris"],
        "year": [2019, 2019, 2018, 2018,2019, 2019],
        "x": [1, 2, 2, 2, 0, 0],
        "y": [3, 3, 2, 1, 4, 4],
       }
df = pd.DataFrame(dict)
print(df)
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4
## 5      Paris  2019  0  4

To remove the duplicates:

print(df.drop_duplicates())
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4

Remove duplicates by keeping the last value of the identified duplicates:

df.drop_duplicates(keep='last')
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 5      Paris  2019  0  4

To remove identified duplicates when focusing on city names, and keeping only the first value :

print(df.drop_duplicates(subset = ["city"], keep = 'first'))
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 4      Paris  2019  0  4

Same as above but with a focus on couples (city, year)

print(df.drop_duplicates(subset = ["city", "year"], keep = 'first'))
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4

Note that the original dataframe was not impacted, since we did not touch the inplace argument. If we now ask that the changes be made on the dataframe instead of getting a copy:

df.drop_duplicates(subset = ["city", "year"], keep = 'first', inplace = True)
print(df)
##         city  year  x  y
## 0  Marseille  2019  1  3
## 1        Aix  2019  2  3
## 2  Marseille  2018  2  2
## 3        Aix  2018  2  1
## 4      Paris  2019  0  4

To find out if a value is duplicated in a dataframe, NumPy offers the duplicated() method, which returns a mask indicating for each observation, whether it is duplicated or not. Its operation is similar to df.drop_duplicates(), except for the inplace parameter which is not present.

print(df.duplicated(subset = ["city"], keep = 'first'))
## 0    False
## 1    False
## 2     True
## 3     True
## 4    False
## dtype: bool

We can use the any() method to find out if there are duplicates:

print(df.duplicated(subset = ["city"], keep = 'first').any())
## True

10.10 Operations

It is often necessary to perform operations on the columns of a dataframe, especially when it comes to creating a new variable.

By using the principles of column modification (see Section @ref(#pandas-ajout-valeurs)), it is quite easy to imagine that it is possible to apply the functions and methods of NumPy (see Section 9.1) on the values of the columns.

For example, let’s consider the following dataframe:

dict = {"height" :
               [58, 59, 60, 61, 62,
                63, 64, 65, 66, 67,
                68, 69, 70, 71, 72],
        "weight":
               [115, 117, 120, 123, 126,
                129, 132, 135, 139, 142,
                146, 150, 154, 159, 164]
       }
df = pd.DataFrame(dict)
print(df)
##     height  weight
## 0       58     115
## 1       59     117
## 2       60     120
## 3       61     123
## 4       62     126
## 5       63     129
## 6       64     132
## 7       65     135
## 8       66     139
## 9       67     142
## 10      68     146
## 11      69     150
## 12      70     154
## 13      71     159
## 14      72     164

Let’s add the column height_2, increasing the values of the column height to square:

df["height_2"] = df.height**2
print(df.head(3))
##    height  weight  height_2
## 0      58     115      3364
## 1      59     117      3481
## 2      60     120      3600

Now, let’s add the column bmi, providing the body mass indicator values for individuals in the dataframe (\(\text{bmi} = \frac{\text{weight}}{\text{height}^2}\)) :

df["bmi"] = df.weight / df.height_2
print(df.head(3))
##    height  weight  height_2       bmi
## 0      58     115      3364  0.034185
## 1      59     117      3481  0.033611
## 2      60     120      3600  0.033333

10.10.1 Statistics

pandas offers some methods for performing descriptive statistics for each column or row. To do this, the syntax is as follows (all arguments have a default value, the list is simplified here):

dataframe.stat_method(axis, skipna)
  • axis: 0 for rows, 1 for columns
  • skipna: if True, excludes missing values from the calculations

Among the available methods: - mean(): mean - mode(): mode - median(): median - std(): standard error - min(): minimum - max(): maximum - mad(): mean absolute deviation - sum(): sum - prod(): product - count(): counting the elements

For example, to calculate the average of the values for each column:

dico = {"height" : [58, 59, 60, 61, 62],
        "weight": [115, 117, 120, 123, 126],
        "age": [28, 33, 31, 31, 29],
        "height_cm": [162, 156, 172, 160, 158],
        "married": [True, True, False, False, True],
        "city": ["A", "B", "B", "B", "A"]
       }
df = pd.DataFrame(dico)
print(df.mean())
## height        60.0
## weight       120.2
## age           30.4
## height_cm    161.6
## married        0.6
## dtype: float64

If desired, we can average the column values (without any meaning here):

print(df.mean(axis=1))
## 0    72.8
## 1    73.2
## 2    76.6
## 3    75.0
## 4    75.2
## dtype: float64

These functions can be applied to a single column. For example, to display the minimum value:

print("min: ", df.height.min())
## min:  58

It is also useful to be able to obtain the position of the min and max values; this can be obtained with the methods idxmin() and idxmax(), respectively.

print("pos min: ", df.height.idxmin())
## pos min:  0
print("pos min: ", df.height.idxmax())
## pos min:  4

A very useful method is describe(), it allows to return descriptive statistics on all numerical columns :

print(df.describe())
##           height      weight        age   height_cm
## count   5.000000    5.000000   5.000000    5.000000
## mean   60.000000  120.200000  30.400000  161.600000
## std     1.581139    4.438468   1.949359    6.228965
## min    58.000000  115.000000  28.000000  156.000000
## 25%    59.000000  117.000000  29.000000  158.000000
## 50%    60.000000  120.000000  31.000000  160.000000
## 75%    61.000000  123.000000  31.000000  162.000000
## max    62.000000  126.000000  33.000000  172.000000

10.11 Sorting

It is easy to sort a dataframe in ascending or descending order by one or more of its columns. To do this, we use the method sort_values(). The syntax is as follows:

DataFrame.sort_values(by, axis=0, ascending=True,
                      inplace=False, kind="quicksort",
                      na_position="last")
  • by: name or list of names of the column(s) used to sort
  • axis: 0 for the index (default), 1 for the columns
  • ascending: boolean or boolean list, when True the sorting is done by increasing values (default), when False it is done by decreasing values
  • inplace: if True, the sorting affects the dataframe, otherwise it returns a view
  • kind: choice of sorting algorithm (quicksort (default), mergesort, heapsort)
  • na_position: if first, the missing values are placed at the beginning; if last (default), at the end.

Let us give some examples:

dict = {"height" : [58, 59, 60, 61, 62],
        "weight": [115, np.nan, 120, 123, 126],
        "age": [28, 33, 31, 31, 29],
        "height_cm": [162, 156, 172, 160, 158],
        "married": [True, True, np.nan, False, True],
        "city": ["A", "B", "B", "B", "A"]
       }
df = pd.DataFrame(dict)

If we sort the values in descending order of the values in the height column:

df.sort_values(by="height", ascending=False)
##    height  weight  age  height_cm married city
## 4      62   126.0   29        158    True    A
## 3      61   123.0   31        160   False    B
## 2      60   120.0   31        172     NaN    B
## 1      59     NaN   33        156    True    B
## 0      58   115.0   28        162    True    A

To sort in ascending order of the married values (recall that True is interpreted as 1 and False as 0), then decreasing by weight, placing the values NaN first:

df.sort_values(by=["married", "weight"],
               ascending=[True, False],
               na_position="first")
##    height  weight  age  height_cm married city
## 2      60   120.0   31        172     NaN    B
## 3      61   123.0   31        160   False    B
## 1      59     NaN   33        156    True    B
## 4      62   126.0   29        158    True    A
## 0      58   115.0   28        162    True    A

Note that the NaN values have increased for the subgroups composed according to the married values.

10.12 Concatenation

It is frequent to obtain data from multiple sources when conducting an analysis. It is then important to be able to combine the different sources into one. In this section, we will limit ourselves to concatenating different dataframes between them, in simple cases in which we know a priori that all we have to do is put two dataframes side by side or one below the other. The case of slightly more elaborate joints with matching according to one or more columns is discussed in Section 10.13.

First, let’s create two dataframes with the same number of lines:

x_1 = pd.DataFrame(np.random.randn(5, 4),
                   columns=["a", "b", "c", "d"])
x_2 = pd.DataFrame(np.random.randn(5, 2),
                   columns = ["e", "f"])
print("x_1: \n", x_1)
## x_1: 
##            a         b         c         d
## 0  0.231711 -0.474710 -0.309147 -2.032396
## 1 -0.174468 -0.642475 -0.625023  1.325887
## 2  0.531255  1.275284 -0.682826 -0.948186
## 3  0.777362  0.325113 -1.203486  1.209543
## 4  0.157622 -0.293555  0.111560  0.597679
print("\nx_2: \n", x_2)
## 
## x_2: 
##            e         f
## 0 -1.270093  0.120949
## 1 -0.193898  1.804172
## 2 -0.234694  0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318

To “paste” the dataframe x_2 to the right of x_1, we can use the concat() method of pandas. To indicate that concatenation is performed on the columns, the value 1 for the parameter axix is specified as follows:

print(pd.concat([x_1, x_2], axis = 1))
##           a         b         c         d         e         f
## 0  0.231711 -0.474710 -0.309147 -2.032396 -1.270093  0.120949
## 1 -0.174468 -0.642475 -0.625023  1.325887 -0.193898  1.804172
## 2  0.531255  1.275284 -0.682826 -0.948186 -0.234694  0.939908
## 3  0.777362  0.325113 -1.203486  1.209543 -0.171520 -0.153055
## 4  0.157622 -0.293555  0.111560  0.597679 -0.363095 -0.067318

To paste the dataframes below each other, the append() method can be used, as described in Section 10.8.2.1, or the concat() method can also be used.

x_3 = pd.DataFrame(np.random.randn(5, 2),
                   columns = ["e", "f"])
print("x_3: \n", x_3)
## x_3: 
##            e         f
## 0  1.444721  0.325771
## 1 -0.855732 -0.697595
## 2 -0.276134 -1.258759
## 3  0.478094 -0.859764
## 4  0.571988 -0.173965

Let’s add the observations of x_3 below those of x_2:

print(pd.concat([x_2, x_3], axis = 0))
##           e         f
## 0 -1.270093  0.120949
## 1 -0.193898  1.804172
## 2 -0.234694  0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318
## 0  1.444721  0.325771
## 1 -0.855732 -0.697595
## 2 -0.276134 -1.258759
## 3  0.478094 -0.859764
## 4  0.571988 -0.173965

As can be seen, the line index of x_2 has not been modified. If we want it to be, we can specify it via the ignore_index argument:

print(pd.concat([x_2, x_3], axis = 0, ignore_index=True))
##           e         f
## 0 -1.270093  0.120949
## 1 -0.193898  1.804172
## 2 -0.234694  0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318
## 5  1.444721  0.325771
## 6 -0.855732 -0.697595
## 7 -0.276134 -1.258759
## 8  0.478094 -0.859764
## 9  0.571988 -0.173965

If the column names are not identical, values NaN will be inserted:

x_4 = pd.DataFrame(np.random.randn(5, 2),
                   columns = ["e", "g"])
print("x_4: \n", x_4)
## x_4: 
##            e         g
## 0  1.534900  0.872856
## 1  1.856835  0.025914
## 2  0.171984 -0.191163
## 3 -0.292936  1.655677
## 4 -0.207182 -0.686884
pd.concat([x_2, x_4], axis = 0, sort=False, ignore_index=True)
##           e         f         g
## 0 -1.270093  0.120949       NaN
## 1 -0.193898  1.804172       NaN
## 2 -0.234694  0.939908       NaN
## 3 -0.171520 -0.153055       NaN
## 4 -0.363095 -0.067318       NaN
## 5  1.534900       NaN  0.872856
## 6  1.856835       NaN  0.025914
## 7  0.171984       NaN -0.191163
## 8 -0.292936       NaN  1.655677
## 9 -0.207182       NaN -0.686884

10.13 Joins

It is more likely to use slightly more elaborate joins to bring together the different data sources into one. pandas offers a powerful way to gather data, the merge() function.

To illustrate the different joins in this section, let’s create some dataframes:

exports_fr = pd.DataFrame(
    {"country" : "France",
     "year" : np.arange(2014, 2017),
     "exports" : [816.8192172, 851.6632573, 867.4014253]
    })

imports_fr = pd.DataFrame(
    {"country" : "France",
     "year" : np.arange(2015, 2018),
     "imports" : [898.5242962, 936.3691166, 973.8762149]
    })

exports_us = pd.DataFrame(
    {"country" : "USA",
     "year" : np.arange(2014, 2017),
     "exports" : [2208.678084, 2217.733347, 2210.442218]
    })

imports_us = pd.DataFrame(
    {"country" : "USA",
     "year" : np.arange(2015, 2018),
     "imports" : [2827.336251, 2863.264745, np.nan]
    })

imports_morocco = pd.DataFrame(
    {"pays" : "Morocco",
     "annee" : np.arange(2015, 2018),
     "imports" : [46.39884177, 53.52375588, 56.68165748]
    })
exports_morocco = pd.DataFrame(
    {"country" : "Morocco",
     "year" : np.arange(2014, 2017),
     "exports" : [35.50207915, 37.45996653, 39.38228396]
    })

exports = pd.concat([exports_fr, exports_us], ignore_index=True)
imports = pd.concat([imports_fr, imports_us], ignore_index=True)

print("exports: \n", exports)
## exports: 
##    country  year      exports
## 0  France  2014   816.819217
## 1  France  2015   851.663257
## 2  France  2016   867.401425
## 3     USA  2014  2208.678084
## 4     USA  2015  2217.733347
## 5     USA  2016  2210.442218
print("\nimports: \n", imports)
## 
## imports: 
##    country  year      imports
## 0  France  2015   898.524296
## 1  France  2016   936.369117
## 2  France  2017   973.876215
## 3     USA  2015  2827.336251
## 4     USA  2016  2863.264745
## 5     USA  2017          NaN

The merge() function of pandas requires to specify the left table (which we will call here x) via the left argument on which the joining of the right table (which we will call here y) will be performed via the right argument.

By default, the merge() function performs an inner type join, i.e., all x rows that match y, and all x and y columns will be in the result of the join :

print(pd.merge(left = imports, right = exports))
##   country  year      imports      exports
## 0  France  2015   898.524296   851.663257
## 1  France  2016   936.369117   867.401425
## 2     USA  2015  2827.336251  2217.733347
## 3     USA  2016  2863.264745  2210.442218

If we want to change the type of join, we can modify the value of the how parameter of the merge() function, to give it one of the following values:

  • left: all x rows, and all x and y columns. Rows in x for which there is no match in y will have values NaN in the new columns. If there are several matches in the names between x and y, all combinations are returned
  • inner: all x rows for which there are corresponding values in y, and all x and y columns. If there are several matches in the names between x and y, all possible combinations are returned
  • right: all y rows, and all y and x columns. Rows in y for which there is no match in x will have values NaN in the new columns. If there are several matches in the names between y and x, all combinations are returned
  • outer: all rows of x and y, and all columns of x and y. Lines of x for which there is no match in y and those of y for which there is no match in x will have values NaN.
print("left: \n", pd.merge(left = imports, right = exports, how="left"))
## left: 
##    country  year      imports      exports
## 0  France  2015   898.524296   851.663257
## 1  France  2016   936.369117   867.401425
## 2  France  2017   973.876215          NaN
## 3     USA  2015  2827.336251  2217.733347
## 4     USA  2016  2863.264745  2210.442218
## 5     USA  2017          NaN          NaN
print("\nright: \n", pd.merge(left = imports, right = exports, how="right"))
## 
## right: 
##    country  year      imports      exports
## 0  France  2015   898.524296   851.663257
## 1  France  2016   936.369117   867.401425
## 2     USA  2015  2827.336251  2217.733347
## 3     USA  2016  2863.264745  2210.442218
## 4  France  2014          NaN   816.819217
## 5     USA  2014          NaN  2208.678084
print("\nouter: \n", pd.merge(left = imports, right = exports, how="outer"))
## 
## outer: 
##    country  year      imports      exports
## 0  France  2015   898.524296   851.663257
## 1  France  2016   936.369117   867.401425
## 2  France  2017   973.876215          NaN
## 3     USA  2015  2827.336251  2217.733347
## 4     USA  2016  2863.264745  2210.442218
## 5     USA  2017          NaN          NaN
## 6  France  2014          NaN   816.819217
## 7     USA  2014          NaN  2208.678084

The on argument, which expects a column name or list of names, is used to designate the columns used to make the join. The column names must be identical in both dataframes.

print(pd.merge(left = imports, right = exports, on = "country"))
##    country  year_x      imports  year_y      exports
## 0   France    2015   898.524296    2014   816.819217
## 1   France    2015   898.524296    2015   851.663257
## 2   France    2015   898.524296    2016   867.401425
## 3   France    2016   936.369117    2014   816.819217
## 4   France    2016   936.369117    2015   851.663257
## 5   France    2016   936.369117    2016   867.401425
## 6   France    2017   973.876215    2014   816.819217
## 7   France    2017   973.876215    2015   851.663257
## 8   France    2017   973.876215    2016   867.401425
## 9      USA    2015  2827.336251    2014  2208.678084
## 10     USA    2015  2827.336251    2015  2217.733347
## 11     USA    2015  2827.336251    2016  2210.442218
## 12     USA    2016  2863.264745    2014  2208.678084
## 13     USA    2016  2863.264745    2015  2217.733347
## 14     USA    2016  2863.264745    2016  2210.442218
## 15     USA    2017          NaN    2014  2208.678084
## 16     USA    2017          NaN    2015  2217.733347
## 17     USA    2017          NaN    2016  2210.442218

If the names of the columns used to make the join are different between the left and right dataframe, the argument left_on indicates the column name(s) of the left dataframe to be used for the join; and the argument right_on indicates the corresponding name(s) in the right dataframe:

pd.merge(left = imports_morocco, right = exports_morocco,
         left_on= ["pays", "annee"], right_on = ["country", "year"] )
##       pays  annee    imports  country  year    exports
## 0  Morocco   2015  46.398842  Morocco  2015  37.459967
## 1  Morocco   2016  53.523756  Morocco  2016  39.382284

With the argument suffixes, suffixes can be defined to be added to column names when there are columns in x and y with the same name but not used for joining. By default, the suffixes (_x and _y) are added.

print(pd.merge(left = imports, right = exports,
               on = "country",
               suffixes=("_left", "_right")).head(3))
##   country  year_left     imports  year_right     exports
## 0  France       2015  898.524296        2014  816.819217
## 1  France       2015  898.524296        2015  851.663257
## 2  France       2015  898.524296        2016  867.401425

10.14 Aggregation

Sometimes we want to aggregate the values of a variable, for example, from a quarterly to an annual dimension. With spatial observations, this can also be the case, for example, when data are available at the county level and the aggregate values at the state level.

To illustrate the different aggregation operations, let’s create a dataframe with unemployment data for different French regions, departments and years:

unemployment = pd.DataFrame(
    {"region" : (["Bretagne"]*4 + ["Corse"]*2)*2,
     "departement" : ["Cotes-d'Armor", "Finistere",
                      "Ille-et-Vilaine", "Morbihan",
                      "Corse-du-Sud", "Haute-Corse"]*2,
     "year" : np.repeat([2011, 2010], 6),
     "workers" : [8738, 12701, 11390, 10228, 975, 1297,
                   8113, 12258, 10897, 9617, 936, 1220],
     "engineers" : [1420, 2530, 3986, 2025, 259, 254,
                     1334, 2401, 3776, 1979, 253, 241]
    })
print(unemployment)
##       region      departement  year  workers  engineers
## 0   Bretagne    Cotes-d'Armor  2011     8738       1420
## 1   Bretagne        Finistere  2011    12701       2530
## 2   Bretagne  Ille-et-Vilaine  2011    11390       3986
## 3   Bretagne         Morbihan  2011    10228       2025
## 4      Corse     Corse-du-Sud  2011      975        259
## 5      Corse      Haute-Corse  2011     1297        254
## 6   Bretagne    Cotes-d'Armor  2010     8113       1334
## 7   Bretagne        Finistere  2010    12258       2401
## 8   Bretagne  Ille-et-Vilaine  2010    10897       3776
## 9   Bretagne         Morbihan  2010     9617       1979
## 10     Corse     Corse-du-Sud  2010      936        253
## 11     Corse      Haute-Corse  2010     1220        241

As previously discussed (see Section 10.10.1), methods can be used to calculate simple statistics on the data set. For example, to display the average of each of the numerical columns:

print(unemployment.mean())
## year         2010.500000
## workers      7364.166667
## engineers    1704.833333
## dtype: float64

What we are interested in in this section is to perform calculations on subgroups of data. The principle is simple: first, the data are separated according to identified groups (split), then an operation is applied to each group (apply), and finally the results are collected (combine). To perform grouping, depending on factors before performing aggregation calculations, pandas offers the method groupby(). The argument provided is the column name(s) used to perform the groups.

10.14.1 Aggregation According to the Values of a Single Column

For example, let us assume that we want to obtain the total number of unemployed workers per year. First, we use the groupby() method on our dataframe, indicating that the groups must be created according to the values in the year column

print(unemployment.groupby("year"))
## <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x131bae2b0>

Then, we extract the variable workers:

print(unemployment.groupby("year").workers)
# Or
## <pandas.core.groupby.groupby.SeriesGroupBy object at 0x131c28be0>
print(unemployment.groupby("year")["workers"])
## <pandas.core.groupby.groupby.SeriesGroupBy object at 0x131c282b0>

And finally, we can perform the calculation on each subgroup and display the result:

print(unemployment.groupby("year")["workers"].sum())
## year
## 2010    43041
## 2011    45329
## Name: workers, dtype: int64

If we want to perform this calculation for several columns, for example workers and engineers, we just have to select a priori the grouping variance and the variables for which we want to perform the calculation :

unemployment.loc[:,["year", "workers", "engineers"]].groupby("year").sum()
##       workers  engineers
## year                    
## 2010    43041       9984
## 2011    45329      10474

10.14.2 Aggregation According to the Values of Several Columns

Now, let’s assume that we want to aggregate by year and region. The only thing we need to do is to give a list containing the names of the columns used to create the different groups:

unemployment.loc[:,["year", "region",
               "workers", "engineers"]].groupby(["year",
                                                   "region"]).sum()
##                workers  engineers
## year region                      
## 2010 Bretagne    40885       9490
##      Corse        2156        494
## 2011 Bretagne    43057       9961
##      Corse        2272        513

10.15 Data Export and Import

pandas offers many functions for importing and exporting data in different formats.

10.15.1 Data Export

10.15.1.1 Exporting Tabular Data

10.15.1.1.1 To a CSV File {pandas-export_csv}

To export tabular data, such as those contained in a dataframe, NumPy offers the to_csv() method, which accepts many specifications. Let’s look at some of them that seem to me to be the most common:

Table 10.1: Main arguments of the to_csv function
Argument Description
path_or_buf path to the file
sep field separation character
decimal character to be used for the decimal separator
na_rep representation to be used for missing values
header indicates whether the column names should be exported (True by default)
index indicates whether the line names should be exported (True by default)
mode python writing mode (see Table 5.1, by default w)
encoding character encoding (utf-8 by default)
compression compression to be used for the destination file (gzip, bz2, zip, xz)
line_terminator end of line character
quotechar character used to put fields between quotes
chunksize (integer) number of lines to be written at a time
date_format date format for datetime objects

Let’s assume that we want to export the contents of the `unemployment’ dataframe to a CSV file whose fields are separated by semicolons, and by not exporting the index :

unemployment = pd.DataFrame(
    {"region" : (["Bretagne"]*4 + ["Corse"]*2)*2,
     "departement" : ["Cotes-d'Armor", "Finistere",
                      "Ille-et-Vilaine", "Morbihan",
                      "Corse-du-Sud", "Haute-Corse"]*2,
     "year" : np.repeat([2011, 2010], 6),
     "workers" : [8738, 12701, 11390, 10228, 975, 1297,
                   8113, 12258, 10897, 9617, 936, 1220],
     "engineers" : [1420, 2530, 3986, 2025, 259, 254,
                     1334, 2401, 3776, 1979, 253, 241]
    })
print(unemployment)
##       region      departement  year  workers  engineers
## 0   Bretagne    Cotes-d'Armor  2011     8738       1420
## 1   Bretagne        Finistere  2011    12701       2530
## 2   Bretagne  Ille-et-Vilaine  2011    11390       3986
## 3   Bretagne         Morbihan  2011    10228       2025
## 4      Corse     Corse-du-Sud  2011      975        259
## 5      Corse      Haute-Corse  2011     1297        254
## 6   Bretagne    Cotes-d'Armor  2010     8113       1334
## 7   Bretagne        Finistere  2010    12258       2401
## 8   Bretagne  Ille-et-Vilaine  2010    10897       3776
## 9   Bretagne         Morbihan  2010     9617       1979
## 10     Corse     Corse-du-Sud  2010      936        253
## 11     Corse      Haute-Corse  2010     1220        241

For export:

path = "./fichiers_exemples/unemployment.csv"
unemployment.to_csv(path, decimal=";", index=False)

If you want the CSV file to be compressed into a gzip file, you name it with the extension .csv.gz and add the value gzip to the compression parameter:

path = "./fichiers_exemples/chomage.csv.gz"
unemployment.to_csv(path, decimal=";", index=False, compression="gzip")
10.15.1.1.2 To an HDF5 File

To save the data of a dataframe in an HDF5 file using HDFStore, pandas offers the method to_hdf() which works in the same way as the function to_csv() (see Section @ref(pandas-export_csv)).

The argument path_or_buf must be specified to indicate the path and the argument key to identify the object to be saved in the file.

The syntax is as follows:

path = "./fichiers_exemples/chomage.h5"
unemployment.to_hdf(path, "base_chomage", decimal=";", index=False)

10.16 Data Import

pandas offers many functions for importing data. In this version of the course notes, we will discuss 3: read_csv(), to read CSV files; read_excel(), to read Excel files; and read_hdf() to read HDF5 files.

10.16.1 CSV Files

To import data from a CSV file, pandas offers the function read_csv():

path = "./fichiers_exemples/unemployment.csv"
unemployment = pd.read_csv(path, decimal=";")

It is possible to provide a URL pointing to a CSV file as a path, the read_csv() function.

Among the parameters that are frequently used:

  • sep, delimiter: field separator
  • decimal: decimal separator
  • header: line number(s) to be used as data header
  • skiprows: line number(s) to be skipped at the beginning
  • skipfooter: line number(s) to be skipped at the end
  • nrows: number of lines to read
  • na_values: additional character strings to be considered as missing values (in addition to #N/A, #N/A N/A, #NA, -1.#IND, -1.#QNAN, -NaN, -nan, 1.#IND, 1.#QNAN, N/A, NA, NULL, NaN, n/a, nan, null)
  • quotechar: quote character
  • encoding: character encoding (utf-8 by default).

10.16.2 Excel Files

To import Excel files, pandas offers the function read_excel().

path = "./fichiers_exemples/chomage.xlsx"
unemployment = pd.read_excel(path, skiprows=2, header=1, sheet = 1)
print(unemployment)
##     Bretagne    Cotes-d'Armor  2011   8738  1420
## 0   Bretagne        Finistere  2011  12701  2530
## 1   Bretagne  Ille-et-Vilaine  2011  11390  3986
## 2   Bretagne         Morbihan  2011  10228  2025
## 3      Corse     Corse-du-Sud  2011    975   259
## 4      Corse      Haute-Corse  2011   1297   254
## 5   Bretagne    Cotes-d'Armor  2010   8113  1334
## 6   Bretagne        Finistere  2010  12258  2401
## 7   Bretagne  Ille-et-Vilaine  2010  10897  3776
## 8   Bretagne         Morbihan  2010   9617  1979
## 9      Corse     Corse-du-Sud  2010    936   253
## 10     Corse      Haute-Corse  2010   1220   241

Among the frequently used arguments:

  • header: row number to be used as header
  • sheet: name or sheet number
  • skiprows: number of lines to be skipped at the beginning
  • thousands: thousands separator.

10.16.3 HDF5 Files

path = "./fichiers_exemples/chomage.h5"
print(pd.read_hdf(path, "base_chomage"))

10.17 Exercise

Exercise 1: Import and export

  1. Download the csv file by hand at the following address: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.csv and place it in the current directory. Import its content into Python.
  2. Import the data back into Python, but this time by providing the url directly to the import function.
  3. Now, import the contents of the file available at http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_decim.csv. The field separator is a semicolon and the decimal separator is a comma.
  4. Import the contents of the file http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_h.csv. The column names are not present.
  5. Importer le contenu du fichier http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_h_s.csv. The first line is not to be imported.
  6. Import the contents of the first sheet of the Excel file http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.xlsx.
  7. Import the content of the second sheet (notes_h_s) from the Excel file available here: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.xlsx. The first line is a comment not to be considered during the import.
  8. Export the content of the object containing the notes of the previous question in csv format (comma as field separator, dot as decimal separator, do not keep the line numbers).

Exercise 2: Handling Dataframes

  1. Using the read_excel() function of the pandas library, import the contents of the sheet entitled notes_2012 from the Excel file available at the following address: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_etudiants.xlsx and store it in a variable called notes_2012.
  2. Display the first 6 lines of the dataset, then the dimensions of the table.
  3. Keep only the column note_stat of the data table notes_2012 in an object called tmp.
  4. Keep only the columns number_student, note_stat and note_macro in an object named tmp.
  5. Replace the content of tmp with observations of notes_2012 for which the individual obtained a stat score greater (strictly) than 10.
  6. Replace the content of tmp with observations of notes_2012 for which the individual obtained a stats score within the interval (10, 15).
  7. Look for duplicates in the notees_2012 data table; if so, remove them from the table.
  8. Display the type of data in the column num_etudiant, then display the type of all note columns_2012.
  9. Add the following columns to the notes_2012 table:
  1. note_stat_maj: the stat score (note_stat) increased by one point,
  2. note_macro_maj: the macro note (note_macro) plus three points (do it in two steps: first two more points, then one point).
  1. Rename the column year to year.
  2. From the file notes_etudiants.xlsx (see question 1), import the contents of the sheets notes_2013, notes_2014 and prenoms and store them in the objects grades_2013, grades_2014 and first_names, respectively.
  3. Stack the contents of the data tables grades_2012, grades_2013 and grades_2014 in an object that will be called grades.
  4. Merge the grades and first_names tables using a left join, so as to add the information contained in the first name table to the grade observations. The join must be done by the student number and the year, the final object will replace the content of the grades
  5. Sort the grades table by increasing years and decreasing macro grades
  6. Create a column after_2012 which takes the value True if the observation concerns a score assigned after 2012.
  7. By grouping on the grades dataframe compute:
  1. the annual mean and standard deviation of the scores for each of the two classes,
  2. the annual and gender average and standard deviation of the grades for each of the two classes.

References

VanderPlas, Jake. 2016. Python Data Science Handbook: Essential Tools for Working with Data. " O’Reilly Media, Inc.".