10 Data manipulation with pandas
pandas
is an open-source library based on NumPy
providing easy-to-use data structures and data analysis tools. The reader familiar with the basic functions of the R
language will find many similar features with pandas
.
To access the features of pandas
, it is common practice to load the library by assigning it the alias pd
:
We will also use numpy
functions (c.f. Section 9). Let’s make sure to load this library, if it hasn’t already been done:
10.1 Structures
We will look at two types of structures, series (series') and dataframes (
DataFrame’).
10.1.1 Series
Series are one-dimensional tables of indexed data.
10.1.1.1 Creating Series from a List
To create one, we can define a list, then apply the function Series
of pandas
:
## 0 1.0
## 1 4.0
## 2 -1.0
## 3 NaN
## 4 0.5
## 5 1.0
## dtype: float64
The previous example shows that the s
series created contains both the data and an associated index. The values
attribute is used to display the values that are stored in a numpy
array:
## values of s: [ 1. 4. -1. nan 0.5 1. ]
## type of values of s: <class 'numpy.ndarray'>
The index is stored in a specific structure of pandas
:
## index of s: RangeIndex(start=0, stop=6, step=1)
## type of the index of s: <class 'pandas.core.indexes.range.RangeIndex'>
It is possible to assign a name to the series as well as to the index:
s.name = "my_serie"
s.name = "name_index"
print("name of the Serie: {} , name of the index: {}".format(s.name, s.index.name))
## name of the Serie: name_index , name of the index: None
## Serie s:
## 0 1.0
## 1 4.0
## 2 -1.0
## 3 NaN
## 4 0.5
## 5 1.0
## Name: name_index, dtype: float64
10.1.1.2 Definition of the Index
The index can be defined by the user, at the time of creating the series:
## o 1.0
## d 4.0
## i -1.0
## l NaN
## dtype: float64
The index can also be defined with numerical values, without being forced to follow a specific order:
## 4 1.0
## 40 4.0
## 2 -1.0
## 3 NaN
## dtype: float64
The index can be modified later, by overwriting the attribute index
:
## Série s :
## o 1.0
## d 4.0
## i -1.0
## l NaN
## dtype: float64
10.1.1.3 Creation of Special Series
A simple trick to creating series with a repeated value consists in providing a scalar to the Series
function of NumPy
and an index whose length will correspond to the number of times the scalar is repeated:
## 0 5
## 1 5
## 2 5
## 3 5
## dtype: int64
A series can be created from a dictionary:
dictionary = {"King": "Arthur",
"Knight_Round_Table": "Percival",
"Druid": "Merlin"}
s = pd.Series(dictionary)
print(s)
## King Arthur
## Knight_Round_Table Percival
## Druid Merlin
## dtype: object
As noted in the previous output, the dictionary keys were used for the index. When creating the series, specific values can be specified in the key argument: this will result in retrieving only the observations corresponding to these keys:
dictionary = {"King": "Arthur",
"Knight_Round_Table": "Percival",
"Druid": "Merlin"}
s = pd.Series(dictionary, index = ["King", "Druid"])
print(s)
## King Arthur
## Druid Merlin
## dtype: object
10.1.2 Dataframes
Dataframes correspond to the data format traditionally found in economics, two-dimensional tables, with column variables and observations in rows. The columns and rows of the dataframes are indexed.
10.1.2.1 Creating Dataframes from a Dictionary
To create a dataframe, the DataFrame()
function of pandas
can be provided with a dictionary that can be transformed into a series
. This is the case of a dictionary where the values associated with the keys are all the same length:
dict = {"height" :
[58, 59, 60, 61, 62,
63, 64, 65, 66, 67,
68, 69, 70, 71, 72],
"weight":
[115, 117, 120, 123, 126,
129, 132, 135, 139, 142,
146, 150, 154, 159, 164]
}
df = pd.DataFrame(dict)
print(df)
## height weight
## 0 58 115
## 1 59 117
## 2 60 120
## 3 61 123
## 4 62 126
## 5 63 129
## 6 64 132
## 7 65 135
## 8 66 139
## 9 67 142
## 10 68 146
## 11 69 150
## 12 70 154
## 13 71 159
## 14 72 164
The position of the elements in the dataframe serves as an index. As for the series, the values are accessible in the values
attribute and the index in the index
attribute. The columns are also indexed:
## Index(['height', 'weight'], dtype='object')
The head()
method displays the first few lines (the first 5, by default). You can modify its n
arguments to indicate the number of lines to be returned:
## height weight
## 0 58 115
## 1 59 117
When creating a dataframe from a dictionary, if the name of the columns to be imported is specified by a list of strings provided in the columns
argument of the DataFrame
function, it is possible to define not only the columns to be filled but also their order of appearance.
For example, to import only the weight
column:
## weight
## 0 115
## 1 117
And to define the order in which the columns will appear:
## weight height
## 0 115 58
## 1 117 59
If a column name that is missing from the dictionary keys is specified, the resulting dataframe will contain a column with this name but filled with the values NaN
:
## weight height age
## 0 115 58 NaN
## 1 117 59 NaN
10.1.2.2 Creating Dataframes from Series
A dataframe can be created from a series:
s = pd.Series([1, 4, -1, np.nan], index = ["o", "d", "i", "l"])
s.name = "name_variable"
df = pd.DataFrame(s, columns = ["name_variable"])
print(df)
## name_variable
## o 1.0
## d 4.0
## i -1.0
## l NaN
If the series is not named, the columns
argument of the DataFrame
function must not be filled in. But in this case, the column will not have a name, just a numerical index.
## 0
## o 1.0
## d 4.0
## i -1.0
## l NaN
## None
10.1.2.3 Creating Dataframes from a Dictionary List
A dataframe can be created from a list of dictionaries:
dico_1 = {
"Name": "Pendragon",
"Surname": "Arthur",
"Role": "King of Britain"
}
dico_2 = {
"Name": "",
"Surname": "Perceval",
"Role": "Knight of the Round Table"
}
df = pd.DataFrame([dico_1, dico_2])
print(df)
## Name Role Surname
## 0 Pendragon King of Britain Arthur
## 1 Knight of the Round Table Perceval
If some keys are missing in one or more of the dictionaries in the list, the corresponding values in the dataframe will be NaN
:
dico_3 = {
"Surname": "Guinevere",
"Role": "Queen of Britain"
}
df = pd.DataFrame([dico_1, dico_2, dico_3])
print(df)
## Name Role Surname
## 0 Pendragon King of Britain Arthur
## 1 Knight of the Round Table Perceval
## 2 NaN Queen of Britain Guinevere
10.1.2.4 Création de dataframes à partir d’un dictionnaire de séries
A dataframe can also be created from a series dictionary. To illustrate the method, let’s create two dictionaries:
# Annual 2017 GDP
# In millions of current dollars
dict_gdp_current = {
"France": 2582501.31,
"USA": 19390604.00,
"UK": 2622433.96
}
# Annual consumer price index
dict_cpi = {
"France": 0.2,
"UK": 0.6,
"USA": 1.3,
"Germany": 0.5
}
From these two dictionaries, let’s create two corresponding series:
s_gdp_current = pd.Series(dict_gdp_current)
s_cpi = pd.Series(dict_cpi)
print("s_gdp_current : \n", s_gdp_current)
## s_gdp_current :
## France 2582501.31
## USA 19390604.00
## UK 2622433.96
## dtype: float64
##
## s_cpi :
## France 0.2
## UK 0.6
## USA 1.3
## Germany 0.5
## dtype: float64
Then, let’s create a dictionary of series:
## {'gdp': France 2582501.31
## USA 19390604.00
## UK 2622433.96
## dtype: float64, 'cpi': France 0.2
## UK 0.6
## USA 1.3
## Germany 0.5
## dtype: float64}
Finally, let’s create our dataframe:
## gdp cpi
## France 2582501.31 0.2
## Germany NaN 0.5
## UK 2622433.96 0.6
## USA 19390604.00 1.3
The dict_gdp_current
dictionary does not contain a Germany
key, unlike the dict_cpi
dictionary. When the dataframe was created, the GDP value for Germany was therefore assigned as NaN
.
10.1.2.5 Creation of Dataframes from a Two-dimensional NumPy
Array
A dataframe can also be created from a Numpy
array. When creating, with the function DataFrame()
of NumPy
, it is possible to specify the name of the columns (otherwise, the indication of the columns will be numerical):
listing = [
[1, 2, 3],
[11, 22, 33],
[111, 222, 333],
[1111, 2222, 3333]
]
array_np = np.array(listing)
df = pd.DataFrame(array_np,
columns = ["a", "b", "c"])
print(df)
## a b c
## 0 1 2 3
## 1 11 22 33
## 2 111 222 333
## 3 1111 2222 3333
10.1.2.6 Dimensions
The dimensions of a dataframe are accessed with the attribute shape
.
## shape : (4, 3)
The number of lines can also be displayed as follows:
## shape : 4
And the number of columns:
## shape : 3
10.1.2.7 Modification of the Index
As for the series, the index can be modified once the dataframe has been created, by overwriting the values of the attributes index
and columns
, for the index of rows and columns, respectively:
## aa bb cc
## o 1 2 3
## d 11 22 33
## i 111 222 333
## l 1111 2222 3333
10.2 Selection
In this section, we look at different ways to select data in series and dataframes. There are two distinct ways to do this:
- a first one based on the use of brackets directly on the object for which we want to select certain parts;
- second based on indexers, accessible as attributes of
NumPy
objects (loc
,at
,iat
,iat
, etc.)
The second method avoids some confusion that may appear in the case of numerical indexes.
10.2.1 For Series
First, let’s look at ways to extract values from series.
10.2.1.1 With brackets
The index can be used to extract the data:
## 1.0
## 1 4.0
## 2 -1.0
## dtype: float64
## 0 1.0
## 4 0.5
## dtype: float64
Note that unlike `numpy’ tables or lists, negative values for the index cannot be used to retrieve data by counting their position relative to the end:
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: -2
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 766, in __getitem__
## result = self.index.get_value(self, key)
## File "/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3103, in get_value
## tz=getattr(series.dtype, 'tz', None))
## File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
## File "pandas/_libs/index.pyx", line 114, in pandas._libs.index.IndexEngine.get_value
## File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
## File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
## File "pandas/_libs/hashtable_class_helper.pxi", line 964, in pandas._libs.hashtable.Int64HashTable.get_item
In the case of an index composed of strings, it is then possible to refer either to the content of the index (to make it simple, its name) or to its position when extracting the data from the series:
## Series s :
## o 1.0
## d 4.0
## i -1.0
## l NaN
## dtype: float64
## s["d"] :
## 4.0
## s[1] :
## 4.0
## elements o and i :
## o 1.0
## i -1.0
## dtype: float64
On the other hand, in the case where the index is defined with numerical values, to extract the values using the brackets, it will be by the value of the index and not by relying on the position:
## 4.0
10.2.1.2 With Indexers
Pandas offers two types of multi-axis indication: loc
, iloc
. The first is mainly based on the use of axis labels, while the second is mainly based on positions using integers.
For the purposes of this section, let’s create two series; one with a textual index, the other with a numerical index:
s_num = pd.Series([1, 4, -1, np.nan],
index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
index = ["c", "a", "b", "d"])
10.2.1.2.1 Extraction of a Single Element
To extract an object with loc
, we use the name of the index :
## 1.0 1.0
To extract a single element with iloc
, simply indicate its position:
## (4.0, 4.0)
10.2.1.2.2 Extraction of Several Elements
To extract several elements with loc
, we use the names (labels) of the indices, which we provide in a list:
## elements with labels 5 and 4:
## 5 1.0
## 4 -1.0
## dtype: float64
## elements with labels c and b:
## c 1.0
## b -1.0
## dtype: float64
To extract multiple elements with iloc
:
## elements at positions 0 and 2:
## 5 1.0
## 4 -1.0
## dtype: float64
## elements at positions 0 and 2:
## c 1.0
## b -1.0
## dtype: float64
10.2.1.2.3 Slicing
It is possible to perform series slicing, to recover consecutive elements:
## elements with label 5 to 4:
## 5 1.0
## 0 4.0
## 4 -1.0
## dtype: float64
## elements with label c to b:
## c 1.0
## a 4.0
## b -1.0
## dtype: float64
To extract multiple elements with iloc
:
## elements at positions 0 and 2:
## 5 1.0
## 0 4.0
## dtype: float64
## elements at positions 0 and 2:
## c 1.0
## a 4.0
## dtype: float64
As we have seen so far, the upper limit value is not included in the breakdown.
10.2.1.2.4 Mask
A mask can also be used to extract elements, either using loc
or iloc
:
##
## 5 1.0
## 1 NaN
## dtype: float64
##
## c 1.0
## d NaN
## dtype: float64
##
## 5 1.0
## 1 NaN
## dtype: float64
##
## c 1.0
## d NaN
## dtype: float64
10.2.1.2.5 What’s the Point?
Why introduce such ways of extracting data and not just extract it using the brackets on the objects? Let’s look at a simple example. Let’s assume that we have the s_num
series, with an index composed of integers that is not a sequence ranging from 0 to the number of elements. In this case, if we want to recover the 2nd element, because of the index composed of numerical values, we cannot obtain it by asking s[1]
. To extract the 2nd of the series, we must know that its index is 0
and thus ask :
## The element whose index is 0: 4.0
To be able to perform the extraction according to the position, it is very useful to have this attribute iloc
:
## The element in 2nd position: 4.0
10.2.2 For dataframes
Now let’s look at different ways to extract data from a dataframe. Let’s create two dataframes as an example, one with a numerical index; another with a textual index :
dict = {"height" : [58, 59, 60, 61, 62],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, 31, 29],
"taille": [162, 156, 172, 160, 158],
}
df_num = pd.DataFrame(dict)
df_text = pd.DataFrame(dict, index=["a", "e", "c", "b", "d"])
print("df_num : \n", df_num)
## df_num :
## height weight age taille
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
## 3 61 123 31 160
## 4 62 126 29 158
## df_text :
## height weight age taille
## a 58 115 28 162
## e 59 117 33 156
## c 60 120 31 172
## b 61 123 31 160
## d 62 126 29 158
To make it simple, when we want to perform an extraction with the iloc
attributes, the syntax is as follows:
with line_selection
:
- a single value:
1
(second line) ; - a list of values:
[2, 1, 3]
(3rd line, 2nd line and 4th line) ; - a breakdown:
[2:4]
(from the 3rd line to the 4th line (not included)).
for column_selection
:
- a single value:
1
(second column) ; - a list of values:
[2, 1, 3]
(3rd column, 2nd column and 4th column) ; - a breakdown:
[2:4]
(from the 3rd column to the 4th column (not included)).
With loc
, the syntax is as follows:
with line_selection
:
- a single value: `
a'' (line named
a`); - a list of names:
["a", "c", "b"]
(Rows named “a”, “c” and “b”) ; - a mask:
df.['a']<10
(Rows for which the mask values areTrue
).
with column_selection
:
- a single value: `
a'' (column named
a`); - a list of values:
["a", "c", "b"]
(columns named “a”, “c” and “b”) ; - a breakdown: `
["a": "c"]
(from the column named “a” to the column named “c”).
10.2.2.1 Extraction of a Rows
To extract a Rows from a dataframe, the name of the Rows can be used with loc
:
## Rows named 'e':
## height 59
## weight 117
## age 33
## taille 156
## Name: e, dtype: int64
##
## Rows named 'e':
## height 59
## weight 117
## age 33
## taille 156
## Name: 1, dtype: int64
Or, its position with iloc
:
## Rows in position 0:
## height 58
## weight 115
## age 28
## taille 162
## Name: a, dtype: int64
##
## Rows in position 0:
## height 58
## weight 115
## age 28
## taille 162
## Name: 0, dtype: int64
10.2.2.2 Extraction of Several Rows
To extract multiple lines from a dataframe, their names can be used with loc
(in an array):
## Rows named a and c:
## height weight age taille
## a 58 115 28 162
## c 60 120 31 172
##
## Rows named 0 and 2:
## height weight age taille
## 0 58 115 28 162
## 2 60 120 31 172
Or, their position with iloc
:
## Rows at positions 0 and 3:
## height weight age taille
## a 58 115 28 162
## b 61 123 31 160
##
## Rows at positions 0 and 3:
## height weight age taille
## 0 58 115 28 162
## 3 61 123 31 160
10.2.2.3 Slicing of Several Rows
A line sequence can be retrieved by delimiting the first and last line to be retrieved according to their name and using loc
:
## Rows from label à to c:
## height weight age taille
## a 58 115 28 162
## e 59 117 33 156
## c 60 120 31 172
##
## Rows from label 0 to 2:
## height weight age taille
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
With the attribute iloc
, this is also possible (again, the upper bound is not included):
## Rows at position 0 to 3 (not included):
## height weight age taille
## a 58 115 28 162
## e 59 117 33 156
## c 60 120 31 172
##
## Rows at position 0 to 3 (not included):
## height weight age taille
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
10.2.2.4 Mask
A mask can also be used to select certain rows For example, if we want to retrieve the rows for which the variable height
has a value greater than 60, we use the following mask:
## a False
## e False
## c False
## b True
## d True
## Name: height, dtype: bool
To filter:
## height weight age taille
## b 61 123 31 160
## d 62 126 29 158
10.2.2.5 Extraction of a Single Column
To extract a column from a dataframe, we can use square brackets and refer to the name of the column (which is indexed by names):
## a 115
## e 117
## Name: weight, dtype: int64
By selecting a single column, we obtain a series (the dataframe index is kept for the series):
## <class 'pandas.core.series.Series'>
A column can also be extracted by referring to the attribute of the dataframe named after this column:
## a 115
## e 117
## Name: weight, dtype: int64
As for the series, we can rely on the attributes loc
and iloc
:
## Column 2 (loc):
## a 115
## e 117
## c 120
## b 123
## d 126
## Name: weight, dtype: int64
## Column 2 (iloc):
## a 115
## e 117
## c 120
## b 123
## d 126
## Name: weight, dtype: int64
10.2.2.6 Extraction of Several Columns
To extract several columns, the names of the columns can be placed in a table:
## weight height
## a 115 58
## e 117 59
## c 120 60
## b 123 61
## d 126 62
The order in which these columns are called corresponds to the order in which they will be returned.
Again, we can use the loc
attribute (we use the colon here to specify that we want all the lines):
## Columns from weight to height:
## weight height
## a 115 58
## e 117 59
## c 120 60
## b 123 61
## d 126 62
And the iloc
attribute:
## Columns 2 and 1 :
## weight height
## 0 115 58
## 1 117 59
## 2 120 60
## 3 123 61
## 4 126 62
10.2.2.7 Slicing Several Columns
To perform a slice, the attributes loc
and iloc
can be used. We must be careful as the names of the columns used for the breakdown are not placed in a table here:
With loc
:
## Columns 2 and 2:
## height weight age
## a 58 115 28
## e 59 117 33
## c 60 120 31
## b 61 123 31
## d 62 126 29
And with the iloc
attribute:
## Columns from position 0 to 2 (not included) :
## height weight
## a 58 115
## e 59 117
## c 60 120
## b 61 123
## d 62 126
10.2.2.8 Extraction of Rows and Columns
Now that we have reviewed multiple ways to select one or more rows or columns, we can also mention that it is possible to make selections of columns and rows in the same instruction.
For example, with iloc
, let’s select the rows from position 0 to position 2 (not included) and the columns from position 1 to 3 (not included):
## weight age
## a 115 28
## e 117 33
With loc
, let’s select the rows named a
and c
and the columns from the one named weight
to age
.
## weight age
## a 115 28
## c 120 31
10.3 Renaming Columns in a Dataframe
To rename a column in a dataframe, pandas
offers the method rename()
. Let’s take an example with our df
dataframe:
dict = {"height" : [58, 59, 60, 61, 62],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, 31, 29],
"taille": [162, 156, 172, 160, 158],
}
df = pd.DataFrame(dict)
print(df)
## height weight age taille
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
## 3 61 123 31 160
## 4 62 126 29 158
Let’s rename the column taille
to height
, using a dicionnaire specified in the argument columns
, with as key the current name of the column, and value the new name :
## height weight age height
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
## 3 61 123 31 160
## 4 62 126 29 158
For the change to be effective, the argument inplace
is set to True
, otherwise the change is not made to the dataframe.
To rename several columns at the same time, we can provide several pairs of value keys in the dictionary:
## height weight_pounds years height
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
## 3 61 123 31 160
## 4 62 126 29 158
10.4 Filtering
To filter the data in a table, depending on the values encountered for some variables, masks are used, as indicated in Section 10.2.2.4.
Let’s look at some examples here, by redefining our dataframe:
dict = {"height" : [58, 59, 60, 61, 62],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, 31, 29],
"height_cm": [162, 156, 172, 160, 158],
}
df = pd.DataFrame(dict)
print(df)
## height weight age height_cm
## 0 58 115 28 162
## 1 59 117 33 156
## 2 60 120 31 172
## 3 61 123 31 160
## 4 62 126 29 158
The idea is to create a mask returning a series containing Boolean values, one per line. When the value of the mask line is set to True
, the dataframe line on which the mask will be applied will be retained, while it will not be retained when the value of the mask line is set to False
.
Let’s look at a simple example, in which we want to keep observations only for which the value of the age
variable is less than 30:
## 0 True
## 1 False
## 2 False
## 3 False
## 4 True
## Name: age, dtype: bool
Then, we simply have to apply this mask, with loc
. We want all the columns, but only a few rows:
## height weight age height_cm
## 0 58 115 28 162
## 4 62 126 29 158
Note: it also works without loc
:
## height weight age height_cm
## 0 58 115 28 162
## 4 62 126 29 158
More simply, we can use the query()
method of pandas
. A Boolean expression to be evaluated is provided for this method to filter the data:
## height weight age height_cm
## 0 58 115 28 162
## 4 62 126 29 158
The query can be a little more complex, by combining comparison operators (see Section 4.2) and logical operators (see Section 4.3). For example, suppose that we want to filter the dataframe values to retain only those observations for which the size is less than or equal to 62 and the mass is strictly greater than 120. The request would then be:
## height weight age height_cm
## 3 61 123 31 160
It can be noted that the following instruction gives the same result:
## height weight age height_cm
## 3 61 123 31 160
10.4.1 Checking whether a value belongs to dataframe
To create a mask indicating whether the values of a series or dataframe belong to a set, the isin()
method can be used. For example, let’s return a mask indicating if the values in the height
column of df
are in the range \([59,60]\) :
## 0 False
## 1 True
## 2 True
## 3 False
## 4 False
## Name: height, dtype: bool
10.5 Missing Values
In economics, it is quite common to face incomplete data. The way missing data is managed by pandas
is the use of two special values: None
and NaN
.
The value None
can be used in the tables NumPy
only when the type of the latter is object
.
## [1 4 -1 None]
## <class 'numpy.ndarray'>
With an array of type object
, operations performed on the data will be less efficient than with a numerical array. (VanderPlas 2016, p 121).
The value NaN
is a floating point number value (see Section 9.1.10). NumPy
manages it differently from NaN
, and does not assign an object pass type from the start in the presence of NaN
:
## [ 1. 4. -1. nan]
## <class 'numpy.ndarray'>
With pandas
, these two values, None
and NaN
can be present:
## 0 1.0
## 1 NaN
## 2 -1.0
## 3 NaN
## dtype: float64
## <class 'pandas.core.series.Series'>
This also applies to dataframes:
dict = {"height" : [58, 59, 60, 61, np.nan],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, np.nan, 29],
"height_cm": [162, 156, 172, 160, 158],
}
df = pd.DataFrame(dict)
print(df)
## height weight age height_cm
## 0 58.0 115 28.0 162
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 3 61.0 123 NaN 160
## 4 NaN 126 29.0 158
However, it should be noted that only the type of variables for which missing values exist are passed to float64
:
## height float64
## weight int64
## age float64
## height_cm int64
## dtype: object
float64
type. When working on a table with no missing values, whose type is int
or bool
, if a missing value is entered, pandas
will change the data type to float64
and object
, respectively.
pandas
offers different ways to handle missing values.
10.5.1 Identify Missing Values
With the isnull()
method, a boolean mask is returned, indicating True
for observations with the value NaN
or None
:
## 0 False
## 1 True
## 2 False
## 3 True
## dtype: bool
To know if a value is not zero, the notnull()
method can be used:
## 0 True
## 1 False
## 2 True
## 3 False
## dtype: bool
10.5.2 Remove Observations with Missing Values
The dropna()
method allows to remove observations with null values:
## height weight age height_cm
## 0 58.0 115 28.0 162
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
10.5.3 Removing Missing Values by Other Values
To replace missing values with other values, pandas
proposes to use the method fillna()
:
## height weight age height_cm
## 0 58.0 115 28.0 162
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 3 61.0 123 -9999.0 160
## 4 -9999.0 126 29.0 158
10.6 Deletions
To delete a value on one of the axes of a series or dataframe, NumPy
offers the method drop()
.
10.6.1 Deleting Elements in a Series
To illustrate how the drop()
method works, let’s create a series with a numerical index, another with a textual index :
s_num = pd.Series([1, 4, -1, np.nan],
index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
index = ["c", "a", "b", "d"])
An element can be deleted from a series by using its name:
## for s_num:
## 0 4.0
## 4 -1.0
## 1 NaN
## dtype: float64
##
## for s_text:
## a 4.0
## b -1.0
## d NaN
## dtype: float64
We can also retrieve the name according to the position, by going through a detour using the index()
method:
## s_num.index[0]: 5
## s_text.index[0]: c
## for s_num:
## 0 4.0
## 4 -1.0
## 1 NaN
## dtype: float64
##
## for s_text:
## a 4.0
## b -1.0
## d NaN
## dtype: float64
To delete several elements, we can provide several index names in a list using the drop()
method:
## for s_num:
## 0 4.0
## 1 NaN
## dtype: float64
##
## for s_text:
## a 4.0
## d NaN
## dtype: float64
Again, we can retrieve the name according to the position, by going through a detour using the index()
method:
## s_num.index[[0,2]]: Int64Index([5, 4], dtype='int64')
## s_text.index[[0,2]]: Index(['c', 'b'], dtype='object')
## for s_num:
## 0 4.0
## 1 NaN
## dtype: float64
##
## for s_text:
## a 4.0
## d NaN
## dtype: float64
It is also possible to use a slicing to obtain the series without the element(s) (See Section 10.2.1.2.3)
10.6.2 Deleting Elements in a Dataframe
To illustrate how the drop()
method works on a dataframe, let’s create one:
s_num = pd.Series([1, 4, -1, np.nan],
index = [5, 0, 4, 1])
s_text = pd.Series([1, 4, -1, np.nan],
index = ["c", "a", "b", "d"])
dict = {"height" : [58, 59, 60, 61, np.nan],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, np.nan, 29],
"height_cm": [162, 156, 172, 160, 158],
}
df = pd.DataFrame(dict)
10.6.2.1 Deleting Rows
To delete a row from a dataframe, we can refer to its name (here, the names are numbers, but they are labels):
## Delete the first row:
## height weight age height_cm
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 3 61.0 123 NaN 160
## 4 NaN 126 29.0 158
If the rows have text labels, they can first be retrieved using the index()
method:
## Delete the first row:
## height weight age height_cm
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 3 61.0 123 NaN 160
## 4 NaN 126 29.0 158
To delete several rows, the name of these rows in a list is given to the drop()
method:
## Delete the 1st and 4th rows:
## height weight age height_cm
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 4 NaN 126 29.0 158
Or, by indicating the positions of the lines:
## Delete the 1st and 4th rows:
## height weight age height_cm
## 1 59.0 117 33.0 156
## 2 60.0 120 31.0 172
## 4 NaN 126 29.0 158
It is also possible to use a slicing to obtain the series without the element(s) (See Sections 10.2.2.3 et 10.2.2.7)
10.6.2.2 Deleting Columns
To delete a column from a dataframe, we proceed in the same way as for rows, but by adding the parameter axis=1
to the method drop()
to specify that we are interested in the columns :
## Delete the first column:
## weight age height_cm
## 0 115 28.0 162
## 1 117 33.0 156
## 2 120 31.0 172
## 3 123 NaN 160
## 4 126 29.0 158
We can first retrieve the labels of the columns according to their position using the method columns()
:
## label_pos : height
## Delete the first column:
## weight age height_cm
## 0 115 28.0 162
## 1 117 33.0 156
## 2 120 31.0 172
## 3 123 NaN 160
## 4 126 29.0 158
To delete several columns, the names of these columns are given in a list in the drop()
method:
## Delete the 1st and 4th columns:
## weight age
## 0 115 28.0
## 1 117 33.0
## 2 120 31.0
## 3 123 NaN
## 4 126 29.0
Or, by indicating the positions of the columns:
label_pos = df.columns[[0, 3]]
print("Delete the 1st and 4th columns: \n", df.drop(label_pos, axis=1))
## Delete the 1st and 4th columns:
## weight age
## 0 115 28.0
## 1 117 33.0
## 2 120 31.0
## 3 123 NaN
## 4 126 29.0
It is also possible to use a slicing to obtain the series without the element(s) (c.f. Sections 10.2.2.3 and 10.2.2.7)
10.7 Replacing Values
We will now look at how to modify one or more values, in the case of a series and then a dataframe.
10.7.1 For a Series
To modify a particular value in a series or dataframe, the equal symbol (=
) can be used, having previously targeted the location of the value to be modified, using the extraction techniques explained in Section 10.2.
For example, let’s consider the following series:
## s_num: 5 1.0
## 0 4.0
## 4 -1.0
## 1 NaN
## dtype: float64
Let’s modify the second element of s_num
, to give it the value -3 :
## s_num: 5 1.0
## 0 -3.0
## 4 -1.0
## 1 NaN
## dtype: float64
It is of course possible to modify several values at the same time.
Again, all we need to do is target the positions (there are many ways to do this) and provide an object of equivalent dimensions to replace the targeted values. For example, in our s_num
series, we will replace the values in position 1 and 3 (2nd and 4th values) with -10 and -9 :
## 5 1.0
## 0 -10.0
## 4 -1.0
## 1 -9.0
## dtype: float64
10.7.2 For a Dataframe
Let’s consider the following dataframe:
dict = {"city" : ["Marseille", "Aix",
"Marseille", "Aix", "Paris", "Paris"],
"year": [2019, 2019, 2018, 2018,2019, 2019],
"x": [1, 2, 2, 2, 0, 0],
"y": [3, 3, 2, 1, 4, 4],
}
df = pd.DataFrame(dict)
print("df: \n", df)
## df:
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
## 5 Paris 2019 0 4
10.7.2.1 Changes of a Particular Value
Let’s change the value of the first line of df
for the column year
, so that it is 2020. First, let’s retrieve the position of the year
column in the dataframe, using the get_loc()
method applied to the colnames
attribute of the dataframe:
## pos_year: 1
Then, let’s make the modification:
## df:
## city year x y
## 0 Marseille 2020 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
## 5 Paris 2019 0 4
10.7.2.2 Modifications on One or More Columns
To modify all the values in a column to place a particular value, for example a 2 in the x
column of df
:
## df:
## city year x y
## 0 Marseille 2020 2 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 2 4
## 5 Paris 2019 2 4
It is also possible to modify the values in the column by providing a list of values:
## df:
## city year x y
## 0 Marseille 2020 2 3
## 1 Aix 2019 3 3
## 2 Marseille 2018 4 2
## 3 Aix 2018 2 1
## 4 Paris 2019 1 4
## 5 Paris 2019 0 4
We can therefore imagine modifying the values of a column according to the values that we read in another column. For example, let’s assume the following code: if the value of y
is 2, then x is a', if the value of
yis 1, if the value of
xis
b’, otherwise it is NaN
. First, let’s build a list containing the values to insert (which we will name nv_val
), using a loop. We will go through all the elements of the y' column, and at each iteration add to
nv_val` the value obtained by making our comparisons:
new_val = []
for i in np.arange(len(df.index)):
if df.y[i] == 2:
new_val.append("a")
elif df.y[i] == 1:
new_val.append("b")
else:
new_val.append(np.nan)
print("new_val: ", new_val)
## new_val: [nan, nan, 'a', 'b', nan, nan]
We are ready to modify the content of the x
column of df
to replace it with new_val
:
## df:
## city year x y
## 0 Marseille 2020 NaN 3
## 1 Aix 2019 NaN 3
## 2 Marseille 2018 a 2
## 3 Aix 2018 b 1
## 4 Paris 2019 NaN 4
## 5 Paris 2019 NaN 4
To replace several columns at the same time:
## df:
## city year x y
## 0 Marseille 2020 2 1
## 1 Aix 2019 3 1
## 2 Marseille 2018 4 1
## 3 Aix 2018 2 1
## 4 Paris 2019 1 1
## 5 Paris 2019 0 1
In the previous instruction, we replaced the contents of the x
and y
columns with a vector of handwritten values for x
and with the value 1 for all observations for y
.
10.7.2.3 Modifications on One or More Rows
To replace a row with a constant value (of little interest in the current example):
## df:
## city year x y
## 0 Marseille 2020 2 1
## 1 1 1 1 1
## 2 Marseille 2018 4 1
## 3 Aix 2018 2 1
## 4 Paris 2019 1 1
## 5 Paris 2019 0 1
It may be more interesting to replace an observation as follows:
## df:
## city year x y
## 0 Marseille 2020 2 1
## 1 Aix 2018 1 2
## 2 Marseille 2018 4 1
## 3 Aix 2018 2 1
## 4 Paris 2019 1 1
## 5 Paris 2019 0 1
To replace several rows, the method is identical:
## df:
## city year x y
## 0 Marseille 2020 2 1
## 1 Aix 2018 1 2
## 2 Marseille 2018 4 1
## 3 Aix 2018 -1 -1
## 4 Paris 2019 1 1
## 5 Paris 2019 0 1
10.8 Adding Values
Now let’s look at how to add values, first in a series, then in a dataframe.
10.8.1 For a Series
Let’s consider the following series:
## s_num: 5 1.0
## 0 4.0
## 4 -1.0
## 1 NaN
## dtype: float64
10.8.1.1 Adding a Single Value in a Series
To add a value, we use the append()
method. Here, with s_num
, as the index is manual, we are compelled to provide a series with a value for the index as well:
## s_num_2:
## 2 1
## dtype: int64
## s_num:
## 5 1.0
## 0 4.0
## 4 -1.0
## 1 NaN
## 2 1.0
## dtype: float64
Note that the append()
method returns a view. To effectively add values, we must make a new assignment.
By having a series with an automatically generated numerical index, we can specify the value True
for the ignore_index
argument of the append()
method to indicate that we do not take into account the value of the index of the object we add :
## s:
## 0 10
## 1 2
## 2 4
## 3 2
## dtype: int64
10.8.1.2 Adding Several Values in a Series
To add several values, we use the append()
method. Here, with s_num
, as the index is manual, we are required to provide a series with a value for the index as well:
## 5 1.0
## 0 4.0
## 4 -1.0
## 1 NaN
## 2 1.0
## 2 1.0
## dtype: float64
## s_num: 5 1.0
## 0 4.0
## 4 -1.0
## 1 NaN
## 2 1.0
## dtype: float64
By having a series with an automatically generated numerical index:
## 0 10
## 1 2
## 2 4
## 3 2
## dtype: int64
10.8.2 For a dataframe
Let’s go back to our dataframe:
dict = {"city" : ["Marseille", "Aix",
"Marseille", "Aix", "Paris", "Paris"],
"year": [2019, 2019, 2018, 2018,2019, 2019],
"x": [1, 2, 2, 2, 0, 0],
"y": [3, 3, 2, 1, 4, 4],
}
df = pd.DataFrame(dict)
print("df : \n", df)
## df :
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
## 5 Paris 2019 0 4
10.8.2.1 Adding a Row in a Dataframe
As for a series, to add a row, we use the append()
method. First, let’s create a new dataframe with the line to add:
new_row = pd.DataFrame([["Marseille", "2021", 2, 4]],
columns = df.columns)
print("new_row: \n", new_row)
## new_row:
## city year x y
## 0 Marseille 2021 2 4
We made sure to have the same column name here, by indicating in the columns
parameter of the pd.DataFrame
method the name of the df
columns, i.e., df.columns
.
Let’s add the new row to df
:
Again, the append()
method applied to a dataframe, returns a view and does not affect the object.
It can be noted that when adding a row, if the column names are not indicated in the same order as in the dataframe in which the addition is made, an indication must be added to the out
parameter of the append()
method:
- if
sort=True
, the order of the columns of the added row will be applied to the destination dataframe - if
sort=False
, the column order of the destination dataframe will not be modified.
new_row = pd.DataFrame([["2021", "Marseille", 2, 4]],
columns = ["year", "coty", "x", "y"])
print("new_row: \n", new_row)
## new_row:
## year coty x y
## 0 2021 Marseille 2 4
## avec sort=True :
## city coty x y year
## 0 Marseille NaN 1 3 2019
## 1 Aix NaN 2 3 2019
## 2 Marseille NaN 2 2 2018
## 3 Aix NaN 2 1 2018
## 4 Paris NaN 0 4 2019
## 5 Paris NaN 0 4 2019
## 6 Marseille NaN 2 4 2021
## 7 NaN Marseille 2 4 2021
10.8.2.2 Adding Multiple Rows in a Dataframe
To add several rows, it’s exactly the same principle as with a single one, just add a dataframe of several rows, with the same names again.
The rows to be inserted:
new_rows = pd.DataFrame([
["Marseille", "2022", 2, 4],
["Aix", "2022", 3, 3]],
columns = df.columns)
print("new_rows: \n", new_rows)
## new_rows:
## city year x y
## 0 Marseille 2022 2 4
## 1 Aix 2022 3 3
Then the insertion:
10.8.2.3 Adding a Column to a Dataframe
To add a column in a dataframe:
## df:
## city year x y z
## 0 Marseille 2019 1 3 0.117443
## 1 Aix 2019 2 3 0.393782
## 2 Marseille 2018 2 2 0.452730
## 3 Aix 2018 2 1 0.538148
## 4 Paris 2019 0 4 0.790622
## 5 Paris 2019 0 4 0.465836
## 6 Marseille 2021 2 4 0.435332
## 7 Marseille 2022 2 4 0.569479
## 8 Aix 2022 3 3 0.969259
10.8.2.4 Adding Multiple Columns to a Dataframe
To add several columns:
## df:
## city year x y z a b
## 0 Marseille 2019 1 3 0.117443 0.040556 0.689236
## 1 Aix 2019 2 3 0.393782 0.548120 0.929546
## 2 Marseille 2018 2 2 0.452730 0.462577 0.918117
## 3 Aix 2018 2 1 0.538148 0.376472 0.975302
## 4 Paris 2019 0 4 0.790622 0.327912 0.397002
## 5 Paris 2019 0 4 0.465836 0.813529 0.262626
## 6 Marseille 2021 2 4 0.435332 0.646552 0.430151
## 7 Marseille 2022 2 4 0.569479 0.047426 0.764531
## 8 Aix 2022 3 3 0.969259 0.994958 0.599731
10.9 Removing Duplicate Values
To remove duplicate values from a dataframe, NumPy
offers the drop_duplicates()
method, which takes several optional arguments:
subset
: by indicating one or more column names, the search for duplicates is only done on these columns;keep
: allows to indicate which observation to keep in case of identified duplicates:- if
keep='first'
, all duplicates are removed except the first occurrence, - if
keep='last'
, all duplicates are removed except the last occurrence, - if
keep='False'
, all duplicates are removed;
- if
inplace
: boolean (default:False
) to indicate if duplicates should be removed from the dataframe or if a copy should be returned (default).
Let’s give some examples using this dataframe which makes up two duplicates when we consider its totality. If we focus only on years or cities, or both, other duplicates can be identified.
dict = {"city" : ["Marseille", "Aix",
"Marseille", "Aix", "Paris", "Paris"],
"year": [2019, 2019, 2018, 2018,2019, 2019],
"x": [1, 2, 2, 2, 0, 0],
"y": [3, 3, 2, 1, 4, 4],
}
df = pd.DataFrame(dict)
print(df)
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
## 5 Paris 2019 0 4
To remove the duplicates:
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
Remove duplicates by keeping the last value of the identified duplicates:
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 5 Paris 2019 0 4
To remove identified duplicates when focusing on city names, and keeping only the first value :
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 4 Paris 2019 0 4
Same as above but with a focus on couples (city, year)
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
Note that the original dataframe was not impacted, since we did not touch the inplace
argument. If we now ask that the changes be made on the dataframe instead of getting a copy:
## city year x y
## 0 Marseille 2019 1 3
## 1 Aix 2019 2 3
## 2 Marseille 2018 2 2
## 3 Aix 2018 2 1
## 4 Paris 2019 0 4
To find out if a value is duplicated in a dataframe, NumPy
offers the duplicated()
method, which returns a mask indicating for each observation, whether it is duplicated or not. Its operation is similar to df.drop_duplicates()
, except for the inplace
parameter which is not present.
## 0 False
## 1 False
## 2 True
## 3 True
## 4 False
## dtype: bool
We can use the any()
method to find out if there are duplicates:
## True
10.10 Operations
It is often necessary to perform operations on the columns of a dataframe, especially when it comes to creating a new variable.
By using the principles of column modification (see Section @ref(#pandas-ajout-valeurs)), it is quite easy to imagine that it is possible to apply the functions and methods of NumPy
(see Section 9.1) on the values of the columns.
For example, let’s consider the following dataframe:
dict = {"height" :
[58, 59, 60, 61, 62,
63, 64, 65, 66, 67,
68, 69, 70, 71, 72],
"weight":
[115, 117, 120, 123, 126,
129, 132, 135, 139, 142,
146, 150, 154, 159, 164]
}
df = pd.DataFrame(dict)
print(df)
## height weight
## 0 58 115
## 1 59 117
## 2 60 120
## 3 61 123
## 4 62 126
## 5 63 129
## 6 64 132
## 7 65 135
## 8 66 139
## 9 67 142
## 10 68 146
## 11 69 150
## 12 70 154
## 13 71 159
## 14 72 164
Let’s add the column height_2
, increasing the values of the column height
to square:
## height weight height_2
## 0 58 115 3364
## 1 59 117 3481
## 2 60 120 3600
Now, let’s add the column bmi
, providing the body mass indicator values for individuals in the dataframe (\(\text{bmi} = \frac{\text{weight}}{\text{height}^2}\)) :
## height weight height_2 bmi
## 0 58 115 3364 0.034185
## 1 59 117 3481 0.033611
## 2 60 120 3600 0.033333
10.10.1 Statistics
pandas
offers some methods for performing descriptive statistics for each column or row. To do this, the syntax is as follows (all arguments have a default value, the list is simplified here):
axis
: 0 for rows, 1 for columnsskipna
: ifTrue
, excludes missing values from the calculations
Among the available methods:
- mean()
: mean
- mode()
: mode
- median()
: median
- std()
: standard error
- min()
: minimum
- max()
: maximum
- mad()
: mean absolute deviation
- sum()
: sum
- prod()
: product
- count()
: counting the elements
For example, to calculate the average of the values for each column:
dico = {"height" : [58, 59, 60, 61, 62],
"weight": [115, 117, 120, 123, 126],
"age": [28, 33, 31, 31, 29],
"height_cm": [162, 156, 172, 160, 158],
"married": [True, True, False, False, True],
"city": ["A", "B", "B", "B", "A"]
}
df = pd.DataFrame(dico)
print(df.mean())
## height 60.0
## weight 120.2
## age 30.4
## height_cm 161.6
## married 0.6
## dtype: float64
If desired, we can average the column values (without any meaning here):
## 0 72.8
## 1 73.2
## 2 76.6
## 3 75.0
## 4 75.2
## dtype: float64
These functions can be applied to a single column. For example, to display the minimum value:
## min: 58
It is also useful to be able to obtain the position of the min and max values; this can be obtained with the methods idxmin()
and idxmax()
, respectively.
## pos min: 0
## pos min: 4
A very useful method is describe()
, it allows to return descriptive statistics on all numerical columns :
## height weight age height_cm
## count 5.000000 5.000000 5.000000 5.000000
## mean 60.000000 120.200000 30.400000 161.600000
## std 1.581139 4.438468 1.949359 6.228965
## min 58.000000 115.000000 28.000000 156.000000
## 25% 59.000000 117.000000 29.000000 158.000000
## 50% 60.000000 120.000000 31.000000 160.000000
## 75% 61.000000 123.000000 31.000000 162.000000
## max 62.000000 126.000000 33.000000 172.000000
10.11 Sorting
It is easy to sort a dataframe in ascending or descending order by one or more of its columns. To do this, we use the method sort_values()
. The syntax is as follows:
DataFrame.sort_values(by, axis=0, ascending=True,
inplace=False, kind="quicksort",
na_position="last")
by
: name or list of names of the column(s) used to sortaxis
:0
for the index (default),1
for the columnsascending
: boolean or boolean list, whenTrue
the sorting is done by increasing values (default), whenFalse
it is done by decreasing valuesinplace
: ifTrue
, the sorting affects the dataframe, otherwise it returns a viewkind
: choice of sorting algorithm (quicksort
(default),mergesort
,heapsort
)na_position
: iffirst
, the missing values are placed at the beginning; iflast
(default), at the end.
Let us give some examples:
dict = {"height" : [58, 59, 60, 61, 62],
"weight": [115, np.nan, 120, 123, 126],
"age": [28, 33, 31, 31, 29],
"height_cm": [162, 156, 172, 160, 158],
"married": [True, True, np.nan, False, True],
"city": ["A", "B", "B", "B", "A"]
}
df = pd.DataFrame(dict)
If we sort the values in descending order of the values in the height
column:
## height weight age height_cm married city
## 4 62 126.0 29 158 True A
## 3 61 123.0 31 160 False B
## 2 60 120.0 31 172 NaN B
## 1 59 NaN 33 156 True B
## 0 58 115.0 28 162 True A
To sort in ascending order of the married
values (recall that True
is interpreted as 1 and False
as 0), then decreasing by weight
, placing the values NaN
first:
## height weight age height_cm married city
## 2 60 120.0 31 172 NaN B
## 3 61 123.0 31 160 False B
## 1 59 NaN 33 156 True B
## 4 62 126.0 29 158 True A
## 0 58 115.0 28 162 True A
Note that the NaN
values have increased for the subgroups composed according to the married
values.
10.12 Concatenation
It is frequent to obtain data from multiple sources when conducting an analysis. It is then important to be able to combine the different sources into one. In this section, we will limit ourselves to concatenating different dataframes between them, in simple cases in which we know a priori that all we have to do is put two dataframes side by side or one below the other. The case of slightly more elaborate joints with matching according to one or more columns is discussed in Section 10.13.
First, let’s create two dataframes with the same number of lines:
x_1 = pd.DataFrame(np.random.randn(5, 4),
columns=["a", "b", "c", "d"])
x_2 = pd.DataFrame(np.random.randn(5, 2),
columns = ["e", "f"])
print("x_1: \n", x_1)
## x_1:
## a b c d
## 0 0.231711 -0.474710 -0.309147 -2.032396
## 1 -0.174468 -0.642475 -0.625023 1.325887
## 2 0.531255 1.275284 -0.682826 -0.948186
## 3 0.777362 0.325113 -1.203486 1.209543
## 4 0.157622 -0.293555 0.111560 0.597679
##
## x_2:
## e f
## 0 -1.270093 0.120949
## 1 -0.193898 1.804172
## 2 -0.234694 0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318
To “paste” the dataframe x_2
to the right of x_1
, we can use the concat()
method of pandas
. To indicate that concatenation is performed on the columns, the value 1
for the parameter axix
is specified as follows:
## a b c d e f
## 0 0.231711 -0.474710 -0.309147 -2.032396 -1.270093 0.120949
## 1 -0.174468 -0.642475 -0.625023 1.325887 -0.193898 1.804172
## 2 0.531255 1.275284 -0.682826 -0.948186 -0.234694 0.939908
## 3 0.777362 0.325113 -1.203486 1.209543 -0.171520 -0.153055
## 4 0.157622 -0.293555 0.111560 0.597679 -0.363095 -0.067318
To paste the dataframes below each other, the append()
method can be used, as described in Section 10.8.2.1, or the concat()
method can also be used.
## x_3:
## e f
## 0 1.444721 0.325771
## 1 -0.855732 -0.697595
## 2 -0.276134 -1.258759
## 3 0.478094 -0.859764
## 4 0.571988 -0.173965
Let’s add the observations of x_3
below those of x_2
:
## e f
## 0 -1.270093 0.120949
## 1 -0.193898 1.804172
## 2 -0.234694 0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318
## 0 1.444721 0.325771
## 1 -0.855732 -0.697595
## 2 -0.276134 -1.258759
## 3 0.478094 -0.859764
## 4 0.571988 -0.173965
As can be seen, the line index of x_2
has not been modified. If we want it to be, we can specify it via the ignore_index
argument:
## e f
## 0 -1.270093 0.120949
## 1 -0.193898 1.804172
## 2 -0.234694 0.939908
## 3 -0.171520 -0.153055
## 4 -0.363095 -0.067318
## 5 1.444721 0.325771
## 6 -0.855732 -0.697595
## 7 -0.276134 -1.258759
## 8 0.478094 -0.859764
## 9 0.571988 -0.173965
If the column names are not identical, values NaN
will be inserted:
## x_4:
## e g
## 0 1.534900 0.872856
## 1 1.856835 0.025914
## 2 0.171984 -0.191163
## 3 -0.292936 1.655677
## 4 -0.207182 -0.686884
## e f g
## 0 -1.270093 0.120949 NaN
## 1 -0.193898 1.804172 NaN
## 2 -0.234694 0.939908 NaN
## 3 -0.171520 -0.153055 NaN
## 4 -0.363095 -0.067318 NaN
## 5 1.534900 NaN 0.872856
## 6 1.856835 NaN 0.025914
## 7 0.171984 NaN -0.191163
## 8 -0.292936 NaN 1.655677
## 9 -0.207182 NaN -0.686884
10.13 Joins
It is more likely to use slightly more elaborate joins to bring together the different data sources into one. pandas
offers a powerful way to gather data, the merge()
function.
To illustrate the different joins in this section, let’s create some dataframes:
exports_fr = pd.DataFrame(
{"country" : "France",
"year" : np.arange(2014, 2017),
"exports" : [816.8192172, 851.6632573, 867.4014253]
})
imports_fr = pd.DataFrame(
{"country" : "France",
"year" : np.arange(2015, 2018),
"imports" : [898.5242962, 936.3691166, 973.8762149]
})
exports_us = pd.DataFrame(
{"country" : "USA",
"year" : np.arange(2014, 2017),
"exports" : [2208.678084, 2217.733347, 2210.442218]
})
imports_us = pd.DataFrame(
{"country" : "USA",
"year" : np.arange(2015, 2018),
"imports" : [2827.336251, 2863.264745, np.nan]
})
imports_morocco = pd.DataFrame(
{"pays" : "Morocco",
"annee" : np.arange(2015, 2018),
"imports" : [46.39884177, 53.52375588, 56.68165748]
})
exports_morocco = pd.DataFrame(
{"country" : "Morocco",
"year" : np.arange(2014, 2017),
"exports" : [35.50207915, 37.45996653, 39.38228396]
})
exports = pd.concat([exports_fr, exports_us], ignore_index=True)
imports = pd.concat([imports_fr, imports_us], ignore_index=True)
print("exports: \n", exports)
## exports:
## country year exports
## 0 France 2014 816.819217
## 1 France 2015 851.663257
## 2 France 2016 867.401425
## 3 USA 2014 2208.678084
## 4 USA 2015 2217.733347
## 5 USA 2016 2210.442218
##
## imports:
## country year imports
## 0 France 2015 898.524296
## 1 France 2016 936.369117
## 2 France 2017 973.876215
## 3 USA 2015 2827.336251
## 4 USA 2016 2863.264745
## 5 USA 2017 NaN
The merge()
function of pandas
requires to specify the left table (which we will call here x
) via the left
argument on which the joining of the right table (which we will call here y
) will be performed via the right
argument.
By default, the merge()
function performs an inner
type join, i.e., all x
rows that match y
, and all x
and y
columns will be in the result of the join :
## country year imports exports
## 0 France 2015 898.524296 851.663257
## 1 France 2016 936.369117 867.401425
## 2 USA 2015 2827.336251 2217.733347
## 3 USA 2016 2863.264745 2210.442218
If we want to change the type of join, we can modify the value of the how
parameter of the merge()
function, to give it one of the following values:
left
: allx
rows, and allx
andy
columns. Rows inx
for which there is no match iny
will have valuesNaN
in the new columns. If there are several matches in the names betweenx
andy
, all combinations are returnedinner
: allx
rows for which there are corresponding values iny
, and allx
andy
columns. If there are several matches in the names betweenx
andy
, all possible combinations are returnedright
: ally
rows, and ally
andx
columns. Rows iny
for which there is no match inx
will have valuesNaN
in the new columns. If there are several matches in the names betweeny
andx
, all combinations are returnedouter
: all rows ofx
andy
, and all columns ofx
andy
. Lines ofx
for which there is no match iny
and those ofy
for which there is no match inx
will have valuesNaN
.
## left:
## country year imports exports
## 0 France 2015 898.524296 851.663257
## 1 France 2016 936.369117 867.401425
## 2 France 2017 973.876215 NaN
## 3 USA 2015 2827.336251 2217.733347
## 4 USA 2016 2863.264745 2210.442218
## 5 USA 2017 NaN NaN
##
## right:
## country year imports exports
## 0 France 2015 898.524296 851.663257
## 1 France 2016 936.369117 867.401425
## 2 USA 2015 2827.336251 2217.733347
## 3 USA 2016 2863.264745 2210.442218
## 4 France 2014 NaN 816.819217
## 5 USA 2014 NaN 2208.678084
##
## outer:
## country year imports exports
## 0 France 2015 898.524296 851.663257
## 1 France 2016 936.369117 867.401425
## 2 France 2017 973.876215 NaN
## 3 USA 2015 2827.336251 2217.733347
## 4 USA 2016 2863.264745 2210.442218
## 5 USA 2017 NaN NaN
## 6 France 2014 NaN 816.819217
## 7 USA 2014 NaN 2208.678084
The on
argument, which expects a column name or list of names, is used to designate the columns used to make the join. The column names must be identical in both dataframes.
## country year_x imports year_y exports
## 0 France 2015 898.524296 2014 816.819217
## 1 France 2015 898.524296 2015 851.663257
## 2 France 2015 898.524296 2016 867.401425
## 3 France 2016 936.369117 2014 816.819217
## 4 France 2016 936.369117 2015 851.663257
## 5 France 2016 936.369117 2016 867.401425
## 6 France 2017 973.876215 2014 816.819217
## 7 France 2017 973.876215 2015 851.663257
## 8 France 2017 973.876215 2016 867.401425
## 9 USA 2015 2827.336251 2014 2208.678084
## 10 USA 2015 2827.336251 2015 2217.733347
## 11 USA 2015 2827.336251 2016 2210.442218
## 12 USA 2016 2863.264745 2014 2208.678084
## 13 USA 2016 2863.264745 2015 2217.733347
## 14 USA 2016 2863.264745 2016 2210.442218
## 15 USA 2017 NaN 2014 2208.678084
## 16 USA 2017 NaN 2015 2217.733347
## 17 USA 2017 NaN 2016 2210.442218
If the names of the columns used to make the join are different between the left and right dataframe, the argument left_on
indicates the column name(s) of the left dataframe to be used for the join; and the argument right_on
indicates the corresponding name(s) in the right dataframe:
pd.merge(left = imports_morocco, right = exports_morocco,
left_on= ["pays", "annee"], right_on = ["country", "year"] )
## pays annee imports country year exports
## 0 Morocco 2015 46.398842 Morocco 2015 37.459967
## 1 Morocco 2016 53.523756 Morocco 2016 39.382284
With the argument suffixes
, suffixes can be defined to be added to column names when there are columns in x
and y
with the same name but not used for joining. By default, the suffixes (_x
and _y
) are added.
print(pd.merge(left = imports, right = exports,
on = "country",
suffixes=("_left", "_right")).head(3))
## country year_left imports year_right exports
## 0 France 2015 898.524296 2014 816.819217
## 1 France 2015 898.524296 2015 851.663257
## 2 France 2015 898.524296 2016 867.401425
10.14 Aggregation
Sometimes we want to aggregate the values of a variable, for example, from a quarterly to an annual dimension. With spatial observations, this can also be the case, for example, when data are available at the county level and the aggregate values at the state level.
To illustrate the different aggregation operations, let’s create a dataframe with unemployment data for different French regions, departments and years:
unemployment = pd.DataFrame(
{"region" : (["Bretagne"]*4 + ["Corse"]*2)*2,
"departement" : ["Cotes-d'Armor", "Finistere",
"Ille-et-Vilaine", "Morbihan",
"Corse-du-Sud", "Haute-Corse"]*2,
"year" : np.repeat([2011, 2010], 6),
"workers" : [8738, 12701, 11390, 10228, 975, 1297,
8113, 12258, 10897, 9617, 936, 1220],
"engineers" : [1420, 2530, 3986, 2025, 259, 254,
1334, 2401, 3776, 1979, 253, 241]
})
print(unemployment)
## region departement year workers engineers
## 0 Bretagne Cotes-d'Armor 2011 8738 1420
## 1 Bretagne Finistere 2011 12701 2530
## 2 Bretagne Ille-et-Vilaine 2011 11390 3986
## 3 Bretagne Morbihan 2011 10228 2025
## 4 Corse Corse-du-Sud 2011 975 259
## 5 Corse Haute-Corse 2011 1297 254
## 6 Bretagne Cotes-d'Armor 2010 8113 1334
## 7 Bretagne Finistere 2010 12258 2401
## 8 Bretagne Ille-et-Vilaine 2010 10897 3776
## 9 Bretagne Morbihan 2010 9617 1979
## 10 Corse Corse-du-Sud 2010 936 253
## 11 Corse Haute-Corse 2010 1220 241
As previously discussed (see Section 10.10.1), methods can be used to calculate simple statistics on the data set. For example, to display the average of each of the numerical columns:
## year 2010.500000
## workers 7364.166667
## engineers 1704.833333
## dtype: float64
What we are interested in in this section is to perform calculations on subgroups of data. The principle is simple: first, the data are separated according to identified groups (split), then an operation is applied to each group (apply), and finally the results are collected (combine). To perform grouping, depending on factors before performing aggregation calculations, pandas
offers the method groupby()
. The argument provided is the column name(s) used to perform the groups.
10.14.1 Aggregation According to the Values of a Single Column
For example, let us assume that we want to obtain the total number of unemployed workers per year. First, we use the groupby()
method on our dataframe, indicating that the groups must be created according to the values in the year
column
## <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x131bae2b0>
Then, we extract the variable workers
:
## <pandas.core.groupby.groupby.SeriesGroupBy object at 0x131c28be0>
## <pandas.core.groupby.groupby.SeriesGroupBy object at 0x131c282b0>
And finally, we can perform the calculation on each subgroup and display the result:
## year
## 2010 43041
## 2011 45329
## Name: workers, dtype: int64
If we want to perform this calculation for several columns, for example workers
and engineers
, we just have to select a priori the grouping variance and the variables for which we want to perform the calculation :
## workers engineers
## year
## 2010 43041 9984
## 2011 45329 10474
10.14.2 Aggregation According to the Values of Several Columns
Now, let’s assume that we want to aggregate by year and region. The only thing we need to do is to give a list containing the names of the columns used to create the different groups:
## workers engineers
## year region
## 2010 Bretagne 40885 9490
## Corse 2156 494
## 2011 Bretagne 43057 9961
## Corse 2272 513
10.15 Data Export and Import
pandas
offers many functions for importing and exporting data in different formats.
10.15.1 Data Export
10.15.1.1 Exporting Tabular Data
10.15.1.1.1 To a CSV File {pandas-export_csv}
To export tabular data, such as those contained in a dataframe, NumPy
offers the to_csv()
method, which accepts many specifications. Let’s look at some of them that seem to me to be the most common:
Argument | Description |
---|---|
path_or_buf |
path to the file |
sep |
field separation character |
decimal |
character to be used for the decimal separator |
na_rep |
representation to be used for missing values |
header |
indicates whether the column names should be exported (True by default) |
index |
indicates whether the line names should be exported (True by default) |
mode |
python writing mode (see Table 5.1, by default w ) |
encoding |
character encoding (utf-8 by default) |
compression |
compression to be used for the destination file (gzip , bz2 , zip , xz ) |
line_terminator |
end of line character |
quotechar |
character used to put fields between quotes |
chunksize |
(integer) number of lines to be written at a time |
date_format |
date format for datetime objects |
Let’s assume that we want to export the contents of the `unemployment’ dataframe to a CSV file whose fields are separated by semicolons, and by not exporting the index :
unemployment = pd.DataFrame(
{"region" : (["Bretagne"]*4 + ["Corse"]*2)*2,
"departement" : ["Cotes-d'Armor", "Finistere",
"Ille-et-Vilaine", "Morbihan",
"Corse-du-Sud", "Haute-Corse"]*2,
"year" : np.repeat([2011, 2010], 6),
"workers" : [8738, 12701, 11390, 10228, 975, 1297,
8113, 12258, 10897, 9617, 936, 1220],
"engineers" : [1420, 2530, 3986, 2025, 259, 254,
1334, 2401, 3776, 1979, 253, 241]
})
print(unemployment)
## region departement year workers engineers
## 0 Bretagne Cotes-d'Armor 2011 8738 1420
## 1 Bretagne Finistere 2011 12701 2530
## 2 Bretagne Ille-et-Vilaine 2011 11390 3986
## 3 Bretagne Morbihan 2011 10228 2025
## 4 Corse Corse-du-Sud 2011 975 259
## 5 Corse Haute-Corse 2011 1297 254
## 6 Bretagne Cotes-d'Armor 2010 8113 1334
## 7 Bretagne Finistere 2010 12258 2401
## 8 Bretagne Ille-et-Vilaine 2010 10897 3776
## 9 Bretagne Morbihan 2010 9617 1979
## 10 Corse Corse-du-Sud 2010 936 253
## 11 Corse Haute-Corse 2010 1220 241
For export:
If you want the CSV file to be compressed into a gzip
file, you name it with the extension .csv.gz
and add the value gzip
to the compression
parameter:
10.15.1.1.2 To an HDF5 File
To save the data of a dataframe in an HDF5 file using HDFStore, pandas
offers the method to_hdf()
which works in the same way as the function to_csv()
(see Section @ref(pandas-export_csv)).
The argument path_or_buf
must be specified to indicate the path and the argument key
to identify the object to be saved in the file.
The syntax is as follows:
10.16 Data Import
pandas
offers many functions for importing data. In this version of the course notes, we will discuss 3: read_csv()
, to read CSV files; read_excel()
, to read Excel files; and read_hdf()
to read HDF5 files.
10.16.1 CSV Files
To import data from a CSV file, pandas
offers the function read_csv()
:
It is possible to provide a URL pointing to a CSV file as a path, the read_csv()
function.
Among the parameters that are frequently used:
sep
,delimiter
: field separatordecimal
: decimal separatorheader
: line number(s) to be used as data headerskiprows
: line number(s) to be skipped at the beginningskipfooter
: line number(s) to be skipped at the endnrows
: number of lines to readna_values
: additional character strings to be considered as missing values (in addition to#N/A
,#N/A N/A
,#NA
,-1.#IND
,-1.#QNAN
,-NaN
,-nan
,1.#IND
,1.#QNAN
,N/A
,NA
,NULL
,NaN
,n/a
,nan
,null
)quotechar
: quote characterencoding
: character encoding (utf-8
by default).
10.16.2 Excel Files
To import Excel files, pandas
offers the function read_excel()
.
path = "./fichiers_exemples/chomage.xlsx"
unemployment = pd.read_excel(path, skiprows=2, header=1, sheet = 1)
print(unemployment)
## Bretagne Cotes-d'Armor 2011 8738 1420
## 0 Bretagne Finistere 2011 12701 2530
## 1 Bretagne Ille-et-Vilaine 2011 11390 3986
## 2 Bretagne Morbihan 2011 10228 2025
## 3 Corse Corse-du-Sud 2011 975 259
## 4 Corse Haute-Corse 2011 1297 254
## 5 Bretagne Cotes-d'Armor 2010 8113 1334
## 6 Bretagne Finistere 2010 12258 2401
## 7 Bretagne Ille-et-Vilaine 2010 10897 3776
## 8 Bretagne Morbihan 2010 9617 1979
## 9 Corse Corse-du-Sud 2010 936 253
## 10 Corse Haute-Corse 2010 1220 241
Among the frequently used arguments:
header
: row number to be used as headersheet
: name or sheet numberskiprows
: number of lines to be skipped at the beginningthousands
: thousands separator.
10.17 Exercise
Exercise 1: Import and export
- Download the csv file by hand at the following address: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.csv and place it in the current directory. Import its content into Python.
- Import the data back into Python, but this time by providing the url directly to the import function.
- Now, import the contents of the file available at http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_decim.csv. The field separator is a semicolon and the decimal separator is a comma.
- Import the contents of the file http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_h.csv. The column names are not present.
- Importer le contenu du fichier http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_h_s.csv. The first line is not to be imported.
- Import the contents of the first sheet of the Excel file http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.xlsx.
- Import the content of the second sheet (
notes_h_s
) from the Excel file available here: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes.xlsx. The first line is a comment not to be considered during the import. - Export the content of the object containing the notes of the previous question in csv format (comma as field separator, dot as decimal separator, do not keep the line numbers).
Exercise 2: Handling Dataframes
- Using the
read_excel()
function of thepandas
library, import the contents of the sheet entitlednotes_2012
from the Excel file available at the following address: http://egallic.fr/Enseignement/Python/Exercices/donnees/notes_etudiants.xlsx and store it in a variable called notes_2012. - Display the first 6 lines of the dataset, then the dimensions of the table.
- Keep only the column
note_stat
of the data tablenotes_2012
in an object calledtmp
. - Keep only the columns
number_student
,note_stat
andnote_macro
in an object namedtmp
. - Replace the content of
tmp
with observations ofnotes_2012
for which the individual obtained a stat score greater (strictly) than 10. - Replace the content of tmp with observations of
notes_2012
for which the individual obtained a stats score within the interval (10, 15). - Look for duplicates in the
notees_2012
data table; if so, remove them from the table. - Display the type of data in the column
num_etudiant
, then display the type of all note columns_2012. - Add the following columns to the
notes_2012
table:
note_stat_maj
: the stat score (note_stat
) increased by one point,note_macro_maj
: the macro note (note_macro
) plus three points (do it in two steps: first two more points, then one point).
- Rename the column year to year.
- From the file
notes_etudiants.xlsx
(see question 1), import the contents of the sheetsnotes_2013
,notes_2014
andprenoms
and store them in the objectsgrades_2013
,grades_2014
andfirst_names
, respectively. - Stack the contents of the data tables
grades_2012
,grades_2013
andgrades_2014
in an object that will be calledgrades
. - Merge the
grades
andfirst_names
tables using a left join, so as to add the information contained in the first name table to the grade observations. The join must be done by the student number and the year, the final object will replace the content of the grades - Sort the
grades
table by increasing years and decreasing macro grades - Create a column
after_2012
which takes the valueTrue
if the observation concerns a score assigned after 2012. - By grouping on the
grades
dataframe compute:
- the annual mean and standard deviation of the scores for each of the two classes,
- the annual and gender average and standard deviation of the grades for each of the two classes.
References
VanderPlas, Jake. 2016. Python Data Science Handbook: Essential Tools for Working with Data. " O’Reilly Media, Inc.".