5 Loading and Saving Data
To explore data and/or perform statistical or econometric analyses, it is important to know how to import and export data.
First of all, it is important to mention the notion of a working directory. In computer science, the current directory of a process refers to a directory of the file system associated with that process.
When we launch Jupyter Notebook, a tree structure is displayed, and we navigate inside it to create or open a notebook. The directory containing the notebook is the current directory. When Python is told to import data (or export objects), the origin (or destination) will be indicated relatively in the current directory, unless absolute paths (i.e., a path from the root /
) are used.
If a Python program is started from a terminal, the current directory is the directory in which the terminal is located at the time the program is started.
To display the current directory in Python, the following code can be used:
## /Users/ewengallic/Dropbox/Universite_Aix_Marseille/Magistere_2_Programming_for_big_data/Cours/chapters/python/Python_for_economists
The listdir()
function of the os
library is very useful: it allows to list all the documents and directories contained in the current directory, or in any directory if the
argument path
informs the path (absolute or relative). After importing the function (from os import getcwd
), it can be called: os.listdir()
.
5.1 Load Data
Depending on the data format, data import techniques differ.
pandas
library.
5.1.1 Fichiers textes
When the data is present in a text file (ASCII), Python offers the open()
function.
The (simplified) syntax of the open()
function is as follows:
Here is what the arguments correspond to (there are others):
file
: a string indicating the path and name of the file to be opened;mode
: specifies the way the file is opened (see the lines below for possible values);buffering
: specifies using an integer the behavior to be adopted for buffering (1 to buffering per line; an integer \(>1\) to indicate the size in bytes of the chunks to be buffered);encoding
: specifies the encoding of the file;errors
: specifies how to handle encoding and decoding errors (e.g.,strict
returns an exception error,ignore
ignores errors,replace
replaces them,backslashreplace
replaces malformed data with escape sequences);newline
: controls the end of the lines (\n
,\r
, etc.).
Value | Description |
---|---|
r |
Opening to read (default) |
w |
Opening to write |
x |
Opening to create a document, fails if the file already exists |
a |
Opening to write, adding at the end of the file if it already exists |
+ |
Opening for update (read and write) |
b |
To be added to an opening mode for binary files (rb or wb ) |
t |
Text mode (automatic decoding of bytes in Unicode). Default if not specified (adds to the mode, like b ) |
It is important to remember to close the file once we have finished using it. To do this, we use the close()
method.
In the fichiers_exemples
folder is a file called text_file.txt
which contains three lines of text. Let’s open this file, and use the .read()
method to display its content:
path = "./fichiers_exemples/fichier_texte.txt"
# Opening in read-only mode (default)
my_file = open(path, mode = "r")
print(my_file.read())
## Bonjour, je suis un fichier au format txt.
## Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.
## Trois lignes devraient suffir.
A common practice in Python is to open a file in a with
block. The reason for this choice is that a file opened in such a block is automatically closed at the end of the block.
The syntax is as follows:
# Opening in read-only mode (default)
with open(path, "r") as mon_fichier:
data = function_to_get_data_from_my_file()
For example, to retrieve each line as an element of a list, a loop running through each line of the file can be used. At each iteration, the line is retrieved:
# Opening in read-only mode (default)
with open(path, "r") as my_file:
data = [x for x in my_file]
print(data)
## ['Bonjour, je suis un fichier au format txt.\n', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\n", 'Trois lignes devraient suffir.']
Note: at each iteration, the strip()
method can be applied. It returns the character string of the line, by removing any white characters at the beginning of the string :
# Opening in read-only mode (default)
with open(path, "r") as my_file:
data = [x.strip() for x in my_file]
print(data)
## ['Bonjour, je suis un fichier au format txt.', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.", 'Trois lignes devraient suffir.']
The readlines()
method can also be used to import lines into a list:
## ['Bonjour, je suis un fichier au format txt.\n', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\n", 'Trois lignes devraient suffir.']
Character encoding may be a problem during import. In this case, it may be a good idea to change the value of the encoding
argument of the open()
function. The available encodings depend on the locale. The available values are obtained using the following method (code not executed in these notes):
5.1.1.1 Import from the Internet
To import a text file from the Internet, methods from the urllib
library can be used:
import urllib
from urllib.request import urlopen
url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/fichier_texte.txt"
with urllib.request.urlopen(url) as my_file:
data = my_file.read()
print(data)
## b"Bonjour, je suis un fichier au format txt.\nJe contiens plusieurs lignes, l'id\xc3\xa9e \xc3\xa9tant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\nTrois lignes devraient suffir."
As can be seen, the encoding of characters is a concern here. We can apply the method decode()
:
## Bonjour, je suis un fichier au format txt.
## Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.
## Trois lignes devraient suffir.
5.1.2 CSV Files
CSV files (comma separated value) are very common. Many databases export their data to CSV (e.g., World Bank, FAO, Eurostat, etc.). To import them into Python, you can use the csv
module.
Again, we use the open()
function, with the arguments described in Section 5.1.1. Then, we use the reader()
method of the csv
module:
import csv
with open('./fichiers_exemples/fichier_csv.csv') as my_file:
my_file_reader = csv.reader(my_file, delimiter=',', quotechar='"')
data = [x for x in my_file_reader]
print(data)
## [['nom', 'prénom', 'équipe'], ['Irving', ' "Kyrie"', ' "Celtics"'], ['James', ' "Lebron"', ' "Lakers"', ''], ['Curry', ' "Stephen"', ' "Golden State Warriors"']]
The reader()
method can take several arguments, described in Table 5.2.
Argument | Description |
---|---|
csvfile |
The object opened with open() |
dialect |
Argument specifying the “dialect” of the CSV file (e.g., excel , excel-tab , unix ) |
delimiter |
The character delimiting the fields (i.e., the values of the variables) |
quotechar |
Character used to surround fields containing special characters |
escapechar |
Escape character |
doublequote |
Controls how the quotechar appear inside a field: when True , the character is doubled, when False , the escape character is used as a prefix to the quotechar |
lineterminator |
String of characters used to end a line |
skipinitialspace |
When True , the white character located just after the field separation character is ignored |
strict |
When True , returns an exception error if there is a bad input of CSV |
A CSV file can also be imported as a dictionary, using the csv.DictReader()
method of the CSV module :
import csv
path = "./fichiers_exemples/fichier_csv.csv"
with open(path) as my_file:
my_file_csv = csv.DictReader(my_file)
data = [ligne for ligne in my_file_csv]
print(data)
## [OrderedDict([('nom', 'Irving'), ('prénom', ' "Kyrie"'), ('équipe', ' "Celtics"')]), OrderedDict([('nom', 'James'), ('prénom', ' "Lebron"'), ('équipe', ' "Lakers"'), (None, [''])]), OrderedDict([('nom', 'Curry'), ('prénom', ' "Stephen"'), ('équipe', ' "Golden State Warriors"')])]
5.1.2.1 Import From the Internet
As with txt
files, a CSV file hosted on the Internet can be loaded:
import csv
import urllib.request
import codecs
url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/fichier_csv.csv"
with urllib.request.urlopen(url) as my_file:
my_file_csv = csv.reader(codecs.iterdecode(my_file, 'utf-8'))
data = [ligne for ligne in my_file_csv]
print(data)
## [['nom', 'prénom', 'équipe'], ['Irving', ' "Kyrie"', ' "Celtics"'], ['James', ' "Lebron"', ' "Lakers"', ''], ['Curry', ' "Stephen"', ' "Golden State Warriors"']]
5.1.3 JSON Files
To import files in JSON format (JavaScript Object Notation), which are widely used when communicating with an API, you can use the json
library, and its load()
method:
import json
url = './fichiers_exemples/tweets.json'
with open(url) as my_file_json:
data = json.load(my_file_json)
Then, you can display the imported content using the pprint()
function:
## {'created_at': 'Wed Sep 26 07:38:05 +0000 2018',
## 'id': 11,
## 'loc': [{'long': 5.3698}, {'lat': 43.2965}],
## 'text': 'Un tweet !',
## 'user_mentions': [{'id': 111, 'screen_name': 'nom_twittos1'},
## {'id': 112, 'screen_name': 'nom_twittos2'}]}
5.1.3.1 Import from the Internet
Once again, it is possible to import JSON files from the Internet:
import urllib
from urllib.request import urlopen
url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/tweets.json"
with urllib.request.urlopen(url) as my_file:
data = json.load(my_file)
pprint(data)
## {'created_at': 'Wed Sep 26 07:38:05 +0000 2018',
## 'id': 11,
## 'loc': [{'long': 5.3698}, {'lat': 43.2965}],
## 'text': 'Un tweet !',
## 'user_mentions': [{'id': 111, 'screen_name': 'nom_twittos1'},
## {'id': 112, 'screen_name': 'nom_twittos2'}]}
5.1.4 Excel Files
Excel files (xls
or xlsx
) are also widely used in economics. The reader is referred to Section 10.16.2 for a method of importing Excel data with the pandas
library.
5.2 Exporting data
It is not uncommon to have to export data, for instance to share it. Again, the function open()
is used, by playing with the value of the argument mode
(see Table 5.1).
5.2.1 Text Files
Let’s say we need to export lines of text to a file. Before giving an example with the open()
function, let’s look at two important functions to convert the contents of some objects to text.
The first, str()
, returns a string version of an object. We have already applied it to numbers that we wanted to concatenate in Section 2.1.4.
## "['pomme', 1, 3]"
The result of this instruction returns the list as a string: "['pomme', 1, 3]"
.
The second function that seems important to address is repr()
. This function returns a string containing a printable representation on an object screen. In addition, this channel can be read by the interpreter.
## "'Fromage, tu veux du fromage ?\\n'"
The result writes: "'Fromage, tu veux du fromage ?\\n'"
.
Let’s say we want to export two lines:
- the first, a text that indicates a title (“Kyrie Irving Characteristics”);
- the second, a dictionary containing information about Kyrie Irving (see below).
Let’s define this dictionary:
z = { "name": "Kyrie",
"surname": "Irving",
"date_of_birth": 1992,
"teams": ["Cleveland", "Boston", "Nets"]}
One of the syntaxes for exporting data in txt
format is:
# Ouverture en mode lecture (par défaut)
path = "path/to/file.txt"
with open(path, "w") as my_file:
function_to_export()
We create a variable indicating the path to the file. Then we open the file in writing mode by specifying the argument mode = "w"
. Then, we still have to write our lines in the file.
path = "./fichiers_exemples/Irving.txt"
with open(path, mode = "w") as my_file:
my_file.write("Characteristics of Kyrie Irving\n")
my_file.writelines(repr(z))
## 32
If the file is already existing, having used mode="w"
, the old file will be overwritten by the new one. If we want to add lines to the existing file, we will use mode="a"
for example:
If we want to be warned if the file already exists, and to make the writing fail if this is the case, we can use mode="x"
:
## Error in py_call_impl(callable, dots$args, dots$keywords): FileExistsError: [Errno 17] File exists: './fichiers_exemples/Irving.txt'
##
## Detailed traceback:
## File "<string>", line 1, in <module>
5.2.2 CSV Files
As economists, we are more likely to have to export data in CSV format rather than text, due to the rectangular structure of the data we are handling. As for the import of CSV (c.f. Section 5.1.2), on utilise le module csv
. we use the module csv
. To write to the file, we use the writer()
method. The formatting arguments of this function are the same as those of the reader()
function (see Table 5.2).
Example of creating a CSV file:
import csv
path = "./fichiers_exemples/ffile_export.csv"
with open(path, mode='w') as my_file:
my_file_write = csv.writer(my_file, delimiter=',',
quotechar='"',
quoting=csv.QUOTE_MINIMAL)
my_file_write.writerow(['Country', 'Year', 'Quarter', 'GR_PIB'])
my_file_write.writerow(['France', '2017', 'Q4', 0.7])
my_file_write.writerow(['France', '2018', 'Q1', 0.2])
## 29
## 20
## 20
Of course, most of the time, we do not write each entry by hand. We export the data contained in a structure. Section 10.16.2 provides examples of this type of export, when the data are contained in two-dimensional tables created with the pandas
library.
5.2.3 JSON Files
It may be necessary to save structured data in JSON format, for example when an API (e.g., the Twitter API) has been used that returns objects in JSON format.
To do this, we will use the json
library, and its dump()
method. This method allows to serialize an object (for example a list, like what you get with the Twitter API queried with the twitter-python
library) in JSON.
import json
x = [1, "apple", ["seed", "red"]]
y = { "name": "Kyrie",
"surname": "John",
"year_of_birth": 1992,
"teams": ["Cleveland", "Boston", "Nets"]}
x_json = json.dumps(x)
y_json = json.dumps(y)
print("x_json: ", x_json)
## x_json: [1, "apple", ["seed", "red"]]
## y_json: {"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}
As can be seen, there are some minor problems with accentuated character rendering. We can specify, using the argument ensure_ascii
evaluated at False
that we do not want to make sure that non-ascii characters are escaped by sequences of type \uXXXX
.
x_json = json.dumps(x, ensure_ascii=False)
y_json = json.dumps(y, ensure_ascii=False)
print("x_json: ", x_json)
## x_json: [1, "apple", ["seed", "red"]]
## y_json: {"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}
path = "./fichiers_exemples/export_json.json"
with open(path, 'w') as f:
json.dump(json.dumps(x, ensure_ascii=False), f)
f.write('\n')
json.dump(json.dumps(y, ensure_ascii=False), f)
## 1
If we want to re-import in Python the content of the file export_json.json
:
path = "./fichiers_exemples/export_json.json"
with open(path, "r") as f:
data = []
for line in f:
data.append(json.loads(line, encoding="utf-8"))
print(data)
## ['[1, "apple", ["seed", "red"]]', '{"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}']
5.2.4 Exercise
- Create a list named
a
containing information on the unemployment rate in France in the second quarter of 2018. This list must contain three elements:- the year;
- the quarter;
- the value of the unemployment rate (\(9.1\%\)).
- Export the contents of the list
a
in CSV format, preceded by a line specifying the names of the fields. Use the semicolon (;
) as a field separator. - Import the file created in the previous question into Python.