5 Loading and Saving Data

To explore data and/or perform statistical or econometric analyses, it is important to know how to import and export data.

First of all, it is important to mention the notion of a working directory. In computer science, the current directory of a process refers to a directory of the file system associated with that process.

When we launch Jupyter Notebook, a tree structure is displayed, and we navigate inside it to create or open a notebook. The directory containing the notebook is the current directory. When Python is told to import data (or export objects), the origin (or destination) will be indicated relatively in the current directory, unless absolute paths (i.e., a path from the root /) are used.

If a Python program is started from a terminal, the current directory is the directory in which the terminal is located at the time the program is started.

To display the current directory in Python, the following code can be used:

import os
cwd = os.getcwd()
print(cwd)
## /Users/ewengallic/Dropbox/Universite_Aix_Marseille/Magistere_2_Programming_for_big_data/Cours/chapters/python/Python_for_economists

The listdir() function of the os library is very useful: it allows to list all the documents and directories contained in the current directory, or in any directory if the argument path informs the path (absolute or relative). After importing the function (from os import getcwd), it can be called: os.listdir().

5.1 Load Data

Depending on the data format, data import techniques differ.

Chapter 10 provides other ways to import data, with the pandas library.

5.1.1 Fichiers textes

When the data is present in a text file (ASCII), Python offers the open() function.

The (simplified) syntax of the open() function is as follows:

open(file, mode='r', buffering=-1,
  encoding=None, errors=None, newline=None)

Here is what the arguments correspond to (there are others):

  • file: a string indicating the path and name of the file to be opened;
  • mode: specifies the way the file is opened (see the lines below for possible values);
  • buffering: specifies using an integer the behavior to be adopted for buffering (1 to buffering per line; an integer \(>1\) to indicate the size in bytes of the chunks to be buffered);
  • encoding: specifies the encoding of the file;
  • errors: specifies how to handle encoding and decoding errors (e.g., strict returns an exception error, ignore ignores errors, replace replaces them, backslashreplace replaces malformed data with escape sequences);
  • newline : controls the end of the lines (\n, \r, etc.).
Table 5.1: Main Values for How to Open Files.
Value Description
r Opening to read (default)
w Opening to write
x Opening to create a document, fails if the file already exists
a Opening to write, adding at the end of the file if it already exists
+ Opening for update (read and write)
b To be added to an opening mode for binary files (rb or wb)
t Text mode (automatic decoding of bytes in Unicode). Default if not specified (adds to the mode, like b)

It is important to remember to close the file once we have finished using it. To do this, we use the close() method.

In the fichiers_exemples folder is a file called text_file.txt which contains three lines of text. Let’s open this file, and use the .read() method to display its content:

path = "./fichiers_exemples/fichier_texte.txt"
# Opening in read-only mode (default)
my_file = open(path, mode = "r")
print(my_file.read())
## Bonjour, je suis un fichier au format txt.
## Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.
## Trois lignes devraient suffir.
my_file.close()

A common practice in Python is to open a file in a with block. The reason for this choice is that a file opened in such a block is automatically closed at the end of the block.

The syntax is as follows:

# Opening in read-only mode (default)
with open(path, "r") as mon_fichier:
  data = function_to_get_data_from_my_file()

For example, to retrieve each line as an element of a list, a loop running through each line of the file can be used. At each iteration, the line is retrieved:

# Opening in read-only mode (default)
with open(path, "r") as my_file:
  data = [x for x in my_file]
print(data)
## ['Bonjour, je suis un fichier au format txt.\n', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\n", 'Trois lignes devraient suffir.']

Note: at each iteration, the strip() method can be applied. It returns the character string of the line, by removing any white characters at the beginning of the string :

# Opening in read-only mode (default)
with open(path, "r") as my_file:
  data = [x.strip() for x in my_file]
print(data)
## ['Bonjour, je suis un fichier au format txt.', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.", 'Trois lignes devraient suffir.']

The readlines() method can also be used to import lines into a list:

with open(path, "r") as my_file:
    data = my_file.readlines()
print(data)
## ['Bonjour, je suis un fichier au format txt.\n', "Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\n", 'Trois lignes devraient suffir.']

Character encoding may be a problem during import. In this case, it may be a good idea to change the value of the encoding argument of the open() function. The available encodings depend on the locale. The available values are obtained using the following method (code not executed in these notes):

import locale
locale.locale_alias

5.1.1.1 Import from the Internet

To import a text file from the Internet, methods from the urllib library can be used:

import urllib
from urllib.request import urlopen
url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/fichier_texte.txt"
with urllib.request.urlopen(url) as my_file:
   data = my_file.read()
print(data)
## b"Bonjour, je suis un fichier au format txt.\nJe contiens plusieurs lignes, l'id\xc3\xa9e \xc3\xa9tant de montrer comment fonctionne l'importation d'un tel fichier dans Python.\nTrois lignes devraient suffir."

As can be seen, the encoding of characters is a concern here. We can apply the method decode():

print(data.decode())
## Bonjour, je suis un fichier au format txt.
## Je contiens plusieurs lignes, l'idée étant de montrer comment fonctionne l'importation d'un tel fichier dans Python.
## Trois lignes devraient suffir.

5.1.2 CSV Files

CSV files (comma separated value) are very common. Many databases export their data to CSV (e.g., World Bank, FAO, Eurostat, etc.). To import them into Python, you can use the csv module.

Again, we use the open() function, with the arguments described in Section  5.1.1. Then, we use the reader() method of the csv module:

import csv
with open('./fichiers_exemples/fichier_csv.csv') as my_file:
  my_file_reader = csv.reader(my_file, delimiter=',', quotechar='"')
  data = [x for x in my_file_reader]

print(data)
## [['nom', 'prénom', 'équipe'], ['Irving', ' "Kyrie"', ' "Celtics"'], ['James', ' "Lebron"', ' "Lakers"', ''], ['Curry', ' "Stephen"', ' "Golden State Warriors"']]

The reader() method can take several arguments, described in Table 5.2.

Table 5.2: Arguments of the reader() Function
Argument Description
csvfile The object opened with open()
dialect Argument specifying the “dialect” of the CSV file (e.g., excel, excel-tab, unix)
delimiter The character delimiting the fields (i.e., the values of the variables)
quotechar Character used to surround fields containing special characters
escapechar Escape character
doublequote Controls how the quotechar appear inside a field: when True, the character is doubled, when False, the escape character is used as a prefix to the quotechar
lineterminator String of characters used to end a line
skipinitialspace When True, the white character located just after the field separation character is ignored
strict When True, returns an exception error if there is a bad input of CSV

A CSV file can also be imported as a dictionary, using the csv.DictReader() method of the CSV module :

import csv
path = "./fichiers_exemples/fichier_csv.csv"
with open(path) as my_file:
    my_file_csv = csv.DictReader(my_file)
    data = [ligne for ligne in my_file_csv]
print(data)
## [OrderedDict([('nom', 'Irving'), ('prénom', ' "Kyrie"'), ('équipe', ' "Celtics"')]), OrderedDict([('nom', 'James'), ('prénom', ' "Lebron"'), ('équipe', ' "Lakers"'), (None, [''])]), OrderedDict([('nom', 'Curry'), ('prénom', ' "Stephen"'), ('équipe', ' "Golden State Warriors"')])]

5.1.2.1 Import From the Internet

As with txt files, a CSV file hosted on the Internet can be loaded:

import csv
import urllib.request
import codecs

url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/fichier_csv.csv"
with urllib.request.urlopen(url) as my_file:
    my_file_csv = csv.reader(codecs.iterdecode(my_file, 'utf-8'))
    data = [ligne for ligne in my_file_csv]
print(data)
## [['nom', 'prénom', 'équipe'], ['Irving', ' "Kyrie"', ' "Celtics"'], ['James', ' "Lebron"', ' "Lakers"', ''], ['Curry', ' "Stephen"', ' "Golden State Warriors"']]

5.1.3 JSON Files

To import files in JSON format (JavaScript Object Notation), which are widely used when communicating with an API, you can use the json library, and its load() method:

import json
url = './fichiers_exemples/tweets.json'

with open(url) as my_file_json:
    data = json.load(my_file_json)

Then, you can display the imported content using the pprint() function:

from pprint import pprint
pprint(data)
## {'created_at': 'Wed Sep 26 07:38:05 +0000 2018',
##  'id': 11,
##  'loc': [{'long': 5.3698}, {'lat': 43.2965}],
##  'text': 'Un tweet !',
##  'user_mentions': [{'id': 111, 'screen_name': 'nom_twittos1'},
##                    {'id': 112, 'screen_name': 'nom_twittos2'}]}

5.1.3.1 Import from the Internet

Once again, it is possible to import JSON files from the Internet:

import urllib
from urllib.request import urlopen
url = "http://egallic.fr/Enseignement/Python/fichiers_exemples/tweets.json"
with urllib.request.urlopen(url) as my_file:
   data = json.load(my_file)
pprint(data)
## {'created_at': 'Wed Sep 26 07:38:05 +0000 2018',
##  'id': 11,
##  'loc': [{'long': 5.3698}, {'lat': 43.2965}],
##  'text': 'Un tweet !',
##  'user_mentions': [{'id': 111, 'screen_name': 'nom_twittos1'},
##                    {'id': 112, 'screen_name': 'nom_twittos2'}]}

5.1.4 Excel Files

Excel files (xls or xlsx) are also widely used in economics. The reader is referred to Section 10.16.2 for a method of importing Excel data with the pandas library.

5.2 Exporting data

It is not uncommon to have to export data, for instance to share it. Again, the function open() is used, by playing with the value of the argument mode (see Table 5.1).

5.2.1 Text Files

Let’s say we need to export lines of text to a file. Before giving an example with the open() function, let’s look at two important functions to convert the contents of some objects to text.

The first, str(), returns a string version of an object. We have already applied it to numbers that we wanted to concatenate in Section 2.1.4.

x = ["pomme", 1, 3]
str(x)
## "['pomme', 1, 3]"

The result of this instruction returns the list as a string: "['pomme', 1, 3]".

The second function that seems important to address is repr(). This function returns a string containing a printable representation on an object screen. In addition, this channel can be read by the interpreter.

y = "Fromage, tu veux du fromage ?\n"
repr(y)
## "'Fromage, tu veux du fromage ?\\n'"

The result writes: "'Fromage, tu veux du fromage ?\\n'".

Let’s say we want to export two lines:

  • the first, a text that indicates a title (“Kyrie Irving Characteristics”);
  • the second, a dictionary containing information about Kyrie Irving (see below).

Let’s define this dictionary:

z = { "name": "Kyrie",
  "surname": "Irving",
  "date_of_birth": 1992,
  "teams": ["Cleveland", "Boston", "Nets"]}

One of the syntaxes for exporting data in txt format is:

# Ouverture en mode lecture (par défaut)
path = "path/to/file.txt"
with open(path, "w") as my_file:
  function_to_export()

We create a variable indicating the path to the file. Then we open the file in writing mode by specifying the argument mode = "w". Then, we still have to write our lines in the file.

path = "./fichiers_exemples/Irving.txt"
with open(path, mode = "w") as my_file:
  my_file.write("Characteristics of Kyrie Irving\n")
  my_file.writelines(repr(z))
## 32

If the file is already existing, having used mode="w", the old file will be overwritten by the new one. If we want to add lines to the existing file, we will use mode="a" for example:

with open(path, mode = "a") as my_file:
  my_file.writelines("\nAnother line\n")

If we want to be warned if the file already exists, and to make the writing fail if this is the case, we can use mode="x":

with open(path, mode = "x") as my_file:
  my_file.writelines("A new line that will not be added\n")
## Error in py_call_impl(callable, dots$args, dots$keywords): FileExistsError: [Errno 17] File exists: './fichiers_exemples/Irving.txt'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

5.2.2 CSV Files

As economists, we are more likely to have to export data in CSV format rather than text, due to the rectangular structure of the data we are handling. As for the import of CSV (c.f. Section 5.1.2), on utilise le module csv. we use the module csv. To write to the file, we use the writer() method. The formatting arguments of this function are the same as those of the reader() function (see Table 5.2).

Example of creating a CSV file:

import csv
path = "./fichiers_exemples/ffile_export.csv"

with open(path, mode='w') as my_file:
    my_file_write = csv.writer(my_file, delimiter=',',
                                    quotechar='"',
                                    quoting=csv.QUOTE_MINIMAL)

    my_file_write.writerow(['Country', 'Year', 'Quarter', 'GR_PIB'])
    my_file_write.writerow(['France', '2017', 'Q4', 0.7])
    my_file_write.writerow(['France', '2018', 'Q1', 0.2])
## 29
## 20
## 20

Of course, most of the time, we do not write each entry by hand. We export the data contained in a structure. Section 10.16.2 provides examples of this type of export, when the data are contained in two-dimensional tables created with the pandas library.

5.2.3 JSON Files

It may be necessary to save structured data in JSON format, for example when an API (e.g., the Twitter API) has been used that returns objects in JSON format.

To do this, we will use the json library, and its dump() method. This method allows to serialize an object (for example a list, like what you get with the Twitter API queried with the twitter-python library) in JSON.

import json
x = [1, "apple", ["seed", "red"]]
y = { "name": "Kyrie",
  "surname": "John",
  "year_of_birth": 1992,
  "teams": ["Cleveland", "Boston", "Nets"]}
x_json = json.dumps(x)
y_json = json.dumps(y)

print("x_json: ", x_json)
## x_json:  [1, "apple", ["seed", "red"]]
print("y_json: ", y_json)
## y_json:  {"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}

As can be seen, there are some minor problems with accentuated character rendering. We can specify, using the argument ensure_ascii evaluated at False that we do not want to make sure that non-ascii characters are escaped by sequences of type \uXXXX.

x_json = json.dumps(x, ensure_ascii=False)
y_json = json.dumps(y, ensure_ascii=False)

print("x_json: ", x_json)
## x_json:  [1, "apple", ["seed", "red"]]
print("y_json: ", y_json)
## y_json:  {"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}
path = "./fichiers_exemples/export_json.json"

with open(path, 'w') as f:
    json.dump(json.dumps(x, ensure_ascii=False), f)
    f.write('\n')
    json.dump(json.dumps(y, ensure_ascii=False), f)
## 1

If we want to re-import in Python the content of the file export_json.json:

path = "./fichiers_exemples/export_json.json"
with open(path, "r") as f:
    data = []
    for line in f:
        data.append(json.loads(line, encoding="utf-8"))

print(data)
## ['[1, "apple", ["seed", "red"]]', '{"name": "Kyrie", "surname": "John", "year_of_birth": 1992, "teams": ["Cleveland", "Boston", "Nets"]}']

5.2.4 Exercise

  1. Create a list named a containing information on the unemployment rate in France in the second quarter of 2018. This list must contain three elements:
    • the year;
    • the quarter;
    • the value of the unemployment rate (\(9.1\%\)).
  2. Export the contents of the list a in CSV format, preceded by a line specifying the names of the fields. Use the semicolon (;) as a field separator.
  3. Import the file created in the previous question into Python.