# Python training UGA 2017¶

A training to acquire strong basis in Python to use it efficiently

Pierre Augier (LEGI), Cyrille Bonamy (LEGI), Eric Maldonado (Irstea), Franck Thollard (ISTerre), Oliver Henriot (GRICAD), Christophe Picard (LJK), Loïc Huder (ISTerre)

# A short introduction to Numpy, Scipy and Pandas¶

## Python scientific ecosystem¶

There are a lot of very good Python packages for sciences. The fundamental packages are in particular:

• numpy: numerical computing with powerful numerical arrays objects, and routines to manipulate them.
• scipy: high-level numerical routines. Optimization, regression, interpolation, etc.
• matplotlib: 2D-3D visualization, “publication-ready” plots.

With IPython and Spyder, Python plus these fundamental scientific packages constitutes a very good alternative to Matlab, that is technically very similar (using the libraries Blas and Lapack).

## A short introduction on NumPy¶

Code using numpy usually starts with the import statement

In [1]:
import numpy as np


NumPy provides the type np.ndarray. Such arrays are multidimensionnal sequences of homogeneous elements (numbers) to represent vectors, matrices, tensors...

### Array creation¶

NumPy arrays can be created in several ways:

In [2]:
# from a list
l = [10.0, 12.5, 15.0, 17.5, 20.0]
np.array(l)

Out[2]:
array([10. , 12.5, 15. , 17.5, 20. ])
In [3]:
# fast but the values can be anything
np.empty(4)

Out[3]:
array([1.75274491e-316, 6.94225492e-310, 6.94224527e-310, 6.94225376e-310])
In [4]:
# Filled with zeros (slower than np.empty)
np.zeros([2, 6])

Out[4]:
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
In [5]:
# Multidimensional array filled with ones
a = np.ones([2, 3, 4])
print(a.shape, a.size, a.dtype)
a

(2, 3, 4) 24 float64

Out[5]:
array([[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]],

[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]]])

### Generate sequences¶

In [6]:
# Like range but produces a 1D numpy array
np.arange(4)

Out[6]:
array([0, 1, 2, 3])
In [7]:
# Start and step can be changed
np.arange(2., 4., 0.1)

Out[7]:
array([2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2,
3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9])
In [8]:
# Equally-spaced elements between start and end (included)
np.linspace(10, 20, 5)

Out[8]:
array([10. , 12.5, 15. , 17.5, 20. ])

A NumPy array can be easily converted to a Python list.

In [9]:
a = np.linspace(10, 20 ,5)
a.tolist()

Out[9]:
[10.0, 12.5, 15.0, 17.5, 20.0]

## Manipulating NumPy arrays¶

### Access elements¶

Elements in a numpy array can be accessed using indexing and slicing in any dimension. It also offers the same functionalities available in Fortan or Matlab.

### Indexes and slices¶

For example, we can create an array A and perform any kind of selection operations on it.

In [10]:
A = np.random.random([4, 5])
A

Out[10]:
array([[0.74931905, 0.4399789 , 0.96017188, 0.88886798, 0.28382067],
[0.4532329 , 0.99181478, 0.07017858, 0.4993961 , 0.1678844 ],
[0.59791893, 0.50793759, 0.77954852, 0.05390075, 0.984206  ],
[0.93149267, 0.02959492, 0.60720976, 0.92916837, 0.24923606]])
In [11]:
# Get the element from second line, first column
A[1, 0]

Out[11]:
0.45323290450951004
In [12]:
# Get the first two lines
A[:2]

Out[12]:
array([[0.74931905, 0.4399789 , 0.96017188, 0.88886798, 0.28382067],
[0.4532329 , 0.99181478, 0.07017858, 0.4993961 , 0.1678844 ]])
In [13]:
# Get the last column
A[:, -1]

Out[13]:
array([0.28382067, 0.1678844 , 0.984206  , 0.24923606])
In [14]:
# Get the first two lines and the columns with an even index
A[:2, ::2]

Out[14]:
array([[0.74931905, 0.96017188, 0.28382067],
[0.4532329 , 0.07017858, 0.1678844 ]])

### Using a mask to select elements validating a condition:¶

In [15]:
cond = A > 0.5
print(cond)
print(A[cond])

[[ True False  True  True False]
[False  True False False False]
[ True  True  True False  True]
[ True False  True  True False]]
[0.74931905 0.96017188 0.88886798 0.99181478 0.59791893 0.50793759
0.77954852 0.984206   0.93149267 0.60720976 0.92916837]


The mask is in fact a particular case of the advanced indexing capabilities provided by NumPy. For example, it is even possible to use lists for indexing:

In [16]:
# Selecting only particular columns
print(A)
A[:, [0, 1, 4]]

[[0.74931905 0.4399789  0.96017188 0.88886798 0.28382067]
[0.4532329  0.99181478 0.07017858 0.4993961  0.1678844 ]
[0.59791893 0.50793759 0.77954852 0.05390075 0.984206  ]
[0.93149267 0.02959492 0.60720976 0.92916837 0.24923606]]

Out[16]:
array([[0.74931905, 0.4399789 , 0.28382067],
[0.4532329 , 0.99181478, 0.1678844 ],
[0.59791893, 0.50793759, 0.984206  ],
[0.93149267, 0.02959492, 0.24923606]])

## Perform array manipulations¶

### Apply arithmetic operations to whole arrays (element-wise):¶

In [17]:
(A+5)**2

Out[17]:
array([[33.05466955, 29.59337041, 35.52364888, 34.67876614, 27.91876089],
[29.73774911, 35.90184432, 25.7067108 , 30.24335742, 26.70702917],
[31.3366963 , 30.33737648, 33.4031811 , 25.54191284, 35.81072142],
[35.18260526, 25.29682501, 31.44080124, 35.15503757, 27.55447916]])

### Apply functions element-wise:¶

In [18]:
np.exp(A) # With numpy arrays, use the functions from numpy !

Out[18]:
array([[2.11555894, 1.55267445, 2.61214542, 2.43237461, 1.32819473],
[1.57339059, 2.6961229 , 1.07269972, 1.6477259 , 1.18279987],
[1.81833078, 1.66186022, 2.1804876 , 1.05537986, 2.67568654],
[2.53829518, 1.0300372 , 1.8353033 , 2.53240228, 1.28304487]])

## NumPy efficiency¶

In addition of being extremely convenient to manipulate arrays of arbitrary dimensions, numpy is also much more efficient than pure Python.

### Array creation:¶

In [19]:
n = 1000

In [20]:
%%capture timeit_python
# to capture the result of the command timeit in the variable timeit_python
# Pure Python
%timeit list(range(n))

In [21]:
%%capture timeit_numpy
# numpy
%timeit np.arange(n)

In [23]:
compare_times('Creation of object', timeit_python, timeit_numpy)

14.1 us +- 1.04 us per loop (mean +- std. dev. of 7 runs, 100000 loops each)

1.63 us +- 94.5 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each)

Creation of object: ratio times (Python / NumPy):  8.650306748466258


### Array operations:¶

In [24]:
n = 200000
python_r_1 = range(n)
python_r_2 = range(n)

numpy_a_1 = np.arange(n)
numpy_a_2 = np.arange(n)

In [25]:
%%capture timeit_python
%%timeit
# Regular Python
[(x + y) for x, y in zip(python_r_1, python_r_2)]

In [26]:
%%capture timeit_numpy
%%timeit
#Numpy
numpy_a_1 + numpy_a_2

In [27]:
compare_times('Additions', timeit_python, timeit_numpy)

20.1 ms +- 885 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

276 us +- 25.5 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

Additions: ratio times (Python / NumPy):  72.82608695652175


This shows that when you need to perform mathematical operations on a lot of homogeneous numbers, it is more efficient to use numpy arrays.

### Setting parts of arrays¶

In [28]:
A[:, 0] = 0.
print(A)

[[0.         0.4399789  0.96017188 0.88886798 0.28382067]
[0.         0.99181478 0.07017858 0.4993961  0.1678844 ]
[0.         0.50793759 0.77954852 0.05390075 0.984206  ]
[0.         0.02959492 0.60720976 0.92916837 0.24923606]]

In [29]:
# BONUS: Safe element-wise inverse with masks
cond = (A != 0)
A[cond] = 1./A[cond]
print(A)

[[ 0.          2.27283627  1.04148019  1.12502646  3.52335153]
[ 0.          1.00825277 14.24936289  2.00241854  5.95647958]
[ 0.          1.96874581  1.28279379 18.55261602  1.01604746]
[ 0.         33.78958815  1.64687736  1.07623121  4.01226058]]


### Attributes and methods of np.ndarray (see the doc)¶

In [30]:
print([s for s in dir(A) if not s.startswith('__')])

['T', 'all', 'any', 'argmax', 'argmin', 'argpartition', 'argsort', 'astype', 'base', 'byteswap', 'choose', 'clip', 'compress', 'conj', 'conjugate', 'copy', 'ctypes', 'cumprod', 'cumsum', 'data', 'diagonal', 'dot', 'dtype', 'dump', 'dumps', 'fill', 'flags', 'flat', 'flatten', 'getfield', 'imag', 'item', 'itemset', 'itemsize', 'max', 'mean', 'min', 'nbytes', 'ndim', 'newbyteorder', 'nonzero', 'partition', 'prod', 'ptp', 'put', 'ravel', 'real', 'repeat', 'reshape', 'resize', 'round', 'searchsorted', 'setfield', 'setflags', 'shape', 'size', 'sort', 'squeeze', 'std', 'strides', 'sum', 'swapaxes', 'take', 'tobytes', 'tofile', 'tolist', 'tostring', 'trace', 'transpose', 'var', 'view']


#### Example 1: Get the mean through different dimensions¶

In [31]:
print(A)
print('Mean value', A.mean())
print('Mean line', A.mean(axis=0))
print('Mean column', A.mean(axis=1))

[[ 0.          2.27283627  1.04148019  1.12502646  3.52335153]
[ 0.          1.00825277 14.24936289  2.00241854  5.95647958]
[ 0.          1.96874581  1.28279379 18.55261602  1.01604746]
[ 0.         33.78958815  1.64687736  1.07623121  4.01226058]]
Mean value 4.7262184313320255
Mean line [0.         9.75985575 4.55512856 5.68907306 3.62703479]
Mean column [1.59253889 4.64330276 4.56404062 8.10499146]


#### Example 2: Manipulate the shape of arrays while keeping all elements¶

In [32]:
print(A, A.shape)
A_flat = A.flatten()
print(A_flat, A_flat.shape)

[[ 0.          2.27283627  1.04148019  1.12502646  3.52335153]
[ 0.          1.00825277 14.24936289  2.00241854  5.95647958]
[ 0.          1.96874581  1.28279379 18.55261602  1.01604746]
[ 0.         33.78958815  1.64687736  1.07623121  4.01226058]] (4, 5)
[ 0.          2.27283627  1.04148019  1.12502646  3.52335153  0.
1.00825277 14.24936289  2.00241854  5.95647958  0.          1.96874581
1.28279379 18.55261602  1.01604746  0.         33.78958815  1.64687736
1.07623121  4.01226058] (20,)

In [33]:
new_A = A_flat.reshape((4, 5))
print(new_A, new_A.shape)

[[ 0.          2.27283627  1.04148019  1.12502646  3.52335153]
[ 0.          1.00825277 14.24936289  2.00241854  5.95647958]
[ 0.          1.96874581  1.28279379 18.55261602  1.01604746]
[ 0.         33.78958815  1.64687736  1.07623121  4.01226058]] (4, 5)


### Remark: matrix/dot product¶

Since Python 3.5, it exists the operator @ that performs matrix and dot products (previously only available through np.matmul and np.dot)

In [34]:
b = np.linspace(0, 10, 11)
c = b @ b

print(b)
print(c)

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
385.0

In [35]:
I = np.identity(11)
I[5:, :] = 0.
print(I, b)

I @ b

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]

Out[35]:
array([0., 1., 2., 3., 4., 0., 0., 0., 0., 0., 0.])

#### For Matlab users¶

Matlab Numpy
element wise .* *
dot product * @

### To finish: dtypes and sub-packages¶

numpy arrays can also be sorted, even when they are composed of complex data if the type of the columns are explicitly stated with dtypes.

In [36]:
dtypes = np.dtype([('country', 'S20'), ('density', 'i4'),
('area', 'i4'), ('population', 'i4')])
x = np.array([('Netherlands', 393, 41526, 16928800),
('Belgium', 337, 30510, 11007020),
('United Kingdom', 256, 243610, 62262000),
('Germany', 233, 357021, 81799600)],
dtype=dtypes)
arr = np.array(x, dtype=dtypes)
arr.sort(order='density')
print(arr)

[(b'Germany', 233, 357021, 81799600)
(b'United Kingdom', 256, 243610, 62262000)
(b'Belgium', 337,  30510, 11007020)
(b'Netherlands', 393,  41526, 16928800)]


Dtypes are particularly useful when loading data of different types from a file with np.genfromtxt:

In [37]:
meteo_data = np.genfromtxt('../TP/TP1_MeteoData/data/synop-2016.csv', names=True, delimiter=',',
dtype=('f8', 'S25', 'f8', 'f8', 'i4', 'f8', 'S20'))
meteo_data

Out[37]:
array([(7761., b'2016-01-01T01:00:00+01:00', 2. , 283.75, 94, 0.2, b'AJACCIO'),
(7761., b'2016-01-01T04:00:00+01:00', 2.2, 283.95, 91, 0.2, b'AJACCIO'),
(7761., b'2016-01-01T07:00:00+01:00', 1.7, 284.05, 88, 0.2, b'AJACCIO'),
...,
(7761., b'2016-12-31T16:00:00+01:00', 2.2, 287.75, 61, 0. , b'AJACCIO'),
(7761., b'2016-12-31T19:00:00+01:00', 1.9, 284.05, 79, 0. , b'AJACCIO'),
(7761., b'2016-12-31T22:00:00+01:00', 2.5, 283.05, 79, 0. , b'AJACCIO')],
dtype=[('ID_OMM_station', '<f8'), ('Date', 'S25'), ('Average_wind_10_mn', '<f8'), ('Temperature', '<f8'), ('Humidity', '<i4'), ('Rainfall_3_last_hours', '<f8'), ('Station', 'S20')])

#### NumPy and SciPy sub-packages:¶

We already saw numpy.random to generate numpy arrays filled with random values. This submodule also provides functions related to distributions (Poisson, gaussian, etc.) and permutations.

To perform linear algebra with dense matrices, we can use the submodule numpy.linalg. For instance, in order to compute the determinant of a random matrix, we use the method det

In [38]:
A = np.random.random([5,5])
print(A)
np.linalg.det(A)

[[0.97686707 0.19570122 0.36140422 0.06750466 0.18070765]
[0.53024102 0.68681885 0.67799432 0.14903517 0.96101573]
[0.41638762 0.08302575 0.37110877 0.04414894 0.49869079]
[0.86026592 0.16876345 0.23845197 0.60002328 0.68178478]
[0.9029329  0.68171618 0.35792988 0.09063473 0.78865979]]

Out[38]:
0.046431202254328265
##### Matrix inversion (only on square matrices !)¶
In [39]:
square_subA = A[1:3, 1:3]
print(square_subA)
np.linalg.inv(square_subA)

[[0.68681885 0.67799432]
[0.08302575 0.37110877]]

Out[39]:
array([[ 1.8686853 , -3.41398029],
[-0.41806881,  3.4584154 ]])

If the data are sparse matrices, instead of using numpy, it is recommended to use the sparse subpackage of scipy.

In [40]:
from scipy.sparse import csr_matrix
print(csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]))

  (0, 0)	1
(0, 1)	2
(1, 2)	3
(2, 0)	4
(2, 2)	5


#### SciPy or NumPy ?¶

scipy also provides a submodule for linear algebra scipy.linalg. It provides an extension of numpy.linalg.

## Introduction to Pandas: Python Data Analysis Library¶

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.

## Some Solutions of Practical 1 with Pandas¶

In [41]:
import pandas as pd

filename = "../TP/TP1_MeteoData/data/synop-2016.csv"

"""
max temperature
"""

print(df['Temperature'].max() - 273.15)

"""
mean temperature
"""
print(df['Temperature'].mean() - 273.15)

"""
total rainfall
"""
print(df['Rainfall 3 last hours'].sum())

"""
August max temperature

"""
print(df[df['Date'].str.startswith('2016-08')]['Temperature'].max()-273.15)

32.60000000000002
16.268433652530803
2334.7
30.5