Python training UGA 2017

A training to acquire strong basis in Python to use it efficiently

Pierre Augier (LEGI), Cyrille Bonamy (LEGI), Eric Maldonado (Irstea), Franck Thollard (ISTerre), Christophe Picard (LJK), Loïc Huder (ISTerre)

Data structures

4 built-in containers: list, tuple, set and dict.

For more containers: see collections...

list: mutable sequence

Lists are mutable ordered tables of inhomogeneous objects. They can be viewed as an array of references (nearly pointers) to objects.

In [1]:
# 2 equivalent ways to define an empty list
l0 = []
l1 = list()
assert l0 == l1

# not empty lists
l2 = ['a', 2]
l3 = list(range(3))
print(l2, l3, l2 + l3)
print(3 * l2)
['a', 2] [0, 1, 2] ['a', 2, 0, 1, 2]
['a', 2, 'a', 2, 'a', 2]

The itertools module provide other ways of iterating over lists or set of lists (e.g. cartesian product, permutation, filter, ... ).

list: mutable sequence

The built-in function dir returns a list of name of the attributes. For a list, these attributes are python system attributes (with double-underscores) and 11 public methods:

In [2]:
print(dir(l3))
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
In [3]:
l3.append(10)
print(l3)
l3.reverse()
print(l3)
[0, 1, 2, 10]
[10, 2, 1, 0]
In [4]:
# Built-in functions applied on lists
# return lower value
print(min(l3))
# return higher value
print(max(l3))
# return sorted list
print(sorted([5, 2, 10, 0]))
0
10
[0, 2, 5, 10]
In [5]:
# "pasting" two lists can be done using zip
l1 = [1, 2, 3]
s = 'abc'
print(list(zip(l1, s)))
print(list(zip('abc', 'defg')))
[(1, 'a'), (2, 'b'), (3, 'c')]
[('a', 'd'), ('b', 'e'), ('c', 'f')]

list: list comprehension

They are iterable so they are often used to make loops. We have already seen how to use the keyword for. For example to build a new list (side note: x**2 computes x^2):

In [6]:
l0 = [1, 4, 10]
l1 = []
for number in l0:
    l1.append(number**2)
    
print(l1)
[1, 16, 100]

There is a more readable (and slightly more efficient) method to do such things, the "list comprehension":

In [7]:
l1 = [number**2 for number in l0]
print(l1)
[1, 16, 100]
In [8]:
# list comprehension with a condition
[s for s in ['a', 'bbb', 'e'] if len(s) == 1]
Out[8]:
['a', 'e']
In [9]:
# lists comprehensions can be cascaded
[(x,y) for x in [1,2] for y in ['a','b'] ]
Out[9]:
[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

Do it yourself (advanced)

  • Write a function extract_patterns(text, n=3) extracting the list of patterns of size n=3 from a long string (e.g. if text = "basically", patterns would be the list ['bas', 'asi', 'sic', ..., 'lly']). Use list comprehension, range, slicing. Use a sliding window.

  • You can apply your function to a long "ipsum lorem" string (ask to your favorite web search engine).

A possible solution

In [10]:
text = "basically"

def extract_patterns(text, n=3):
    pat = [text[i:i+n] for i in range(len(text)-n+1)]
    return pat

print("patterns=", extract_patterns(text))
print("patterns=", extract_patterns(text, n=5))
patterns= ['bas', 'asi', 'sic', 'ica', 'cal', 'all', 'lly']
patterns= ['basic', 'asica', 'sical', 'icall', 'cally']

tuple: immutable sequence

Tuples are very similar to lists but they are immutable (they can not be modified).

In [11]:
# 2 equivalent notations to define an empty tuple (not very useful...)
t0 = ()
t1 = tuple()
assert t0 == t1

# not empty tuple
t2 = (1, 2, 'a')  # with the parentheses
t2 = 1, 2, 'a'    # it also works without parentheses
t3 = tuple(l3)  # from a list
In [12]:
# tuples only have 2 public methods (with a list comprehension)
[name for name in dir(t3) if not name.startswith('__')]
Out[12]:
['count', 'index']
In [13]:
# assigment of multiple variables in 1 line
a, b = 1, 2
print(a, b)
# exchange of values
b, a = a, b
print(a, b)
1 2
2 1

tuple: immutable sequence

Tuples are used a lot with the keyword return in functions:

In [14]:
def myfunc():
    return 1, 2, 3

t = myfunc()
print(type(t), t)
# Directly unpacking the tuple
a, b, c = myfunc()
print(a, b, c)
<class 'tuple'> (1, 2, 3)
1 2 3

set: a hashtable

Unordered collections of unique elements (a hashtable). Sets are mutable. The elements of a set must be hashable.

In [15]:
s0 = set()
In [16]:
{1, 1, 1, 3}
Out[16]:
{1, 3}
In [17]:
set([1, 1, 1, 3])
Out[17]:
{1, 3}
In [18]:
s1 = {1, 2}
s2 = {2, 3}
print(s1.intersection(s2))
print(s1.union(s2))
{2}
{1, 2, 3}

set: lookup

Hashtable lookup (for example 1 in s1) is algorithmically efficient (complexity O(1)), i.e. theoretically faster than a look up in a list or a tuple (complexity O(size iterable)).

In [19]:
print(1 in s1, 1 in s2)
True False
In [20]:
from random import shuffle, randint

n = 20
i = randint(0, n-1)
print('integer remove from the list:', i)
l = list(range(n))
l.remove(i)
shuffle(l)
print('shuffled list: ', l)
integer remove from the list: 11
shuffled list:  [19, 15, 0, 18, 4, 2, 6, 12, 13, 8, 1, 7, 9, 3, 14, 5, 10, 16, 17]

DIY: back to the "find the removed element" problem

  • Could the problem be solved using set ?
  • What is the complexity of this solution ?

 A possible solution :

In [21]:
full_set = set(range(n))
changed_set = set(l)
ns = full_set - changed_set
ns.pop()
Out[21]:
11

Complexity :

  • line 1: n insertions --> O(n)
  • line 2: n insertions --> O(n)
  • line 3: one traversal O(n), with one lookup at each time (O(1) -> O(n)

    -> Complexity of the whole algorithm : O(n)

Complexity of the "sum" solution :

  • One traversal for the computation of the sum O(n) with sum at each step O(1) -> O(n)

dict: unordered set of key: value pairs

The dictionary (dict) is a very important data structure in Python. All namespaces are (nearly) dictionaries and "Namespaces are one honking great idea -- let's do more of those!" (The zen of Python).

A dict is a hashtable (a set of keys) with associated values.

Dictionary creation

Syntax: {key1: value1, key2: value2...} Keys can be anything as long as they are "hashable" (ex: Python standard types).

In [22]:
d = {'a': 1, 'b': 2, 0: False, 1: True}
print(d)
{'a': 1, 'b': 2, 0: False, 1: True}

Adding/replacing an entry to an exisiting dict:

d[key] = value will create a new entry or replace the value associated with key.

In [23]:
date = {}
date['Year'] = 2020
date['Month'] = 1
print(date)
date['Year'] = 2015
date
{'Year': 2020, 'Month': 1}
Out[23]:
{'Year': 2015, 'Month': 1}

Get an existing entry

d[key] will return the value associated with key. If key does not exist, it will throw a KeyError.

In [24]:
print(date['Year'])
try:
    current_day = date['Day']
except KeyError as e:
    print(f'{e} is not an existing key !')
2015
'Day' is not an existing key !

Tip: parallel between dict and list

You can first think about dict as a super list which can be indexed with other objects than integers (and in particular with str).

In [25]:
l = ["value0", "value1"]
l.append("value2")
print(l)
['value0', 'value1', 'value2']
In [26]:
l[1]
Out[26]:
'value1'
In [27]:
d = {"key0": "value0", "key1": "value1"}
d["key2"] = "value2"
print(d)
{'key0': 'value0', 'key1': 'value1', 'key2': 'value2'}
In [28]:
d["key1"]
Out[28]:
'value1'

Warning: In general, dict are not ordered (since they are based on a hashtable)!

However, since Python 3.6, the implementation changed so that dict now keep the order of insertion of keys.

dict: public methods

In [29]:
# dict have 11 public methods (with a list comprehension)
[name for name in dir(d) if not name.startswith('__')]
Out[29]:
['clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'update',
 'values']

dict: different ways to loop over a dictionary

In [30]:
# loop on items
for key, value in d.items():
    print(key, value)
key0 value0
key1 value1
key2 value2
In [31]:
# loop on values
for value in d.values():
    print(value)
value0
value1
value2
In [32]:
# loop on keys
for key in d.keys():
    print(key)
key0
key1
key2
In [33]:
# dict comprehension (here to change the case of values)
print(d)
d1 = {k: v.upper() for k, v in d.items()}
print(d1)
{'key0': 'value0', 'key1': 'value1', 'key2': 'value2'}
{'key0': 'VALUE0', 'key1': 'VALUE1', 'key2': 'VALUE2'}

Do it yourself:

Write a function that returns a dictionary containing the number of occurrences of letters in a text.

In [34]:
text = 'abbbcc'

A possible solution:

In [35]:
def count_elem(sequence):
    d = {}

    for letter in sequence:
        if letter not in d:
            d[letter] = 1
        else:
            d[letter] += 1
    return d

print("text=", text, "counts=", count_elem(text))
text= abbbcc counts= {'a': 1, 'b': 3, 'c': 2}

Do it yourself : smart completion (advanced)

We will reuse our function extract_patterns.

  • For a text, count the appearance of each pattern (using a dictionary).
  • Given a query pattern of size 2, propose the pattern of size 3 with the same prefix that has the highest frequency. Filter the keys of the previous dictionary so that they starts with the query pattern.
In [36]:
def build_count_base(t):                 
    d = {}                   
    for s in t:
        if s in d:
            d[s] += 1
        else:  
            d[s] = 1
    return d

def build_count_set(t):                   
    d = {k:0 for k in set(t)}
    for s in t:
        d[s] += 1
    return d

def build_count_count(t):
    d = {k:t.count(k) for k in set(t)}
    return d

def build_count_except(t):             
    d = {}                   
    for s in t:
        try:     
            d[s] += 1
        except KeyError:
            d[s] = 1
    return d

import collections

def build_count_counter(t):
    return collections.Counter(t)

def build_count_defaultdict(t):
    d = collections.defaultdict(int)
    for k in s:
        d[k] += 1
    return d

s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam tristique at velit in varius. Cras ut ultricies orci. Fusce vel consequat ante, vitae luctus tortor. Sed condimentum faucibus enim, sit amet pulvinar ligula feugiat ac. Sed interdum id risus id rhoncus. Nullam nisi justo, ultrices eu est nec, hendrerit maximus lorem. Nam urna eros, accumsan nec magna eu, elementum semper diam. Nulla tempus, nibh id elementum dapibus, ex diam lacinia est, sit amet suscipit nulla nibh eu sapien. Aliquam orci enim, malesuada in facilisis vitae, pharetra sit amet mi. Pellentesque mi tortor, sagittis quis odio quis, fermentum faucibus ex. Aenean sagittis nisl orci. Maecenas tristique velit sed leo facilisis porttitor. "
s = s*10000
print(f"len(s) = {len(s)}, nbkeys {len(set(s))} base, count, count_count, except, collections.Counter")
%timeit build_count_base(s)
%timeit build_count_set(s)
%timeit build_count_count(s)
%timeit build_count_except(s)
%timeit build_count_counter(s)
%timeit build_count_defaultdict(s)

print("with split")
s2 = s.split()
print(f"len(s) = {len(s2)}, nbkeys {len(set(s2))} base, count, count_count, except, collections.Counter")
%timeit build_count_base(s2)
%timeit build_count_set(s2)
%timeit build_count_count(s2)
%timeit build_count_except(s2)
%timeit build_count_counter(s2)
%timeit build_count_defaultdict(s2)
len(s) = 7160000, nbkeys 33 base, count, count_count, except, collections.Counter
781 ms ± 88.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
793 ms ± 66.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
266 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
668 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
356 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
607 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with split
len(s) = 1100000, nbkeys 90 base, count, count_count, except, collections.Counter
165 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
175 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.83 s ± 81.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
130 ms ± 5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
85.4 ms ± 4.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
747 ms ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A possible solution

In [37]:
text="Las Vegas Overlook Loop is a 6.3 mile loop trail located near Las Vegas"

def extract_patterns(text, n=3):
    "extracts the patterns of size n from text and return it"
    pat = [text[i:i+n] for i in range(len(text)-n+1)]
    return pat

def guess(prefix, count):
    "complete the prefix with the most probable pattern (according to count)"
    # get all the pattern in keys of count that starts with prefix
    # compat_prefix = DIY
    
    # find among the compatible prefixes the one wich score best according to count
    best_prefix = "?"
    return best_prefix

patterns = extract_patterns(text)
print("patterns = ", patterns)
patterns_count = count_elem(patterns)
print("patterns_counts = ", patterns_count)

print("guess for oo = ", guess("oo", patterns_count))
print("guess for eg = ", guess("eg", patterns_count))
patterns =  ['Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas', 'as ', 's O', ' Ov', 'Ove', 'ver', 'erl', 'rlo', 'loo', 'ook', 'ok ', 'k L', ' Lo', 'Loo', 'oop', 'op ', 'p i', ' is', 'is ', 's a', ' a ', 'a 6', ' 6.', '6.3', '.3 ', '3 m', ' mi', 'mil', 'ile', 'le ', 'e l', ' lo', 'loo', 'oop', 'op ', 'p t', ' tr', 'tra', 'rai', 'ail', 'il ', 'l l', ' lo', 'loc', 'oca', 'cat', 'ate', 'ted', 'ed ', 'd n', ' ne', 'nea', 'ear', 'ar ', 'r L', ' La', 'Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas']
patterns_counts =  {'Las': 2, 'as ': 3, 's V': 2, ' Ve': 2, 'Veg': 2, 'ega': 2, 'gas': 2, 's O': 1, ' Ov': 1, 'Ove': 1, 'ver': 1, 'erl': 1, 'rlo': 1, 'loo': 2, 'ook': 1, 'ok ': 1, 'k L': 1, ' Lo': 1, 'Loo': 1, 'oop': 2, 'op ': 2, 'p i': 1, ' is': 1, 'is ': 1, 's a': 1, ' a ': 1, 'a 6': 1, ' 6.': 1, '6.3': 1, '.3 ': 1, '3 m': 1, ' mi': 1, 'mil': 1, 'ile': 1, 'le ': 1, 'e l': 1, ' lo': 2, 'p t': 1, ' tr': 1, 'tra': 1, 'rai': 1, 'ail': 1, 'il ': 1, 'l l': 1, 'loc': 1, 'oca': 1, 'cat': 1, 'ate': 1, 'ted': 1, 'ed ': 1, 'd n': 1, ' ne': 1, 'nea': 1, 'ear': 1, 'ar ': 1, 'r L': 1, ' La': 1}
guess for oo =  ?
guess for eg =  ?
In [38]:
text="Las Vegas Overlook Loop is a 6.3 mile loop trail located near Las Vegas"

def extract_patterns(text, n=3):
    pat = [text[i:i+n] for i in range(len(text)-n+1)]
    return pat

def guess(prefix, count):
    "complete the prefix with the most probable pattern (according to count)"
    # get all the pattern in keys of count that starts with prefix
    compatibles_prefixes = [x for x in count.keys() if x.startswith(prefix)]
    if len(compatibles_prefixes) == 0:
        return None
    best_prefix = compatibles_prefixes[0]
    best_score = count[best_prefix]
    for pref in compatibles_prefixes[1:]:
        if best_score < count[best_prefix]:
            best_score = count[pref]
            best_prefix = pref
    return best_prefix

patterns = extract_patterns(text)
print("patterns = ", patterns)
patterns_count = count_elem(patterns)
print("patterns_counts = ", patterns_count)

print("guess for oo = ", guess("oo", patterns_count))
print("guess for eg = ", guess("eg", patterns_count))
patterns =  ['Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas', 'as ', 's O', ' Ov', 'Ove', 'ver', 'erl', 'rlo', 'loo', 'ook', 'ok ', 'k L', ' Lo', 'Loo', 'oop', 'op ', 'p i', ' is', 'is ', 's a', ' a ', 'a 6', ' 6.', '6.3', '.3 ', '3 m', ' mi', 'mil', 'ile', 'le ', 'e l', ' lo', 'loo', 'oop', 'op ', 'p t', ' tr', 'tra', 'rai', 'ail', 'il ', 'l l', ' lo', 'loc', 'oca', 'cat', 'ate', 'ted', 'ed ', 'd n', ' ne', 'nea', 'ear', 'ar ', 'r L', ' La', 'Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas']
patterns_counts =  {'Las': 2, 'as ': 3, 's V': 2, ' Ve': 2, 'Veg': 2, 'ega': 2, 'gas': 2, 's O': 1, ' Ov': 1, 'Ove': 1, 'ver': 1, 'erl': 1, 'rlo': 1, 'loo': 2, 'ook': 1, 'ok ': 1, 'k L': 1, ' Lo': 1, 'Loo': 1, 'oop': 2, 'op ': 2, 'p i': 1, ' is': 1, 'is ': 1, 's a': 1, ' a ': 1, 'a 6': 1, ' 6.': 1, '6.3': 1, '.3 ': 1, '3 m': 1, ' mi': 1, 'mil': 1, 'ile': 1, 'le ': 1, 'e l': 1, ' lo': 2, 'p t': 1, ' tr': 1, 'tra': 1, 'rai': 1, 'ail': 1, 'il ': 1, 'l l': 1, 'loc': 1, 'oca': 1, 'cat': 1, 'ate': 1, 'ted': 1, 'ed ': 1, 'd n': 1, ' ne': 1, 'nea': 1, 'ear': 1, 'ar ': 1, 'r L': 1, ' La': 1}
guess for oo =  ook
guess for eg =  ega
In [39]:
text="Las Vegas Overlook Loop is a 6.3 mile loop trail located near Las Vegas"

def extract_patterns(text, n=3):
    pat = [text[i:i+n] for i in range(len(text)-n+1)]
    return pat

def guess(prefix, count):
    "complete the prefix with the most probable pattern (according to count)"
    # get all the pattern in keys of count that starts with prefix
    compat_prefix = [(x, count[x]) for x in count.keys() if x.startswith(prefix)]
    ordered_compat_pref = sorted(compat_prefix, key=lambda x: x[1], 
                                 reverse=True)
    return ordered_compat_pref

patterns = extract_patterns(text)
print("patterns = ", patterns)
patterns_count = count_elem(patterns)
print("patterns_counts = ", patterns_count)

print("guess for oo = ", guess("oo", patterns_count))
print("guess for eg = ", guess("eg", patterns_count))
patterns =  ['Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas', 'as ', 's O', ' Ov', 'Ove', 'ver', 'erl', 'rlo', 'loo', 'ook', 'ok ', 'k L', ' Lo', 'Loo', 'oop', 'op ', 'p i', ' is', 'is ', 's a', ' a ', 'a 6', ' 6.', '6.3', '.3 ', '3 m', ' mi', 'mil', 'ile', 'le ', 'e l', ' lo', 'loo', 'oop', 'op ', 'p t', ' tr', 'tra', 'rai', 'ail', 'il ', 'l l', ' lo', 'loc', 'oca', 'cat', 'ate', 'ted', 'ed ', 'd n', ' ne', 'nea', 'ear', 'ar ', 'r L', ' La', 'Las', 'as ', 's V', ' Ve', 'Veg', 'ega', 'gas']
patterns_counts =  {'Las': 2, 'as ': 3, 's V': 2, ' Ve': 2, 'Veg': 2, 'ega': 2, 'gas': 2, 's O': 1, ' Ov': 1, 'Ove': 1, 'ver': 1, 'erl': 1, 'rlo': 1, 'loo': 2, 'ook': 1, 'ok ': 1, 'k L': 1, ' Lo': 1, 'Loo': 1, 'oop': 2, 'op ': 2, 'p i': 1, ' is': 1, 'is ': 1, 's a': 1, ' a ': 1, 'a 6': 1, ' 6.': 1, '6.3': 1, '.3 ': 1, '3 m': 1, ' mi': 1, 'mil': 1, 'ile': 1, 'le ': 1, 'e l': 1, ' lo': 2, 'p t': 1, ' tr': 1, 'tra': 1, 'rai': 1, 'ail': 1, 'il ': 1, 'l l': 1, 'loc': 1, 'oca': 1, 'cat': 1, 'ate': 1, 'ted': 1, 'ed ': 1, 'd n': 1, ' ne': 1, 'nea': 1, 'ear': 1, 'ar ': 1, 'r L': 1, ' La': 1}
guess for oo =  [('oop', 2), ('ook', 1)]
guess for eg =  [('ega', 2)]