Overview of the Python HPC landscape 🚀¶

Python, only a great glue language?

Pierre Augier

Python for High-Performance Computing?¶

Fast prototyping (Numpy!)
Popular:
- Well-known
- Several great libraries
Share ideas between developers / scientists
- Popularity counts
- Readability counts
- Expressivity counts
Anyway, one needs a good and well-known scripting language so yes!

(even considering Julia)

Where / when should we stop ?¶

Python & fast prototyping...¶

The software engineering method for scientists 👩‍🔬 👨‍🔬 and HPC¶

Fast prototyping
Solidify as needed
- ↗ code quality
- testing
- refactoring
- type hints and type checking
- ↗ performances for bottlenecks (profile ⏱ !)

Again and again: (1, 2), (1, 2), ...

Python: a programming language, compromises ⚖️¶

Designed for fast prototyping & "glue" codes together

Generalist + easy to learn ⇒ huge and diverse community 👨🏿‍🎓🕵🏼 👩🏼‍🎓 👩🏽‍🏫👨🏽‍💻👩🏾‍🔬 🎅🏼 🌎 🌍 🌏
Expressivity and readability
Not oriented towards high performance

(fast and easy dev, easy debug, correctness)
- Highly dynamic 🐒 + introspection (inspect.stack())
- Automatic memory management 💾
- All objects encapsulated 🥥 (PyObject, C struct)
- Objects accessible through "references" ➡️
- Usually interpreted

Python interpreters¶

CPython

Interpreted (nearly) instruction per instruction, (nearly) no code optimization

The numerical stack (Numpy, Scipy, Scikits, ...) based on the CPython C API (CPython implementation details)!

PyPy

Optimized implementation with tracing Just-In-Time compilation

"Abstractions for free"

The CPython C API is an issue! PyPy can't accelerate Numpy code!

Micropython

For microcontrollers

Python & performance¶

References and PyObjects¶

In [2]:

mylist = [1, 3, 5]

list: array of references towards PyObjects

The C / Python border¶

In [3]:

arr = 2 * np.arange(10)
print(arr[2])

Python & performance¶

Python interpreters bad at crunching numbers¶

Pure Python terrible 🐢 (except with PyPy)...

In [4]:

from math import sqrt
my_const = 10.
result = [elem * sqrt(my_const * 2 * elem**2) for elem in range(1000)]

but even this is not very efficient (temporary objects)...

In [5]:

import numpy as np
a = np.arange(1000)
result = a * np.sqrt(my_const * 2 * a**2)

Even slightly worth with PyPy 🙁

Is Python efficient enough?¶

Python is known to be slow... But what does it mean?

Efficiency / inefficiency: depends on tasks ⏱¶

When is it inefficient? Especially for number crunching 🔢 ...¶

Can we write efficient scientific code in 🐍 ?¶

Book¶

See also https://faster-cpython.readthedocs.io/

Performance (generalities)¶

Measure ⏱, don't guess! Profile to find the bottlenecks.¶

Cprofile (pstats, SnakeViz), line-profiler, perf, perf_events

Do not optimize everything!¶

"Premature optimization is the root of all evil" (Donald Knuth)
80 / 20 rule, efficiency important for expensive things and NOT for small things

CPU or IO bounded problems¶

Use the right algorithms and the right data structures!¶

For example, using Numpy arrays instead of Python lists...

Unittests before optimizing to maintain correctness!¶

unittest, pytest

"Crunching numbers" and computers architectures¶

CPU optimizations¶

pipelining, hyper-threading, vectorization, advanced instructions (simd), ...
important to get data aligned in memory (arrays)

Proper compilation needed for high efficiency !¶

Compilation to virtual machine instructions¶

What does CPython (compile, "byte code", nearly no optimization, see dis module)

Compilation to machine instructions¶

Just-in-time

Has to be fast (warm up), can be hardware specific
Ahead-of-time

Can be slow, hardware specific or more general to distribute binaries

Compilers are usually good for optimizations! Better than most humans...

Transpilation¶

From one language to another language (for example Python to C++)

Parallelism¶

Hardware:¶

Multicore CPU
Multi nodes super computers (MPI)
GPU (Nvidia: Cuda, Cupy) / Intel Xeon Phi

Different problems¶

CPU bounded (need to use cores at the same time)
IO bounded (wait for IO)

Different parallel strategies¶

IO bounded: one process + `async`/`await`¶

Cooperative concurrency

Functions able to pause

asyncio, trio

Different parallel strategies¶

One process split in light subprocesses called threads 👩🏼‍🔧 👨🏼‍🔧👩🏼‍🔧 👨🏼‍🔧¶

handled by the OS
share memory and can use at the same time different CPU cores

How?

OpenMP (Natively in C / C++ / Fortran. For Python: Pythran, Cython, ...)
In Python: threading and concurrent.futures

⚠️ in Python, one interpreter per process (~) and the Global Interpreter Lock (GIL)...

In a Python program, different threads can run at the same time (and take advantage of multicore)
But... the Python interpreter runs the Python bytecodes sequentially !
- Terrible 🐌 for CPU bounded if the Python interpreter is used a lot !
- No problem for IO bounded !

Different parallel strategies¶

One program, $n$ processes 👩🏼‍🔧 👨🏼‍🔧👩🏼‍🔧 👨🏼‍🔧¶

Exchange data:¶

Very efficient and no problem with Python!

With MPI: mpi4py and h5py parallel
multiprocessing
ZeroMQ

2 other packages for parallel computing with Python¶

dask
joblib

Python for HPC: first a glue language¶

Many tools to interact with static languages:

ctypes, cffi, cython, cppyy, pybind11, f2py, pyo3, ...

Glue together pieces of native code (C, Fortran, C++, Rust, ...) with a nice syntax

⇒ Numpy, Scipy, ...

Remarks:

Numpy: great syntax for expressing algorithms, (nearly) as much information as in Fortran
Performance of a @ b (Numpy) versus a * b (Julia)?

Same! The same library is called! (often OpenBlas or MKL)

General principle for perf with Python (not fully valid for PyPy):¶

Don't use too often the Python interpreter (and small Python objects) for computationally demanding tasks.

Pure Python

→ Numpy

→ Numpy without too many loops (vectorized)

→ C extensions

But ⚠️ ⚠️ ⚠️ writting a C extension by hand is not a good idea ! ⚠️ ⚠️ ⚠️

No need to quit the Python language to avoid using too much the Python interpreter !¶

Tools to¶

compile Python
write C extensions without writing C

Cython, Numba, Pythran, Transonic, PyTorch, ...

First conclusions¶

Python great language & ecosystem for sciences & data

Performance issues, especially for crunching numbers 🔢

⇒ need to accelerate the "numerical kernels"

Many good accelerators and compilers for Python-Numpy code
- All have pros and cons!
⇒ We shouldn't have to write specialized code for one accelerator!

Other languages don't replace Python for sciences
- Modern C++ is great and very complementary 💑 with Python
- Julia is interesting but not the heaven on earth