The growing popularity of Jupyter notebooks shows that interactive code is becoming a priority in our data driven world. Still, for many applications, we can’t tolerate the overhead of interpreted languages. Programs written in pure Python are expected to run 100 times slower than their C counterparts.

But read-eval-print-loops are not new, languages like LISP have been around since the 1960s. This begs the question: Why haven’t we bridged this performance gap yet?

In this post, we will try to dissect Python’s inner workings and discuss how we can deal with performance issues of interpreted languages, to get the best of both worlds: the speed of compiled ones, when we need it, without losing the abstraction and flexibility Python’s runtime offers.

Disclaimer

The reader should:

  • have a good understanding of C (pointers, types, memory allocation, call stack)
  • know the basics of assembly (registers, instructions)
  • be familiar with Python syntax and vocabulary

Let me know if you find any error in this post. As I tried to cover many topics, I rushed through most of them. Take the time you need to read up on them, or simply go through this trying to grasp the main concepts.

Interpretation

The reference implementation of the Python language is called CPython, written in C. The term Python may loosely refer to it.

Similarly to compiled languages, Python code goes through several transformations before getting executed. It is successively represented as

  • Python syntax: a string of characters
  • parse tree: a top-down reading of the code, which reflects the structure of each line.
  • abstract syntax tree (AST): a model of the program’s structure, regardless of its textual form
  • Python bytecode: stack instructions for the CPython virtual machine.

illustration trees

To construct the parse tree, a lexer splits the code into tokens (symbols, strings, numbers, names) which are then fed to a parser. The parser performs a hierarchical grouping of these tokens, according to rules defined by Python’s grammar. 1

By transforming the parse tree into an AST, we get one step closer to the instructions. Internal nodes of the AST are language constructs (assignment, function declaration, list comprehension, etc) and their edges have labels describing their relation to their children. From this tree, it is much easier to know in which order the computations have to be performed.

Finally, the code gets compiled to Python “bytecode”. This term can be misleading, as the CPython VM instructions obtained, although written as stack code, are still far from ASM machine code. They are an intermediate representation of the program, between the source Python code and the target ASM instructions, breaking the translation into simpler steps.

Below, a short example using the built-in AST and disassembly modules

import ast, dis

def f(x):
  x += 1
  x.test()
  len(x)

ast.dump( ast.parse("x += 1; x.test(); len(x)") )
"""
Module(body=[
    AugAssign(target=Name(id='x', ctx=Store()), op=Add(), value=Constant(value=1, kind=None)),
    Expr(value=Call(func=Attribute(value=Name(id='x', ctx=Load()), attr='test', ctx=Load()), args=[], keywords=[])),
    Expr(value=Call(func=Name(id='len', ctx=Load()), args=[Name(id='x', ctx=Load())], keywords=[]))
], type_ignores=[])
"""
dis.dis(f)
"""
  2           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 (1)
              4 INPLACE_ADD
              6 STORE_FAST               0 (x)

  3           8 LOAD_FAST                0 (x)
             10 LOAD_METHOD              0 (test)
             12 CALL_METHOD              0
             14 POP_TOP

  4          16 LOAD_GLOBAL              1 (len)
             18 LOAD_FAST                0 (x)
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE
"""

As you can see, this bytecode is type-agnostic, the VM only deals with objects. For exemple, the test method has to be resolved at runtime: a dictionary lookup is needed to understand which function this instruction refers to. At each instruction, CPython’s virtual machine loads and writes to PyObjects frames on the execution stack, simulating basic operations of a more abstract computer.

Victor Skvortsov’s blog is a great resource for learning more about the internals of CPython.

Objects

Python’s data model, treats everything as objects, with an identity (similar to a memory address), a type and a value. Values of the following types are immutable: int, string, tuples. Otherwise, they are mutable: lists and dict-like objects. The type of an object determines what operations it supports. What makes an object type callable (like methods and functions), is simply to have a __call__ attribute.

From the flexibility of these objects and Python’s extensive use of iterators, arise powerful functional style patterns (or evil trickery, depending on who you ask). Functional style programming is often more expressive and reusable than the code using an imperative style. Although many things can be done with lambda functions, it is convenient to know about:

However, functions are not always as pure as they seem. Since objects are passed by reference, giving them as arguments will not cause a copy of their value, and they can be modified in the function’s body. This can be leveraged to create side effects. This is notably used by constructors, which have as first argument self, the object you want to initialise by adding to or modifying its attributes.

Remark: immutability can cause some confusion:

# immutable type (int)
x = 3; y = 1 + 2 # x == y -> x is y
x += 1 # x == 4, y == 3, x is not y

# mutable type (list)
x = [0]; y = [0] # x == y but x is not y
x = y # x is y
x += [1] # y == [0,1]
x = x + [2] # x == [0,1,2], y == [0,1]

# to make a grid, you want a different list for each row
grid = [ [0]*n ] * n # bad: grid[0] is grid[1]
grid = [ [0]*n for i in range(n) ] # OK: grid[0] is not grid[1]
# grid[0][0] is grid[0][1] but int immutable -> OK

Python is garbage collected. To free up memory, Python will check which objects are unreachable by keeping reference counters. The object’s reference counter is incremented when some variable gets assigned to it and decremented when the variable is deleted (explicitly or when exiting its scope). When it reaches 0, the memory can be freed. The garbage collection adds some overhead to all assignments, and having all values on the heap adds a level of indirection compared to having them on the stack.

Performance

To understand what causes a program to be slow, it is always a good idea to profile it. In most cases timeit suffices to identify which segments are taking too long. cProfile offers a more detailed breakdown of where the time is spent during execution.

For I/O bound tasks, like making HTTP requests, concurrency can reduce execution time by a couple orders of magnitude. Threading and coroutines work very well here. However, when it comes to intensive computational work, Python’s Global Interpreter Lock, which helps with memory management and avoiding race conditions, prevents threads from running in parallel.

For compute bound tasks, like scientific calculations, multiprocessing is an option. But spawning subprocesses is costly and a 4 core / 8 threads CPU can at most get you a 8x speedup. In many cases, you can leverage efficient computations on pandas.Series or numpy.array. The best way of performing matrix and tensor operations is probably using numpy.einsum. Once you get used to the Einstein notation and broadcasting rules, you can avoid slow loops and transpose gymnastics. Many operations on NumPy arrays are efficiently implemented in C and use Python as an interface, we call them bindings.

Programs involving branching logic cannot be expressed as algebraic operations. For these, it is necessary to step away from CPython’s interpreter to achieve C-like performance.

Bindings and Cython

If you already have a good C library that does what you want, you are probably looking for a foreign function interface (FFI), or bindings for short. Through shared libraries (.so binary files), CPython provides two ways of interfacing with programs written in C:

  • the ctypes module allows to interface with functions of shared libraries and C data types. From Python you transform your arguments and specify the return type of the function you want to call, this is useful for shared libraries you don’t want to modify and recompile.
  • Python C API: a header file to access PyObjects from C. In your C code, you will receive pointers to Python objects and interact with the runtime to read from and write to them.

ctypes is an ABI: it interacts with the library at the binary level.
Other methods involve writing or generating C code that understands Python arguments.

In this Real Python article, four methods of creating bindings are discussed: ctypes, cffi, PyBind and Cython. I think cffi is the simplest approach to writing bindings, and works well with distutils and setuptools for packaging your code.

Cython not only serves for bindings, you can write your performance sensitive code in Pyrex, with the .pyx extension, which looks like a type-annotated Python. These files are first compiled to C and then turned into their own .so binaries. Thanks to the portable Pyrex code, Cython became a popular choice for libraries. The C code is generated and compiled at setup. Cython is also well integrated with Jupyter notebooks, and produces annotation to show you where your code is translated to simple machine instructions and where you might need to remove Python calls.

In your notebook, simply run %load_ext Cython and write your Cython code in a cell, as shown here. You will then be able to call the declared functions from other cells.

%%cython --annotate
import numpy as np
cimport numpy as np
cimport cython

cpdef void f(np.ndarray[double, ndim=2] a):
    h = a.shape[0]
    w = a.shape[1]
    for i in range(h):
        for j in range(w):
            a[i,j] *= 2

Just In Time compilation

You may have heard of PyPy. PyPy differs from CPython because it uses a JIT compiler instead of an interpreter. This makes most Python code run faster. They claim to support most Python features, but don’t guarantee 100% coverage.

At the first call of a function, a JIT compiler analyses the argument types and creates machine code for the entire function, to avoid interacting with a Python VM at each instruction. The compiled program is then cached for later use.

This is all great, but since PyPy doesn’t impose restrictions on your program, it is given the hard task of optimizing for everything, and unfortunately is still no match to C. This is where Numba comes in. It specializes in scientific computations, and is used as a decorator on the desired functions, which make the rest of your code safe from compatibility issues. With Numba, it is recommended to use NumPy arrays when possible and avoid mixing types in other containers. Simple loops should be preferred over generator expressions and transformations on iterators. Like in Cython, you should adopt a C-style, even if the typing is implicit.

It takes virtually no effort2 to turn existing Python code into low level instructions: simply put the @jit decorator before your function. To make sure you are not calling any Python, you can change it to @njit or @jit(nopython=True). By leveraging static code analysis, it can infer the variables’ types, and run its compiler on the typed program.

from numba import njit

@njit
def func(arr):
    h,w = arr.shape
    for i in range(h):
        for j in range(w):
            arr[i,j] *= 2

Its tight integration with NumPy arrays makes it ideal for scientific computations. It works very well with SIMD and vectorizable operations, and can even target GPUs. The Numba project is part of PyData, like NumPy and pandas, and supported by NVIDIA, Intel and AMD.
5 minutes guide to Numba

Conclusion

To summarize, we have seen here how Python’s interpreter works under the hood, and why it makes the language less suitable for performant computations, we saw some ways Python can interface with compiled code and we studied an alternative approach with JIT compilers.

I expect the popularity of tools like Numba and some JITted languages to grow over the next few years. This doesn’t mean that bindings are going to die. They help abstracting the heavy machinery and combine the strengths of different languages.

This post was mainly about imperative programming, but we should not neglect functional languages and other representations of programs, like computational graphs, which are very common in machine learning for exemple (cool stuff: MediaPipe).

I hope you learned as much as I did researching this, and got a clearer picture of some of the fundamentals of computing. The next step for me is to learn about the LLVM project, a toolchain for compilation, and I may or may not write something on distributed and cloud computing.

  1. Python also has an abstract grammar

  2. unless you are heavily relying on Python specific features