Konrad Hinsen's Blog

Why Python does so well in scientific computing

A few days ago, I noticed this tweet in my timeline:

That sounded like a good read for the weekend, which it was. The main argument the author makes is that C remains unsurpassed as a system integration language, because it permits interfacing with “alien” code, i.e. code written independently and perhaps even in different languages, down to assembly. In fact, C is one of the few programming languages that lets you deal with whatever data at the byte level. Most more “modern” languages prohibit such interfacing in the name of safety - the only memory you can access is memory allocated through your safe language’s runtime system. As a consequence, you are stuck in the closed universe of your language.

System integration is indeed an important and often overlooked aspect of working with software. And this is particularly true for scientific computing, where application software with a fixed set of functionality is rare. Solving a scientific problem typically involves combining many pieces of software into a very problem-specific whole, which may well be run only a few times (see also my earlier post on this topic). This is exactly the task of system integration: assembling pieces into a whole using glue code where necessary. In computational science, this glue code takes the form of scripts, workflows, or more recently notebooks. This is technically quite different from the OS-level system integration that Stephen Kell refers to, but functionally it is the same.

Stephen’s post reminded me of my long-standing plan to write a blog post about why Python has been so successful in scientific computing, in spite of having a reputation for bad performance. So… here it is.

There are of course many reasons for Python’s success, but one of them is that it does a pretty good job at system integration. There are two Python features that I consider important for this, which are not shared by many other languages. One is data types explicitly designed for interfacing, the other is duck typing in combination with a small but versatile set of standard interfaces.

The first Python data type designed for interfacing in a scientific computing context is the good old NumPy array - which is in fact older than NumPy, having been introduced in 1995 by NumPy’s predecessor, Numeric. Arrays are one of the bread-and-butter data types in scientific computing, to the point of being the only one available in languages like Fortran 77 or APL. The implementation of arrays in Numeric was designed to use the same data layout as Fortran and C, in order to allow interfacing to the Fortran and C libraries that dominated scientific computing in 1995 (and still do, though to a somewhat lesser extent). The idea behind Numeric and later NumPy was always to use Python as a glue language for Fortran and C libraries, and achieve speed by delegating time-critical operations to code written in these languages.

The second Python data type designed for interfacing is memoryview, related to the buffer protocol. This is as close as Python gets to C-style memory access. The buffer protocol lets different Python data types access each other’s internal memory at the byte level. A typical use case would be an image data type (e.g. from Pillow) allowing access to the in-memory representation of an image through an array type (e.g. from NumPy), permitting the implementation of image manipulation algorithms in terms of array operations.

The third and least known Python data type for interfacing is the capsule that replaces the earlier CObject. Capsules exist solely for the benefit of Python modules written in C, which can exchange opaque data with one another via glue code written in Python, even though the glue code itself cannot inspect or manipulate the data in any way. A typical use is to wrap C function pointers in a Python object such that Python glue code, e.g. a script, can pass a C function from one module to a to C code from another module.

All these interfacing data types mediate between Python and C code, although quite often the Python system integrator is hardly aware of using C code at all. The other Python feature for system integration, duck typing with standard interfaces, is what facilitates glueing together independently written Python modules. By “standard interfaces”, I mean the sequence and dictionary interfaces, but also the standard method names for operator overloading.

To see why this is an important feature, let us look at statically typed languages that by design do not have it. As a concrete example, consider multidimensional arrays in Java. They are not part of the language or its standard library, but they can be implemented on top of it with reasonable effort. In fact, there are several Java implementations you can choose from. And that’s the problem. Suppose you want to use an FFT library based on array implementation A together with a linear algebra library based on array implementation B. Bad luck - the arrays from A and B have different types, so you cannot use the output of an FFT as the input to a linear equation solver. It doesn’t matter that the underlying abstraction is the same, and that even the implementations have much in common. For a Java compiler, tje types don’t match, period.

Python is not completely immune to this problem. It is perfectly possible to write Python code, or C code in a C module, that expects a precise type of data as input, and will raise an exception otherwise. But in Python code that would be considered bad style, and in C modules for Python as well except where required for performance or for compatibility with the C code. Wherever possible, Python programmers are expected to use the standard interfaces for working with data. Iteration and indexing work the same way for arrays as for the built-in lists, for example. For operations that are not covered by the standard interfaces, Python programmers are supposed to use Python methods, which are subject to duck typing as well. In practice, independently implemented Python types are much more interoperable than independently implemented Java types. For the specific case of n-dimensional arrays, Python has had the chance of overwhelming acceptance of a single implementation, which is due more to social and historical than to technical issues.

Finally, even though Python is a pretty good choice for system integration in scientific computing, there are of course limits, which are exactly of the kind that Stephen Kell explains in his essay: combining Python code with code in other managed languages, say R or Julia, requires a lot of work and even then is fragile, because the required hacks depend on undocumented implementation details. I suspect that the only solution would be to have language-neutral garbage-collected data objects proposed as an OS-level service that maintains an option for non-managed byte-level access à la C. The closest existing technology I am aware of is Microsoft’s CLR, better known by its commercial name .NET. Its implementation is now Open Source and runs on multiple platforms, but its Windows-only origins and strong ties to a huge Microsoft-y library have been an obstacle to adoption by the traditionally Unix-centric scientific computing communty.