Why Python does so well in scientific computing

2017-09-12 python, scientific computing

A few days ago, I noticed this tweet in my timeline:

I 'still' program in C. Why? Hint: it's not about performance. I wrote an essay to elaborate... appearing at Onward! https://t.co/pzxjfvUs5B
— Stephen Kell (@stephenrkell) September 5, 2017

That sounded like a good read for the weekend, which it was. The main argument the author makes is that C remains unsurpassed as a system integration language, because it permits interfacing with "alien" code, i.e. code written independently and perhaps even in different languages, down to assembly. In fact, C is one of the few programming languages that lets you deal with whatever data at the byte level. Most more "modern" languages prohibit such interfacing in the name of safety - the only memory you can access is memory allocated through your safe language's runtime system. As a consequence, you are stuck in the closed universe of your language.

System integration is indeed an important and often overlooked aspect of working with software. And this is particularly true for scientific computing, where application software with a fixed set of functionality is rare. Solving a scientific problem typically involves combining many pieces of software into a very problem-specific whole, which may well be run only a few times (see also my earlier post on this topic). This is exactly the task of system integration: assembling pieces into a whole using glue code where necessary. In computational science, this glue code takes the form of scripts, workflows, or more recently notebooks. This is technically quite different from the OS-level system integration that Stephen Kell refers to, but functionally it is the same.

Stephen's post reminded me of my long-standing plan to write a blog post about why Python has been so successful in scientific computing, in spite of having a reputation for bad performance. So... here it is.

There are of course many reasons for Python's success, but one of them is that it does a pretty good job at system integration. There are two Python features that I consider important for this, which are not shared by many other languages. One is data types explicitly designed for interfacing, the other is duck typing in combination with a small but versatile set of standard interfaces.

The first Python data type designed for interfacing in a scientific computing context is the good old NumPy array - which is in fact older than NumPy, having been introduced in 1995 by NumPy's predecessor, Numeric. Arrays are one of the bread-and-butter data types in scientific computing, to the point of being the only one available in languages like Fortran 77 or APL. The implementation of arrays in Numeric was designed to use the same data layout as Fortran and C, in order to allow interfacing to the Fortran and C libraries that dominated scientific computing in 1995 (and still do, though to a somewhat lesser extent). The idea behind Numeric and later NumPy was always to use Python as a glue language for Fortran and C libraries, and achieve speed by delegating time-critical operations to code written in these languages.

The second Python data type designed for interfacing is memoryview, related to the buffer protocol. This is as close as Python gets to C-style memory access. The buffer protocol lets different Python data types access each other's internal memory at the byte level. A typical use case would be an image data type (e.g. from Pillow) allowing access to the in-memory representation of an image through an array type (e.g. from NumPy), permitting the implementation of image manipulation algorithms in terms of array operations.

The third and least known Python data type for interfacing is the capsule that replaces the earlier CObject. Capsules exist solely for the benefit of Python modules written in C, which can exchange opaque data with one another via glue code written in Python, even though the glue code itself cannot inspect or manipulate the data in any way. A typical use is to wrap C function pointers in a Python object such that Python glue code, e.g. a script, can pass a C function from one module to a to C code from another module.

All these interfacing data types mediate between Python and C code, although quite often the Python system integrator is hardly aware of using C code at all. The other Python feature for system integration, duck typing with standard interfaces, is what facilitates glueing together independently written Python modules. By "standard interfaces", I mean the sequence and dictionary interfaces, but also the standard method names for operator overloading.

To see why this is an important feature, let us look at statically typed languages that by design do not have it. As a concrete example, consider multidimensional arrays in Java. They are not part of the language or its standard library, but they can be implemented on top of it with reasonable effort. In fact, there are several Java implementations you can choose from. And that's the problem. Suppose you want to use an FFT library based on array implementation A together with a linear algebra library based on array implementation B. Bad luck - the arrays from A and B have different types, so you cannot use the output of an FFT as the input to a linear equation solver. It doesn't matter that the underlying abstraction is the same, and that even the implementations have much in common. For a Java compiler, tje types don't match, period.

Python is not completely immune to this problem. It is perfectly possible to write Python code, or C code in a C module, that expects a precise type of data as input, and will raise an exception otherwise. But in Python code that would be considered bad style, and in C modules for Python as well except where required for performance or for compatibility with the C code. Wherever possible, Python programmers are expected to use the standard interfaces for working with data. Iteration and indexing work the same way for arrays as for the built-in lists, for example. For operations that are not covered by the standard interfaces, Python programmers are supposed to use Python methods, which are subject to duck typing as well. In practice, independently implemented Python types are much more interoperable than independently implemented Java types. For the specific case of n-dimensional arrays, Python has had the chance of overwhelming acceptance of a single implementation, which is due more to social and historical than to technical issues.

Finally, even though Python is a pretty good choice for system integration in scientific computing, there are of course limits, which are exactly of the kind that Stephen Kell explains in his essay: combining Python code with code in other managed languages, say R or Julia, requires a lot of work and even then is fragile, because the required hacks depend on undocumented implementation details. I suspect that the only solution would be to have language-neutral garbage-collected data objects proposed as an OS-level service that maintains an option for non-managed byte-level access à la C. The closest existing technology I am aware of is Microsoft's CLR, better known by its commercial name .NET. Its implementation is now Open Source and runs on multiple platforms, but its Windows-only origins and strong ties to a huge Microsoft-y library have been an obstacle to adoption by the traditionally Unix-centric scientific computing communty.

Comments retrieved from Disqus

vikas jain:
Very Impressive Python tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Python. I'm also a learner taken up Python training and I think your content has cleared some concepts of mine. While browsing for Python tutorials on YouTube i found this fantastic video on Python. Do check it out if you are interested to know more.:-https://www.youtube.com/wat...
vikas jain:
I appreciate your work on Python. It's such a wonderful read on Python tutorial. Keep sharing stuffs like this. I am also educating people on similar Python so if you are interested to know more you can watch this Python tutorial:-https://www.youtube.com/wat...
Urmila pandey:
Worthful Python tutorial. Appreciate a lot for taking up the pain to write such a quality content on Python course. Just now I watched this similar Python tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:- https://www.youtube.com/wat...
Urmila pandey:
Worthful Python tutorial. Appreciate a lot for taking up the pain to write such a quality content on Python course. Just now I watched this similar Python tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:- https://www.youtube.com/wat...
Manju Gupta:
Very Impressive Python tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Python. I'm also a learner taken up Python training and I think your content has cleared some concepts of mine. While browsing for Python tutorials on YouTube i found this fantastic video on Python. Do check it out if you are interested to know more.:-https://www.youtube.com/wat...
Manju Gupta:
I appreciate your work on Python. It's such a wonderful read on Python tutorial. Keep sharing stuffs like this. I am also educating people on similar Python so if you are interested to know more you can watch this Python tutorial:-https://www.youtube.com/wat...
Chris Barker:
You write:
"The idea behind Numeric and later NumPy was always to use Python as a glue language for Fortran and C libraries"
I have often wondered about this -- I started using Numeric in 1999, and followed the development through numarray, and then numpy, and onward :-)
I've often said that an ndarray is two things:
1) a nice featureful n-dimensional array object for Python, and
2) a wrapper around a C array (or really, a pointer to a data block).
(2) allows enormous power in communicating with Fortran and C codes -- as you mention.
The question is -- was this an intentional design decision? or a happy accident?
Has any one ever asked Jim Hugunin or David Asher?
(though I see your name on my historical copy of the docs from 2000 -- so maybe you were in on that decision at the time!)
- Konrad Hinsen:
  Yes, I was part of the initial Numerical Python development team, so I can confirm that interfacing to C and Fortran code was an important goal at the time. There is actually some evidence for this in the code and the API. For example, the separation of the array object storage into a data space and a small Python object with just the bookkeeping information. Plus the possibility to create an array using an externally allocated and managed data space.
  - Chris Barker:
    Thanks! good to know it wasn't just a happy accident.
    I haven't followed it recently, but at one point the folks working on the NumPyPy project really didn't "get" the importance of this aspect of numpy.
Stephen Kell:
Hi Konrad. Thanks for the "citation" and the kind words. :-)
One question: would you say it's the design of CPython and/or the Python language that have enabled this, or just the happenstance that somebody wrote those modules (NumPy, memoryview, capsule) and got them adopted? Could it have happened as easily in another dynamic language, say? I'm not familiar enough with the Python library ecosystem to distinguish these cases.
Your closing paragraph's idea of a "language-neutral garbage-collected data objects proposed as an OS-level service", is very close to what I've been working on with liballocs (https://github.com/stephenr.... I believe the trick is to tolerate as much diversity as possible, rather than fixing "one true way" to implement higher-level languages and cutting loose the non-conformers (the CLR approach, more-or-less). In particular, I'm (slowly) working towards a treatment of garbage collection that allows a considerable degree of pluralism -- think multiple somewhat-cooperating allocators/collectors, rather than a single shared one.
- Konrad Hinsen:
  Hi Stephen,
  You will probably find Rich Hickey's talk on the design of Clojure interesting: https://www.youtube.com/wat...
  He insists very much on a systems point of view and points out the dangers of language lock-in. His context is very different from yours and mine, but the overall message is the same.
- Konrad Hinsen:
  Hi Stephen,
  thanks for your comments!
  To answer your question, I'd say it is a bit of both. CPython had C modules right from the start, in fact it used them in its own implementation. Those C modules are a bit more than the FFI that any modern language has. It is bidirectional in that it gives C modules access to Python data types, and lets them define new ones. That was a perfect basis for the later developments (NumPy, memoryview, capsule), which wouldn't have made much sense otherwise. It didn't have to happen, but it definitely wouldn't have happened without the existing support.
  Your liballocs looks interesting, although the list of build dependencies is a bit discouraging. I'll start by reading the paper :-) The idea of a lightweight and minimalistic storage management, not tied to a language or even to a bytecode interpreter/compiler, looks very useful. In scientific computing, it could solve many problems of interfacing languages operating at different levels of storage abstraction, e.g. C, C++, Fortran 90 , and dynamic languages such as Python.
  - Stephen Kell:
    Thanks for the clarification!
    Certainly I am interested in finding users of my work within scientific computing... I'm currently scratching my head about how best to achieve this. One possible blocker is that even small run-time overheads are often considered intolerable.
    In case I can offer encouragement, most of the build dependencies are standard (and do not transfer to runtime)... the build instructions "should" "just work", at least on Debian-based machines (and with close equivalents on RPM distros). If not, do file a GitHub issue... but yes, I am working on packaging the library and tools more nicely in various ways. :-) Some of the dependencies will be eliminated once I have integrated more closely with gcc/clang... again, some work is ongoing, though not going as fast as I'd like.
    - Konrad Hinsen:
      After a quick look at your 2015 paper, I confirm that this looks very interesting. But it seems that all language implementers must build on liballocs for this to work. This might take some time to happen.
      As for run-time overheads, it all depends on where they occur. Much scientific code works on large uniform datasets, typically arrays. An overhead for the first access to an array is usually not a problem. An overhead for every element access would be prohibitive.
      - Stephen Kell:
        My hypothesis is that existing implementations can be retrofitted, rather than building new ones from scratch. But yes, this work needs to be done. And I admit the hypothesis is not tested yet, but is rather a case of "seems to be true" based on my current knowledge of the internals of various language implementations. The V8 modification mentioned in the paper started in this direction... but V8 is a particularly complex case. I hope to do some more work on this fairly soon, using on some simpler VM.
        Yes, I try to confine overheads to rare operations, such as malloc-style allocation. So I think the core run-time services should be supportable on scientific code... just not every possible use of them (e.g. bounds-checking array accesses may have to be skipped).
        
        Konrad Hinsen:
        The main problem I see is not so much the amount of work that must be done but the number of people that need to contribute to make it happen. That's perhaps more of a marketing question than a technical one.
        Are people in your corner of computing at least aware of the importance of the problem you are trying to solve? In my corner (computational science), they are not, although in my opinion it's one of our biggest problems in daily life. Most people don't see it as a problem because they don't envisage a solution. Languages being isolated universes is just normal, there is nothing to be done about this. I tried to explain the issues in an earlier blog post (http://blog.khinsen.net/pos..., but apparently with little success.
        As for array bounds checking, I am still hoping some PL designer will come up with a good solution. Mistakes in array index expressions are very frequent, but everyone turns off array bounds checking at compile time because of the huge runtime cost. Static array access validation would be very nice to have.