Computational reproducibility has become a topic of much debate in recent years. Often that debate is fueled by misunderstandings between scientists from different disciplines, each having different needs and priorities. Moreover, the debate is often framed in terms of specific tools and techniques, in spite of the fact that tools and techniques in computing are often short-lived. In the following, I propose to approach the question from the scientists’ point of view rather than from the engineering point of view. My hope is that this point of view will lead to a more constructive discussion, and ultimately to better computational reproducibility.
The plea for stability in the SciPy ecosystem that I posted last week on this blog has generated a lot of feedback, both as comments and in a lengthy Twitter thread. For the benefit of people discovering it late, here is a summary of the main arguments and my reply to them.
Two NumPy-related news items appeared on my Twitter feed yesterday, just a few days after I had accidentally started a somewhat heated debate myself concerning the poor reproducibility of Python-based computer-aided research. The first was the announcement of a plan for dropping support for Python 2. The second was a pointer to a recent presentation by Nathaniel Smith entitled “Inside NumPy” and dealing mainly with the NumPy team’s plans for the near future. Lots of material to think about… and comment on.
A few days ago, I noticed this tweet in my timeline:
I 'still' program in C. Why? Hint: it's not about performance. I wrote an essay to elaborate... appearing at Onward! https://t.co/pzxjfvUs5B— Stephen Kell (@stephenrkell) September 5, 2017
That sounded like a good read for the weekend, which it was. The main argument the author makes is that C remains unsurpassed as a system integration language, because it permits interfacing with “alien” code, i.e. code written independently and perhaps even in different languages, down to assembly. In fact, C is one of the few programming languages that lets you deal with whatever data at the byte level. Most more “modern” languages prohibit such interfacing in the name of safety - the only memory you can access is memory allocated through your safe language’s runtime system. As a consequence, you are stuck in the closed universe of your language.
Think of all the things you hate about using computers in doing research. Software installation. Getting your colleagues’ scripts to work on your machine. System updates that break your computational code. The multitude of file formats and the eternal need for conversion. That great library that’s unfortunately written in the wrong language for you. Dependency and provenance tracking. Irreproducible computations. They all have something in common: they are consequences of the difficulty of composing digital information. In the following, I will explain the root causes of these problem. That won’t make them go away, but understanding the issues will perhaps help you to deal with them more efficiently, and to avoid them as much as possible in the future.
Yesterday a blog post by Cyrille Rossant entitled “Moving away from HDF5” caught my eye. My own tendency at the moment is to use HDF5 more and more, so I was interested in why someone else would want to do the opposite. Here is my conclusion after reading his post, plus some ideas about where scientific data management is or should be heading in my opinion.
We all know that software deployment in a research environment can be a pain, but knowing this as a fact is not quite the same as experiencing it in reality. Over the last days, I spent way more time that I would have imagined on what sounds like a simple task: installing a scientific application written in Python on a Linux machine for use by a group of students in a training session. Here is an outline of the difficulties, in the hope that it will (1) help others who face similar problems and (2) contributes a little bit to improving the situation.