Reproducible research in the Python ecosystem: a reality check

2017-04-06 python, reproducible research

A few years ago, I decided to adopt the practices of reproducible research as far as possible within the technical and social constraints I have to live with. So how reproducible is my published code over time?

The example I have chosen for this reproducibility study is a 2013 paper about computing diffusion coefficients from molecular simulations. All code and data has been published as an ActivePaper on figshare. To save space, intermediate results had been removed from the published archive. This makes my reproducibility check very straightforward: a simple aptool update will recompute everything starting from these intermediate results up to the plots that went into the paper.

One nice aspect of ActivePapers is that it stores the version numbers of all dependencies, so I can quickly verify that in 2013, I had used Python 2.7.3, NumPy 1.6.2, h5py 2.1.3, and matplotlib 1.2.x (yes, the x is part of the reported version number).

First try: use my current Python environment

The evironment in which I do most of my current research has Python 3.5.2, NumPy 1.11.1, h5py 2.6, and Matplotlib 1.5.1. I set it up about a year ago when I got a new laptop, and haven't had a good reason to update it since then. I had made some effort back in 2013 to make my code compatible with Python 3, so why not try now if this was a worthy investment?

Outcome: running the computations works just fine, with results that are not identical at the bit level but close enough for my application. However, I get some warnings from matplotlib when generating the plots. Here is the first one, the others are similar:

UserWarning: Legend does not support 'x' instances.
A proxy artist may be used instead.
See: http://matplotlib.org/users/legend_guide.html#using-proxy-artist
  "#using-proxy-artist".format(orig_handle)

A quick inspection of the plots shows that the legends have almost disappeared, all that's left is a small white box. That makes many of the plots unintellegible.

Just out of curiosity, I made a quick attempt to figure out the error message. What's that 'x' instance? The following messages also refer to 'yz' instances and a few others. A look at my script reveals that 'x', 'yz' etc. are in fact the strings that I supplied as legends. Sounds strange to call them 'x' instances, as if 'x' were a class. And what's that cryptic reference to a proxy artist?

Better stop here: my goal was to see if I can reproduce my data and figures from 2013 in a Python environment from 2016, and the answer is no. The plots are mutilated to the point of no longer being useful.

Second try: use my current Python 2.7 environment

Some of my research code still lives in the Python 2.7 universe, so I also have a Python environment based on Python 2.7.11 on my laptop, with NumPy 1.8.2, h5py 2.5, and matplotlib 1.4.3. That's much closer to the original one, so let's see how well it does in my reproducibility evaluation.

Outcome: Much better. The computations work fine as before, and the plots generate a single warning:

MatplotlibDeprecationWarning: The "loc" positional argument to legend is deprecated. Please use the "loc" keyword instead.

The legends still look OK, so the warning is just a minor nuisance, as one would expect from a deprecation-related message. Interestingly, this warning is also about legends, so it looks like there was a serious backwards-incompatible change in matplotlib's legend function between 1.2 and 1.5, which was prepared by a deprecation warning in 1.4.

Third try: reconstructing the original environment

Since I have the version numbers of everything, why not try to reconstruct the original environment exactly? Let's go for the same major and minor version numbers, which should be sufficient. That's a job for Anaconda:

conda create -n python2013 python=2.7 numpy=1.6 h5py=2.1 matplotlib=1.2 anaconda
source active python2013
pip install tempdir
pip install ActivePapers.Py

Outcome: no warnings, no errors. Identical results. Reproducibility bliss at its best.

Conclusions

In summary, my little experiment has shown that reproducibility of Python scripts requires preserving the original environment, which fortunately is not so difficult over a time span of four years, at least if everything you need is part of the Anaconda distribution. I am not sure I would have had the patience to reinstall everything from source, given an earlier bad experience.

The purely computational part of my code was even surprisingly robust under updates in its dependencies. But the plotting code wasn't, as matplotlib has introduced backwards-incompatible changes in a widely used function. Clearly the matplotlib team prepared this carefully, introducing a deprecation warning before introducing the breaking change. For properly maintained client code, this can probably be dealt with.

The problem is that I do not intend to maintain the plotting scripts for all the papers I publish. And that's not only out of laziness, but because doing so would violate the spirit of reproducible research. The code I publish is exactly the code that I used for the original work, without any modification. If I started maintaining it, I could easily change the results by accident. I'd thus have to introduce regression tests as a safeguard against such changes. But... how do I test for visual equivalence of plots? Bitwise reproducibility is about as realistic to expect for image files as for floating-point numbers: I don't even get bitwise identical image files running the same Python code with identical matplotlib versions on different machines.

For my next paper, I will look for alternatives to matplotlib. My plotting needs are rather basic, so perhaps there is some other library with a more stable API that is good enough for me. Suggestions are welcome!

Comments retrieved from Disqus

Vicky Steeves:
Hi! I'd also recommend checking out ReproZip, which is designed to capture the computational environment of research for reproducibility: https://reprozip.org && https://examples.reprozip.org
To create a completely reproducible package (a .rpz file), you just prepend "reprozip trace" to your current command -- so it would look like "reprozip trace python funScript.py" Then to create the package, you just type "reprozip pack <package-name>"
You can send that to someone else (a reviewer, a collaborator, yourself in 5 years) who can then reproduce your work across different operating systems/configs using our unpacker plugins. You can use a graphical interface or the command line to unpack.
The point of ReproZip is to create computationally reproducible work -- that is, capture research at the environment level, like your blog post captured so accurately, as easily as possible. 2 commands to pack, 2 to unpack (unless you use the GUI, then it's only a few clicks).
This is a low-barrier way to create nice little reproducible packages of your research, to either share with others or share with yourself. Anyway, you should check it out!
- Konrad Hinsen:
  Thanks for mentioning ReproZip! It does sound familiar - I discovered this a few months ago but at least back then it was Linux only, so I couldn't use it for my own work. Judging from a quick look at the documentation, it seems there is now support for re-executing archives under MacOS and Windows, but making archives still requires Linux.
  Not that this is meant as a criticism - a good tool for Linux is significant progress. And I understand that capturing environments at the executable level requires low-level systems operations that are hardly portable.
Peter Amstutz:
Docker helps a lot. Best practice seems to be to store the Dockerfile and everything that goes into it (including sw packages) in git so you can rebuild the same environment later. But even then, I've seen changes in kernel version or VM configuration break even Dockerized workflows.
- Konrad Hinsen:
  In the spirit of my reality check, I would appreciate hearing of experiments with the long-term stability of Docker images. Did anyone try to revive a four-year-old Docker image? Yes, I know that means going back to the very first Docker release, so it's perhaps asking for too much. But two years should be reasonable. Any takers?
  - F. Pina Martins:
    I think there's a paper somewhere in this idea. =-)
    That being said, I'm genuinely interested in how well this works, since I'm currently using docker containers to make my own research reproducible.
Damien Irving:
Have you seen this post from Titus Brown?
http://ivory.idyll.org/blog...
It explores the concept of a half life for the repeatability of your research:
"... it is at least plausible to argue that we don't really care about our ability
to exactly re-run a decade old computational analysis. What we do
care about is our ability to figure out what was run and what the
important decisions were -- something that Yolanda Gil refers to as
"inspectability." But exact repeatability has a short shelf-life."
- Konrad Hinsen:
  I do remember that post, since I participated quite a bit in the discussion about it. And I think that discussion deserves to continue.
  Most of the current reflections about reproducibility, include my post here, start from the technical end: What is the state of the art? What is feasible in principle, what is feasible with reasonable effort? How can we do a bit better than the current state of the art? An aspect that has been neglected in comparison is the scientific end: What do we need to be able to do with published computational work in order to consider it a part of the scientific record?
  Inspectabilty is an interesting concept in this context, but it remains vague for now. What makes a work inspectable, what makes it verifiable? How can we ensure/check inspectability and/or verifiability at publication time? What are the time scales over which we need to ensure them?
  One key problem with the inspectability concept is that it is not obvious that reading program source code without being able to run it is useful in real life. Once a program reaches a modest level of complexity, looking at the source code is not sufficient to understand what it does, in my experience. A related issue is potential bugs in dependencies, which can only be detected if the precise versions of all dependencies are there - which you can only be sure of in practice if you can actually run the code.
F. Pina Martins:
Great Post!
I'd recommend keep using matplotlib, regardless of how the package evolves. Focus instead on what you have shown here - reproducing the environment.
Keep in mind that *for now* matplotlib broke, but in 5 years, other components may break, since software is constantly evolving.
Having a way to "fixate" the environment seems to me like the way to go.
Regarding the plot "comparison", I wouldn't worry too much about image comparisons, as long as the data used to generate it can be regression tested.
- Konrad Hinsen:
  Focusing on reproducing the environment is fine with me in principle, but we don't have any approach to this that has been around for long enough to be considered reliable. So for now, I prefer to do my best at both ends - preserving the environment AND avoiding unstable dependencies.
Pierre de Buyl:
Looking at old plotting packages then? Would gnuplot somehow fit your needs?
- Konrad Hinsen:
  I'd prefer something Python-based for two reasons:
  - avoid the configuration/installation/version-checking issues related to external commands
  - integration with ActivePapers