Reproducible research in the Python ecosystem: a reality check

python, reproducible research

A few years ago, I decided to adopt the practices of reproducible research as far as possible within the technical and social constraints I have to live with. So how reproducible is my published code over time?

The example I have chosen for this reproducibility study is a 2013 paper about computing diffusion coefficients from molecular simulations. All code and data has been published as an ActivePaper on figshare. To save space, intermediate results had been removed from the published archive. This makes my reproducibility check very straightforward: a simple aptool update will recompute everything starting from these intermediate results up to the plots that went into the paper.

One nice aspect of ActivePapers is that it stores the version numbers of all dependencies, so I can quickly verify that in 2013, I had used Python 2.7.3, NumPy 1.6.2, h5py 2.1.3, and matplotlib 1.2.x (yes, the x is part of the reported version number).

First try: use my current Python environment

The evironment in which I do most of my current research has Python 3.5.2, NumPy 1.11.1, h5py 2.6, and Matplotlib 1.5.1. I set it up about a year ago when I got a new laptop, and haven't had a good reason to update it since then. I had made some effort back in 2013 to make my code compatible with Python 3, so why not try now if this was a worthy investment?

Outcome: running the computations works just fine, with results that are not identical at the bit level but close enough for my application. However, I get some warnings from matplotlib when generating the plots. Here is the first one, the others are similar:

UserWarning: Legend does not support 'x' instances.
A proxy artist may be used instead.
See: http://matplotlib.org/users/legend_guide.html#using-proxy-artist
  "#using-proxy-artist".format(orig_handle)

A quick inspection of the plots shows that the legends have almost disappeared, all that's left is a small white box. That makes many of the plots unintellegible.

Just out of curiosity, I made a quick attempt to figure out the error message. What's that 'x' instance? The following messages also refer to 'yz' instances and a few others. A look at my script reveals that 'x', 'yz' etc. are in fact the strings that I supplied as legends. Sounds strange to call them 'x' instances, as if 'x' were a class. And what's that cryptic reference to a proxy artist?

Better stop here: my goal was to see if I can reproduce my data and figures from 2013 in a Python environment from 2016, and the answer is no. The plots are mutilated to the point of no longer being useful.

Second try: use my current Python 2.7 environment

Some of my research code still lives in the Python 2.7 universe, so I also have a Python environment based on Python 2.7.11 on my laptop, with NumPy 1.8.2, h5py 2.5, and matplotlib 1.4.3. That's much closer to the original one, so let's see how well it does in my reproducibility evaluation.

Outcome: Much better. The computations work fine as before, and the plots generate a single warning:

MatplotlibDeprecationWarning: The "loc" positional argument to legend is deprecated. Please use the "loc" keyword instead.

The legends still look OK, so the warning is just a minor nuisance, as one would expect from a deprecation-related message. Interestingly, this warning is also about legends, so it looks like there was a serious backwards-incompatible change in matplotlib's legend function between 1.2 and 1.5, which was prepared by a deprecation warning in 1.4.

Third try: reconstructing the original environment

Since I have the version numbers of everything, why not try to reconstruct the original environment exactly? Let's go for the same major and minor version numbers, which should be sufficient. That's a job for Anaconda:

conda create -n python2013 python=2.7 numpy=1.6 h5py=2.1 matplotlib=1.2 anaconda
source active python2013
pip install tempdir
pip install ActivePapers.Py

Outcome: no warnings, no errors. Identical results. Reproducibility bliss at its best.

Conclusions

In summary, my little experiment has shown that reproducibility of Python scripts requires preserving the original environment, which fortunately is not so difficult over a time span of four years, at least if everything you need is part of the Anaconda distribution. I am not sure I would have had the patience to reinstall everything from source, given an earlier bad experience.

The purely computational part of my code was even surprisingly robust under updates in its dependencies. But the plotting code wasn't, as matplotlib has introduced backwards-incompatible changes in a widely used function. Clearly the matplotlib team prepared this carefully, introducing a deprecation warning before introducing the breaking change. For properly maintained client code, this can probably be dealt with.

The problem is that I do not intend to maintain the plotting scripts for all the papers I publish. And that's not only out of laziness, but because doing so would violate the spirit of reproducible research. The code I publish is exactly the code that I used for the original work, without any modification. If I started maintaining it, I could easily change the results by accident. I'd thus have to introduce regression tests as a safeguard against such changes. But... how do I test for visual equivalence of plots? Bitwise reproducibility is about as realistic to expect for image files as for floating-point numbers: I don't even get bitwise identical image files running the same Python code with identical matplotlib versions on different machines.

For my next paper, I will look for alternatives to matplotlib. My plotting needs are rather basic, so perhaps there is some other library with a more stable API that is good enough for me. Suggestions are welcome!

Comments retrieved from Disqus

← Previous Next →