Konrad Hinsen's Blog

A plea for stability in the SciPy ecosystem

Two NumPy-related news items appeared on my Twitter feed yesterday, just a few days after I had accidentally started a somewhat heated debate myself concerning the poor reproducibility of Python-based computer-aided research. The first was the announcement of a plan for dropping support for Python 2. The second was a pointer to a recent presentation by Nathaniel Smith entitled “Inside NumPy” and dealing mainly with the NumPy team’s plans for the near future. Lots of material to think about… and comment on.

The end of Python 2 support for NumPy didn’t come as a surprise to anyone in the Python community. With Python 2 itself not being supported after 2020, it doesn’t make any sense for Python-dependent software to continue support beyond that date. The detailed plan for the transition of NumPy to a Python–3-only package looks quite reasonable. Which doesn’t mean that everything is fine. The disappearance of Python 2 will leave much scientific software orphaned, and many published results irreproducible. Yes, the big well-known packages of the SciPy ecosystem all work with Python 3 by now, but the same cannot be said for many domain-specific libraries that have a much smaller user and developer base, and much more limited resources. As an example, my own Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread) for which resources (funding plus competent staff) are very difficult to find.

Speaking purely from a computational science point of view, the Python 2->3 transition was a big mistake. While Python 3 does have some interesting new features for scientists, most of them could have been implemented in Python 2 as well, without breaking backward compatibility. There are, of course, good reasons for the modernization of the language. I am not saying that Guido van Rossum is an idiot - far from it. As popular as Python may be in today’s scientific research, scientific users make up for a very small part of the total Python user base. Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it’s mostly a calamity for computational science.

Apart from the major earthquake caused by this change in the Python language itself, whose victims we will be able to count starting from 2020, the SciPy ecosystem has been subject to regular minor seismic activities by breaking changes in its foundational libraries, such as NumPy or matplotlib. I am not aware of any systematic study of their impact, but my personal anecdotal evidence (see e.g. this report) suggests that a Python script can be expected to work for two to three years, but not for five or more. Older scripts will either crash, which is a nuisance, or produce different results, which is much worse because the problem may well go unnoticed.

In my corner of science, biomolecular simulation, the time scale of methodological progress is decades. This doesn’t mean that nothing exciting happens in shorter time spans. It just means that methods and techniques, including software, remain relevant for one to three decades. It isn’t even uncommon for a single research project to extend over several years. As an example, I just edited a script whose last modification date was December 2015. It’s part of collaborative project involving methodological development and application work in both experiment and theory. The back-and-forth exchanges between experimentalists and theoreticians take a lot of time. In the course of such projects, I update software and even change computers. If infrastructure updates break my code in progress, that’s a major productivity loss.

Beyond personal productivity considerations, breaking changes are a threat to the reproducibility of scientific studies, an aspect that has been gaining more and more attention recently because so many published results were found to be non-reproducible or erroneous (note that these are very different things, but that’s not my topic for today), with software taking a big share of the responsibility. The two main issues are: (1) non-reproducible results cannot be trusted, because nobody really knows how they were obtained and (2) code whose results are non-reproducible is not a reliable basis for further work (Newton’s famous “standing on the shoulders of giants”). Many researchers, myself included, are advocating better practices to ensure computational reproducibility. In view of the seismic activities outlined above, I have been wondering for a while whether I should add “don’t use Python” to my list of recommendations. What’s holding me back is mainly the lack of any decent alternative to today’s SciPy ecosystem.

Watching Nathaniel’s BIDS talk, I was rather disappointed that these issues were not treated at all. There is a general discussion of “change”, including a short reference to breaking changes and their impact on downstream projects, which suggests that there has been some debate of these questions in the NumPy community (note that I am no longer following the NumPy discussion mailing list for lack of time). However, assuming that Nathaniel’s summary is representative of that debate, neither reproducibility nor the requirements of the different software layers in scientific computing seem to have received the attention they deserve.

I have written before about software layers and the lifecycle of digital scientific knowledge, so I will just give a summary here. A scientific software stack looks like this:

  • Layer 4: project-specific code
  • Layer 3: domain-specific libraries
  • Layer 2: scientific infrastructure
  • Layer 1: non-scientific infrastructure

In the SciPy universe, we have Python in layer 1, NumPy and friends in layer 2, lots of lesser-known libraries (including my MMTK mentioned above) in layer 3, and application scripts and notebooks in layer 4.

A breaking change in any layer affects everything in the layers above. The authors of the affected higher-level code have three options:

  1. adapt their code (maintenance)
  2. freeze their code (describe the stack they actually used)
  3. do nothing

The first choice is of course the ideal case but it requires serious development resources. With the second one, archival reproducibility is guaranteed, i.e. a reader knows under which conditions the code can be used and trusted, and how these conditions can be recreated. But frozen code is not a good basis for further work. Using it requires much work for re-creating an outdated environment. Worse, using two or more of such packages together is in general impossible because each one has different dependency version requirements. Finally, the third option leaves the code in a limbo state where it isn’t even clear under which conditions it can be expected to work. In a research context, this ought to be considered unacceptable.

Let’s consider now how these three choices are applied in practice, for each layer in the software stack. Software in layers 1 and 2 must obviously be maintained, otherwise people would quickly abandon it. Fortunately these layers also suffer the least from collapse, because there is less code below them. Layer 3 code gets more or less well maintained, depending on the size of the communities supporting it, and on the development resources available. Quite often, maintenance is sub-optimal for lack of resources, with the maintainers aware of the problem but unable to do a better job. That’s my situation with MMTK.

Layer 4 code is the focus of the reproducible research movement. Today, most of this code is still not published, and of the small part that does get out, a large part is neither maintained nor frozen but simply dumped to a repository. In fact, the best practices recommended for reproducible research can be summarized as “freeze and publish layer 4 code”. Maintaining layer 4 code has been proposed (see e.g. continuous analysis ), but it is unclear if the idea will find acceptance. The obvious open question is who should do the maintenance. Considering that most research is done by people who spend a few years in a lab and then move on, it’s difficult to assign the responsibility for maintenance to the original authors of the code. But anyone else is less competent, less motivated, and would likely expect to be payed for doing a service job.

An argument I hear frequently in the SciPy community (and elsewhere) is that scientific code that is not actively used and maintained isn’t worth bothering with (see e.g. this tweet by Titus Brown). The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don’t agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter.

I would like to see the SciPy community define its point of view on these issues openly and clearly. We all know that development resources are scarce, that not everything that’s desirable can be done. The real world requires compromises and priorities. But these compromises and priorities need to be discussed and communicated openly. It’s OK to say that the community’s priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don’t have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community’s business, and that those who have such needs should look elsewhere. Or else, decide that SciPy is inclusive and caters for all computer-aided research - and draw the conclusion that stability must take a larger weight in future development decisions.

What is not OK is what I perceive as the dominant attitude today: sell SciPy as a great easy-to-use tool for all scientists, and then, when people get bitten by breaking changes, tell them that it’s their fault for not having a solid maintenance plan for their code.

Finally, in anticipation of an argument that I expect to see, let me stress that this is not a technical issue. Computing technology moves at a fast pace, but that doesn’t mean that lack of stability is a fatality. My last Fortran code, published in 1994, still works without changing a single line. Banks have been running Cobol code unchanged for decades. Today’s Java implementations will run the very first Java code from 1995 without changes, and even much faster thanks to JIT technology. This last example also shows that stability is not in contradiction with progress. You can have both if that’s a design goal. It’s all a matter of policy, not technology.

Note added 2017–11–22: see also my summary of the discussion in reaction to this post.