A plea for stability in the SciPy ecosystem

2017-11-16 python, scientific computing

Two NumPy-related news items appeared on my Twitter feed yesterday, just a few days after I had accidentally started a somewhat heated debate myself concerning the poor reproducibility of Python-based computer-aided research. The first was the announcement of a plan for dropping support for Python 2. The second was a pointer to a recent presentation by Nathaniel Smith entitled "Inside NumPy" and dealing mainly with the NumPy team's plans for the near future. Lots of material to think about... and comment on.

The end of Python 2 support for NumPy didn't come as a surprise to anyone in the Python community. With Python 2 itself not being supported after 2020, it doesn't make any sense for Python-dependent software to continue support beyond that date. The detailed plan for the transition of NumPy to a Python-3-only package looks quite reasonable. Which doesn't mean that everything is fine. The disappearance of Python 2 will leave much scientific software orphaned, and many published results irreproducible. Yes, the big well-known packages of the SciPy ecosystem all work with Python 3 by now, but the same cannot be said for many domain-specific libraries that have a much smaller user and developer base, and much more limited resources. As an example, my own Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread) for which resources (funding plus competent staff) are very difficult to find.

Speaking purely from a computational science point of view, the Python 2->3 transition was a big mistake. While Python 3 does have some interesting new features for scientists, most of them could have been implemented in Python 2 as well, without breaking backward compatibility. There are, of course, good reasons for the modernization of the language. I am not saying that Guido van Rossum is an idiot - far from it. As popular as Python may be in today's scientific research, scientific users make up for a very small part of the total Python user base. Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it's mostly a calamity for computational science.

Apart from the major earthquake caused by this change in the Python language itself, whose victims we will be able to count starting from 2020, the SciPy ecosystem has been subject to regular minor seismic activities by breaking changes in its foundational libraries, such as NumPy or matplotlib. I am not aware of any systematic study of their impact, but my personal anecdotal evidence (see e.g. this report) suggests that a Python script can be expected to work for two to three years, but not for five or more. Older scripts will either crash, which is a nuisance, or produce different results, which is much worse because the problem may well go unnoticed.

In my corner of science, biomolecular simulation, the time scale of methodological progress is decades. This doesn't mean that nothing exciting happens in shorter time spans. It just means that methods and techniques, including software, remain relevant for one to three decades. It isn't even uncommon for a single research project to extend over several years. As an example, I just edited a script whose last modification date was December 2015. It's part of collaborative project involving methodological development and application work in both experiment and theory. The back-and-forth exchanges between experimentalists and theoreticians take a lot of time. In the course of such projects, I update software and even change computers. If infrastructure updates break my code in progress, that's a major productivity loss.

Beyond personal productivity considerations, breaking changes are a threat to the reproducibility of scientific studies, an aspect that has been gaining more and more attention recently because so many published results were found to be non-reproducible or erroneous (note that these are very different things, but that's not my topic for today), with software taking a big share of the responsibility. The two main issues are: (1) non-reproducible results cannot be trusted, because nobody really knows how they were obtained and (2) code whose results are non-reproducible is not a reliable basis for further work (Newton's famous "standing on the shoulders of giants"). Many researchers, myself included, are advocating better practices to ensure computational reproducibility. In view of the seismic activities outlined above, I have been wondering for a while whether I should add "don't use Python" to my list of recommendations. What's holding me back is mainly the lack of any decent alternative to today's SciPy ecosystem.

Watching Nathaniel's BIDS talk, I was rather disappointed that these issues were not treated at all. There is a general discussion of "change", including a short reference to breaking changes and their impact on downstream projects, which suggests that there has been some debate of these questions in the NumPy community (note that I am no longer following the NumPy discussion mailing list for lack of time). However, assuming that Nathaniel's summary is representative of that debate, neither reproducibility nor the requirements of the different software layers in scientific computing seem to have received the attention they deserve.

I have written before about software layers and the lifecycle of digital scientific knowledge, so I will just give a summary here. A scientific software stack looks like this:

Layer 4: project-specific code
Layer 3: domain-specific libraries
Layer 2: scientific infrastructure
Layer 1: non-scientific infrastructure

In the SciPy universe, we have Python in layer 1, NumPy and friends in layer 2, lots of lesser-known libraries (including my MMTK mentioned above) in layer 3, and application scripts and notebooks in layer 4.

A breaking change in any layer affects everything in the layers above. The authors of the affected higher-level code have three options:

adapt their code (maintenance)
freeze their code (describe the stack they actually used)
do nothing

The first choice is of course the ideal case but it requires serious development resources. With the second one, archival reproducibility is guaranteed, i.e. a reader knows under which conditions the code can be used and trusted, and how these conditions can be recreated. But frozen code is not a good basis for further work. Using it requires much work for re-creating an outdated environment. Worse, using two or more of such packages together is in general impossible because each one has different dependency version requirements. Finally, the third option leaves the code in a limbo state where it isn't even clear under which conditions it can be expected to work. In a research context, this ought to be considered unacceptable.

Let's consider now how these three choices are applied in practice, for each layer in the software stack. Software in layers 1 and 2 must obviously be maintained, otherwise people would quickly abandon it. Fortunately these layers also suffer the least from collapse, because there is less code below them. Layer 3 code gets more or less well maintained, depending on the size of the communities supporting it, and on the development resources available. Quite often, maintenance is sub-optimal for lack of resources, with the maintainers aware of the problem but unable to do a better job. That's my situation with MMTK.

Layer 4 code is the focus of the reproducible research movement. Today, most of this code is still not published, and of the small part that does get out, a large part is neither maintained nor frozen but simply dumped to a repository. In fact, the best practices recommended for reproducible research can be summarized as "freeze and publish layer 4 code". Maintaining layer 4 code has been proposed (see e.g. continuous analysis ), but it is unclear if the idea will find acceptance. The obvious open question is who should do the maintenance. Considering that most research is done by people who spend a few years in a lab and then move on, it's difficult to assign the responsibility for maintenance to the original authors of the code. But anyone else is less competent, less motivated, and would likely expect to be payed for doing a service job.

An argument I hear frequently in the SciPy community (and elsewhere) is that scientific code that is not actively used and maintained isn't worth bothering with (see e.g. this tweet by Titus Brown). The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don't agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn't matter.

I would like to see the SciPy community define its point of view on these issues openly and clearly. We all know that development resources are scarce, that not everything that's desirable can be done. The real world requires compromises and priorities. But these compromises and priorities need to be discussed and communicated openly. It's OK to say that the community's priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don't have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community's business, and that those who have such needs should look elsewhere. Or else, decide that SciPy is inclusive and caters for all computer-aided research - and draw the conclusion that stability must take a larger weight in future development decisions.

What is not OK is what I perceive as the dominant attitude today: sell SciPy as a great easy-to-use tool for all scientists, and then, when people get bitten by breaking changes, tell them that it's their fault for not having a solid maintenance plan for their code.

Finally, in anticipation of an argument that I expect to see, let me stress that this is not a technical issue. Computing technology moves at a fast pace, but that doesn't mean that lack of stability is a fatality. My last Fortran code, published in 1994, still works without changing a single line. Banks have been running Cobol code unchanged for decades. Today's Java implementations will run the very first Java code from 1995 without changes, and even much faster thanks to JIT technology. This last example also shows that stability is not in contradiction with progress. You can have both if that's a design goal. It's all a matter of policy, not technology.

Note added 2017-11-22: see also my summary of the discussion in reaction to this post.

Comments retrieved from Disqus

xoviat:
Honestly if you actually want MMTK to be ported to Python 3, the least you can do is sign up for a GitHub account and upload the code to a repository. Right now, it's definitely not going to be ported because no one can look at the code.
- Konrad Hinsen:
  It has been on Bitbucket for a couple of years:
  https://bitbucket.org/khins...
  Releases have been on SourceSup, where they have always been among the top-ten downloads:
  https://sourcesup.renater.f...
Luis Pedro Coelho:
Long form follow-up: https://metarabbit.wordpres...
bastibe:
You can always install old versions of Python and packages using "pip install scipy==0.9.0". Old versions are not going away. If you need stability, this seems to be an easy option. Am I missing something?
- Konrad Hinsen:
  Many people have made this suggestion. In theory it works, as long as all dependencies are in PyPI. C library dependencies are often a problem. But the main issue is that you cannot suppose that everyone (all program authors and users) know exactly what to do and do it correctly. In practice, the approach you describe almost never works because some information is missing. To make it practical, we'd need easy-to-use tooling for all phases: producing a complete list of versioned dependencies (including C libraries), verifying the completeness of this list, and restoring the environment on a different machine. All that with simple tools that everybody can figure out how to use on all platforms.
  People are working on this, and I am optimistic that we will get there, but for a few more years we will have to live with the current state. Which is why stability still matters for reproducibility.
  In addition, stability will always matter for slow-moving science, where you need to combine ten-year-old and two-year-old libraries in a single program.
  - Syndafloden:
    If you want a completely reproduceable case, you'll likely need to package it with the specific runtimes or dependencies -- Which shouldn't be very hard at all, with, say, a Nanobox solution or something similar.
    You usually want that either way, regardless of use-case, language or environment.
    - Robert Jamie Munro:
      Python is really terrible here compared to, for example, node/npm or even Java / Maven. There's even an XKCD comic about it: https://m.xkcd.com/1987/
      - Justin Black:
        So this is an operating system specific solution, but one could use a docker image with versioned binaries, and pinned python packages using requirements.txt
        That way, the image has everything you need in it.
NPoisson:
Hmm. I will tend to consider that a numerical work should be distributed as a git repository with freezed source code. Even better, new tech allow to freeze the software stack if it's not too much hardware dependent.
Aka, a pip requirements with proper versioning + a Dockerfile should be able to provide a freezed ecosystem and allow good reproducibility. Of course, these tech are new and not known for their stability... for now. But I think that it will be an important part of the scentific stack : you define well your OS and software needs, you provide your source in a well documented way and you distribute both with your publication.
- Konrad Hinsen:
  Many people are working on various solutions for freezing, mostly at a level below Python/SciPy and thus generic. I am rather optimistic that this will work out fine ultimately, although my personal bet is not specifically on Docker. However, it will take a long time to come up with a reliable and stable solution and then develop good tooling to make it easy to use.
  This is in fact what I call "archival reproducibility" in my post. It's an important step, but not a replacement for stable infrastructure.
gerritholl:
Scipy moved to version 1.0 three weeks ago (https://github.com/scipy/sc... ), after 16 years of development. Within those 16 years, many what you call layer-3 and layer-4 code has been built on top of scipy, in the full knowledge the API was not stable yet, as the 0.x version number indicated. The bump to version 1.0 suggests the API should be more stable from now on, which hopefully will be the case.
I agree that communication is key. If you want to build code that will run unchanged for 20 years, relying on a library that is in version `0.x` is probably not a good idea, unless you freeze the version and bundle it along. When scientific software is in beta, as scipy effectively was until three weeks ago, the API *should* be able to change. But 16 years to go from initial release to initial stable release, as scipy did, is very long.
- Konrad Hinsen:
  I fully agree, though I'd recommend more explicit communication than just a version number. Non-developers are often not familiar with version number conventions.
  I have no personal experience with scipy stability because I have always avoided scipy except for ephemeral experimentations. The reason is the difficult installation procedure, for which I didn't want to do technical support to the users of my own code.
  - stefanvdwalt:
    With the arrival of binary wheels, hopefully this is now a non-issue.
    - Konrad Hinsen:
      It's indeed much less of an issue. The remaining difficult situation is HPC systems (clusters, supercomputers) with severe Internet access restrictions that render pip non-operational. While downloading wheels on a different machine is possible in principle, few people know it's possible and fewer know how to do it. In practice, people install from source code on those machines.
      - Nathaniel J. Smith:
        Surely if you can get the source code onto the machine, then you can also get a wheel onto it? It's literally exactly the same process, except you click on the '.whl' link instead of the '.tar.gz' link. Actually, downloading wheels is easier, because you can type 'pip wheel <package name="" or="" source="" tree="">' and it will automatically download the whole transitive dependency tree as wheels, which you can then rsync over or copy onto a USB stick or whatever the magic transfer system is.
        I understand that not everyone may not realize this, but rewriting every scipy feature inside every package seems like a lot more work than explaining how to download wheels :-).
        
        Konrad Hinsen:
        You are right that all the technology is there. As so often, the remaining big issue is making sure that everybody who has the problem can find the solution in a reasonable amount of time.
        BTW, the alternative to using scipy is not rewriting all its features, but rewriting, or finding in a smaller dependency, the one or two features that a given application needs. And that is sometimes easier than dealing with your users' installation questions, in my experience.
Luis Pedro Coelho:
+1 on this.
I find that often this discussion devolves into a binary "let's be like the kernel: stable APIs forever" vs "let's move fast and break things", but I would be happy with "let's break thing if we must, but try hard to avoid breaking other people's code when there is an obvious alternative".
The python2/3 transition is annoying (and py3 was an avoidable mistake), but I think that numpy/scipy changing their interfaces without any regard for backwards compatibility is much worse. For example, scipy.stats.mannwhitneyu has had at least 3 different behaviours in as many years without a lot of discussion of the possible effects on people's code. I almost published wrong results because of this particular change.
Histogram() changes has also caused me problems (for a while, people would email me every few months about not being able to reproduce my paper because numpy broke the code [https://metarabbit.wordpres...].
I once filed what I thought was an obvious bugfix (make the code follow the documented API instead of changing it for one high profile project) and had to argue for it: https://github.com/numpy/nu... Again, they broke my code for absolutely no good reason.
- Nathaniel J. Smith:
  I clicked through your links because I sympathize with your frustration, and wanted to see what we did wrong in case it's something we can handle better in the future. I'm still not sure what your issue with histogram was -- the link in that paragraph leads to a blog post that doesn't have any more details either. But I did read through PR #2780, which is linked both from that blog post and the bottom of your comment.
  I have to say, I found this extremely frustrating. The change that broke your code wasn't for "no good reason" or "aesthetic grounds" (as you describe it in the linked blog post) -- it was made because the 1.7 release broke Theano, and they submitted a fix to un-break it. I.e., your evidence that we don't care about backwards compatibility is that we *made a backwards compatibility fix*. In the process, we did accidentally break your code -- sorry about that. The patch was reviewed, but at the time no-one realized that it could cause compatibility breakage. (I'm still not entirely clear on why that happened -- I think it has to do with ways in which C++ is stricter than C? Nonetheless, it obviously did. Again, I apologize for this part.) Once you submitted your PR and alerted us of the problem, we confirmed with Theano that your fix wasn't going to break their code again, and then we merged it and backported it to the stable release branch. This all happened within 12 hours, and I posted the first reply – which linked to the previous context explaining why the change was made, and started the process of checking with Theano – 6 minutes after your original submission, at 3am my time.
  It's true that everyone mostly ignored your argument about the documentation. This is for two reasons: first, when documentation and code disagree, the default is to change the documentation. This is mandatory if you care about backwards compatibility -- in fact it follows directly from the rule that you cited at the beginning of your post. Changing the code might break users, and changing the documentation is an "obvious alternative" that doesn't risk breaking users. So everyone was focused on the breakage, not the documentation. And second, it didn't even matter anyway – we were already in the process of fixing the problem, so we focused on that instead of getting into a tangential discussion about engineering principles.
  All in all, I'm shocked that *this* is your example you use to go around sneering about how we're a bunch of terrible engineers who don't care about our users. You should feel ashamed of yourself.
  We've certainly made mistakes, and doubtless will continue to do so in the future. NumPy's a complex project, maintained by a small handful of volunteers, who are trying to support millions of users with contradictory requirements – inevitably we do mess up. When we do, we know it causes real harm to our users, and we try to do better. But at least acknowledge that we're trying. Geez.
- stefanvdwalt:
  While there may be isolated cases that have been badly handled, the general approach is to be conservative with API changes unless there is a significant benefit (e.g., clarity, or additional usage possibilities). Many libraries in the SciPy ecosystem follow a three-release deprecation cycle, which means in practice that if you run your code once a year, you will at least see warnings that indicate what needs to be changed. The expectation that libraries should *never* change APIs is unreasonable; for papers you should consider either specifying the version or NumPy, or publish the code in a location where you have the ability to change it later. Your comment seems to suggest that the NumPy and SciPy developers do not care about backward compatibility, which I don't think is an accurate reflection.
  - Luis Pedro Coelho:
    As I wrote, I don't think that the choice is a binary one between "never change the API" like the kernel and changing it at will.
    "in practice that if you run your code once a year, you will at least see warnings that indicate what needs to be changed"
    This is only true if I run my code once a year with the (at the time) most up to date version; not true otherwise. Also, sometimes I want to retrieve code that I used 2 years ago in another project and I would rather have an expectation that it works.
    "While there may be isolated cases that have been badly handled, the general approach is to be conservative with API changes unless there is a significant benefit (e.g., clarity, or additional usage possibilities)."
    This is exactly our disagreement. I don't think that "clarity or additional usage possibilities" is anywhere close to something that would justify breaking backwards compatibility for a foundational project like numpy or scipy.
    Add new functions while deprecating the older ones. Most new functionality can be done with new functions or even just new arguments. This way, you improve the API and evolve it. After a few years, remove old functions. But changing the behaviour of working code in 3 release cycles (18 months) is not what I'd consider conservative, it's rather on the "move fast and break things" side of the scale. For more cutting edge projects, that could be OK, even expected, but numpy/scipy should be more like infrastructure.
    I won't even ask for something like semantic versioning (where there would be a commitment to supporting the APIs for duration of a major release), but 18 months is way too short for a project like numpy, especially for changes that silently change results. And if I report a change to a documented API that caused code to stop compiling, it should be treated it as a bona fides bug (and not a discussion of which API is best).
Pierre de Buyl:
Hi Konrad, interesting read!
In the direction of "mitigation" of these issues, your other idea that data is more important than code (Hinsen 2012, CISE). Whether you maintain, freeze, or ignore, the availability of reference data allows future "you" or future "someone else" to perform at least a comparison test.
- Konrad Hinsen:
  Yes, data in open, documented, and software-independent formats is a big plus for longevity. My own MMTK is a bad example there, because it uses a trajectory format that includes executable Python code, making it very hard to process from other languages. I have repented and defined a more open and language-neutral format (MOSAIC, https://mosaic-data-model.g....
  Unfortunately, data supremacy is almost as hard to sell as stable software!
:
Nathan Goldbaum:
NumPy LTS will continue to be available on Python2 and MMTK will continue to be able to be built with it.
- Konrad Hinsen:
  Indeed, but very soon Python 2 will have to be banned into some sandbox because security bugs are no longer fixed. It's good that NumPy LTS will remain available for frozen code, but it's not sufficient to keep code alive and useful.
jsierles:
Freezing the stack may end up being the only real solution as dependencies trees grow in complexity. If this were easy to do, and long term reproducibility could be guaranteed, would you accept it as a solution?
- Konrad Hinsen:
  As I wrote in my post, it's a partial solution, OK from a reproducibility point of view, but insufficient for long-running projects, or for taking up old projects again. For that, I need to be able to use ten-year-old and two-year-old libraries together from the same script.
  - jsierles:
    I won't argue that long term support is important at a library level. However, it seems unrealistic in modern software environments to expect it. Rather, I think we need to look towards new ways of WRITING code, and of defining dependencies. For example, if you could split your script into sections, each using a different dependency tree for each, but passing values between them outside the runtime, you could avoid a lot of typical problems with dependency hell. Also, tools like Guix (which you've written about) help solve the underlying dependency graph problem in a manageable way. I've seen some success with this approach.
    I agree this is not a 'technical' issue, but also think there are more solutions available than are made obvious at this level of discussion. Would love to see some actual code and see how we could specific problems!
    - Konrad Hinsen:
      There are various *possible* technical solutions, and more are being worked on. But today, we have no solution that works in practice, meaning that it is sufficiently simple on all major platforms that the majority of scientists can work with it. Which is why for now, and a few more years to come, breaking changes in infrastructure are a danger for reproducibility.
      BTW, labelling a potential solution as "unrealistic" is a major contribution to the problem itself. As I pointed out with the examples of the Fortran, COBOL, and Java ecosystems, stability is possible not only in theory but also in practice, under the condition that everyone keeps it in mind during design and development. In a community where most people consider stability unrealistic, it cannot happen.
      - jsierles:
        I completely agree that the label contributes to the problem. And that some call for stability is justified in any heavily used software project. However, in the case of Python, and other languages like Javascript, the issues run deeper than the label. Down to how packaging systems work and the language designers goals when making changes. Stability? Simplicity? Programmer happiness? It's truly hard to reconcile these, and less and less so as more languages enter the space. So I don't see adopting stability as something necessarily easier or faster to do than exploring other solutions that can apply to a wider range of problems.
        Furthermore, I see that technical solutions are equally unfairly labeled as unrealistic because of an unquantifiable cost of adoption. The result is that we see a lot of talk about reproducibility that boil down to a lengthy laundry list of best practices, i.e. (http://journals.plos.org/pl....
        Instead, as technologists, I think we are responsible to build better tools and more creative solutions to the problem.
        
        Konrad Hinsen:
        I pretty much agree with all that. And I would definitely encourage technologists to continue looking for better solutions. The one mistake not to make is to declare victory when a proof of concept has been achieved. That's just the beginning of the next episode: convince enough early adopters that communities like Software Carpentry will add the new technology to their courses.