Konrad Hinsen's Blog

Industrialization of scientific software: a case study

A coffee break conversion at a scientific conference last week provided an excellent illustration for the industrialization of scientific research that I wrote about in a recent blog post. It has provoked some discussion on Twitter that deserves being recorded and commented on a more permanent medium. Which is here.

I was chatting with a colleague who I have been meeting at such occasions for about 15 years. He asked me if I was still developing my Molecular Modelling Toolkit. I replied that I had stopped working on it because the end of support for Python 2 in 2020 would quickly make it too hard to use for most of its intended audience, and that I didn’t have the means nor the motivation to port it to Python 3. He was quite surprised by my explanations, since he had never heard of the end of support for Python 2, though he did know that there was also a version 3 that was a bit different. His own data analysis scripts were still Python 2 because he had never seen a good reason to even look at Python 3 - never break a working system! But he was alarmed by my prediction that Python 2 would soon disappear from Linux distributions, as he relied on Ubuntu (regularly updated by his lab’s systems administrator) to provide him with Python 2 and the few libraries he used.

I was not surprised, as I have had similar conversations with various colleagues over the last years. In particular when someone contacts me with a Python question, which happens quite frequently as I have the reputation of being a Python expert in my little corner of science. The typical profile of these people is experimentalists who write and use small data analysis scripts, but for whom computation is not the central part of their research. They picked up Python from a colleague or a student, or perhaps through attending a short introductory course (such as a Software Carpentry workshop). They have a Python installation on their machine, which is managed by someone else. For them, Python is “just there”, exactly like other Unix basics such as sh, or grep. Moreover, Python has been part of their computing life for many years, often for their entire scientific career, and it has never caused them any trouble.

When I mentioned my coffee break conversation on Twitter, Greg Landrum commented that he would expect every Python user to make an effort to stay informed about important Python news, so everyone should by now have heard of the end-of-life decision for Python 2. This reminded me of an earlier Twitter conversation with Stefano Zacchiroli, who expressed similar views. As did other actors of the FOSS universe in various real-life discussions. There seems to be a widely shared expectation among FOSS developers that users should follow news about the software they use and take the required steps to adapt to “mandatory changes”, as Stefano put it. My story illustrates that this is not happening. There is a category of users who (1) don’t follow development news and (2) expect the software they use to stay around forever without major breaking changes.

This is exactly the phenomenon that I call the industrialization of scientific software. Some software packages, such as the core of the Scientific Python ecosystem, become so popular beyond their core community that for an important part of their users they are industrial products, something they obtain once and then use without thinking much about its origins or possible evolution. One sign of a piece software becoming an industrial product is its inclusion in standard Linux distributions, where it is just one package out of many that users can choose from. Linux distributions take the role that department stores have for material goods, providing a platform for window-shopping and acquisition via a standardized procedure. For users who get their software from a Linux distribution, all software looks a bit alike. They have no reason to be more careful about Python than about sh or grep.

Just like material goods industries, the developers of industrial software, FOSS or not, have no easy way to communicate with their clients. If such communication becomes inevitable, as for example in the case of a product recall for safety reasons, an enormous effort must be deployed to ensure that the message reaches most of its audience. Pierre de Buyl made a suggestion along these lines, proposing to put up posters with an explanation of the Python 2->3 transition in every research lab. Asking research funders to support such an action would be an interesting experiment.

Is there anything that FOSS communities can do to prevent such miscommunication in the future? A look at industrial material goods may provide inspiration. Every non-trivial technical product comes with a user manual, which typically starts with pointing out safety precautions that users are expected to be aware of. Do this, don’t do that, watch out for exceptional situations. The documentation of software packages could do the same, and tutorials could then emphasize the message when explaining the product to potential future customers. Here is what such a warning could look like:

This software package is developed for cutting-edge scientific
research. Our priority in development is to improve the software
and to adapt it for the needs of future applications. As a consequence,
we cannot maintain client code compatibility indefinitely.
Users of this package are expected to check the release notes
(available at http://...) at least once per year, and to adapt
their code to changes in the interfaces explained there.

I would expect such a notice in the introduction to the SciPy Lecture Notes, for example. It describes the SciPy ecosystem, comparing it to alternative choices, but says no word about what users need to do to safely use this ecosystem in their research work. As I said in my previous post, the FOSS community has largely been blind to the consequences of software industrialization, maintaining the outdated view that developers and users form a single community. It’s time for an upgrade.

Note added after the initial publication: Dan Katz commented on Twitter with a reference to this very clear statement on the development priorities for Matlab. It would be very helpful if FOSS communities published similar statements about their products.