The industrialization of scientific research
Over the last few years, I have spent a lot of time thinking, speaking, and discussing about the reproducibility crisis in scientific research. An obvious but hard to answer question is: Why has reproducibility become such a major problem, in so many disciplines? And why now? In this post, I will make an attempt at formulating an hypothesis: the underlying cause for the reproducibility crisis is the ongoing industrialization of scientific research.
First of all, let me explain what I mean by industrialization. In the production of material goods, this term stands for a transition to high-volume production in large sites (factories), profiting from economies of scale. This doesn't directly carry over to immaterial goods such as information and knowledge, which can be copied at near-zero cost. There are, however, aspects of industrialization that do make sense for immaterial goods. The main one is a clear separation of producers, who design and make products for an anonymous group of potential clients, and consumers who choose from pre-existing products on the market. This stands in contrast to 1) producing for one's own consumption, and 2) commissioning someone else (e.g. a craftsman) to make a personalized product. Both of these approaches lead to products optimized for a specific consumer's need, whereas industrial products are made for a large and anonymous market.
In scientific research, immaterial industrial products are a recent phenomenon. The ones that I will concentrate on are software and datasets that are publicly available and used by scientists outside of any collaboration with their authors. Twenty years ago, this would have been a rare event. Most software was written for in-lab use, and not even made available to others. Only a small number of basic, standardized, and widely used tools, such as compilers, were already industrial products. Most data were likewise not shared outside the research group that collected them. The resulting non-verifiability of scientific findings was an obvious problem, and led ultimately to today's growing Open Science movement. However, the Open Science movement goes well beyond asking for the transparency that is fundamentally required by the scientific method. It wants software and data to be reusable by other scientists and for different purposes. This is stated most explicitly by the FAIR data label, in which the R stands for reusability. Open Science thus turns software and datasets into industrial commodities.
The knowledge gap
A characteristic feature of industrial products is that consumers know much less about them than producers. Consumers cannot ask for personalized explanations either, unlike in the case of a product tailor-made by a craftsman. For material goods, this has led to a wide range of professions, institutions, and regulations designed to help consumers choose suitable products and to protect them against producers' abuse of their superior knowledge. Examples are consumer protection agencies, independent experts, technical norms, quality labels, etc. For the industrial products in scientific research, we have no established equivalents yet, and it is not even clear if can ever have them. And that is, in my opinion, a major cause of the reproducibility crisis.
One piece of evidence is the nature of the cases discussed in the context of the crisis. Reproducibility has been an issue with experiments since the dawn of science, and yet experimental non-reproducibility never shows up in the examples cited. This is not because it is unimportant, but because it is well understood. Experimentalists of all disciplines know what ought to be reproducible in their field, and to which degree, and even the most theoretically minded theoreticians understand that experiments necessarily come with uncertainties. The issues that do show up in the catalogs of non-reproducible results are related to two specific research tools: statistics and computers. Both are recent, and both are routinely used by scientists who do not fully understand them. In other words, their users are consumers of industrial products who lack guidance in their choice of tools and methods.
Side note: I can almost hear some readers complain that statistics are nothing recent, going back to Arab mathematicians who lived 1000 years ago. You are right. What is recent is the widespread use of statistics in science. Before computers, statistical methods had to be applied manually, keeping them simple and the datasets small. The kind of statistical inference whose results turn out to be non-reproducible, e.g. in psychology, would not have been possible without computers.
As an illustration, consider the common use of p-value thresholds for deciding on significance. Anyone who understands the statistical framework to which p-values belong (hypothesis testing) agrees that most uses of such thresholds in the scientific literature make no sense. The fact that they are widely used nevertheless thus shows that most people who deal with them, as authors or as reviewers, do not understand the statistical hypothesis testing sufficiently well. And since the abuse of p-values has been going on for a while, it has now become a de-facto accepted practice, to the point that the people who do understand its absurdity have a hard time being heard. The same can be said about the abuse of journal impact factors for judging the authors of scientific articles, which are a sign of CVs and publication lists becoming industrial products as well.
The root cause of computational non-reproducibility is an even better illustration of software becoming an industrial product. I noticed that many scientists who have never experienced reproducibility issues themselves find it hard to imagine that they can exist. After all, 2 + 2 is 4, today and tomorrow. What happens when two people obtain different results from "the same" computation is that they performed in fact different computations (using different software) without being aware of the difference. Software has become ever more complex over the last decades, but software developers have also made an effort to hide this complexity from users - with great success. Most scientists are surprised to learn that when they run that little script sent by a colleague, they are really using hundreds of software packages written (and modified frequently) by hundreds of people over many years with only loose coordination. It's not only those hundreds of packages that are industrial commodities, but even the assembly of all those pieces, for example a Linux distribution.
What can we do?
We can look at the much better understood industrial production of material goods for inspiration for possible solutions. A complex industrial product, such as a car or a television set, comes with a user manual and perhaps an obligation for user training, such as obtaining a driver's license. Moreover, technical norms impose precautions on producers to make their products safe to use by non-experts. Independent experts evaluate products and publish reports that guide consumers in their choice. These approaches can be adapted to scientific software and statistical methods, but that work remains to be done.
I expect reproducibility to play a major role in this, as a quality label. A reproducible result can still be wrong, but nevertheless reproducibility guarantees the absence of some kinds of common problems. We need additional, complementary quality labels of course, and in fact we have a few, such as the presence of test suites for scientific software, or the existence of provenance metadata for datasets. But this is only the beginning. We do not yet know how to make data and code an industrial product that is safe to use by others, nor do we know how to prepare scientists for working in such an ecosystem. Best practices, even good enough practices, remain to be established.
Experts will likely be another ingredient of a solution. I suspect that most statistics-related problems could be solved by requiring that every publication making a claim based on statistical significance be validated by a trained statistician. We will have to figure out how to organize this validation. One possibility is to create independent certification agencies, similar to cascad for computational reproducibility, that employ qualified statisticians and deliver validation certificates that will figure prominently in a paper.
It's not just software and data
As I said above, I have focused on data and code because the computational aspects of science are what I am most familiar with. But industrialization isn't limited to computing. Even the good old journal article is slowly turning into an industrial product. With approaches such as meta-analyses or content mining, scientific papers are being used by people who are not part of the community that their authors belong to, and may thus not have the tacit knowledge shared by that community which might well be necessary to fully appreciate the published results. Interdisciplinary research is also a source of potential misunderstandings due to unshared tacit knowledge.
We can also see industrialization in the management of science. In fact, the term "management" in itself implies some form of industrialization. Unfortunately, management principles from the material goods and service industries are being applied uncritically to scientific research, leading to phenomena such as the abuse of the journal impact factor to measure an individual's productivity, or the attribution of budgets based on multiple-year predictions of research outcomes (called "grant proposals") that lack any credibility. This suggests that the people who design these management practices consider science itself a commodity, as an industry that can be run just like any other industry. There is, however, a crucial difference: whereas the production of material goods is by necessity based on well-known technologies and processes (otherwise their deployment at scale would be bound to fail), research is all about the unknown. Scientists can describe directions they want to take, but not promise to reach specific goals in the future. Science is intrinsically a bottom-up process, whereas management is about top-down organization.
Open Source and Open Science
Back to software, there is one aspect that deserves further discussion: the role of the FOSS (free/open source software) approach that has been gaining traction in research over the last decade, and that has furthermore inspired much of the Open Science movement. The origin of the FOSS movement can be seen as a rebellion against the industrialization of software, which made it difficult to impossible for users to adapt it to their needs. The widely shared story of Richard Stallman's fight against a proprietary printer driver (see here for example) is a nice illustration. Initially, the FOSS movement focused on establishing legal means (licenses) to protect software from becoming proprietary. More slowly, and less explicitly, it worked towards a view of software development as something a community does for its own needs, with the ideal that anyone sufficiently motivated should be able to join such a community and participate in the development process. This was a reasonable proposal in the 1980s, when software was simpler and most computer users had by necessity some programming experience.
Today's situation is very different. Most software has the status of an industrial product for most of its users, whether it's FOSS or not. In theory, anyone can learn anything about FOSS and participate in its evolution at all levels. In practice, the effort is prohibitive for most, and nobody today can envisage understanding all the software they depend on, let alone contributing to its development. As I explained above, it has even become close to impossible to just keep track of which software one depends on. From a user's perspective, the development communities of FOSS projects are industrial software producers just like commercial companies. In a way, FOSS users even have less power because the developer communities have no legal or moral obligations toward their users at all. There are a few cases of institutions that permit users to influence and support the development of FOSS, for example the Pharo consortium or the Inria foundation, but they are the exception rather than the rule.
In science, the FOSS ideal of communities producing software for their own use works very well for domain-specific software packages, whose developers are a representative subset of a well-defined scientific community. But infrastructure software that is used across many scientific disciplines will invariably end up being an industrial product for most of its users. This is true for most of the Scientific Python ecosystem, for example, and also for the statistical software universe that has grown around the R language. Note that I am not saying that the FOSS approach has no advantages there. Open source code is very important to ensure the transparency required for making science verifiable. What I am saying is that openness is not enough to ensure that software is a safe-to-use industrial product, nor does it provide a mechanism for keeping a product's evolution in sync with the needs of its user base.
Whereas the FOSS community has largely remained blind to this issue, the Open Science movement seems to be more aware of the pitfalls of "just" being open, at least for data. The I and R (interoperability, reusability) in FAIR are the best evidence for this. For now, they remain ideals for which practically usable implementations remain to be defined. Perhaps this will lead to a more careful consideration of reusability for software as well. As with the material goods industries, the key is to recognize users and educators as stakeholders and ensure that their needs are taken into account by producers. Open source communities working on widely used infrastructure software could, for example, adopt a governance model that includes representative non-developing users. Funders of such communities could make such a governance model a condition for funding. But the very first step is creating an awareness of the problem. Development communities should openly state their ambition. It's OK to develop software for use inside a delimited community, but then don't advertise it as easy to use for everyone. It's also OK to aim high and work on general-purpose infrastructure software, but then explain how users can make themselves heard without having to become contributors themselves. Being "open" is not enough.
Comments retrieved from Disqus
- asmeurer:
Software, like all systems, does not just continue to work so long as you don't break it. It only works because people continuously work to keep it from breaking. Imagine if your city builds a bridge. Some years later, there is a bond election to pay for costs for the bridge. Now consider a voter who votes against the bond, saying, "they already built the bridge, why do they need more money? As long as they don't tear it down, it should continue to work." This is of course ridiculous. Bridges and roads require maintenance, or they will degrade. They do not just have a one time cost. Software is the same way. Even though the bits that make up the source code of software are just as immutable as the atoms of concrete in the bridge, it still requires ongoing maintenance or it will rot, just as the bridge will start to develop potholes, and eventually start to crumble if it is not maintained. The ecosystem of software and hardware that a piece of code runs on and alongside must be considered as part of the system, just as the cars should be considered as part of the system of a bridge.
The other thing to understand is that for open source software, this maintenance is provided almost exclusively by unpaid volunteers. I wonder how much your colleague has given to NumFOCUS, since he expects the software to be supported indefinitely. I would encourage you to show him this https://www.fordfoundation.....
Maintaining Python 2 support means splitting this development effort away from the development of new features, the fixing of bugs, and so on. It also means keeping a large amount of technical debt (I've written about this here https://www.asmeurer.com/bl.... Actually, if you want to continue to use Python 2, you can. What you can't do is expect the volunteers who work on CPython to continue to work on it, or the volunteers to work on libraries to continue to support it in addition to Python 3, or the volunteers who work on Linux distributions to continue to support it. These things would all require ongoing development efforts (see my first paragraph). You are of course free to pay a vendor to continue to provide Python 2 support for you (I'm sure some will pop up if the market demand is there), or attempt to fix any holes in the support yourself.
- Konrad Hinsen:
Hi Aaron,
thanks for your comments!
Before giving my point of view on your first paragraph, let me reply to your second one, which is really the topic of my post. My colleague doesn't expect software to be maintained indefinitely by someone else for free. His expectation of Python being just there forever is nothing but an extrapolation of past experience. He has no idea about how software maintenance works, nor any opinion on how it should work. And even after our coffee break conversation, he probably has no more than a foggy notion of all that. Coffee breaks are way too short, as we all know.
As for your statement that "software only works because people continuously work to keep it from breaking", that's a self-fulfilling prophecy in my opinion. What breaks software package A is a breaking change in its dependency, software package B. If everybody introduces breaking changes all the time, all software will break all the time, and your statement becomes true. In a world where everyone avoids breaking changes, software can work for a very long time without any maintenance. I have 25 year old Fortran programs that still work, as do the shell scripts that coordinate them. Software is as stable as its developers want it to be. What you can't have, given today's state of the art, is stable software *and* rapid improvement in functionality. That's a choice that developers must make. And then they should then make a clear public statement about their choice.
- asmeurer:
Right, I don't think there is any malice. For the most part, it is just ignorance of how open source maintenance works. Usually once you explain this to someone, they get it, but by default people don't think about it and they assume that things that just work will continue to work, and don't really consider that they only work because there are people out there who dedicate time or money to making them work.
Even for Fortran there is a maintenance cost. Every Fortran compiler has to support multiple versions of the language, and any compiler that works on a modern machine is necessarily being actively developed, because the architectures of 30 years ago aren't the same as the ones today. So ultimately, "breaking changes" will always happen *somewhere* in the stack, unless you are exclusively using 30-year old software on 30-year old hardware. Your colleague can continue to use his Python code by not updating Python from Python 2, except it won't be available on the latest Linux distro, He can avoid updating Linux, except old versions of Linux won't work on newer hardware. He can avoid updating his hardware, except hardware eventually dies.
- Konrad Hinsen:
Yes, Fortran compilers are being maintained. Fortran (in any of its standardized versions) is what I call a stable platform. Compiler developers work on avoiding collapse from below, in order to ensure that programmers in the software stack above needn't worry about it. And they work on improvements that any particular user might care about or not (speed, new versions of the standard, new hardware...).
But saying that Fortran requires maintenance hides enormous differences in degree. I am pretty sure that the first release of GNU Fortran for Linux would still work on a modern Linux, though you may have to install support for 32-bit code first. All of the software stack in the PC world has been very stable. People upgrade because they want new features or other improvements, not because they face software collapse.
An interesting historical side note: all software platforms that go back to the 2000s or earlier are stable. All the ANSI standard languages, but also the JVM or the Linux ecosystem as a whole. Rapidly changing platforms are a recent phenomenon. What happened? One hypothesis: the advertising business, with its extreme short-term focus, become an important driving force for technology.
What really bothers my experimentalist colleague is the risk of Python 2 dropping out of Linux distributions, because that's what makes Python easily accessible. You can't afford not to update Linux these days, for security reasons. Maybe a conservative distribution such as CentOS will keep Python 2 for some years to come.
- Konrad Hinsen:
- asmeurer:
- Konrad Hinsen:
- Stephen Kell:
Thanks Konrad... another very thought-provoking post. I agree with your basic premise that FOSS-style culture of "fix it yourself", or more generally of conveniently identifying users with developers, doesn't match today's large-scale patterns of software distribution and co-evolution.
However, there's an elephant in the room: who is right? Why *shouldn't* Python 2 be around forever? Why is there any category difference between Python and (say) awk, or sh?
I lean towards the view that there shouldn't be, that this is another instance of the language-implementer tail wagging the working-programmer dog. (The mess known as "FFIs" is another massive example of this.)
This is cultural... stereotyping wildly, "PL" people often see dictating change to users as their prerogative; "systems" people often don't share this (e.g. witness Linus Torvalds's strong insistence on backwards compatibility).
The analogy with instruction manuals is also problematic. My toaster's instruction manual literally says "Caution: do not insert any objects into the toast slot". More
generally, these sorts of things often contain advice that is practically unfollowable, but exists to cover the backside of the manufacturer. It may or may not be legally enforceable, but my point is that this isn't necessarily the right culture to be inspired by. Putting signs and disclaimers everywhere seems like an "ambulance at the bottom of the cliff" solution. It's ducking the big question: how can we structure the "material" of software so that the things people quite reasonably want to do are the things that actually work?
I have some thoughts on that question, but would love to hear yours first. :-)
- Konrad Hinsen:
Hi Stephen,
thanks for your comments!
I have been thinking about that elephant for a while, but I deliberately left it out of this blog post in order to concentrate on what I hope to be more consensual: that mutual miscomprehension between developers and users is a problem we collectively need to work on. But I'll happily come back to the elephant :-)
For me, there is no category difference between Python 2 and sh or awk. They are just on almost opposite ends of a spectrum. As you say, the root of the difference is cultural. In the Unix/"system" approach, there is an ideally small set of infrastructure software that defines the rules of the system and which ought to be a stable basis that nobody perturbs without a very good reason. I'd say that sh belongs to this infrastructure, but awk probably not. Another principle more specifically for Unix is the famous "tools that do one thing but do it well". Small tools are easy to keep stable as well. Integration of these tools for solving a specific problem is someone else's job, meaning that Unix is designed for power users.
Python, on the other hand, started out as a "batteries included" supertool, and needs to evolve constantly in order to remain the top supertool for many tasks. Unlike Unix tools, the different parts of the Python standard library cannot evolve independently at their own rhythm, which enourages an attitude of embracing change and pursuing it as a goal in itself. Compatibility is then not only seen as a waste of effort, but also as a sign of attachment to the past. Moreover, working on a supertool puts developers in a god-like position. They are not creating a humble part of a system, but a system on its own that transcends mere operating systems. And as you suggest, programming languages are probably the extreme case of god-like power.
Another aspect is the size and structure of the communities. Unix is anarchy: everybody does their tool, period. No annual conferences, no governance, no code of conduct, no bureaucracy managing formal enhancement proposals. Python started out the same way, and was very stable in its early years. Today's Python community is big and organized. Such communities require shared beliefs, and "software changes" is one of these beliefs in the Python community. It probably helps that society at large is obsessed by innovation, even without any associated goal of improvement.
Concerning your comment on instruction manuals, I agree that today's legalistic attitude has pushed alerts and warnings beyond the limit of the reasonable. Maybe I should have written "instruction manuals as they were 30 years ago".
Finally, the big question. I doubt there is one general answer to it, so I will stick to what I know best: research in the natural sciences. I have tried to be neutral in my description of the ongoing industrialization, but I consider it mostly a bad development, with few but important exceptions. For software, the main exceptions are well-understood compute-intensive procedures in simulation and data analysis. For everything else, I believe we need more anarchy and more control over software in the hands of each individual scientist. Meaning small understandable building blocks, rather than the monolithic libraries of today's SciPy ecosystem, however convenient that may be for getting a job done quickly.
Am I now entitled to learn about your thoughts on the big question ? ;-)
- Stephen Kell:
Thanks for the thoughtful reply. And sorry for the delay... I thought I had posted this, but I had merely written it.
Firstly, I agree that keeping to consensus-inducing topics is often a good tactic... apologies for charging off a revolutionary direction. :-)
"Batteries included" versus "one thing well" does identify a difference. Then again, Unix itself is in some sense a "batteries included" system and has a community process of sorts in the form of POSIX... its rate of change is tempered by both standardisation (slow) and plurality (many implementations). Perhaps if Python had several widely used implementations, the 2-vs-3 issue would have gone very differently... as you say, culture is a major factor.
To answer your question... I see the whole "supported versions" issue (not just in Python) and the burden of porting software to "keep up", as most immediately a consequence of two big but very concrete problems that our operating systems and programming tools set us up with. Solving them is already currently "possible" but uneconomical.
The first problem is that software packaging has no notion of isolation. Without extra effort, I can't have version X of some library/program installed and also version Y, because they collide/interfere with each other (e.g. they may want to install things at the same path, but also more directly that A links with B, say). The notion of "install" doesn't distinguish "coexistence" (both are available to me) and communication (both intentionally interact, including by presence in a shared namespace). This is a fairly direct consequence of Unix-style linking and sharing of the filesystem namespace. The right redesign of those could solve it, and I believe it needn't be very invasive. Some package managers do attempt something like this, but I've yet to see one that really goes deep enough.
The second problem is that critical fixes are not isolated from general development. In order to get implementation fixes for a given piece of software, you have also to get interface "fixes". For example, if a security bug is discovered that dates back to a library version N, it will probably only be fixed in version N+k. The interface of that version is probably different, so you have to port your code. I've not seen much focus on black-box approaches to security defence ("block, not patch"). Again, I'd argue this can be traced to Unix -- if all you have is opaque byte streams, recognising bad input is a tall order because it must be coded from scratch each time -- but by evolving Unix we can fix it. A memory-safe C will also help here (am working on it!).
Of course the reality of both of these is more complicated than I've made out. But in a world without both of these problems, I think widespread (cultural) expectations around software's "continued workingness" would be very different, because the "support" of large institutions wouldn't be necessary to keep a given codebase running acceptably. (And I did write even more about all this, but I think that rather than rambling away here, I should save the details for a blog post of my own....)
- Konrad Hinsen:
Thanks for taking the time to write and the initiative to actually post this reply :-)
Python actually has multiple implementations. I don't know how widely used Jython, IronPython, and PyPy are these days, but I'd say it doesn't matter. The consensus in the Python community is that CPython is the reference implementation and everyone else has to follow.
The first problem you describe looks like an early case of "convention over configuration". Recent package managers (I am thinking of Nix and Guix) are switching to explicit configuration, which does indeed solve most of the problems of what Windowsians call "DLL hell", except for one: the case where A depends on B and C, with B depending on D v1 and C on D v2. The only attempt I am aware of to solve that problem is Unison (https://www.unisonweb.org/), which refers to dependencies by hash code rather then by name.
Your second problem looks much harder to me because it involves so many different aspects: culture, economics, power relations, etc. I am not convinced that enough people actually want to solve the problem, whose continued existence provides market dominance to some, and employment to others.
So... I am looking forward to your blog post!
- Konrad Hinsen:
- Stephen Kell:
Thanks for the thoughtful reply. I agree that keeping to consensus-inducing topics is often a good tactic... apologies for charging off a revolutionary direction. :-)
"Batteries included" versus "one thing well" does identify a difference. Then again, Unix itself is in some sense a "batteries included" system and has a community process of sorts in the form of POSIX... albeit with a very slow rate of change gated by both standardisation (slow) and plurality (many implementations). Perhaps if Python had several widely used implementations, the 2-vs-3 issue would have gone very differently... as you say, culture is a major factor.
To answer your question... I see the whole "supported versions" issue (not just in Python) and the burden of porting software to "keep up", as most immediately a consequence of two big but very concrete problems that our operating systems and programming tools set us up with. Solving them is already currently "possible" but uneconomical.
The first problem is that software packaging has no notion of isolation. Without extra effort, I can't have version X of some library/program installed and also version Y, because they collide/interfere with each other (e.g. they may want to install things at the same path, but also more directly that A links with B, say). The notion of "install" doesn't distinguish "coexistence" (both are available to me) and communication (both intentionally interact, including by presence in a shared namespace). This is a fairly direct consequence of Unix-style linking and sharing of the filesystem namespace. The right redesign of those could solve it, and I believe it needn't be very invasive. Some package managers do attempt something like this, but I've yet to see one that really goes deep enough.
The second problem is that critical fixes are not isolated from general development. In order to get implementation fixes for a given piece of software, you have also to get interface fixes. For example, if a security bug is discovered that dates back to a library version N, it will probably only be fixed in version N+k. The interface of that version is probably different, so you have to port your code. I've not seen much focus on black-box approaches to security defence ("block, not patch"). Again, I'd argue this can be traced to Unix -- if all you have is opaque byte streams, recognising bad input is a job done from scratch each time -- but by evolving Unix we can fix it. A memory-safe C will also help here.
Of course the reality of both of these is more complicated than I've made out. But in a world without both of these problems, I think widespread (cultural) expectations around software's "continued workingness" would be very different, because the "support" of large institutions wouldn't be necessary to keep a given codebase running acceptably. I did write even more here, but I think that rather than rambling away here, I should save the details for a blog post of my own....
- asmeurer:
It's curious that you consider the SciPy ecosystem to be monolithic. It's generally considered to be built out of building blocks. A typical scientific workflow will require several libraries, which work together but are developed separately. If you want to do plots, you will use matplotlib or some other plotting library. If you need basic scientific functions you will use numpy or scipy, and for something more domain specific you will use a domain specific library, and so on. Contrast this to something like MATLAB or Mathematica where there is a single application package that does everything.
- Konrad Hinsen:
The SciPy stack is monolithic from the end user's point of view: you can't pick individual versions of each library and expect them to work together. You can only combine versions from close points in time. The developers' perspective is certainly very different. But the requirement of co-evolution in a context of rapid change in interfaces leads to a similar end result as centrally coordinated development.
- Konrad Hinsen:
- Stephen Kell:
- Konrad Hinsen: