In my last post, I have discussed the two main types of scientific models: empirical models, also called descriptive models, and explanatory models. I have also emphasized the crucial role of equations and specifications in the formulation of explanatory models. But my description of scientific models in that post left aside a very important aspect: on a more fundamental level, all models are stories.
It is often said that science rests on two pillars, experiment and theory. Which has lead some to propose one or two additional pillars for the computing age: simulation and data analysis. However, the real two pillars of science are observations and models. Observations are the input to science, in the form of numerous but incomplete and imperfect views on reality. Models are the inner state of science. They represent our current understanding of reality, which is necessarily incomplete and imperfect, but understandable and applicable. Simulation and data analysis are tools for interfacing and thus comparing observations and models. They don’t add new pillars, but transforms both of them. In the following, I will look at how computing is transforming scientific models.
Many people are asking for my opinion on the recent impressive success of AlphaFold at CASP14, perhaps incorrectly assuming that I am an expert on protein folding. I have actually never done any research in that field, but it’s close enough to my research interests that I have closely followed the progress that has been made over the years. Rather than reply to everyone individually, here is a public version of my comments. They are based on the limited information on AlphaFold that is available today. I may come back to this post later and expand it.
Computational reproducibility has become a topic of much debate in recent years. Often that debate is fueled by misunderstandings between scientists from different disciplines, each having different needs and priorities. Moreover, the debate is often framed in terms of specific tools and techniques, in spite of the fact that tools and techniques in computing are often short-lived. In the following, I propose to approach the question from the scientists’ point of view rather than from the engineering point of view. My hope is that this point of view will lead to a more constructive discussion, and ultimately to better computational reproducibility.
Over the last years, an interesting metaphor for information and knowledge curation is beginning to take root. It compares knowledge to a landscape in which it identifies in particular two key elements: streams and gardens. The first use of this metaphor that I am aware of is this essay by Mike Caulfield, which I strongly recommend you to read first. In the following, I will apply this metaphor specifically to scientific knowledge and its possible evolution in the digital era.
Dear software engineers,
Many of you were horrified at the sight of the C++ code that Neil Ferguson and his team wrote to simulate the spread of epidemics. I feel with you. The only reason why I am less horrified than you is that I have seen a lot of similar-looking code before. It is in fact quite common in scientific computing, in particular in research projects that have been running for many years. But like you, I don’t have much trust in that code being a faithful and trustworthy implementation of the epidemiological models that it is supposed to implement, and I don’t want to defend bad code in science.
In his 1962 classic “The Architecture of Complexity”, Herbert Simon described the hierarchical structure found in many complex systems, both natural and human-made. But even though complexity is recognized as a major issue in software development today, the architecture described by Simon is not common in software, and in fact seems unsupported by today’s software development and deployment tools.
Malleable systems are software systems that are designed to be modified and extended by their users, eliminating the usually strict borderline between developers and users. Making scientific software more malleable is a goal that I have been pursuing for 25 years, starting with a shift from Fortran to Python as my main programming language, and a simultaneous shift from writing programs to writing toolkits, such as my Molecular Modelling Toolkit first published in 1997. Therefore I was pleased to discover the Malleable Systems Collective, which has just published a post in which I examine what is probably the most successful malleable system in the history of software: Emacs. If you care about users having more influence on their software, check out their site!
One question I have been thinking about in the context of reproducible research is this: Why is all stable software technology old, and all recent technology fragile? Why is it easier to run 40-year-old Fortran code than ten-year-old Python code? A hypothesis that comes to mind immediately is growing code complexity, but I’d expect this to be an amplifier rather than a cause. In this pose, I will look at another candidate: the dominance of Open Source communities in the development of scientific software.