The structure and interpretation of scientific models

2020-12-10 computational science

It is often said that science rests on two pillars, experiment and theory. Which has lead some to propose one or two additional pillars for the computing age: simulation and data analysis. However, the real two pillars of science are observations and models. Observations are the input to science, in the form of numerous but incomplete and imperfect views on reality. Models are the inner state of science. They represent our current understanding of reality, which is necessarily incomplete and imperfect, but understandable and applicable. Simulation and data analysis are tools for interfacing and thus comparing observations and models. They don't add new pillars, but transforms both of them. In the following, I will look at how computing is transforming scientific models.

Empirical models

The first type of scientific model that people construct when figuring out a new phenomenon is the empirical or descriptive model. Its role is to capture observed regularities, and to separate them from noise, the latter being small deviations from the regular behavior that are, at least provisionally, attributed to imprecisions in the observations, or to perturbations to be left for later study. Whenever you fit a straight line to a set of points, for example, you are constructing an empirical model that captures the linear relation between two observables. Empirical models almost always have parameters that must be fitted to observations. Once the parameters have been fitted, the model can be used to predict future observations, which is a great way to test its generality. Usually, empirical models are constructed from generic building blocks: polynomials and sine waves for constructing mathematical functions, circles, spheres, and triangles for geometric figures, etc.

The use of empirical models goes back a few thousand years. As I have described in an earlier post, the astronomers of antiquity who constructed a model for the observed motion of the Sun and the planets used the same principles that we still use today. Their generic building blocks were circles, combined in the form of epicycles. The very latest variant of empirical models is machine learning models, where the generic building blocks are, for example, artificial neurons. Impressive success stories of machine learning models have led some enthusiasts to proclaim the end of theory, but I hope to be able to convince you in the following that empirical models of any kind are the beginning, not the end, of constructing scientific theories.

The main problem with empirical models is that they are not that powerful. They can predict future observations from past observations, but that's all. In particular, they cannot answer what-if questions, i.e. make predictions for systems that have never been observed in the past. The epicycles of Ptolemy's model describing the motion celestial bodies cannot answer the question how the orbit of Mars would be changed by the impact of a huge asteroid, for example. Today's machine learning models are no better. Their latest major success story as I am writing this is the AlphaFold predicting protein structures from their sequences. This is indeed a huge step forward, as it opens the door to completely new ways of studying the folding mechanisms of proteins. It is also likely to become a powerful tool in structural biology, if it is actually made available to biologists. But it is not, as DeepMind's blog post claims, "a solution to a 50-year-old grand challenge in biology". We still do not know what the fundamental mechanisms of protein folding are, nor how they play together for each specific protein structure. And that means that we cannot answer what-if questions such as "How do changes in a protein's environment influence its fold?"

Explanatory models

The really big success stories of science are models of a very different kind. Explanatory models describe the underlying mechanisms that determine the values of observed quantities, rather than extrapolating the quantities themselves. They describe the systems being studied at a more fundamental level, allowing for a wide range of generalizations.

A simple explanatory model is given by the Lotka-Volterra equations, also called predator-prey equations. This is a model for the time evolution of the populations of two species in a preditor-prey relation. An example is shown in this plot (Lamiot, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons):

predator-prey

An empirical model would capture the oscillations of the two curves and their correlations, for example by describing the populations as superpositions of sine waves. The Lotka-Volterra equations instead describe the interactions between the population numbers: predators and prey are born and die, but in addition predators eat prey, which reduces the number of prey in proportion to the number of predators, and contributes to a future increase in the number of predators because they can better feed their young. With that type of description, one can ask what-if questions: What if hunters shoot lots of predators? What if prey are hit by a famine, i.e. a decrease in their own source of food? In fact, the significant deviations from regular periodic change in the above plot suggests that such "outside" events are quite important in practice.

Back to celestial mechanics. The decisive step towards an explanatory model was made by Isaac Newton, after two important preparatory steps by Copernicus and Kepler, who put the Sun at the center, removing the need for epicycles, and described the planets' orbits more accurately as ellipses. Newton's laws of motion and gravitation fully explained these elliptical orbits and improved on them. More importantly, they showed that the fundamental laws of physics are the same on Earth and in space, a fact that may seem obvious to us today but wasn't in the 17th century. Finally, Newton's laws have permitted the elaboration of a rich theory, today called "classical mechanics", that provides several alternative forms of the basic equations (in particular Lagrangian and Hamiltonian mechanics), plus derived principles such as the conservation of energy. As for what-if questions, Newton's laws have made it possible to send artefacts to the moon and to the other planets of the solar system, something which would have been unimaginable on the basis of Ptolemy's epicycles.

So far I have cited two explanatory models that take the form of differential equations, but that is not a requirement. An example from the digital age is given by agent-based models. There is, however, a formal characteristic that is shared by all explanatory models that I know, and that distinguishes them from empirical models: they take the form of specifications.

Specifications and equations vs. algorithms and functions

Let's look at a simple problem for illustration: sorting a list of numbers (or anything else with a well-defined order). I have a list L, with elements L[i], i=1..N where N is the length of the list L. What I want is a sorted version which I will call sorted(L). The specification for sorted(L) is quite simple:

sorted(L) is a list of length N.
For all elements of L, their multiplicities in L and sorted(L) are the same.
For all i=1..N-1, sorted(L)[i] ≤ sorted(L)[i+1].

Less formally: sorted(L) is a list with the same elements as L, but in the right order.

This specification of sorted(L) is complete in that there is one unique list that satisfies it. However, it does not provide much help for actually constructing that list. That is what a sorting algorithm provides. There are many known algorithms for sorting, and you can learn about them from Wikipedia, for example. What matters for my point is that (1) given the specification, it is not a trivial task to construct an algorithm, (2) given a few algorithms, it is not a trivial task to write down a common specification that they satisfy (assuming of course that it exists). And that means that specifications and algorithms provide complementary pieces of knowledge about the problem.

In terms of levels of abstraction, specifications are more abstract than algorithms, which in turn are more abstract than implementations. In the example of sorting, the move from specification to algorithm requires technical details to be filled in, in particular the choice of a sorting algorithm. Moving on from the algorithm to a concrete implementation involves even more technical details: the choice of a programming language, the data structures for the list and its elements, etc.

In the universe of continuous mathematics, the relation between equations (e.g. differential equations) and the functions that satisfy them is exactly the same as the relation between specifications and algorithms in computation. Newton's equations can thus be seen as a specification for the elliptical orbits that Kepler had described a bit earlier. Like in the case of sorting, it is not a trivial task to derive Kepler's elliptical orbits from Newton's equations, nor is it a trivial task to write down Newton's equations as the common specification of all the (approximatively) elliptical orbits in the solar system. The two views of the problem are complementary, one being closer to the observations, the other providing more insight.

One reason why specifications and equations are more powerful is that they are modular. Two specifications combined make up another, more detailed, specification. Two equations make up a system of equations. An example is given my Newton's very general law of motion, which is extended by his law of gravitation to make a model for celestial mechanics. The same law of motion can be combined with different laws defining forces for different situations, for example the motion of an airplane. In contrast, there is no way to deduce anything about airplanes from Kepler's elliptical planetary orbits. Functions and algorithms satisfy complete specifications, and conserve little information about the components from which this complete specification was constructed.

A challenge for computational science

Computational science initially used computers as a tool for applying structurally simple but laborious computational algorithms. The focus was on efficient implementations of known algorithms, later also on developing efficient algorithms for solving well-understood equations. The steps from specification to algorithm to implementation were done by hand, with little use of computational tools.

That was 60 years ago. Today, we have computational models that are completely unrelated to the mathematical models that go back to the 19th century. And when we do use the foundational mathematical models of physics and chemistry, we combine them with concrete systems specifications whose size and complexity requires the use of computational tools. And yet, we still focus on implementations and to a lesser degree on algorithms, neglecting specifications almost completely. For many routinely used computational tools, the implementation is the only publicly accessible artefact. The algorithms they implement are often undocumented or not referenced, and the specifications from which the algorithms were derived are not written down at all. Given how crucial the specification level of scientific models has been in the past, we can expect to gain a lot by introducing it into computational science as well.

To do so, we first need to develop a new appreciation for scientific models as distinct from the computational tools that implement them. We then need to think about how we can actually introduce specification-based models into the workflows of computational science. This requires designing computational tools that let us move freely between the three levels of specification, algorithm, and implementation. This is in my opinion the main challenge for computational science in the 21st century.

Finally...

Some readers may have recognized that the title of this post is a reference to two books, Structure and Interpretation of Computer Programs (with a nice though inofficial online version) and Structure and Interpretation of Classical Mechanics (also online). The second one is actually somewhat related to the topic of this post: it is a textbook on classical mechanics that uses computational techniques for clarity of exposition. More importantly, both books focus on inducing a deep understanding of their topics, rather than on teaching superficial technical details. This humble blog post cannot pretend to reach that level, of course, but its goal is to spark developments that will culminate in textbooks of the same quality as its two inspirations.

Comments retrieved from Disqus

Konrad Hinsen:
A recommended follow-up read: What is declarative programming?
by Alan Kay. His "what" and "how" is almost the same distinction as "specification" vs "algorithm".