The structure and interpretation of scientific models, part 2

2021-01-08 computational science

In my last post, I have discussed the two main types of scientific models: empirical models, also called descriptive models, and explanatory models. I have also emphasized the crucial role of equations and specifications in the formulation of explanatory models. But my description of scientific models in that post left aside a very important aspect: on a more fundamental level, all models are stories.

To illustrate my point, I will take up my running example from part 1: celestial mechanics. Newton's model for our solar system is, as I said, composed of several equations, the most famous of which, F = m ⋅ a, many readers will probably remember from a high-school physics class. But that equation means nothing on its own. It just says that there are three quantities, one of which being the product of the other two.

The minimal story required to make sense of this equation provides a definition of the three quantities involved. For acceleration (the a), this may look superficially simple: it's the second derivative of an object's position in time. The concepts of position and time are part of our everyday intuition, so that's the easy part. Velocity is an intuitive everyday concept as well, but its precise relation to position as a time derivative is not. For acceleration, nothing short of calculus will do. In fact, Newton invented calculus along with his physical theory! Defining mass (the m) and force (the F) is not a trivial task either. Both concepts are rooted in our everyday intuition about the world, but their role in Newton's law of motion requires a much more precise understanding. If you have doubts about this, try explaining the difference between mass and weight to someone who doesn't have a scientific education.

From this big-picture point of view, equations such as F = m ⋅ a are tiny pieces of our scientific models. They are the tips of icebergs whose massive underwater parts are the stories defining the underlying concepts and linking them to our intuition about the world, often through multiple and increasingly abstract layers. We tend to forget about these stories, because once we have understood them well enough, what we actually work with are the equations. But this works only for the well-established models whose stories are now found in textbooks. New research continuously introduces new models, often as small variants or extensions of existing ones. Their stories are told in scientific publications.

Historically, mathematical notation was introduced as a convenient shorthand for use in plain-language stories. The lengthy phrase "force equals mass times acceleration" thus became F = m ⋅ a. The transition to symbolic equations encouraged the development of formal methods in mathematics, starting with algebraic transformations of simple equations. This approach was so successful that equations became the main focus of interest in science. Later, other formal representations were added for the non-numerical aspects of models, graphs being the prime example. The most recent addition to the collection of formal notations for scientific models is software. Today, scientists spend most of their time working with the formalized parts of scientific models, such as equations or algorithms, to the point of neglecting the stories that give them meaning.

What happens when people use the equations of scientific models without a proper understanding of their stories is nicely illustrated by the joke about the physics student who combines Einstein's E = m ⋅ c² with Pythagoras' a² + b² = c² to deduce E = m ⋅ (a² + b²). It works as a joke among physicists because in their community, everybody knows the two inputs and the contexts from which they are taken. For other people, there is nothing funny about this reasoning, and it can even look convincing. Such superficial use of scientific models without understanding their context is actually quite common in today's research: the inappropriate use of statistical inference methods is a major cause of the reproducibility crisis.

Computing technology has played a big role in alienating scientists from their models. Most obviously, computers have made it possible to apply scientific models and methods as black-box tools: in an automated fashion, without understanding them. But the attitudes of the software industry, whose development tools computational science has inherited, have also contributed to this tendency. The focus of the software industry is on professional developers making tools for others that almost magically solve some of their problems. Users then get a manual, or hands-on training, for learning how to use the tool, but the inner workings of the tool are something they shouldn't even have to think about. A good tool is one that minimizes learning requirements. Applied to science, this implies that users shouldn't have to know the stories behind the models. Everyone with a dataset should be able to do statistical inference with a few mouse clicks and get a nice visualization. But without the stories, we can easily draw wrong conclusions from nice graphics.

After a long period of separation of tools and stories, computational notebooks are now bringing some of the stories back. The enthusiastic adoption of notebooks by computational scientists is perhaps the best evidence for the importance of stories in science. But today's notebooks capture only the surface stories of a research project. It's tips of icebergs again. The typical notebook makes use of a large number of code libraries that are based on non-trivial scientific models, but the reader of the notebook remains completely unaware of them. Ideally, these models, with their stories, should be only a few clicks away.

So what would an electronic representation of scientific models look like, ideally? It's a collection of cross-referencing stories. In the celestial mechanics example, there's a story about positions, velocities, and accelerations, which refers to a story about time and to a story about derivatives. There is another story that explains mass. The story of Newton's law of motion, which also introduces the concept of force, can then refer to these more fundamental stories. If this description reminds you of Wikipedia, or in fact of any Wiki, you are right. Wikis are also collections of cross-referencing stories. What is missing in Wikis is a machine-readable version of the formalized parts of our models. Which, as I explained in part 1, needs to allow at least equations, specifications, and algorithms for its ingredients. Another feature that is missing in today's Wikis, although some people are working on it, is the possibility to integrate computational tools in the form of code snippets. Their role would be to give access to visualizations, simulations, and other exploration tools.

My own experiments in this domain are Leibniz, a digital scientific notation for embedding machine-readable formal models into human-readable stories, and the Pharo edition of ActivePapers, which integrates datasets and computational tools into a Wiki-like collection of stories. Both ingredients require more work, and then need to be combined. There remains a lot of work to do.