From facts to narratives

2015-12-08 computational science

A recurrent theme in computational science (and elsewhere) is the need to combine machine-readable information (which in the following I will call "facts" for simplicity) with a narrative for the benefit of human readers. The most obvious situation is a scientific publication, which is essentially a narrative explaining the context and motivation for a study, the work that was undertaken, the results that were observed, and conclusions drawn from these results. For a scientific study that made use of computation (which is almost all of today's research work), the narrative refers to various computational facts, in particular machine-readable input data, program code, and computed results.

A computational notebook, as pioneered by Mathematica and recently popularized by Jupyter (formerly known as the IPython notebook), is another document that mixes facts and narratives. Compared to a scientific article, program code takes a much more prominent role, and the narrative is focused on the computation. In software development tools, we find the fact-narrative mixture in version control, where the commits are a stream of facts to which the commit messages attach a narrative. At a more basic level, comments in program code can be thought of as narratives embedded into the code. Literate programming inverts this relation by embedding the code into a narrative.

All these situations share a common problem: the tools we have today force us to choose between treating the facts first-class, accepting a low-quality narrative, or to optimize the narrative while compromising on the quality of fact management. In the following, I will argue that this is due to a poorly thought-out relation between facts and narratives, and outline possible improvements.

Comments in source code are an example where priority is given to the facts, i.e. the program. The reader is supposed to read the code, the comments are there only to provide non-obvious background information, and sometimes to outline an overall structure. Reading commented code takes a lot of time and effort, because the reader has to deal with all the details of the program code. A pure narrative would explain software at a more abstract level, leaving out details or relegating them to an appendix. As an example for the opposite extreme, a scientific article is primarily a narrative, including only small pieces of the facts for illustration. A complete description of the facts would require all of the program code and input data. This is why replicability and reproducibility are currently big issues in computational science.

Facts and narratives live in two different universes. Facts belong to the computational universe, in which all information is encoded in formal languages with (ideally) well-defined syntax and semantics. Computation processes input data (which includes the program code) and produces output data in a process that is perfecly well-defined and deterministic. A real-life computation depends on a lot of input data due to the many details that matter. That means a lot of facts, but computers are very good at handling a lot of facts.

Narratives belong to the universe of human thought and communication. They rely on a rich context that human readers are expected to have acquired through prior study. This context contains in particular the appropriate abstractions that allow the narrative to remain at a manageable level, because humans can only keep a limited amount of details in their heads. To see the importance of this point, imagine a narrative that explains how to "open a door" in terms of the detailed eye movements and muscle contractions required to perform this task - such a narrative would be completely incomprehensible. On the other hand, narratives do not need to be very precise in many aspects because humans excel at "making sense" of information even if it contains mistakes and incongruences.

Computers are good at handling facts but not narratives. Humans are good at handling narratives but not facts in the quantities that typically define a computation. Letting computers intervene in the processing of narratives leads to funny results - try Google Translate on a non-trivial text for an illustration. Letting humans intervene in the execution of a computation is a major source of mistakes. That is why a key ingredient to improving replicability is the automation of all computational steps. In the ideal world, no part of a computation would be defined by a narrative providing nstructions for a human operator. Anyone who has every had to install software knows that we are still far away from that ideal world.

Note that I only said that humans should not intervene in the execution of a computation. They do of course intervene in its definition. Program source code, after all, is written by humans. More generally, humans intervene quite often in computational science by using interactive tools. In that case, the stream of user interactions becomes part of the definition of a computation. If it is recorded, the computation can later be executed again without human intervention. This is of course well known: replicability requires that all user interaction must be recorded.

Since facts and narratives live in different universes, we should avoid mixing them carelessly. Crossing the boundary between the two universes should always be explicit. A narrative should not include copies of pieces of facts, but references to locations in a fact universe. And facts should not refer to narratives at all. The relation between the two universes is not symmetric: computers are tools made by humans for their benefit, so the computational universe is subordinate to the human universe.

Now let us look at the examples cited in the beginning from this new point of view. In scientific communication, the separation of facts and narratives was actually well respected initially. The lab notebook recorded facts, and the published paper contained a narrative quoting facts from the lab notebook. No scientist would ever have contemplated writing a paper by modifying the contents of his or her lab notebook! Unfortunately, this basic wisdom was lost with the adoption of computers. Computers make it very easy to modify information, to the point that version control had to be invented to prevent massive information loss by careless editing. Moreover, the distinction between a lab notebook and a paper became blurred by both being files processed using a computer. Finally, computational scientists never adopted the habit of keeping lab notebooks until very recently, coming mostly from a theoretical rather then an experimental background.

Today there is a lot of discussion about "electronic lab notebooks", but the fundamental characteristic of a lab notebook being a record of facts is not often mentioned in this context. Very frequently, computational notebooks as implemented by Jupyter or Mathematica are claimed to be lab notebooks for computational science. It is probably clear at this point that I do not agree. Computational notebooks are designed for writing narratives that include computations and their results. They are best considered specialized word processors that encourage refining a document through many iterations of modification involving the code, its results, and the textual elements. The computational side of notebooks is limited to efficient interactive code evaluation. There is no logging of interactions, and no description of the computational infrastructure (libraries, ...) on which the interactive computations rely. As a consequence, computations in a notebook are in general not replicable. I believe this can be fixed, and I have made a concrete proposal for doing so, but unfortunately I do not have the means to actually implement this idea.

In version control as it was originally designed, a repository is a fact database that contains sequences of versions of file sets. Commit messages, like comments in a program, are small narratives that provide a high-level overview and often a motivation for each change. The role of a repository is similar to the role of a lab notebook: it is a permanent record of what happened, with narratives written close in time to the recorded events. As commits and commit messages accumulate over time, following along becomes an arduous task for a human reader: the narrative contains too much irrelevant detail. This became a serious practical issue as version control was adopted as a tool for collaboration, with members of a team communicating through commit messages. Git therefore introduced the approach of "rewriting history". The idea is to "clean up" a stream of commits by re-ordering and merging them and by writing new commit messages, with the goal of creating a better narrative. Rewriting history remains a hot topic of debate. Most people realize the utility of cleaning up the narrative, but it also feels wrong to destroy the original historical record in the process. Moreover, there is a clear risk of introducing mistakes when rewriting history. In view of what I said above, the basic mistake is the failure to separate cleanly facts from narratives. The cleaned-up narrative should be separate from the original commented stream of commits and refer to it. In git terminology, rewriting history should create a new branch, and the rebasing operations done in deriving the new branch from the initial one should be recorded. Moreover, the editing tools should ensure that the final file contents are the same in the two branches.

I hope that these two examples have illustrated why it is desirable to keep facts and narratives distinct, with well-defined references from narratives to facts. Unfortunately, today's computational technology doesn't help much with reaching this goal when the facts are parts of a complex computation. We cannot define such a computation while remaining completely in the computational universe. And we cannot define unambiguous references to arbitrary facts inside a computational universe either. Most of the data formats and tools we use for preparing narratives do not even try to respect the separation of universes. Finally, the formal languages we use to encode computational facts (programming languages, file formats, etc.) are mostly not designed for being embedded into narratives. There's still a lot to do.