The lifecycle of digital scientific knowledge

2015-11-09 computational science

Like all information with a complex structure, scientific knowledge evolves over time. New ideas turn into validated models, and are ultimately integrated into a coherent body of knowledge defined by the concensus of a scientific community. In this essay, I explore how this process is affected by the ever increasing use of computers in scientific research. More precisely, I look at "digital scientific knowledge", by which I mean scientific knowledge that is processed using computers. This includes both software and digital datasets. For simplicity, I will concentrate on software, but much of the reasoning applies to datasets as well, if only because the precise meaning of non-trivial datasets is often defined by the software that treats them.

Before looking at the "digital" aspects, I will summarize the traditional lifecycle of scientific knowledge from the "printed page" era. It has been going on for centuries and follows well-established procedures and habits. I will then argue that these procedures should serve as a guideline for the management of digital scientific knowledge as well, and that computing technology for science should be designed to support this lifecycle.

New observations, instruments, models, methods, and ideas are first published in journal articles. Such an article explains the background and motivation for the work, summarizes the state of the art, and then exposes the new elements that the authors wish to contribute to the scientific record. Other scientists from the field read the article, and draw conclusions for their own work, which are translated to citations to the article in their own publications. After some time, if the original publication creates enough interest, it will become a subject of discussion in its research community, and it will be mentioned in review articles, which place it in the context of other recent work in the field.

Being cited in review articles is typically the last step in the lifecycle of an individual contribution. Its ideas and conclusions are then merged with related ideas and conclusions and reformulated to become part of the state of the art of the field, recorded in reference works, monographs, and textbooks. These works represent a kind of community concensus. New research, in the same or in other domains, builds on such concensus knowledge, often implicitly by assuming that every reader of a journal article is familiar with the contents of reference works, monographs, and textbooks.

The introduction of computers into scientfic research has lead to many changes to this process. Some of them, such as the transition from paper to computer files as a support medium for scientific article and, reference works, are relatively minor. The most profound change is that an important part of digital scientific knowledge exists only in the form of software. This is true in particular for complex scientific models, for which we have no other convenient form of representation. An example where this situation is very explicit is the Community Earth System Model for climate research, which takes the form of a software package. Most often, the status of computational models is more fuzzy. As an example, consider force fields for proteins such as AMBER or CHARMM. People refer to these force fields by citing scientific articles, but these articles contain only outlines of the models. Their only complete recorded expressions are implementations as part of simulation software packages, but unlike for the Community Earth System Model, there is no software package designed to function as a reference implementation defining the model.

The fundamental difference between software and other media for storing scientific knowledge is that software has two sides: a human-facing side, and a machine-facing side. As a medium for expressing scientific knowledge, software fulfills the same role as prose or mathematical formulas. But the necessity of specifying a computation so precisely that a machine can execute it imposes severe constraints (software must be expressed using formal languages), and the desire to perform computations efficiently in a world of finite resources adds a different set of priorities in software development that are often in conflict with the criteria attached to the role of a medium for expressing ideas. As an illustration, the source code of a simulation program that has been heavily optimized for parallel execution combines 10% of scientific model with 90% of resource management and bookkeeping, making the scientific model not only hard to understand but even hard to find in the source code. For a more detailed discussion, see my article in F1000Research.

Many of the problems that computational science is facing today (reliability, reproducibility, black-box mentality, etc.) can be traced back to an insufficient support for the lifecycle of scientific knowledge by today's software development tools. Practically all of them were developed by and for software development communities outside of scientific research. As a consequence, these tools (programming languages, compilers, packaging and deployment tools, version control systems, etc.) do not take into account the specificities of scientific computing. Worse, computational scientists do nothing to improve the situation. The dominant attitude today is "scientists have to adopt best practices from software engineering and acquire the skills required to apply them". What I advocate is a somewhat different point of view: scientists should adapt these practices and the tools that implement them to their specific needs.

To see where the problems are, let's look at the lifecycle of scientific knowledge expressed as software. New models and methods are developed by a mixture of thinking, tinkering, and exploring the consequences. This requires a representation that humans can understand and manipulate easily. Executability by a computer is a condition, but other machine-related criteria hardly matter at this stage. Once some useful contribution to the field has been identified, it is communicated to the research community, in a form that is easily understandable, but also easy to deploy on other people's computers. This step is the equivalent of publishing a scientific paper. Next, other scientists start to play with the new stuff. This includes comparisons with other models and methods, analysis of model properties, application to different scenarios, etc. The conclusions from this work should take a form similar to a review article. This would be a toolkit in which different models and methods are made available for execution, with added annotations about their relative strengths and weaknesses. Finally, a synthesis of different ideas leads to a concensus implementation supported and maintained by a wider community of scientists, both as a basis for their own future work and as an infrastructure tool for other communities. This last step corresponds to reference works, and should be accompanied by tutorials that take the role of textbooks. At this stage, usability and performance become major criteria, whereas it is acceptable that not everyone can easily understand the implementation. Those who do wish to understand the method can go back to the "review paper" stage.

Most of the discussion about scientific software today is focused on the last stage. It's about community-supported software packages, whose sustained development requires significant efforts and investments. Most of this effort is required to keep the software useful in a world of rapidly changing computational environments, and to improve its human interfaces. A smaller part is dedicated to implementing new scientific models and methods. This effort has no equally important counterpart in the traditional lifecycle of scientific knowledge, and therefore the people who work on it find it hard to get recognition for their work. It is "not science" by the standards of the generation that occupies most leadership positions in research today. Fortunately, this attitude is starting to change.

This focus on the last stage is perhaps also the reason for the dominating attitude that scientists should simply adopt best practices from software engineering. In fact, the development and maintenance of community software packages implementing concensus models and methods is technically close enough to software development in business and industry that the same tools and procedures can be applied. This is not true, however, for the the earlier stages in the lifecycle of digital scientific knowledge. As we will see, they are not well supported by today's software development tools and practices. What's worse is that most computational scientists accept this situation as inevitable.

At the first stage, a scientist's activity is better described by "manipulating and exploring models and methods" than by "software development". Computational models are of course algorithms, and thus software, but this is almost a technical detail. What is more important is a clear view of the hypotheses and approximations that have lead to a specific model, and a trace of the scientific validation that has been performed (comparison with experimental data and with other models). Programming languages are not at all a good match for this kind of work, nor are software engineering approaches such as testing. In terms of software technology, a computational model is much closer to a specification than to a piece of software.

For the next stage, the evaluation of a new idea in a narrow community of specialists, the technical requirements are somewhere in between the two neighboring stages. The manipulation of computational models loses some importance, whereas evaluation and comparison become more relevant. Interoperability matters a lot: even if the authors of two models chose different languages (corresponding to different scientific notations in the traditional scenario), a comparative evaluation should be a straightforward task. With programming languages, it clearly isn't. The technical difficulties of making programs written in different languages talk to each other are effectively discouraging scientists from even trying. We would need tools such as "notational adapters" and, even more importantly, some low-level conventions for code and data that everybody can agree and build on. As a guideline for developing such technology, keep the analogy with review articles in mind. What would an executable review article about similar but independently developed computational methods look like? Which authoring tools are available to support such work?

Finally, the transition from the first two stages to the last one is not as smooth as it ought to be. Quite often, an implementation written for convenient manipulation by humans must be completely rewritten in order to fit into a collection of optimized subroutines. What we should have is compiler-like tools that translate code from the first two stages into standard programming languages, using annotations added by expert programmers for guidance. The idea is to have a toolchain (1) guarantee the equivalence of the initial and the optimized level, and (2) keep track of additional approximations that were made for performance reasons. Moreover, community-supported optimized software libraries should be usable as infrastructure tools in the next level of model and method development, and thus be interoperable with the tools appropriate for the first stage, which are inexistent for now.

Another way to describe this specificity of scientific computing, compared to other application domains, is the absence of a clear borderline between software developers and software users. Most scientists are users of tried and trusted computational methods while working on the development or validation of methods at another level. The only clear separation we have, conceptually, is the one between scientific models and methods on one hand and computing technology (in particular resource management) on the other hand. Unfortunately, that is exactly the separation that current software technology does not allow us to make.

Comments retrieved from Disqus

:
- Konrad Hinsen:
  I kind of agree with much of what you say, but it's about publishing, not about knowledge representation. In that respect, the transition to digital has opened up many new options and I am all for exploring these - in fact, I am participating actively in doing so.
  The specific topic of this post is not how information is shared and archived, but how knowledge is encoded in the form of symbols. What I want to preserve from the printed paper era is the flexibility of adapting notation to the task. It is programming languages that are rigid and constraining when seen as a medium for expressing thoughts. There are good (but also bad) reasons for that in the context of software development, but they don't carry over to doing science.
  Finally, the analogy with statically linked executables is not very useful in my opinion: an executable is not at all useful for communication scientific knowledge.