From reproducible to verifiable computer-aided research
The importance of reproducibility in computer-aided research (and elsewhere) is by now widely recognized in the scientific community. Of course, a lot of work remains to be done before reproducibility can be considered the default. Doing computational research reproducibly must become easier, which requires in particular better support in computational tools. Incentives for working and publishing reproducibly must also be improved. But I believe that the Reproducible Research movement has made enough progress that it's worth considering the next step towards doing trustworthy research with the help of computers: verifiable research.
Verifiable research is research that you can verify for yourself. Not in the sense of verifying the scientific conclusions, which often can only be done many years later. The more modest goal is to verify that a publication contains no mistakes of the kind that every human being tends to make: mistakes in manual computations, mistakes in transcribing observations from a lab notebook, etc.
Ideally, all research should be verifiable. A paper is supposed to provide sufficient details about the work that was done to enable competent peers to verify the reasoning and repeat any experiments. Peer review is supposed to certify that a paper is verifiable, and reviewers are even encouraged to do the verification if that is possible with reasonable effort.
In the pre-computing era, much published research was indeed verifiable. Given the high cost of verifying experimental work, it is safe to assume that actual verification was the exeception. But theoretical work of any importance was commonly verified by many readers who repeated the (manual) computations.
With the increasing use of computers, papers slowly turned into mere summaries of research work. Providing all the details was simply impossible - software was too complex to be fully described in a journal article. It also became common to use software written by other people, and even commercial software whose detailed workings are secret. This development was nicely summarized by Buckheit and Donoho in 1995 in what became a famous quote in the Reproducible Research movement:
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
Today this statement applies not only to computational science, but to all of computer-aided research, as many experimental and theoretical studies involve computers and software as well. The publication of all software and all input datasets in a form that other scientists can actually process on their own computers has become the main objective for making computer-aided research reproducible.
Unfortunately, having all the software and input data that go with a journal article is still not sufficient to make the work verifiable. With the exception of particularly simple computations, it is practically impossible to figure out what the software really computes, and in particular to verify that it computes what the paper claims it computes. Assuming, of course, that the paper actually does provide a detailed description of its claims, which is often not the case. Much computer-aided research is thus "not even wrong".
It is the complexity of much modern scientific software that makes verification practically impossible, and for that reason software is rarely subjected to peer review. After all, who would accept the Herculean task to verify the correct functioning of a piece of software? Even "software papers", i.e. papers that merely exist to provide a citable reference for some software, are reviewed without any serious validation of the software itself. At best, reviewers check that best practices of software engineering have been respected, for example by writing a test suite with good code coverage. But no amount of testing can verify that the software computes what it is supposed to compute. If some numerical constant in the source code is off by 10% due to a typo, there's a good chance that nobody will ever notice. Such mistakes have happened (see this article for a few stories), and there are good reasons to believe they are actually frequent (see this article for arguments). The most convincing argument should be our daily experience with computers that crash or ask us to install "critical updates". If systems software is so clearly full of mistakes, is it reasonable to assume that scientific software has none at all?
The difficulty of verifying computational results in combination with the obvious importance of computational techniques in science has lead to a change of attitude that in my opinion is detrimental to science in the long run. Most importantly, the burden of proof has been shifted from the proponents of a new hypothesis to its opponents. If you cannot show that a computational study is wrong, then it is silently assumed correct. If you want to publish results that are contradictory to work published earlier, it's your obligation to explain why, even though you cannot possibly verify the earlier work. This is why protein structures in contradiction with the later retracted ones from Geoffrey Chang's group were rejected for publication for a long time. Contradictory results should be handled by a critical inspection of all of them, but this is possible only for verifiable research.
Another detrimental change of attitude is that "correct" has been replaced by "community-accepted" as a quality criterion in many fields. Recently, I have started to ask a simple question after seminars on computational work: "Why should I believe your results? What did you do to verify them?" Most often, the answer is "We used software and protocols that are widely applied in our community". Unfortunately, popularity can be taken as an indicator of correctness only if it is safe to assume that many users have actually verified those tools and methods. Which again assumes verifiability as a minimum criterion.
So... what can we do?
Verifiable computer-aided research is a tiny subset of today's published research. It's even a small subset of today's reproducible research. Can we do something about this? I believe we can, and I will summarize some possible approaches.
The most obvious approach to make a computation verifiable is to document all code and data well enough that a competent reader is convinced of its correctness. Literate programming (for algorithms) and computational notebooks (for computations) are good techniques for this. As with any scientific proofreading, verification by inspection requires much care and a critical attitude. People are easily fooled into believing something because it is well presented, for example. But the most important obstacle to this approach is the modularity of much of today's scientific software. If you reuse existing libraries - and there are of course good reasons to do so - then you probably won't rewrite them in literate programming style for explaining their algorithms to your critical reader. A computation is only as verifiable as its least verifiable ingredient.
Another way to make computer-aided research verifiable is to make the computations reimplementable. This means that the published journal article, or some supplementary material to that article, contains a precise enough human-readable description of the algorithms that a scientist competent in the field can write a new implementation from scratch, and verify that it produces the same (or close enough) results. This is not a fool-proof approach, of course, and again modularity is a major risk factor. If the computation uses some complex library and the reimplementor chooses to use the same library, then the library code is not verified by the reimplementation. The more the reimplementation differs from the original authors' code, the better it is as a verification aid. This is by the way also a strong argument for diversity in scientific software. In terms of development efficiency, a single community-supported software package per field is great, but for verifiability, it is better to have multiple packages that can do the same job.
Both approaches I have outlined fail for complex software. A million-line simulation code developed over many years by an entire research group can neither be studied nor reimplemented by a single person wishing to verify it. Even a small team working in close collaboration wouldn't be up to the task. The solution I propose for this situation is to introduce an intermediate layer between the software and the human-readable documents (papers, software documentation) that describe what it computes. A layer that contains all the science but none of the technicalities of the software, such as parallelism, platform-dependence, or resource management. The idea is to "factor out" the accidental complexity and retain only the essential complexity, the one due to the complexity of the models and methods that the software implements. This idea is very similar to the use of formal specifications in software development. The specification would be verified by human scientists, whereas the conformity of the software to the specification would be checked by automated methods, of which randomized unit testing is probably the most immediately useful one.
An intermediate layer that factors out accidental complexity is also of interest for other uses in scientific research. That new layer would be the closest we can get to a digital representation of a model or a method. Rather than use it just in the specification of a single piece of software, we can use it for all kinds of analyses and comparisons, and cite it as the main scientific reference in work based on it, in addition to the citation to the software as the technical tool for doing the computations. For this reason, I call this layer "digital scientific knowledge" and the languages for expressing it "digital scientific notation". None of this exists today, but many developments in computer science can be used as a basis for its development. For the details, see this article.
Comments retrieved from Disqus
- paper editor:
It sounds nice that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.
- Sisaos:
It looks nice that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.
- paper editor:
It seems like a good thing that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.