The importance of reproducibility in computer-aided research (and elsewhere) is by now widely recognized in the scientific community. Of course, a lot of work remains to be done before reproducibility can be considered the default. Doing computational research reproducibly must become easier, which requires in particular better support in computational tools. Incentives for working and publishing reproducibly must also be improved. But I believe that the Reproducible Research movement has made enough progress that it’s worth considering the next step towards doing trustworthy research with the help of computers: verifiable research.
Think of all the things you hate about using computers in doing research. Software installation. Getting your colleagues’ scripts to work on your machine. System updates that break your computational code. The multitude of file formats and the eternal need for conversion. That great library that’s unfortunately written in the wrong language for you. Dependency and provenance tracking. Irreproducible computations. They all have something in common: they are consequences of the difficulty of composing digital information. In the following, I will explain the root causes of these problem. That won’t make them go away, but understanding the issues will perhaps help you to deal with them more efficiently, and to avoid them as much as possible in the future.
Yesterday a blog post by Cyrille Rossant entitled “Moving away from HDF5” caught my eye. My own tendency at the moment is to use HDF5 more and more, so I was interested in why someone else would want to do the opposite. Here is my conclusion after reading his post, plus some ideas about where scientific data management is or should be heading in my opinion.
A recurrent theme in computational science (and elsewhere) is the need to combine machine-readable information (which in the following I will call “facts” for simplicity) with a narrative for the benefit of human readers. The most obvious situation is a scientific publication, which is essentially a narrative explaining the context and motivation for a study, the work that was undertaken, the results that were observed, and conclusions drawn from these results. For a scientific study that made use of computation (which is almost all of today’s research work), the narrative refers to various computational facts, in particular machine-readable input data, program code, and computed results.
Like all information with a complex structure, scientific knowledge evolves over time. New ideas turn into validated models, and are ultimately integrated into a coherent body of knowledge defined by the concensus of a scientific community. In this essay, I explore how this process is affected by the ever increasing use of computers in scientific research. More precisely, I look at “digital scientific knowledge”, by which I mean scientific knowledge that is processed using computers. This includes both software and digital datasets. For simplicity, I will concentrate on software, but much of the reasoning applies to datasets as well, if only because the precise meaning of non-trivial datasets is often defined by the software that treats them.
We all know that software deployment in a research environment can be a pain, but knowing this as a fact is not quite the same as experiencing it in reality. Over the last days, I spent way more time that I would have imagined on what sounds like a simple task: installing a scientific application written in Python on a Linux machine for use by a group of students in a training session. Here is an outline of the difficulties, in the hope that it will (1) help others who face similar problems and (2) contributes a little bit to improving the situation.