Unifying version control and dependency management for reproducible research

2012-04-10 programming, reproducible research, software, source code repositories

When the Greek philosopher Heraclitus pronounced his famous "πάντα ῥεῖ" (everything flows), he most probably was not thinking about software. But it applies to software as much as to other aspects of life: software is in perpetual change, being modified to remove bugs, add features, and adapt it to changing environments. The management of change is now a well-established part of software engineering, with the most emblematic tool being version control. If you are developing software without using version control, stop reading this immediately and learn about Mercurial or Git, the two best version control systems available today. That's way more important than reading the rest of this post.

Software developers use version control to keep track of the evolution of their software, to coordinate team development, and to manage experimental features. But version control is also of interest for software users: it permits them to refer to a specific version of a piece of software they use in a unique and reproducible way, even if that version is not the current one, nor perhaps even an official numbered release. In fact, official numbered releases are becoming a relict of the past. They make little sense in an Open Source universe where everyone has access to source code repositories under version control. In that situation, an official release is nothing but a bookmark pointing to a specific commit number. There is no need for a release number.

Why would you want to refer to a specific version of a piece of software, rather than always use the latest one? There are many reasons. As software evolves, some bugs get fixed but others sneak in. You may prefer the bugs you know to the ones that could surprise you. Sometimes later versions of some software are not fully compatible with their predecessors, be it by design or by mistake. And even if you want to use the very latest version at any time, you might still want to note which version you used for a specific application. In scientific computing, this is one of the fundamental principles of reproducible research: note carefully, and publish, the exact versions of all pieces of software that were used for obtaining any published research result. It's the only way for you and others to be able to understand exactly what happened when you look at your work many years later.

Another undeniable reality of modern software, in particular in the Open Source universe, is that it's modular. Developers use other people's software, especially if it's well written and has the reputation of being reliable, rather than reinventing the wheel. The typical installation instructions of a piece of Open Source software start with a list of dependencies, i.e. packages you have to install before you can install the current one. And of course the packages in the dependency list have their own dependency list. The number of packages to install can be overwhelming. The difficulties of dependency management are so widespread that the term "dependency hell" has been coined to refer to them.

Systems programmers have come up with a solution to that problem as well: dependency management tools, better known as package managers. Such tools keep a database of what is installed and which package depends on which other ones. The well-known Linux distributions are based on such package managers, of which the ones developed by Debian and RedHat are the most popular ones and are now used by other distributions as well. For MacOS X, MacPorts and Fink are the two most popular package managers, and I suspect that the Windows world has its own ones.

One of the major headaches that many computer users face is that version management and dependency management don't cooperate. While most package managers permit to state a minimal version number for a dependency, they don't permit to prescribe a precise version number. There is a good reason for this: the way software installation is managed traditionally on Unix systems makes it impossible to install multiple versions of the same package in parallel. If packages A and B both depend on C, but require different versions of it, there is simply no simple solution. Today's package managers sweep this problem under the rug and pretend that higher version numbers are always as least as good as their predecessors. They will therefore install the higher of the two version numbers required by A and B, forcing one of them to use a version different from its preference.

Anyone who has been using computers intensively for a few years has probably run into such a problem, which manifests itself by some program not working correctly any more after another one, seemingly unrelated, has been installed. Another variant is that an installation fails because some dependency is available in a wrong version. Such problems are part of "dependency hell".

This situation is particularly problematic for the computational scientist who cares about the reproducibility of computed results. At worst, verifying results from 2005 by comparing to results from 2009 can require two completely separate operating system installations running in separate virtual machines. Under such conditions, it is difficult to convince one's colleagues to adopt reproducible research practices.

While I can't propose a ready-to-use solution, I can point out some work that shows that there is hope for the future. One interesting tool is the Nix package manager, which works much like the package managers by Debian or RedHat, but permits installing multiple versions of the same package in parallel, and registers dependencies with precise versions. It could be used as a starting point for managing software for reproducible research, the main advantage being that it should work with all existing software. The next step would be to make each result dataset or figure a separate "package" whose complete dependency list (software and datasets) is managed by Nix with references to precise version numbers. I am currently exploring this approach; watch this space for news about my progress.

For a system even better suited to the needs of reproducible computational science, I refer to my own ActivePapers framework, which combines dependency management and version control for code and data with mechanisms for publishing code+data+documentation packages and re-use code from other publications in a secure way. I have to admit that it has a major drawback as well: it requires all code to run on the Java Virtual Machine (in order to guarantee portability and secure execution), which unfortunately means that most of today's scientific programs cannot be used. Time will tell if scientific computing will adopt some virtual machine in the future that will make such a system feasible in real life. Reproducible research might actually become a strong argument in favour of such a development.