Recent posts

Reproducibility does not imply reproduction

In discussions about computational reproducibility (or replicability, or repeatability, according to the preference of each author), I often see the argument that reproducing computations may not be worth the investment in terms of human effort and computational resources. I think this argument misses the point of computational reproducibility.

Sustainable software and reproducible research: dealing with software collapse

Two currently much discussed issues in scientific computing are the sustainability of research software and the reproducibility of computer-aided research. I believe that the communities behind these two ideals should work together on taming their common enemy: software collapse. As a starting point, I propose an analysis of how the risk of collapse affects sustainability and reproducibility.

From reproducible to verifiable computer-aided research

The importance of reproducibility in computer-aided research (and elsewhere) is by now widely recognized in the scientific community. Of course, a lot of work remains to be done before reproducibility can be considered the default. Doing computational research reproducibly must become easier, which requires in particular better support in computational tools. Incentives for working and publishing reproducibly must also be improved. But I believe that the Reproducible Research movement has made enough progress that it's worth considering the next step towards doing trustworthy research with the help of computers: verifiable research.

Composition is the root of all evil

Think of all the things you hate about using computers in doing research. Software installation. Getting your colleagues' scripts to work on your machine. System updates that break your computational code. The multitude of file formats and the eternal need for conversion. That great library that's unfortunately written in the wrong language for you. Dependency and provenance tracking. Irreproducible computations. They all have something in common: they are consequences of the difficulty of composing digital information. In the following, I will explain the root causes of these problem. That won't make them go away, but understanding the issues will perhaps help you to deal with them more efficiently, and to avoid them as much as possible in the future.

On HDF5 and the future of data management

Yesterday a blog post by Cyrille Rossant entitled "Moving away from HDF5" caught my eye. My own tendency at the moment is to use HDF5 more and more, so I was interested in why someone else would want to do the opposite. Here is my conclusion after reading his post, plus some ideas about where scientific data management is or should be heading in my opinion.

From facts to narratives

A recurrent theme in computational science (and elsewhere) is the need to combine machine-readable information (which in the following I will call "facts" for simplicity) with a narrative for the benefit of human readers. The most obvious situation is a scientific publication, which is essentially a narrative explaining the context and motivation for a study, the work that was undertaken, the results that were observed, and conclusions drawn from these results. For a scientific study that made use of computation (which is almost all of today's research work), the narrative refers to various computational facts, in particular machine-readable input data, program code, and computed results.

This blog is moving!

Welcome to the last post on this WordPress blog. I have set up a new blog for all my future writing.

The reason for the move is that the user interface at WordPress is changing all the time without ever getting better. I like to write my posts on my own computer using Emacs, rather than typing into a rudimentary editing window on a Web site. This is not completely impossible with WordPress, but more hassle than it's worth.

My new blog is hosted on GitHub and powered by Frog, a static Web site generator that mixes my posts written as plain Markdown files with HTML templates based on the Bootstrap framework to produce the pages you can read. This setup gives me much more control over my blog, while at the same time making it easier for me to publish new posts.

The one feature that will disappear is the possibility to subscribe to my blog in order to be informed about new posts by e-mail. If you have a GitHub account, you can get the same effect by following updates to the repository that contains my blog. But the easiest way to learn about new posts is to follow me on Twitter.

The lifecycle of digital scientific knowledge

Like all information with a complex structure, scientific knowledge evolves over time. New ideas turn into validated models, and are ultimately integrated into a coherent body of knowledge defined by the concensus of a scientific community. In this essay, I explore how this process is affected by the ever increasing use of computers in scientific research. More precisely, I look at "digital scientific knowledge", by which I mean scientific knowledge that is processed using computers. This includes both software and digital datasets. For simplicity, I will concentrate on software, but much of the reasoning applies to datasets as well, if only because the precise meaning of non-trivial datasets is often defined by the software that treats them.

A rant about software deployment in 2015

We all know that software deployment in a research environment can be a pain, but knowing this as a fact is not quite the same as experiencing it in reality. Over the last days, I spent way more time that I would have imagined on what sounds like a simple task: installing a scientific application written in Python on a Linux machine for use by a group of students in a training session. Here is an outline of the difficulties, in the hope that it will (1) help others who face similar problems and (2) contributes a little bit to improving the situation.

Beyond Jupyter: what's in a notebook?

Yesterday I participated (as a visitor) in the kickoff meeting for OpenDreamKit, where one recurrent topic of discussion was notebooks, both Jupyter and Sage, including the question if they could be brought together. This reminded me of a recent blog post by Kirill Pomogajko entitled "Why I don't like Jupyter". And it reminded me of my own long-term project of integrating Jupyter with my ActivePapers system for reproducible research. That's three reasons for writing down my thoughts about notebooks and their role(s) in computational research, so here we go.

One key observation is in Gaël Varoquaux's comment on Kirill's blog post: using Jupyter for doing science creates a lock-in, because all collaborators on a project must agree on using Jupyter. There is no other tool that can be used productively for working with notebooks. It's a case of "wordization": digital content is taken hostage by a tool that defines a storage format for its own convenience without much consideration for other tools, be they competing or complementary. Wordization not only restricts the users' freedom to work with their data, but also creates headaches for the future. A data format defined by a tool can easily become unusable as the tool evolves and introduces incompatibilities, or of course if it disappears. In the case of Jupyter, its developers have always provided upgrade paths for notebooks between versions, but at some time this is bound to create trouble. Bugs are a fact of life, and I don't expect that the version-2-compatibility-feature will get much testing in Jupyter version 23. To make it worse, a Jupyter notebook can depend on third-party code that implements embedded widgets. This is one of the reasons why I don't use Jupyter for my research, although I am a big fan of using it for teaching. The other reason is that I cannot usefully link a notebook to other relevant information, such as code and data dependencies. Jupyter doesn't provide any functionality for this, and they are hard to implement externally exactly because of wordization.

Wordization is often associated with evil intentions of market dominance, as they are regularly assumed for a company like Microsoft. But I believe that the fundamental cause is the obsession with tools over content that has driven the computing industry for many years. The tool aspects of a piece of software, such as its feature list and its user interface, are immediately visible. On the contrary, its data model attracts attention only by a few specialists, if at all. Users feel the consequences of bad (or absent) data model design through the symptoms of wordization, in particular lock-in, but rarely understand where it comes from. Interestingly, this problem was also mentioned yesterday at the OpenDreamKit meeting, by Michael Kohlhase who discussed the digital representation of mathematical knowledge and the difficulty of exchanging it between different software tools. I have written earlier about another aspect, the representation of scientific models in computational science, which illustrates the extreme case of tools having absorbed scientific content to the point that its users don't even realize that something is missing.

Back to notebooks. Let's forget about tools for the moment and consider the question of what a notebook actually is, as a digital document. I think that notebooks are trying to be two different things, and that many of the problems we have with them come from this ambiguity. One role of notebooks is the documentation of computational work as a narrative with direct access to the data. This is why people publish notebooks. The other role is as a protocol of interactive explorative work, i.e. the computational scientist's equivalent of a lab notebook. The two roles are not completely unrelated, but they still significatively different.

To see the difference, look at how experimental scientists worked in the good old days of pencil, paper, and the printing press. As experiments were done, all the relevant information (preparation, results, …) was written down, immediately, with a time stamp, in the lab notebook. Like a bank ledger, a lab notebook is an immutable protocol of what happened. You don't go back and change earlier entries, that would even be considered fraud. You just add information at the end. Of course, the resulting protocol is not a good way to communicate one's findings. Therefore they are distilled and written up in a separate narrative, which surrounds a description of the work and its most important results by a motivating introduction and summarizing conclusions. This is the classic scientific article.

Today's computational notebooks are trying to be both protocol and narrative, and pretend that there is a fluent transition between them. One unfortunate consequence is that computational protocols disappear as they are edited to become narratives. This could be alleviated by keeping notebooks under version control, but I have yet to see good versioning support in any notebook-type tool. But, fundamentally, today's notebook tools don't encourage keeping a protocol. They encourage frequent changes to the code and the results, keeping only the latest version. As editors for narratives, notebook tools are also far from ideal because they encourage interactive execution of small code snippets, making it easy to lose track of what was actually executed and in what order. In Jupyter, the only way to ensure a coherent narrative is to (1) restart the kernel and (2) re-execute all cells. There is not even a single menu entry for this operation. Actually, I wonder how many Jupyter users are aware that they must restart the kernel before re-executing all the cells if they want to ensure reproducibility.

With all that said, here is my current idea of what a notebook should look like at the bit level. A notebook data model should have two distinct entries, one for a protocol and one for a narrative. The protocol entry is a sequence of code cells and results, as they were executed since the start of the computation (for Jupyter, that means the last kernel restart). The narrative is a user-edited sequence of code cells, documentation cells, and results. The actual cell contents could well be shared between the two views: store each cell with a unique ID, and make the protocol and the narrative simple lists of IDs. The representation of code and documentation cells in such a data model is straightforward, though there's a huge potential for bikeshedding in defining the details. The representation of results is much more difficult if you want to support more than plain text output. In the long run, it will be inevitable to define clear data models for every type of display widget, which is a lot of work.

From the tool point of view, the current Jupyter interface could be complemented by a non-editable protocol view. I'd also like to see a single command (menu/keyboard) for the "clean slate" operation: save the current state as a snapshot (or commit it directly to version control), restart the kernel, and re-initialize the protocol to an empty list. But what really matters to me is the data model. Contrary to the current one implemented in Jupyter, the one outlined above could be integrated into workflow management and archivation tools, such as my own ActivePapers. We'd probably see an Emacs mode for working with it as well. Plus pretty-printing tools, analysis tools, etc. We'd see an ecosystem of tools working with notebooks. A Dream of Openness.
← Previous Next →

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, polycrisis, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2025-06, 2025-04, 2025-03, 2024-10, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04