Regular readers of this blog may have noticed that I am not very happy with today’s state of computational notebooks, such as they were pioneered by Mathematica and popularized by more recent free incarnations such as Jupyter, R markdown, or Emacs/OrgMode. In this post and the accompanying screencast (my first one!), I will explain what I dislike about today’s notebooks, and how I think we can do better.
One of the more interesting things I have been playing with recently is Pharo, a modern descendent of Smalltalk. This is a summary of my first impressions after using it on a small (and unfinished) project, for which it might actually turn out to be very helpful.
There is an important and ubiquitous process in scientific research that scientists never seem to talk about. There isn’t even a word for it, as far as I now, so I’ll introduce my own: I’ll call it knowledge distillation.
In today’s scientific practice, there are two main variants of this process, one for individual research studies and one for managing the collective knowledge of a discipline. I’ll briefly present both of them, before coming to the main point of this post, which is the integration of digital knowledge, and in particular software, into the knowledge distillation process.
Since the dawn of computer programming, software developers have been aware of the rapidly growing complexity of code as its size increases. Keeping in mind all the details in a few hundred lines of code is not trivial, and understanding someone else’s code is even more difficult because many higher-level decisions about algorithms and data structures are not visible unless the authors have carefully documented them and keep those comments up to date.
My most recent paper submission (preprint available) is about improving the verifiability of computer-aided research, and contains many references to the related subject of reproducibility. A reviewer asked the same question about all these references: isn’t this the same as for experiments done with lab equipment? Is software worse? I think the answers are of general interest, so here they are.
A recent article in “The Atlantic” has been the subject of many comments in my Twittersphere. It’s about scientific communication in the age of computer-aided research, which requires communicating computations (i.e. code, data, and results) in addition to the traditional narrative of a paper. The article focuses on computational notebooks, a technology introduced in the late 1980s by Mathematica but which has become accessible to most researchers only since Project Jupyter (formerly known as the IPython notebook) started to offer an open-source implementation supporting a wide range of programming languages. The gist of the article is that today’s practice of publishing science in PDF files is obsolete, and that notebooks are the future.
It is widely recognized by now that software is an important ingredient to modern scientific research. If we want to check that published results are valid, and if we want to build on our colleagues’ published work, we must have access to the software and data that were used in the computations. The latest high-impact statement along these lines is a Nature editorial that argues that with any manuscript submission, authors should also submit the data and the software for review. I am all for that, and I hope that more journals will follow.
Data science is usually considered a very recent invention, made possible by electronic computing and communication technologies. Some consider it the fourth paradigm of science, suggesting that it came after three other paradigms, though the whole idea of distinct paradigms remains controversial. What I want to point out in this post is that the principles of data science are much older than most of today’s practitioners imagine. Let me introduce you to Apollonius, Hipparchus, and Ptolemy, who applied these principles about 2000 years ago.
The plea for stability in the SciPy ecosystem that I posted last week on this blog has generated a lot of feedback, both as comments and in a lengthy Twitter thread. For the benefit of people discovering it late, here is a summary of the main arguments and my reply to them.
Two NumPy-related news items appeared on my Twitter feed yesterday, just a few days after I had accidentally started a somewhat heated debate myself concerning the poor reproducibility of Python-based computer-aided research. The first was the announcement of a plan for dropping support for Python 2. The second was a pointer to a recent presentation by Nathaniel Smith entitled “Inside NumPy” and dealing mainly with the NumPy team’s plans for the near future. Lots of material to think about… and comment on.