Konrad Hinsen's blog

The low-hanging fruit in computational reproducibility

Konrad Hinsen — 2023-11-30

Yesterday I participated in the International workshop “Software, Pillar of Open Science”, organized by the French Committee for Open Science. In the course of the various presentations and discussions (both in public and during coffee breaks), I realized that something has been absent from such events all the time: the vast majority of scientists.

What prompted this insight was the juxtaposition of two observations: during the introduction, the importance of software in research ("92% of all researchers say they rely on software"), and during the panel on reproducibility, the difficulties resulting from the complexities of today's software stacks.

Here's a provocative proposition: we can solve computational reproducibility for a big majority of those 92% of researchers by buying them a license for Mathematica.

It's not Open Source, and that's bad for Open Science. I agree. But it does everything that most of those researchers need, it's very easy to install and run, and it's stable. You can run 20-year-old Mathematica code in today's version, and get the same results in the vast majority of cases. No reproducibility issues.

It's worth asking the question how a commercial company can solve a problem that highly qualified academic researchers have been discussing for a decade and continue to declare difficult. My answer to this question is threefold: (1) commercial licenses provide the resources for ensuring the floor of the sustainability doughnut, (2) the contractual producer-client relation provides the information necessary for ensuring the ceiling of the sustainability doughnut, and (3) their audience is very different from the participants at software-for-open-science events.

The last aspect is my key message here. All the activities around software in Open Science are organized by and for people who work in computational science, meaning that computation is their principal tool of scientific inquiry. A large proportion of them has a degree in computer science. On the other hand, most of the 92% of researchers who depend on software do computer-aided research but not computational science. Their main tools are instruments or mathematical theories. They use computers as auxiliary tools, mostly for routine data analysis tasks.

The people who contribute to Open Source projects for scientic software have overall the same profile as the participants of software-for-open-science events. They develop and document their software for this kind of profile as well. The invisible others can and do use this software as well, but it's a lot too complex for them. It's above the sustainability ceiling. But since the invisible others are invisible to the developers, they have no way to make their needs heard. In contrast to a commercial company, who knows all of its clients (they are paying for their license every year), cares about them (they are paying for their license every year), and regularly asks them about their needs and their degree of satisfaction.

I brought up this issue during the panel on sustainability, and discovered that there are others who have been thinking about it, for example panel member Josh Greenberg from the Sloan foundation (whom I'd also like to thank for an insightful discussion after the event). That's very promising. And here's my proposal for a first step into this direction: let's work on diversity and inclusion in Open Science. Make sure that all of the 92% of software-using researchers are represented.

This blog gets a facelift

Konrad Hinsen — 2023-11-16

Regular visitors to my blog have probably noticed that it looks different now. However, the visual changes are only a side effect of a more profound change: I now use a different static site generator, coleslaw.

It's been a while that I wanted to replace Disqus by a less invasive commenting system, and the recent announcement by Disqus to insert ads into the comments on my blog was what finally motivated me to actually invest some time to get this done.

The first task was to find a replacement for Disqus. One of my criteria was to allow commenting from the Fediverse, to remove the need for creating yet another account on yet another site just in order to be able to comment. The other criterion was not to depend on some third party service that might disappear or turn evil one day. In reply to a question on Mastodon, Marcel Stimberg pointed me to a post by Carl Schwan explaining how to use replies to a post-related toot as a channel for commenting. That looked just fine: no need for anyone to set up new accounts, just a one-time investment for updating my blog-generation code.

Next, I explored how to implement this technique in the static site generator I was using before, Frog. It turned out to be more complicated than I expected, because Frog allows only a fixed set of metadata fields on a post. Adding a field is certainly not impossible, but I'd have had to make changes to many places in the code to add parsing code for the new field and then pass its optional value around from function to function until its final destination in HTML rendering.

Before attacking such a major code surgery, I checked out other static site generators on a few-hour train ride, looking for one that supports arbitrary metadata or, better yet, is more hackable than Frog. After all, I might want to make other changes in the future, so having a codebase that I feel comfortable hacking on is likely to be valuable. Given my recently renewed interest in Common Lisp (see this post) for the reasons), I quickly settled on Coleslaw as a candidate to take a closer look at.

Coleslaw has a fixed set of metadata fields as well, but that set is defined by the slots of a class. Just add a slow, and you have a new metadata field. Very hackable! Moreover, the codebase is reasonably small, and while it's not a model of clarity, the ability to explore the code in a live programming environment makes it rather easy to get into, contrary to the more static and debug-hostile Racket code of Frog.

So that's why you are now looking at a Coleslaw-generated blog. It's my personal modified fork for now. I may look into factoring out my add-ons as plugins and submit them upstream, but this is absolutely not a high-priority project. Many people have their own fork of Coleslaw with similar personalizations, and that looks just fine. The forks are even very discoverable via GitHub. I'd prefer having discoverability beyond a single forge, but I don't think that's doable today.

Even though the blog looks very different, the contents of the posts have not changed, and the URLs remain identical as well. That took another ten minutes of hacking on Coleslaw. The URLs of the RSS and Atom feeds have also remained the same. I have exported the comments from Disqus and added them as static HTML on the posts. You can no longer add comments on the old posts, but at least read the existing ones. As a bonus, I also imported the posts from my very first blog at wordpress.com, because Coleslaw comes with a Wordpress importer that makes this a very straightforward operation.

The visual presentation of the pages isn't really to my taste, but I am not sure I'll be able to come up with something significantly better with my current rudimentary knowledge of CSS. I'll leave that for a future facelist session, which may of course never happen.

Following branching conversations on Mastodon

Konrad Hinsen — 2023-11-05

This post is a follow-up to my previous one, Deconstructing the Mastodon client. My topic is a scenario that traditional Mastodon clients handle rather badly, wheres my home-grown solution handled it very well: lengthy and branching conversations.

Such conversations happen all the time on social networks. Someone posts an interesting question or observation, which is commented by many others. Then comments are added to comments, and soon the replies form a branching tree that grows over a few days, sometimes even weeks. Keeping up to date with such a conversation is not supported by any Mastodon client I know of. Worse, due to the way Mastodon implements federation, some replies may never arrive on your instance.

What I did in the past is put a bookmark on the initial toot, and then check for new replies once per day or so. Once you get to dozens of toots, checking for new ones is already a minor effort. And although I know how to check for replies outside of my own instance, in practice I hardly ever do it because it's too laborious.

A simple script that I run once per day makes this a lot easier. I still mark interesting conversations as bookmarks. But now it's my script that copies the whole tree into a mail folder, skipping toots that are already present. New additions to the tree thus show up as unread mails in my inbox, just like replies in a mailing list. Better yet, my script retrieves the whole tree twice: once from my own instance, and once by retrieving each toot from the instance it was posted to, checking on that instance for replies. Neither approach is sufficient on its own: my instance doesn't see all replies, but the foreign instances from which I retrieve toots won't show me non-public toots.

Nothing of this is rocket science, but it's a nice illustrations of the possibilities that open up once you take control over your personal information environment. I wish this were easier, and thus accessible to more people. But it won't get easier as long as most computer users find it perfectly normal that a small technophile elite defines what everyone else is able to do in their digital lives. So if you are reading this and think "nice, but that's above my level of competence", the very least you should do is express your desire to be able to do such things on your own. On Mastodon, for example.

Deconstructing the Mastodon client

Konrad Hinsen — 2023-10-09

Ever since I joined Twitter in 2011, and then moved to Mastodon in 2022, I have been unhappy with the timeline view proposed by both of these communication platforms as their main interface. Now I have finally done something about it: I wrote my own Mastodon client. Or perhaps rather a non-client, because the concept of "the client" is a big part of what I disliked.

My use of social networks can be broken down into three categories:

conversations, mostly public but sometimes private
keeping up to date with the work of a small number of people or institutions
staying in touch with communities I consider myself a part of, and following topics I find interesting

These are not clearly separated categories. It's often messages from category 2 that start conversations, and occasionally messages from category 3. But most of my daily use of Mastodon consists of

participating in ongoing conversations
reading the feeds of accounts I care about specifically
scanning all the other news feeds sporadically and often superficially, depending on how much time and interest I have at the moment

A timeline view mixing all messages from all accounts I follow is somewhat acceptable for (3), but no good for (1) and (2). Mastodon proposes lists for (2), and notifications to help with (1), but neither mechanism is satisfying for me. Lists in particular suffer from an awkward user interface. Moreover, I do (3) exclusively on mobile devices (on the bus etc.), (1) almost exclusively on the desktop (as I don't like typing on on-screen keyboards), and (2) alternating between multiple devices.

There are, of course, many Mastodon clients, so I tried out a few of them. For a while, I used Fedilab on Android (for me: phone and e-ink tablet) for activity (3), and the default Web client and/or Elk, mainly on the desktop, for (1) and (2). It was a workable setup, but not a satisfying one. In addition to the cumbersome list interface, what I found missing was synchronization between my usage of multiple devices For (2), I'd need to be able to efficiently access all messages I hadn't seen before, on any of my devices (two mobile, two desktop). As a long-time Emacs user, I also tried mastodon.el, which is nice but, like Emacs, it is desktop only, and thus doesn't help with my multi-device issues.

At some point I realized that what I wanted is not a better Mastodon client, but a better Mastodon workflow. What I care about is a data structure, a stream of toots, that is accessible via an HTTP API. I want to split this stream into several streams according to various criteria. For some substreams, I want to make sure I don't miss any message. For others, I need an interface to scan all messages when I feel like it, or search for specific keywords when I don't have time for scanning everything.

Can I get such interfaces to Mastodon streams without writing my own client? Yes, by repurposing existing software. Small streams of which I don't want to miss anything are much like e-mail (after spam filtering of course!). High-volume streams that I scan or search are much like RSS feeds. There is a lot of good software for managing e-mail and RSS feeds, for all platforms I use and even exotic platforms that I don't use (yet?). There are also good infrastructure tools in this space, in particular for e-mail. isync, for example, takes care of IMAP(S), letting me work with local files (Maildir) and not worry about networks, certificates, and their various modes of failure.

It actually takes surprisingly little software to transform Mastodon streams into e-mail and RSS feeds, if you can resist temptations of overengineering. A toot is a snippet of HTML with optional attachments (images, video, audio). That's also what a MIME message happens to be. A near-perfect match. RSS items are HTML snippets as well. No attachments, but you can include the same preview images that Mastodon clients display with toots. If you can find support libraries for mail, RSS, and the Mastodon API in a programming language that you know well enough, this becomes a very manageable side project.

If your preferences match mine, meaning you are happy to use Common Lisp for such a job, you can use my code as a starting point for your own Mastodon experiments. Its main support libraries are tooter for the Mastodon API, and mel-base for e-mail. RSS is trivial if you have XML support, for which I use plump. My RSS aggregator is Newsblur, which has a reasonable Web interface for the desktop and a very nice Android app. For e-mail, I use K9 on Android, and Emacs on the desktop, but I am pretty sure any other e-mail client would work fine as well. The most time-consuming aspect turned out to be mel-base, a library that's insufficiently documented and not quite up to date, lacking support in particular for subject lines and account names containing Unicode characters.

If you have followed so far, you have probably noticed that my non-client supports nothing but reading toots. Each of my transformed toots ends with a link that opens it in the default Web client, where I can reply, boost, or like. The Web client is also what I use for administrative tasks. Bonus: I add another link to each toot that opens it in the instance of its author, where I have access to the full reply chain, of which my own instance often captures only a subset. A very simple solution to one of Mastodon's unfortunate limitations that are due to federation.

The hopefully generalizable lesson from this project is that it is possible to improve one's personal computing environment with reasonable effort, under the condition of accepting an initial learning curve for some technologies. The important question then is how to identify technologies that are worth learning, which I interpret as technologies that are likely to be useful again for other software personalization efforts. A first draft of a list of criteria:

Choose boring technology. You want well-known, well-documented, and stable infrastructure to build on. No surprises, no tech churn. Your learning effort should be a good investment.
Choose small-scale rather than enterprise-grade technology. Your problems and challenges are very different from Microsoft's. Prefer small software stacks.
Corollary 1: choose carefully who you turn to for advice. Most conference talks, blog posts, StackOverflow discussions, etc. come from software professionals. Better listen to people like yourself (but no, I have no advice on where to find them, nor how to judge their competence).
Corollary 2: consider old technology. Most modern software development tools are designed for software professionals. Tools for small-scale development were common in the 1980s and 1990s, before computers became commodities. Technology from that era that's still supported today may well be your best bet. I am a happy user of Emacs, Smalltalk (more precisely Pharo with Glamorous Toolkit as my preferred user interface), and Common Lisp (more precisely SBCL). Python is from the 1990s as well, but since it was widely adopted by software professionals in the 2000s, its ecosystem suffers too much from tech churn for my taste.
Build on general protocols and file formats rather than specialized ones. Hierarchical filesystems rather than the Dropbox API. E-mail rather than Matrix. HTML, XML, and JSON files rather than JavaScript libraries or Web APIs.
Consider debuggability. Delegate hard-to-debug stuff (e.g. networking, in particular with encryption) to other software. Choose tools that support debuggability. Debugging is a lot easier if you can build your own problem-specific debugging tools, which in turn is best supported by development tools that are extensible and focus on rapid feedback. Smalltalk systems are best in class in this respect, and Glamorous Toolkit even turned this into a design principle, called "Moldable Development".

Unfortunately, there is one more aspect to making good choices that is hard to generalize: you need some expertise in figuring out which problems you can solve yourself with reasonable effort and which are so hard that your efforts are better spent on delegating or circumventing them. Data synchronization is in this second category, but like most people I learned this the hard way (years ago), while trying to do it myself and losing both time and data in the process.

After a few weeks of using my setup, I am fully satisfied with it. I also note that my original ideas about defining my personal algorithmic feeds have evolved substantially with practical experience. Once I have taken care of conversations (they go to e-mail) and the small set of accounts I follow closely (a low-volume RSS feed), I ended up splitting the remaining toots (i.e. most of my timeline) by topics in the crudest imaginable way: substring search. It's not perfect but definitely good enough. There's always room for improvement. My main failure so far is in removing all the cat-related toots from my feeds. That may actually require AI-based image recognition. Some problems are hard!

I'd love to hear about similar projects in this space (tell me on Mastodon!). The only one I am aware of is Jon Udell's Steampipe-based client. Steampipe provides an SQL/database view on many Web services, which is perfect for doing non-trivial queries. That's something my own setup doesn't address at all. It's not something I feel a need for right now, but I may well add Jon's client to my toolbox one day.

Welcome to my digital garden!

Konrad Hinsen — 2022-08-31

A few years ago, I discovered Mike Caulfield's The Garden and the Stream: A Technopastoral and understood why I wasn't happy with my blog.

Blogs are streams, timelines of posts. Each post has a timestamp, and is considered "finished". Later changes are technically possible, but culturally limited to corrections. A blog post is considered a published essay, and therefore comes with a date of publication. I am much more interested in gardens, which are collections of essays that are revised and improved over long time periods.

It took me a while to actually set up a digital garden and populate it with some content, but I eventually did it. I won't say much about it because it speaks for itself. It's just one click away: https://science-in-the-digital-era.khinsen.net/

Does this mean the end of this blog? No, but posts will become even rarer. A blog is still the best place to make announcements, or to comment on events. But I am a researcher, not a journalist. The fundamental job of a researcher is to curate and extend knowledge collections. That's what I will do from now on in my own little garden.

The dependency hubs in Open Source software

Konrad Hinsen — 2021-06-10

A few days ago, Google announced its experimental project Open Source Insights, which permits the exploration of the dependency graph of Open Source software. My first look at it ended with a disappointment: in its initial stage, the site considers only the package universes of Java, JavaScript, Go, and Rust. That excludes most of the software I know and use, which tends to be written mainly in C, C++, Fortran, and Python. But I do have a package manager that has all the dependency information for most of the software that I care about: Guix. So I set out to do my own exploration of the Guix dependency graph, with a particular focus: identifying the hubs of the Open Source dependency network.

This was also a good opportunity to test the practical utility of a new GUI for Guix that I have been working on recently as a side project. In fact, I added this dependency hub analysis to that GUI, so now you can access it with a simple click.

Software being the complex beast that it is, I have to start by properly defining the subjects of my inquiry. What exactly do I mean by "package", "dependency", and "dependency hub"?

The term package is widely used to describe a unit of development and distribution in software systems, but every package manager has a slightly different notion of what a package actually is. A package could be "Python", or "Python 3.8.2", or "Python 3.8.2 built with gcc 7.5, version X of dependency Y, ...". Guix adopts the last, most fine-grained, definition. This is a good choice when you want to do reproducible software builds, but it is not very useful for analyzing dependency graphs. So I chose the level of name + version number, meaning that I consider "Python 3.8.2" a different package from "Python 3.8.1". That's of course debatable as well. But in Guix, it is rare to have multiple versions of a piece of software coexist at the same time. When it does happen, there is a good reason, typically a significant evolution in the software that makes different dependents prefer different versions. An example is Python 2 vs. Python 3, or the different major versions of gcc. In those cases, looking at their dependencies and dependents separately does make sense.

The term dependency is also widely used with different meanings. The two most common ones are runtime dependency and build dependency. A runtime dependency of package X is a package that must be installed on the computer to use package X. In contrast, a build dependency is a package that is required in order to build package X, where building means anything required to turn source code into something executable. Think of it as a generalization of compiling. Usually the build dependencies are roughly a superset of the runtime dependencies: there are packages you need to build package X, e.g. a compiler, but which are then no longer required for using package X. It's the build dependencies that matter for the evolution of software systems, so that's the definition I used in my analysis.

Unfortunately, the complexity of defining dependencies doesn't end there. Many packages have optional dependencies. When they are available, some additional functionality is enabled. Do you count them or not? My pragmatic take is that I trust the Guix developers to have made good choices. So for me, a dependency is whatever it takes to build a package in Guix.

This leaves the notion of a dependency hub to be defined. In network science, a hub is a node that has an exceptionally high number of connections to other nodes, such that a large share of the information propagating through the network passes through the hubs. A software dependency graph differs from most networks in that its edges have a direction: A depending on B is not the same as B depending on A. This leads to several a priori reasonable definitions for hubs: 1. packages that have many dependencies, 2. packages that have many dependents, and 3. packages for which the sum of dependencies plus dependents is high. Let's immediately eliminate the last definition, as I see no interest in it. Definition 1 identifies the packages that are particularly vulnerable to software collapse, definition 2 the packages that can most easily cause software collapse.

The latter characteristic corresponds best to the capture of information flow as the defining feature of network hubs, and it also happens to be what I am most interested in. The information that flows in the network is requests for change. Nodes receive such requests from dependents, who are in fact the software's clients or users. They typically ask for improved or extended functionality. Nodes also receive requests from dependencies, when they implement changes that break backward compatibility and then ask their dependents to adapt to these changes. The nodes that potentially receive and send many requests for change are thus the nodes who have the most dependents. They are the hubs in the dependency network. Note, however, that the asymmetry in the dependency relation still matters. Nodes can ignore requests for change coming from their dependents, but they cannot ignore requests coming from their dependencies. It's called "dependency" for a reason!

At this point, I can take a break from theory and show you the results of my analysis. The top twenty hubs in the Guix dependency graph are:

Package	Number of dependents
perl 5.30.2	7964
pkg-config 0.29.2	7938
zlib 1.2.11	7414
ncurses 6.2	7337
libffi 3.3	6687
xz 5.2.4	6535
readline 8.0	6503
libxml2 2.9.10	6302
expat 2.2.9	6170
libunistring 0.9.10	6150
bzip2 1.0.8	6070
tzdata2019c	6068
Python 3.8.2	6061
bash 5.0	6042
gettext 0.20.1	5768
m4 1.4.18	5621
libgpg error-1.37	5518
libgcrypt 1.8.5	5514
libxslt 1.1.34	5479
gmp 6.2.0	5363

If you want more, here is the full list as a JSON file, sorted by decreasing number of dependents.

If you have thought a bit about what to expect before looking at this table, you have probably included programming languages such as perl or python in this list. But perhaps you did not expect to see utilities such as pkg-config or bzip2. Remember these are build dependencies. The very first step in building a package, any package, is unpacking its source code. Many of the packages in my top-twenty list represent boring but essential infrastructure software. The software equivalent of the power grid and the road network: stuff that everybody just takes for granted. Such packages rarely get into the news, except when something goes seriously wrong, as in the case of the Heartbleed bug affecting OpenSSL. Which, by the way, is at position 634 in my list. It would be much higher up in a network defined by different criteria, of course. There's more to software than build dependencies.

One motivation for writing this post was to point out a common fallacy in reasoning about Open Source software. A popular argument is that Open Source gives you the freedom to change software to fit your needs, by creating and maintaining your own fork. Or paying someone else to do it for you, if you are not an accomplished hacker yourself. The source code is there for anyone to grab, after all, and the license allows modification and redistribution.

This argument was valid in the 1980s. There were few packages, few dependencies, and a much higher percentage of computer users had programming experience. Today, you can perhaps maintain your own fork of Perl, but you cannot fork its hub position in the network, nor can you reasonably maintain forks of its 7964 dependants. If the Perl maintainers introduce a breaking change, those 7964 dependents will either adapt or disappear. Hypothetically, a large number of them could together envisage maintaining their own fork. But there are no good coordination mechanisms among developers of unrelated Open Source projects, and therefore this doesn't happen in practice.

In an earlier post, I have written about community-owned monopolies in the Open Source universe. In that post, I wrote that for software users, there is no practical difference between Microsoft killing Windows 7 and the Python community killing Python 2, even though the former is proprietary and commercial, whereas the latter is Open Source. The reason is that both pieces of software are hubs in dependency networks. Microsoft and the Python developer community are two very different institutions, with very different goals, values, policies, legal status, etc. But that hardly matters for the average software user, whose work depends on a complex web of interacting pieces of software. At the level of that web, it's the information flow patterns that determine evolution. Requests for change, or non-change. Average software users have practically no way to make their needs heard by the people who manage the hubs. Even the best-intentioned altruistic Open Source hub maintainer cannot possibly keep every user's interests in mind, because there is no way to even be aware of them. A web of software is a very different beast than a single project. More is different.

In the almost 40 years since the beginnings of the Open Source movement, the mode of governance of Open Source projects has evolved significantly. Most importantly, all the people involved have realized that governance matters and must be consciously organized, rather than evolve through cumulative random accidents of history, which almost inevitably leads to a tyranny of structurelessness in the long run. Now we must develop an awareness of similar issues at the level of the web of Open Source projects, followed by the development and implementation of better information flow and decision structures.

I will conclude this post with a technical remark. I did my dependency hub analysis using a relatively new tool in the software world, called the Glamorous Toolkit, to which I added an interface to Guix. This toolbox significantly lowers the cost of developing new tools. In the screenshot below, you see on the left the user interface of my analysis. It's an additional view on the Guix package catalog, complementing various other views that are already in place. On the right, you see the complete code for this analysis, including the user interface (which also gives access to the list of dependents, not just the number). In contrast to traditional scripts, there is no overhead for reading data or writing out the results. My code works on data structures that are already in place. What is not obvious from the screenshot is that you get the right-hand panel via alt-click from the left-hand one, meaning that users of my little analysis tool always have direct access to the code. It isn't obvious either that modifying the code on the right will immediately update the view on the left, making development highly interactive. If you think notebooks are great, try Glamorous Toolkit. But be warned that you might then realize that notebooks are no longer the state of the art.

The structure and interpretation of scientific models, part 2

Konrad Hinsen — 2021-01-08

In my last post, I have discussed the two main types of scientific models: empirical models, also called descriptive models, and explanatory models. I have also emphasized the crucial role of equations and specifications in the formulation of explanatory models. But my description of scientific models in that post left aside a very important aspect: on a more fundamental level, all models are stories.

To illustrate my point, I will take up my running example from part 1: celestial mechanics. Newton's model for our solar system is, as I said, composed of several equations, the most famous of which, F = m ⋅ a, many readers will probably remember from a high-school physics class. But that equation means nothing on its own. It just says that there are three quantities, one of which being the product of the other two.

The minimal story required to make sense of this equation provides a definition of the three quantities involved. For acceleration (the a), this may look superficially simple: it's the second derivative of an object's position in time. The concepts of position and time are part of our everyday intuition, so that's the easy part. Velocity is an intuitive everyday concept as well, but its precise relation to position as a time derivative is not. For acceleration, nothing short of calculus will do. In fact, Newton invented calculus along with his physical theory! Defining mass (the m) and force (the F) is not a trivial task either. Both concepts are rooted in our everyday intuition about the world, but their role in Newton's law of motion requires a much more precise understanding. If you have doubts about this, try explaining the difference between mass and weight to someone who doesn't have a scientific education.

From this big-picture point of view, equations such as F = m ⋅ a are tiny pieces of our scientific models. They are the tips of icebergs whose massive underwater parts are the stories defining the underlying concepts and linking them to our intuition about the world, often through multiple and increasingly abstract layers. We tend to forget about these stories, because once we have understood them well enough, what we actually work with are the equations. But this works only for the well-established models whose stories are now found in textbooks. New research continuously introduces new models, often as small variants or extensions of existing ones. Their stories are told in scientific publications.

Historically, mathematical notation was introduced as a convenient shorthand for use in plain-language stories. The lengthy phrase "force equals mass times acceleration" thus became F = m ⋅ a. The transition to symbolic equations encouraged the development of formal methods in mathematics, starting with algebraic transformations of simple equations. This approach was so successful that equations became the main focus of interest in science. Later, other formal representations were added for the non-numerical aspects of models, graphs being the prime example. The most recent addition to the collection of formal notations for scientific models is software. Today, scientists spend most of their time working with the formalized parts of scientific models, such as equations or algorithms, to the point of neglecting the stories that give them meaning.

What happens when people use the equations of scientific models without a proper understanding of their stories is nicely illustrated by the joke about the physics student who combines Einstein's E = m ⋅ c² with Pythagoras' a² + b² = c² to deduce E = m ⋅ (a² + b²). It works as a joke among physicists because in their community, everybody knows the two inputs and the contexts from which they are taken. For other people, there is nothing funny about this reasoning, and it can even look convincing. Such superficial use of scientific models without understanding their context is actually quite common in today's research: the inappropriate use of statistical inference methods is a major cause of the reproducibility crisis.

Computing technology has played a big role in alienating scientists from their models. Most obviously, computers have made it possible to apply scientific models and methods as black-box tools: in an automated fashion, without understanding them. But the attitudes of the software industry, whose development tools computational science has inherited, have also contributed to this tendency. The focus of the software industry is on professional developers making tools for others that almost magically solve some of their problems. Users then get a manual, or hands-on training, for learning how to use the tool, but the inner workings of the tool are something they shouldn't even have to think about. A good tool is one that minimizes learning requirements. Applied to science, this implies that users shouldn't have to know the stories behind the models. Everyone with a dataset should be able to do statistical inference with a few mouse clicks and get a nice visualization. But without the stories, we can easily draw wrong conclusions from nice graphics.

After a long period of separation of tools and stories, computational notebooks are now bringing some of the stories back. The enthusiastic adoption of notebooks by computational scientists is perhaps the best evidence for the importance of stories in science. But today's notebooks capture only the surface stories of a research project. It's tips of icebergs again. The typical notebook makes use of a large number of code libraries that are based on non-trivial scientific models, but the reader of the notebook remains completely unaware of them. Ideally, these models, with their stories, should be only a few clicks away.

So what would an electronic representation of scientific models look like, ideally? It's a collection of cross-referencing stories. In the celestial mechanics example, there's a story about positions, velocities, and accelerations, which refers to a story about time and to a story about derivatives. There is another story that explains mass. The story of Newton's law of motion, which also introduces the concept of force, can then refer to these more fundamental stories. If this description reminds you of Wikipedia, or in fact of any Wiki, you are right. Wikis are also collections of cross-referencing stories. What is missing in Wikis is a machine-readable version of the formalized parts of our models. Which, as I explained in part 1, needs to allow at least equations, specifications, and algorithms for its ingredients. Another feature that is missing in today's Wikis, although some people are working on it, is the possibility to integrate computational tools in the form of code snippets. Their role would be to give access to visualizations, simulations, and other exploration tools.

My own experiments in this domain are Leibniz, a digital scientific notation for embedding machine-readable formal models into human-readable stories, and the Pharo edition of ActivePapers, which integrates datasets and computational tools into a Wiki-like collection of stories. Both ingredients require more work, and then need to be combined. There remains a lot of work to do.

The structure and interpretation of scientific models

Konrad Hinsen — 2020-12-10

It is often said that science rests on two pillars, experiment and theory. Which has lead some to propose one or two additional pillars for the computing age: simulation and data analysis. However, the real two pillars of science are observations and models. Observations are the input to science, in the form of numerous but incomplete and imperfect views on reality. Models are the inner state of science. They represent our current understanding of reality, which is necessarily incomplete and imperfect, but understandable and applicable. Simulation and data analysis are tools for interfacing and thus comparing observations and models. They don't add new pillars, but transforms both of them. In the following, I will look at how computing is transforming scientific models.

Empirical models

The first type of scientific model that people construct when figuring out a new phenomenon is the empirical or descriptive model. Its role is to capture observed regularities, and to separate them from noise, the latter being small deviations from the regular behavior that are, at least provisionally, attributed to imprecisions in the observations, or to perturbations to be left for later study. Whenever you fit a straight line to a set of points, for example, you are constructing an empirical model that captures the linear relation between two observables. Empirical models almost always have parameters that must be fitted to observations. Once the parameters have been fitted, the model can be used to predict future observations, which is a great way to test its generality. Usually, empirical models are constructed from generic building blocks: polynomials and sine waves for constructing mathematical functions, circles, spheres, and triangles for geometric figures, etc.

The use of empirical models goes back a few thousand years. As I have described in an earlier post, the astronomers of antiquity who constructed a model for the observed motion of the Sun and the planets used the same principles that we still use today. Their generic building blocks were circles, combined in the form of epicycles. The very latest variant of empirical models is machine learning models, where the generic building blocks are, for example, artificial neurons. Impressive success stories of machine learning models have led some enthusiasts to proclaim the end of theory, but I hope to be able to convince you in the following that empirical models of any kind are the beginning, not the end, of constructing scientific theories.

The main problem with empirical models is that they are not that powerful. They can predict future observations from past observations, but that's all. In particular, they cannot answer what-if questions, i.e. make predictions for systems that have never been observed in the past. The epicycles of Ptolemy's model describing the motion celestial bodies cannot answer the question how the orbit of Mars would be changed by the impact of a huge asteroid, for example. Today's machine learning models are no better. Their latest major success story as I am writing this is the AlphaFold predicting protein structures from their sequences. This is indeed a huge step forward, as it opens the door to completely new ways of studying the folding mechanisms of proteins. It is also likely to become a powerful tool in structural biology, if it is actually made available to biologists. But it is not, as DeepMind's blog post claims, "a solution to a 50-year-old grand challenge in biology". We still do not know what the fundamental mechanisms of protein folding are, nor how they play together for each specific protein structure. And that means that we cannot answer what-if questions such as "How do changes in a protein's environment influence its fold?"

Explanatory models

The really big success stories of science are models of a very different kind. Explanatory models describe the underlying mechanisms that determine the values of observed quantities, rather than extrapolating the quantities themselves. They describe the systems being studied at a more fundamental level, allowing for a wide range of generalizations.

A simple explanatory model is given by the Lotka-Volterra equations, also called predator-prey equations. This is a model for the time evolution of the populations of two species in a preditor-prey relation. An example is shown in this plot (Lamiot, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons):

An empirical model would capture the oscillations of the two curves and their correlations, for example by describing the populations as superpositions of sine waves. The Lotka-Volterra equations instead describe the interactions between the population numbers: predators and prey are born and die, but in addition predators eat prey, which reduces the number of prey in proportion to the number of predators, and contributes to a future increase in the number of predators because they can better feed their young. With that type of description, one can ask what-if questions: What if hunters shoot lots of predators? What if prey are hit by a famine, i.e. a decrease in their own source of food? In fact, the significant deviations from regular periodic change in the above plot suggests that such "outside" events are quite important in practice.

Back to celestial mechanics. The decisive step towards an explanatory model was made by Isaac Newton, after two important preparatory steps by Copernicus and Kepler, who put the Sun at the center, removing the need for epicycles, and described the planets' orbits more accurately as ellipses. Newton's laws of motion and gravitation fully explained these elliptical orbits and improved on them. More importantly, they showed that the fundamental laws of physics are the same on Earth and in space, a fact that may seem obvious to us today but wasn't in the 17th century. Finally, Newton's laws have permitted the elaboration of a rich theory, today called "classical mechanics", that provides several alternative forms of the basic equations (in particular Lagrangian and Hamiltonian mechanics), plus derived principles such as the conservation of energy. As for what-if questions, Newton's laws have made it possible to send artefacts to the moon and to the other planets of the solar system, something which would have been unimaginable on the basis of Ptolemy's epicycles.

So far I have cited two explanatory models that take the form of differential equations, but that is not a requirement. An example from the digital age is given by agent-based models. There is, however, a formal characteristic that is shared by all explanatory models that I know, and that distinguishes them from empirical models: they take the form of specifications.

Specifications and equations vs. algorithms and functions

Let's look at a simple problem for illustration: sorting a list of numbers (or anything else with a well-defined order). I have a list L, with elements L[i], i=1..N where N is the length of the list L. What I want is a sorted version which I will call sorted(L). The specification for sorted(L) is quite simple:

sorted(L) is a list of length N.
For all elements of L, their multiplicities in L and sorted(L) are the same.
For all i=1..N-1, sorted(L)[i] ≤ sorted(L)[i+1].

Less formally: sorted(L) is a list with the same elements as L, but in the right order.

This specification of sorted(L) is complete in that there is one unique list that satisfies it. However, it does not provide much help for actually constructing that list. That is what a sorting algorithm provides. There are many known algorithms for sorting, and you can learn about them from Wikipedia, for example. What matters for my point is that (1) given the specification, it is not a trivial task to construct an algorithm, (2) given a few algorithms, it is not a trivial task to write down a common specification that they satisfy (assuming of course that it exists). And that means that specifications and algorithms provide complementary pieces of knowledge about the problem.

In terms of levels of abstraction, specifications are more abstract than algorithms, which in turn are more abstract than implementations. In the example of sorting, the move from specification to algorithm requires technical details to be filled in, in particular the choice of a sorting algorithm. Moving on from the algorithm to a concrete implementation involves even more technical details: the choice of a programming language, the data structures for the list and its elements, etc.

In the universe of continuous mathematics, the relation between equations (e.g. differential equations) and the functions that satisfy them is exactly the same as the relation between specifications and algorithms in computation. Newton's equations can thus be seen as a specification for the elliptical orbits that Kepler had described a bit earlier. Like in the case of sorting, it is not a trivial task to derive Kepler's elliptical orbits from Newton's equations, nor is it a trivial task to write down Newton's equations as the common specification of all the (approximatively) elliptical orbits in the solar system. The two views of the problem are complementary, one being closer to the observations, the other providing more insight.

One reason why specifications and equations are more powerful is that they are modular. Two specifications combined make up another, more detailed, specification. Two equations make up a system of equations. An example is given my Newton's very general law of motion, which is extended by his law of gravitation to make a model for celestial mechanics. The same law of motion can be combined with different laws defining forces for different situations, for example the motion of an airplane. In contrast, there is no way to deduce anything about airplanes from Kepler's elliptical planetary orbits. Functions and algorithms satisfy complete specifications, and conserve little information about the components from which this complete specification was constructed.

A challenge for computational science

Computational science initially used computers as a tool for applying structurally simple but laborious computational algorithms. The focus was on efficient implementations of known algorithms, later also on developing efficient algorithms for solving well-understood equations. The steps from specification to algorithm to implementation were done by hand, with little use of computational tools.

That was 60 years ago. Today, we have computational models that are completely unrelated to the mathematical models that go back to the 19th century. And when we do use the foundational mathematical models of physics and chemistry, we combine them with concrete systems specifications whose size and complexity requires the use of computational tools. And yet, we still focus on implementations and to a lesser degree on algorithms, neglecting specifications almost completely. For many routinely used computational tools, the implementation is the only publicly accessible artefact. The algorithms they implement are often undocumented or not referenced, and the specifications from which the algorithms were derived are not written down at all. Given how crucial the specification level of scientific models has been in the past, we can expect to gain a lot by introducing it into computational science as well.

To do so, we first need to develop a new appreciation for scientific models as distinct from the computational tools that implement them. We then need to think about how we can actually introduce specification-based models into the workflows of computational science. This requires designing computational tools that let us move freely between the three levels of specification, algorithm, and implementation. This is in my opinion the main challenge for computational science in the 21st century.

Finally...

Some readers may have recognized that the title of this post is a reference to two books, Structure and Interpretation of Computer Programs (with a nice though inofficial online version) and Structure and Interpretation of Classical Mechanics (also online). The second one is actually somewhat related to the topic of this post: it is a textbook on classical mechanics that uses computational techniques for clarity of exposition. More importantly, both books focus on inducing a deep understanding of their topics, rather than on teaching superficial technical details. This humble blog post cannot pretend to reach that level, of course, but its goal is to spark developments that will culminate in textbooks of the same quality as its two inspirations.

Comments retrieved from Disqus

Konrad Hinsen:
A recommended follow-up read: What is declarative programming?
by Alan Kay. His "what" and "how" is almost the same distinction as "specification" vs "algorithm".

Some comments on AlphaFold

Konrad Hinsen — 2020-12-02

Many people are asking for my opinion on the recent impressive success of AlphaFold at CASP14, perhaps incorrectly assuming that I am an expert on protein folding. I have actually never done any research in that field, but it's close enough to my research interests that I have closely followed the progress that has been made over the years. Rather than reply to everyone individually, here is a public version of my comments. They are based on the limited information on AlphaFold that is available today. I may come back to this post later and expand it.

First of all, the GDT scores obtained by AlphaFold are impressive, which is of course the reason for all the buzz at the moment. The GDT score measures how close a predicted structure is to the experimentally determined one. It is defined on a scale from 0 to 100 and can roughly be interpreted as the percentage of amino acid residues that were placed correctly. For about 2/3 of the proteins in this year's competition, AlphaFold achieved a GDT score in the 90s, whereas in the not so distant past, a score in the 70s was already considered very good. Which exact techniques were used to obtain the predicted structures is not something I can comment on: as far as I know, no technical details have been made public so far. Nor is AlphaFold a publicly available program or service that scientists could explore or apply to their own work. So all we know for now is that DeepMind, the company behind AlphaFold, has figured out a way to obtain good scores at CASP14. In the following I will assume that this is not just good luck, and that the method is applicable to a much larger class of proteins than the CASP candidates.

The scores obtained by AlphaFold are clearly a sign of significant progress. But does it mean that we have "a solution to a 50-year-old grand challenge in biology", as the press release claims? That depends on what exactly one considers that challenge to be.

If the challenge of protein folding is taken to be a purely pragmatic one, i.e. being able to predict structure from sequence, then AlphaFold is a candidate for a solution. How much of a solution will depend on further evaluations that remain to be done, on a larger range of proteins. CASP is limited to proteins for which experimental structures are (just) available. But some proteins resist experimental structure determination, for example because they have no well-defined structure at all. A robust structure prediction tool would have to identify such cases, rather than predict bogus structures. Allosteric proteins, which are proteins that can take more than one stable structure, provide another set of interesting test cases. A third case of interest is protein pairs that differ minimally in their sequence but importantly in structure. The goal of evaluating the robustness of a tool is to understand how it behaves at best, at worst, and for important edge cases, such that its users can judge the trustworthiness of its results.

For many scientists, including myself, having a black-box structure prediction tool is not sufficient to declare the protein folding problem solved. A solution requires an in-depth understanding of the mechanisms that determine protein structure. Whether or not AlphaFold can contribute to identifying these mechanisms is a question that scientists can only start to examine, and only if AlphaFold becomes sufficiently accessible and inspectable for critical examination by outside experts. I hope this will happen, and in fact I am optimistic that it will happen: the problem is important enough to deserve a serious effort by everyone involved. AlphaFold is not the end of the quest for a solution of the protein folding problem, but it could well turn out to be the beginning of a new chapter in the story.

The four possibilities of reproducible scientific computations

Konrad Hinsen — 2020-11-20

Computational reproducibility has become a topic of much debate in recent years. Often that debate is fueled by misunderstandings between scientists from different disciplines, each having different needs and priorities. Moreover, the debate is often framed in terms of specific tools and techniques, in spite of the fact that tools and techniques in computing are often short-lived. In the following, I propose to approach the question from the scientists' point of view rather than from the engineering point of view. My hope is that this point of view will lead to a more constructive discussion, and ultimately to better computational reproducibility.

The format of my proposal is inspired by the well-known "four freedoms" that define Free Software. The focus of reproducibility is not on legal aspects, but on technical ones, and therefore my proposal is framed in terms of possibilities rather than freedoms.

The four essential possibilities

A computation is reproducible if it offers the four essential possibilities:

The possibility to inspect all the input data and all the source code that can possibly have an impact on the results.
The possibility to run the code on a suitable computer of one's own choice in order to verify that it indeed produces the claimed results.
The possibility to explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools.
The possibility to verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

All of these possibilities come in degrees, measured in terms of the effort required to actually do what is supposed to be possible. For example, inspecting the source code of a computation is much easier for a notebook containing the top-level code, with links to repositories of all dependencies, than for a script available from the authors on request. Moreover, the degree to which each possibility exists can strongly vary over time. A piece of software made available on an institutional Web site is easily inspectable while that site exists, but inspectability drops to zero if the Web site closes down.

The reproducibility profile of a computation therefore consists of four time series, each representing one of the possibilities expressed on a suitable scale with its estimated time evolution. The minimum requirement for the label "reproducible" is a non-zero degree for all four possibilities for an estimated duration of a few months, the time it takes for new work to be carefully examined by peers.

Rationale

The possibility to inspect all the source code is required to allow independent verification of the software's correctness, and in particular to check that it does what its documentation claims it does.

The possibility to run the code is required to allow independent verification of the results.

The possibility to explore the behavior of the code is a de facto requirement to fully accomplish the goals of the first possibility. For all but the most trivial pieces of software, inspection of the source code is not enough to convince oneself that it does what it is claimed to do.

The possibility of verifying the correspondence of source code and executable versions is motivated by the complexity of today's software build procedures. Mistakes can as easily be introduced in the build process as in the source code itself. This point is well made by Ken Thompson's Turing Award speech Reflections on Trusting Trust, if you replace mischief by mistake in his arguments.

Discussion in the context of the state of the art

The possibility to inspect all the source code is a criterion that is in principle widely accepted, although many people fail to realize its wide-ranging consequences. "All the source code that can possibly have an impact on the results" actually means a lot of software. It includes many libraries, but also language implementations such as compilers and interpreters. Moreover, inspecting a dependency first of all requires precisely identifying it. This remains a difficult task today, and therefore most published computations today do not offer the first essential possibility, no matter how much effort a reader is willing to invest.

It is tempting to introduce another degree of compliance by requiring that only the most relevant parts of the total source code be inspectable. However, that defies the whole purpose of independent verification. Who decides what it relevant? Usually the author of the computation. But if the code declared to be irrelevant by the author is not inspectable, we have to take the author's word for its irrelevance.

The possibility to run the code is also a widely accepted criterion, though not everyone accepts the additional requirement of executability "on a suitable computer of one's own choice". Software made available as a service (e.g. in the cloud) is considered sufficient for reproducibility by some researchers. Executability is much more susceptible to decay over time than inspectability of the source code, and this is one of the main topics of debate today. Is long-term reproducibility needed? Is it achievable? The answers vary across disciplines. There is unfortunately a strong tendency to auto-censoring here: many scientists believe that long-term reproducibility is not realistic and therefore should not be asked for. This is definitely not true and it is better to frame the question as a trade-off: what is a reasonable price to pay for long-term reproducibility, in a given discipline?

The possibility to explore the behavior of the code is rarely mentioned in discussions of reproducibility. And in fact, exploring the behavior of non-trivial code written by someone else is such a difficult task that many scientists prefer not to require anyone to do it. I am not aware of any scientific journal that expects reviewers of submitted work to check the code of any computation for correctness or at least plausible correctness, which in practice requires examining its behavior. And yet, the scientific method requires everything to be inquirable. It may not be a realistic expectation today, but it should at least be a goal for the future.

Since code explorability is rarely required or even discussed, there is no clear profile of practical implementations either. It's a criterion that requires expert judgement, the expert being a fellow researcher from the same discipline as the author of a computation. It is the software analog of a "well-written" paper, which is a paper that a reader can easily "get into".

The possibility of verifying the correspondence of source code and executable versions is also rarely mentioned. It is also the least fundamental one of the four essential possibilities, because in principle it can be abandoned if a computation is fully reproducible from source code. In practice, however, that is rarely a realistic option. The size and complexity of today's software assemblies makes it impractical to re-build everything from source code, a process that can take many hours. Nearly all software assemblies we run in scientific computing contain some components obtained in pre-built binary form. While it is perfectly OK for most people, most of the time, to use such pre-built binaries, inquirability requires the possibility to check that these binaries really correspond to the source code that the authors of a computation claim to have used. This is a possibility where a low degree can be quite acceptable.

Please comment!

As I said, the goal of this blog post is to start a discussion. Your comments are valuable, possibly more so than the post itself. How important are the four possibilities in your own discipline? How well can they be realized within the current state of the art? Are there additional possibilities you consider important for reproducibility?

Check also the comments on Twitter by exploring the replies to this tweet.

Notes added after publication

2020-11-22

Jeremy Leipzig points out the 2012 ICERM workshop document, whose appendix A discusses several levels of reproducibility. Its last level ("open or reproducible research") covers in a general way the four possibilities I discuss above. The lower levels describe research output in which at least one of the four possibilities is not provided.

2020-11-23

Ivo Jimenez refers to ongoing work at NISO (National Information Standards Organization, USA) to define recommended practices, and Neil Chue Hong says they will be out soon.

Ivo Jimenez also mentions an interesting collection of resources on artifact evaluation for computer systems conferences.

Comments retrieved from Disqus

Roberto Di Cosmo:
Thanks for this nice post: I like the classification, and I love the acknowledgment of the difficulty to have a "one size fits all" solution when it comes to reproducibility, as the dimension of the problem and the resources available to address it really vary a lot across disciplines, and even inside discipline. A nice example of a "scientific journal that expects reviewers of submitted work to check the code of any computation for correctness or at least plausible correctness, which in practice requires examining its behavior" is Image Processing OnLine (https://ipol.im) that goes a long way along the road to reproducibility.
- Konrad Hinsen:
  Thanks for mentioning IPOL! I haven't been able to find reviewing guidelines on their Web site, but I will contact the team to find out what exactly their reviewing process evaluates.
Nicolas Rougier:
In terms of code interactivity, I find the https://distill.pub/ journal to be really good even though I imagine it's a lot of work for authors. But it's really nice to be able to play with the model. In my own domain (computational neuroscience) I dream of having really interactive model where you can test what happens if you modify this or that parameter or simply change the random seed. I suspect this wont't come anytime soon since most journals do not even really care about the code, but who knows.
- Konrad Hinsen:
  Thanks for that nice example, which illustrates possibility #3: the possibility to explore how a computation works. Much of the work by Bret Victor (http://worrydream.com/) is similar to the distill.pub you cite. But as you say, these are very much examples of hand-crafted presentation software, and thus require a huge investment by the authors. Making such presentations more accessible should be one priority in method and tool development. Jupyter widgets are one step in that direction.

The landscapes of digital scientific knowledge

Konrad Hinsen — 2020-07-08

Over the last years, an interesting metaphor for information and knowledge curation is beginning to take root. It compares knowledge to a landscape in which it identifies in particular two key elements: streams and gardens. The first use of this metaphor that I am aware of is this essay by Mike Caulfield, which I strongly recommend you to read first. In the following, I will apply this metaphor specifically to scientific knowledge and its possible evolution in the digital era.

In the landscape metaphor, streams are timelines of information parcels. News, RSS feeds, Twitter, Facebook, but also scientific journals, are stream media. Gardens are continuously evolving information assemblies that are actively curated by their authors. Encyclopedias and dictionaries are perhaps the oldest examples. In the printed paper era, updating an information collection was expensive because everything had to be reprinted and redistributed. As a consequence, garden-type resources were rare. Digital gardens have no such overhead, and almost no cost other than the work of their curators. More and more people are setting up their own digital gardens as an alternative or complement to the personal stream, better known as a blog. Click here, here, and here to see a few examples of personal digital gardens. Like blogs, digital gardens can also be collective efforts, run by a company, a research group, or a larger community. The most widespread tool for digital gardening is the Wiki, but there are also more recent developments in this space, such as Notion or Roam.

One distinction that I haven't seen mentioned yet in this context is the one between a garden and a park. Both are curated and thus continuously evolving. But whereas gardens are set up and maintained for the benefit and enjoyment of their owners, parks are created and maintained for the benefit and enjoyment of the public. The difference can be subtle, as digital gardens are often visible to the public as well. But they are more like the unwalled garden on the roadside that you can admire passing by than like the park in which you can take a walk and sit down reading a book. A good example of a digital park is Wikipedia.

Science is all about acquiring information about our world and distilling it into knowledge, and therefore requires a fair bit of gardening. In its early days, it was managed as a garden by and for a small community of people who were motivated by curiosity and relied on personal wealth or on sponsors for doing their work. Universities employed scientists more for teaching than for doing research. Research was done by individuals or small teams, and presented at conferences or in journal articles, much like today. Unlike today, most scientists were up to date on everything that was happening in their field, and had personal exchanges with almost everyone else, in face-to-face meetings or by correspondence. Conferences were events in which conflicting results and different points of views were actively debated, enabling the formation of consensus. The streams of papers and conference contributions thus watered the garden of scientific knowledge.

All that changed after World War II, when science underwent rapid growth as states injected a lot of money while at the same time expecting the scientific community to cultivate a park rather than a garden, contributing to the common good. Keeping up to date with everybody else's work became more and more difficult, slowly eroding the possibility of consensus formation through live debate at conferences. Productivity metrics focusing on what is easiest to quantify ended up rewarding scientists for contributing to the stream of journal articles, but not for contributing to the cultivation of the park of scientific knowledge. Today, the streams of journal articles have become torrents whose distillation into knowledge is becoming ever more difficult. A good illustration is the (serious) proposal to use machine learning tools to make sense of the "tsunami" of articles resulting from the intense research on the Covid-19 pandemic.

The design and implementation of new mechanisms for knowledge distillation and consensus formation is thus a major challenge for science today, and even though machine learning techniques may prove to be helpful, I expect this to remain a fundamentally human task for a long time to come. These new mechanisms must combine technological aspects (good tools for working towards these goals) and social aspects (incentives for scientists to participate in this work). As always, the social aspects are the harder problem. As a first step and as a source for inspiration, let's look at similar existing mechanisms in science and elsewhere. Which digital parks exist? How do they work? Can their mechanisms be adapted to other applications?

I have already cited Wikipedia as a prime example of a digital park. I had expected to see Wikis more widely used as a platform for collective information curation in science, be it as gardens or parks, but when I searched for examples I found surprisingly few, e.g. Tricki (for mathematical problem-solving techniques) or the Complexity Zoo (on classes of computational complexity). One problematic aspects of Wikis is that they present only a single view to the outside world. They are better suited for presenting an established consensus than for supporting the process of consensus formation in rapidly evolving fields. One of the rare cases of a Wiki used for coordinating collaborative research, rather than for summarizing the state of the art, is the Polymath project. It is probably not a coincidence that this has happened in mathematics, a domain whose working habits remain close to those of the early scientific community, with individuals having more agency than in disciplines that are more dependent on material resources.

Federated Wiki is an interesting evolution of the Wiki concept (initiated by the original inventor of the Wiki, Ward Cunningham) that allows individual contributors to maintain and publish their own view while at the same time encouraging reciprocal borrowing of content. This video illustrates the process nicely. Whereas federated Wiki looks like a promising approach to consensus formation, the technical obstacles to setting up a federated Wiki are significant (contributors must manage personal Web servers and domains) and make it difficult to evaluate it in practice.

Perhaps the most frequent kind of digital park in science today is the collaborative software development project, hosted on platforms such as GitHub, GitLab, or similar platforms operated by research institutions. Ignoring the differences resulting from the focus on code rather than prose, the main differences between platforms and Wikis are (1) a stronger emphasis on discussion ("issues") and (2) the co-existence of multiple branches representing different public or private views of a common project, with one branch (conventionally named "master" or "main") representing the current consensus.

Collaborative software projects are an interesting case study also for the question of incentives. The lack of recognition of software development as a research activity has been deplored for a long time. It is usually attributed to the relative novelty of software as a form of research output. But I suspect that the park nature of software, as opposed to the stream nature of journals, is also an important factor, because it makes it more difficult to evaluate an individual's contributions based on purely formal (and thus easily measurable) criteria. On the other hand, today's collaborative platforms make such an evaluation technically feasible, by counting for example the number of commits made by an individual, or the number of lines changed by those commits. Everybody involved in software development will probably agree that this is a stupid metric, but it's no more stupid than counting publications weighted by journal impact factor.

Another social aspect that is well illustrated by software is the difficulty of the transition from gardens to parks. Projects usually start out as gardens, with a small team developing software for its own use. Then early users start to join, who by necessity have to figure out for themselves how to adapt the software to their needs, and are thus likely to become contributors. With an increasing user base, developers have an interest to work on more robust code and better documentation, in order to reduce the effort of technical support. At that stage, the software becomes attractive to less technically minded users who see no need to ever get in touch with the development community. These users consider the software a park, even if its developers still consider it a garden, leading to contradictory tacit expectations on both sides about the priorities for future maintenance, which I have described in an earlier post. Developers tend to contribute to this confusion by advertising their project as a park while maintaining it as a garden.

The above examples illustrate that the technical challenges of digital gardens and parks are somewhat understood and partially solved. Collaborative software development platforms in particular have proven very effective. Adapting their concepts to different use cases and different users looks definitely possible, although the effort required should not be underestimated, in particular for developing appropriate user interfaces. But the real challenge is creating incentives for collaboration, in a universe currently dominated by competition for limited resources.

Comments retrieved from Disqus

dobeyog618:
Here is the best chance to get citizenship by investment antigua.
dobeyog618:
ausin tree special
civobor871:
Hey, Thank you for this share, Are you looking for a "fence company". Then check our website now.
Sarah Boyer:
Hey, We are a Reliable & Professional Roofing Company. Check our service troy roofing repair company
danny lee:
best nail salon
Ericson2314:
> One of the rare cases of a Wiki used for coordinating collaborative
research, rather than for summarizing the state of the art, is the Polymath project.
It is probably not a coincidence that this has happened in mathematics,
a domain whose working habits remain close to those of the early
scientific community, with individuals having more agency than in
disciplines that are more dependent on material resources.
I think you are on to something, because the only other example I know of is https://ncatlab.org/nlab/, which is also Math related. (And by mathematicians with a computer scientist slant that only makes familiarity with the broader world of Wikis beyond Wikipedia more likely.)

An open letter to software engineers criticizing Neil Ferguson's epidemics simulation code

Konrad Hinsen — 2020-05-18

Dear software engineers,

Many of you were horrified at the sight of the C++ code that Neil Ferguson and his team wrote to simulate the spread of epidemics. I feel with you. The only reason why I am less horrified than you is that I have seen a lot of similar-looking code before. It is in fact quite common in scientific computing, in particular in research projects that have been running for many years. But like you, I don't have much trust in that code being a faithful and trustworthy implementation of the epidemiological models that it is supposed to implement, and I don't want to defend bad code in science.

However, many of your specific criticisms show a lack of familiarity with today's academic research. This code is not the sole result of 13 years of tax-payer-funded research. The core of that research is building and applying the model it implemented by the code, the code itself is merely a means to this end. The scientists who wrote this horrible code most probably had no training in software engineering, and no funding to hire software engineers. And the senior or former scientists who decided to give tax-payer money to this research group are probably even more ignorant of the importance of code for science. Otherwise they would surely have attributed money for software development, and verified the application of best practices.

But the main message of this letter is something different: it's about your role in this story. That's of course a collective you, not you the individual reading this letter. It's you, the software engineering community, that is responsible for tools like C++ that look as if they were designed for shooting yourself in the foot. It's also you, the software engineering community, that has made no effort to warn the non-expert public of the dangers of these tools. Sure, you have been discussing these dangers internally, even a lot. But to outsiders, such as computational scientists looking for implementation tools for their models, these discussions are hard to find and hard to understand. There are lots of tutorials teaching C++ to novices, but I have yet to see a single one that starts with a clear warning about the dangers. You know, the kind of warning that every instruction manual for a microwave oven starts with: don't use this to dry your dog after a bath. A clear message saying "Unless you are willing to train for many years to become a software engineer yourself, this tool is not for you."

As a famous member of your community famously said, software is eating the world. That gives you, dear software engineers, a lot of power in modern society. But power comes with responsibility. If you want scientists to construct reliable implementations of models that matter for public health decisions, the best you can do is make good tools for that task, but the very least you must do is put clear warning signs on tools that you do not want scientists to use - always keeping in mind that scientists are not software engineers, and have neither the time nor the motivation to become software engineers.

Consider what you, as a client, expect from engineers in other domains. You expect cars to be safe to use by anyone with a driver's license. You expect household appliances to be safe to use for anyone after a cursory glance at the instruction manuals. It is reasonable then to expect your clients to become proficient in your work just to be able to use your products responsibly? Worse, is it reasonable to make that expectation tacitly?

Some of you have helped with a first round of code cleanup, which I think is the most constructive attitude you can adopt in the short term. But this is not a sustainable approach for the future. We can't ask software experts for a code review every time we do something important. We computational scientists need you software engineers to help us build a better future for computer-aided research. Which means pretty much all research, because software has been eating science as well for a while. Can we count on your help?

PS added 2020-05-19T10:30: This post has provoked a lively discussion not only in the comments below but also on Twitter. There are way too many comments for me to reply to each one individually, so I decided to address recurrent topics in this follow-up.

Many people seem to have read my post as putting the main responsibility for the problems related to the cited simulation code on software engineers. This was most certainly not my intention. Scientists, policy makers, and journalists have all contributed to a less than satisfactory outcome. My open letter is clearly addressed at a particular group of people (software engineers criticizing the Imperial College Covid-19 simulations on the basis of code quality) and clearly states its focus on the role of software technology, which is what the target audience seems to overlook. A focus is always an arbitrary choice of an author for the sake of brevity or clarity. A glance at the rest of my blog should suffice to show that I do consider computational scientists responsible for their technological choices and their consequences. However, my main intention was not assigning blame for events in the past, but outline what needs to change to prevent similar events in the future.

The car analogy was another frequent target of critical comments. Cars are a mature technology, in which many professions (engineers, workers, mechanics, driving instructors, drivers, etc.) have well-defined roles and everyone involved has a general understanding of the role of everyone else. Software is an immature technology in which roles remain fuzzy and everyone has an even fuzzier view of which other roles exist and who fills them. The discussion of my open letter has provided ample evidence for this all-encompassing fuzziness. What we collectively need to work on is turning software into a mature technology. That requires all stakeholders to make their own role views explicit and then negotiate shared role definitions with everyone else. Several commenters have pointed out the emergence of research software engineers (RSEs) as a sign for progress, and I completely agree. But even the role of RSEs remains fuzzy at this time. Should they work a collaborators on research projects, with a particular specialization? Or as occasional consultants or service providers to researchers? Their interaction with the software engineering universe is even less clear. For now it is mostly one-way in that RSEs bring software technology from the outside into research labs. What my letter argues for is an action in the opposite direction: make software technology evolve to adapt to the specific needs of scientists. A big problem is culture clash. In academia, scientists are traditionally on top of the power pyramid and are used to everyone else working for them (even though the top position is now held by managers, but that's a different story). In the tech world, it's software engineers who are kings and used to everyone else, including their clients, obeying their directives. In the worst case, RSEs might find themselves trapped in the valley between two power pyramids. In the ideal case (from my point of view), they will be diplomats working towards a merger of the two kingdoms, with a simultaneous transformation into a democracy.

Comments retrieved from Disqus

cd:
I have been involved in both sides of this. And my code for academic research purposes was shit. It was written to get the job done. I gave no thought to performance, maintainability or anything else for that matter it wasn't even structured.
When I got a job as a professional I got a real culture shock. The standards that are required are orders of magnitude higher.
You might say well scientists have to do other bits of research to as well as write the code. And that is true. But it also pains me to say that before becoming a professional software engineer I also worked as scientist in a commercial company. And again the standard of research and development was much, much higher.
Academia is sloppy and peer review is sloppy.
Brian L. McMichael:
This is like trying to build a house without any previous experience and then blaming professional homebuilders for not making it easier for commonfolk to nail 2x4's together.
David Sarma:
For the type of software that's under discussion (a concrete realization of a mathematical model), what the scientist cares about is the mathematical model, not the realization of it. This is why "software quality" is shunned as a concern: ideally, it should NOT be something that one has to be concerned about. The ideal scenario would be an algorithmic translation of the mathematical model into computer instructions, with no human there to provide inconsistency and bugs into the process.
The direction that things are headed are pointed to by projects like CVXPY / CVXR. We want a compiler for mathematical language, whose output we for the most part don't have to look at or care about, in the same sense that programmers do not for the most part inspect the assembly language output of their programs, and criticize them for being poorly organized, verbose and unreadable monstrosities. The *solvers* that the model uses of course should be under the most intense scrutiny by the most skilled software engineers... but this goes beyond the scope of the scientific part of the project, in the same sense that we depend on linear algebra libraries working correctly, but modeling greenhouse gases is NOT linear algebra.
In other scenarios, flipping to the dual marks the maturation of a field (ex. "classical" renderers transitioning to physically-based rendering), the end of certain classes of conflict and stress (caustic situations and antagonistic relationships), and the ability to focus on content rather than technology (telling good stories vs attaining photorealism). (Other side effects are, deprecations and job loss, industry-wide collapse in some cases, or transition into other business models.) The injection of constraint solvers into mainstream software engineering (in the manner that Rust does) will likely lead to similar outcomes: the end of certain classes of free-for-all improvisation, and better ability to focus on the content under discussion.
- Konrad Hinsen:
  Thanks for pointing out that there are indeed some developments pointing in the right direction!
Undercover modeller:
I think there is a point that is being missed. This software is essentially repurposed software. Its software built for academic purposes being repurposed as business/nation critical software for making decisions that affect life and death decisions for thousands of people and affect the livelihoods of many millions of people.
I write business critical models as a living using software engineering processes that I've been taught over the years. However, if I was asked to write, say, safety critical software for an plane. I would not apply the same processes, nor know what processes should be applied.
The issue lies in those who commissioned the software, and to a certain and lesser extent, the academics who built it, who should have known that using academic software development techniques was inappropriate for business critical software that might have such a major impact on people's lives.
- Konrad Hinsen:
  That's an interesting remark. Yes, the software has been repurposed. But: that happens all the time with research software. The small function written for the exploration of a dataset ends up in community-managed software and then maybe in industrial applications. Nobody ever commissions software in academia. It's very much bottom-up.
  - Undercover modeller:
    Alas that is true. That's why we never incorporate open source components into our models without either rewriting it or subjecting it to our own testing program.
Colin Gillespie:
So while I sort of agree with your argument, I do think that academics hold much of the blame.
An analogous situation is statistics. Ask any statistician that to perform a vaguely complex analysis requires training and experience, yet many scientists are happy to just copy and paste code/analysis from random parts of the internet.
In building software, the REF (run by academics), actively snubs contributions to software. Instead, they are encouraged to have the "Facebook" type model, publish often and fast. How often are papers retracted if the software is wrong or has a bug?
Michael Höhle:
First page of the OpenBugs Manual - http://www.openbugs.net/Man... https://uploads.disquscdn.c...
- Konrad Hinsen:
  Excellent - thanks for this example!
Brian Sides:
There is a department of Computing at Imperial College London
https://www.imperial.ac.uk/...
Where they teach computer programming
"Welcome to the Department of Computing
Computers are the most significant and exciting technological innovations of the last hundred years. In the future, they will play an even more considerable role in medicine, the sciences, industry, communication and the arts. It's safe to say that the science of Computing will remain a vitally important part of modern civilisation and will be responsible for many of the most important changes in the world in which we live.
Career prospects
Our graduates have the highest average salary for a computing degree in the UK and have gone into a range of careers including Media, Software, Finance and Research with employers such as Google, Microsoft, Facebook, Amazon and Bloomberg. A career in Computing opens the door to a wide range of careers."
Yet over a period of more than 20 years a pandemic computer model was developed.
This is the same pandemic model used for Neil Ferguson.s previous predictions
That were so far off. 2001 mad cow disease leading to Six millions cattle and sheep slaughtered.
Millions were spent buying vaccines against swine flu in 2009 .
If this was some internal test program put together quickly . Then you might expect this quality of code. But even then some bad practices have been employed.
I have emailed some in charge of the computing department asking for comment on the code. But no reply.
Obviously the code was not developed by the computer programming department.
Those developing the pandemic model thought they were so clever they did not need bother with things like documentation and testing or checking with those who know how to programme.
The crude method of projecting numbers forward is as questionable as the code.
If I write a computer program to calculate how many chip shops will be in my small town. If like Neil Ferguson report 9 says It is Exponential doubling every 5 days.
The program will predict in six month there will be over 8 billion chip shops in my small town.
Computers do not have brains, They can not know anything ,that 8 billion chip shops in a small town is impossible,
elgato:
As a physicist that development software for my own research and for fun I couldn't agree more with this letter. I code in c++ and HPC mainly gpus. In academia there is some grade of mis appreciation for developing good code. Scientist, that like me, try to make things better we are seen as people that lose time instead of producing results. Arrogance is also a problem. Learning the idiomstics of a given programing language is easy but writing a maintainable code is not. Long time ago after many years of programming I learnt design patterns which changed my way to code. From there I moved to more in depth about software and how to. But the truth is that in school nobody bothered to tell us about it. We learn by simple doing with no formal education whatsoever which is bad, really bad. This piece of code is only an example of many code out in the wild used by people in day to day bases, just this one as it happens it might affect decision of policies makers. Also, for software developers in here, not all the scientist produce the piece of crap that it's being discuss here, please don't put all the people in the same bag.
Brian Sides:
The original code was written in 'c' not 'C++' there are some Fortran functions that are supported by the 'c' library (some think the code was written originally in Fortran and ported to 'c' ) The code has now been ported to 'C++' and split into multiple files but with out using the object orientated features . The code is still mostly just 'c'. Bugs have been found during the conversion process.
The code was written over a period of more than 20 years. Many thousands of man hours.
at the end they had produced one single file of 15,000 lines of code
that is less that 2 lines of code a day
The code is undocumented with a host of single letter variables.
Data is read and written with out error checking , data is not verified there is no file signature or checksum.
It is very simplistic simulation code.
All code needs to be tested. Important code needs to be independently tested.
These are highly qualified highly paid people . They had a team working on this.
There has been a large investment. Where was the management over site.
It is clear from comments by Neil Ferguson that he thought that it was thousands of lines of undocumented code he kept in his head . Was not a problem , he was kind of proud of it,
As well as the code the method of taking some data of questionable choice then making a many assumptions and applying these to a limited simulation with a small set of real world statistics . That in no ware takes into account the way these and other factors interact. Is very questionable.
There is no excuse . Sage have failed to check the model had been properly tested.
The predictions from this faulty model have misinformed the Government
and led to this ill informed Lock down
sde-2243:
I think, it consists of few different topics, hardly mixable.
1) When I buy a car, I might get (and might not get) some warnings. However nobody expects car manufacturer to teach me how to drive. This is a skill, and I spent literally months and thousands of dollars honing this skill. Still, even I got to the level i can participate in racing events, I would not be arrogant enough to try to drive 18-wheeler. Or bus. And if i try, i would *not* blame others for collision.
Somehow we think that because we have a computer, we possess skills necessary to develop software. Or, if we learn a language, we learn software engineering. This is wrong: this is an acquired skill. Junior engineers coming from college spent years learning how to develop robust systems quickly. Using analogy, I have a chef knife, so why I cannot cook like a Chef? Oh, and by the way -- there was not a single word of warning on the knife when I bought it. There was no video how to hold it properly, where I should use chef knife, peeling knife, and so on. Not a squeak on how to maintain it, how to wash, how to store, how to sharpen.
2) The world is changing quickly. What used to be highly-professional activity quickly becoming a side-skill for people professional in different area, being it biologists, physicists, or computational scientists. Apparently, there is *an emerging market* for development tools for this non-specialists.
However it is hardly reasonable to expect these languages and tools to come from industry *evolution.* [At some point Niklaus Wirth was asked why he does not participate in language standardization. He answered that he is teaching. To teach students, he need a modern language. So he creates one. Standardization is needed by industry -- so let industry do standardization.] Industry does not know what academia needs. And does not care -- justly. But the market means that some company, some group of people might start to work on product that is needed for this market, and start to sell it. [Stephen Wolfram's Mathematica is a great example of such product.]
3) Why this solution cannot emerge from academia? There are computer science / software engineering departments. So, why do you want somebody else to solve your problems, instead of stopping in your colleague's office?
Again, we have examples. Some quite interesting facts in mathematics are proved by software. It was academia that developed tools for proofing, and created validation means for this tools. In fact, academia is more interested in formal validation of programs than software industry (as whole). So, if it is possible for mathematicians, why not for others?
David Frenk:
C++ isn't the problem. Academic researchers write terrible code in every language they use. Python is pretty much the most user-friendly language imaginable, and most academic python code is spaghetti too. Peer review (especially in an open source context wherever possible), and better software engineering training for academics who need to write code are the best solutions here.
- Jef “Credible Hulk” Spaleta:
  when the peer review of the code itself become as important to career advancement as the scientific results publication...things will get better. Otherwise, it won't. Academic researchers by and large are not incentived to write maintainable code. For projects with a large enough budget, you start seeing staff engineers hired to maintain critical codebases, but if the researcher is writing it, its really not expected to be maintainable.
  And while there is an effort put into peer review of the published articles that appear in scientific journals the same effort is not usually required for the digital artifacts (the software) that was used to produce the results expressed in the articles. As it stands career advancement is not predicated on being proficient at producing readable, reusable robust code. Publish or perish doesn't generally apply to the software.. it is what it is.
  - Konrad Hinsen:
    That is indeed an important point, but it's also important to realize that improving the situation is not easy. Reviewing scientific code upon publication requires (1) accepted standards for code quality and (2) reviewers compensated in some way for the significant effort that code review represents. Which is one of the reasons why I ask for better tools: to reduce the effort in code reviews.
Michael:
Coming from a non-CS academic background, I disagree with you. This sentence in particular:
It’s also you, the software engineering community, that has made no effort to warn the non-expert public of the dangers of these [C++] tools.

does not match my experience at all. Every conversation with a scientist that touched upon C++ I can remember has amounted to "yeah, but C++ is very hard, let me use MATLAB/Python/R instead." If you want to convince yourself, go around university departments and poll graduate students on whether or not they think C++ is easy. This will render the fact that you cannot identify "clear warnings" irrelevant, since if everyone agrees that C++ is not easy, that provides evidence those warnings from the software community are coming through to non-experts.
That does not mean Ferguson's code doesn't shed light on a serious problem. The code would not have looked much different in MATLAB/Python/R. The problem is not the language, but the inherent practices that should be involved in developing scientific code: version control, unit tests, documentation, reproducibility. Most academic groups, however, provide little to no training or emphasis on the importance of these tools. The value is instead placed on publishing papers. The purpose of most academic code is usually to be just good enough to plot the graphs that are used for a paper. This is the case even in fields that should be CS-minded, such as applied mathematics. You will never get tenure for writing good code, and your graduate students have no incentive to write good code -- unless, of course, they want to solve a real problem. But then they often go to industry.
If you want to avoid this, educate the older academics. This is already happening. Software engineering, in my view, is extremely transparent about your main gripe: C++ is not for beginners.
Ariel Fogel:
Thanks for writing this. As someone who left software development to go back to academia in a field not directly related to CS, one thing I notice is that there's often not a ton of time to implement best practices unless that's part of the culture of the lab (read: unless your lab is led by a computer scientist who is well trained in software engineering).
Even if there was enough time, more often than not you're writing one-off scripts or things that are what I referred to as a spike in a developer context. And unfortunately, sometimes those spikes continue to get developed. But a lot of times they aren't b/c research is inherently about taking well-informed stabs at the unknown and seeking to uncover something new. It's hard to know when it's worthwhile to start with best practices or the tech debt is high enough that it necessitates a refactor. And even more difficult if you're trying to get funding for that. I'm not sure my lab would be able to write grants that also ensure we TDD every, or even some, pieces of research we produce.
And that's with me having been exposed to software development practices and having access to some professional programmers who help with our research, which most of my peers haven't and don't. Speaking of which, I'm going to go back to writing crappy code now :)
David Hicks:
I've come here from Hacker News where there's a little outrage going on right now ...
I think computational scientists are increasingly going to need to get code reviewed by experts, particularly in areas where that code affects public policy. There are a bunch of ways this might be achieved, and publishing of source code openly under a FOSS license could help here. But it may be that you need to pay people to build models (or pay people to design some sort of extensible model framework) for you.
To look at your analogy here - "You expect cars to be safe to use by anyone with a driver’s license."
Yes, I do. But I don't expect to be able to go to the Ford factory, pick up some tools and make a car that meets road use regulations without some training. By using C++ you've wandered in and had a go with an arc welder, and now you're annoyed with us at the result?
- Konrad Hinsen:
  "I think computational scientists are increasingly going to need to get code reviewed by experts"...
  Let me translate: "We, the software industry, set the rules by which everyone has to play for using computers. If scientists want to do computations, they will have to consult with us and pay us for that."
  That's "software is eating the world" at its best. And that's exactly what my open letter is arguing against. You may of course disagree, as this is a question of policy, but then there is no need for further discussion: you and me have conflicting interests.
  - David Hicks:
    I'd also like to ask, respectfully, if you would resent the suggestion of getting an Architect to look over the plans you drew up for a house you were building? And paying them to do so?
    Software Engineering is a skilled profession, we spend a lifetime learning, practising and perfecting it, but it's somehow wrong to suggest that you might want to consult with someone to help get it right?
    - Konrad Hinsen:
      Houses are like cars: mature technologies where all the roles are well defined. There are no "house planning for dummies" books that lure people into designing their own house without help from an architect.
      I am perfectly fine for software engineering to become as mature as architecture, and left to qualified professionals. But computational scientists need to be able to do their job autonomously. Which is not the case as long as badly designed systems programming languages are almost inevitable for implementing scientific models.
      - David Hicks:
        So now we should set the rules, and be gatekeepers of knowledge? I'm getting very mixed messages here.
        I'm not a C++ coder by trade, very often, so I'll leave them to answer your criticism that it's poorly designed.
        You're asking that computational scientists be able to produce work as well as experienced software engineers can, with no training and with no oversight, without engaging with experienced people to help build out your models, and certainly without paying for any of their insight. Why do you think that should even be possible? Are you haranguing chemical engineers because anyone should be able to build an oil distillation column and it's their fault yours blew up?
        Our discipline is almost uniquely open, you can learn, you can build, we give access to tools and platforms, we share amongst ourselves and with anyone that wants to learn. But that doesn't mean that after reading a couple of intros to C++ you're going to make flawless programs and frankly I find it arrogant that you think you should be able to just bypass the training and achieve comparable results. There's a reason your university has a whole department for computer science.
        
        Konrad Hinsen:
        I am not asking that computational scientists should be able to do zero-effort software engineering. They should be able to develop and evaluate scientific models on their own, using tools designed by software engineers. Much like ordinary people write letters using word processing software.
        To give an example for how this could work (I am not saying this particular approach will work, but I think it's worth investigating): design a stack of ever more specialized DSLs, with a general-purpose programming language at the bottom and each successive layer on top of it specializing towards a scientific application domain. Most scientists could then work most of the time at a level they can manage on their own. When they hit the limits of their DSL, they'd work with RSEs on a more appropriate DSL for their specific problems.
        However, what I outlined above is not a technology fix. Those DSLs should each correspond to a role and a competence profile. It's not just a software stack with layers of abstractions introduced to facilitate maintenance by teams of people who have basically all the same profile. Another important point is interoperability. Lots of specialized DSLs can only work in practice if the epidemiology DSL can interoperate with the statistics DSL and the ODE DSL.
  - David Hicks:
    I strongly disagree with your (mis)characterisation there, particularly as I suggested publishing open source as a way to get more eyes on the code.
    We don't set the rules, clearly, as you can find and use a good many of our tools for free, in whichever way you want, as demonstrated here. But if you're not trained or experienced you're not always going to get the best results, and perhaps you should be looking for outside help.
    Me, I don't expect to be any good at arc welding without some help and training either.
    (edit - I don't even expect to be any good at writing software without getting other people to review it!)
    - Ondřej Čertík:
      I think Konrad is arguing for domain scientists to be able to write software by themselves, without needing CS experts (whether paid or open source) to help fix up their code. I agree with that 100%.
      - David Hicks:
        I would argue that anyone producing software that is going to be relied upon for published scientific results, particularly scientific results that are used to inform public policy, should have such software reviewed by peers, and probably a wider audience than that if the peers are similarly non-expert.
        You might not wish to involve CS 'experts', (and this isn't really CS, but Software Eng) but perhaps some of the habits of such people should be explored. I wouldn't dream of deploying something that hadn't had other eyes on it.
        I agree in the abstract that it's a good thing to create tools for scientists to need as little assistance as possible, and it looks like you're working towards that end - good stuff :)
        But I also think that fundamentally, to produce good software, you need more than one person and you need experienced eyes. It's in the nature of the game.
        
        Ondřej Čertík:
        David, thanks for the comment -- I agree that one should not work in isolation and the more reviews the better. At the same time I like what Konrad said below that computational scientists need to be able to do their job autonomously. It's not mutually exclusive, we should strive for both.
        
        Ondřej Čertík:
        Yes I agree that it's always good to have more than just one person to look over any code.
Ondřej Čertík:
Thanks for the post Konrad. I have couple thoughts on this. One is, that Fortran would be a great fit for this kind of code, and one thing I am planning with the LFortran (https://lfortran.org/) compiler to do once it is more mature is to give "pedantic" (so to say) warnings or even errors on code constructs that should not be used, even though they are perfectly legal Fortran. From little things like enforcing "implicit none" and not allowing "implied save" or not specifying a precision for floating point and other typical pitfalls. And in the long run, I am hoping the compiler can detect a lot more constructs that should be discouraged, such as using pointers instead of allocatable arrays and even things like every time a subroutine has a side effect or when a global variable is declared, the compiler could give a warning, and you must put in some kind of a comment documenting / acknowledging that's what you really want .That way I believe the compiler with excellent warning and error messages can greatly help teach non-expert programmers how to write higher quality code. Part of this is also that in Debug mode, it should check absolutely everything, from integers wrapping around, to any kind of memory issues such as dangling pointers. I think all of this can technically be done.
However, ultimately this goes much beyond just better compilers, and that is the main point of your blog post I think. I personally like C++ for things like writing compilers, but for scientific computing I think it's not great, because every big C++ code that I have seen requires to have CS experts on the team to keep fixing up issues that the domain scientists make. As you also imply in your post.
Fortran is much better suited, but currently it is falling short on its mission, it's lacking tooling, the compiler quality is not great, does not run on modern hardware such as GPUs, etc. I am trying to fix all that, see e.g., some of our recent efforts:
https://fortran-lang.org/
https://ondrejcertik.com/bl...
But this is something that should have been done 20 years ago, because even if we are 100% successful in our vision, it will still take 5 to 10 years before Fortran achieves it.
But I think it goes even beyond that. Even with a language that is better suited for numerical programming, and an excellent compiler that can guide the user to write using the "best practices", I think one also needs to adopt "modern social practices", which is to post the code as open source at GitHub or GitLab, and build a community around it.
Summary: I think there is a huge opportunity to provide high quality tools for domain scientists to use and we have a long way to go.
- Themos:
  The NAG Fortran compiler can check array bounds, integer overflow, undefined variables, dangling pointers, memory leaks and more. But getting unreliable numbers faster and cheaper has been a siren's call few can resist.
  https://wg5-fortran.org/N19... addresses Fortran vulnerabilities. Documents exist for other languages.
  In my view, the fundamental problem is that (non-CS) research codes are not derived from specifications. Huge parameter spaces abound and they are not explored adequately.
  Without careful tuning of incentives, I can't see how we will end up in a better place.
  - Ondřej Čertík:
    @disqus_BXzvDTvCEf:disqus thanks for the comment. Indeed we use the NAG compiler, it's great and the number of things it can catch is awesome. In my comment above I suggest we explore ways how to go even beyond what the NAG compiler can currently catch. Thanks for the link to the N1965 document.
- Konrad Hinsen:
  Thanks for your comments Ondřej. Your work on improving Fortran is very much in line with what I think we (computational science) need. And I certainly agree about developing best practices, which is fortunately already going on.
orca:
this is quite off the mark. the author sets the bar too low for himself by criticizing the most easily (and to be fair legitimately) dismissed criticisms of the Imperial College model by software engineers. Here's a better laid-out critique that the OP doesn't speak to:
The Imperial College modelers released the source code a couple of days ago to the model that shut down the world economy. It's not the original
model code but was rather original source code turned over to volunteer
programmers who re-wrote it so that is more readable. I have done some
model review of financial models in the past but without the source code
I would not be able to do a full review of the Imperial College model.
Now that we have the source code (sort of), I can.

Any such model ought to have been independently reviewed before it is ever
used for real policy decisions. Policy analysis is awash in models but
no one ever really checks them. Going forward, health policy makers
should ask for and disclose independent validation of any model before
using its results to make recommendations of any consequence.

Normally, model reviews are long technical documents but there would also be a
summary section. Here's what I think a summary should have looked like.
...

Overall conclusion: this model cannot be relied on to guide coronavirus policy.
Even if the documentation, coding, and testing problems were fixed, the
model logic is fatally flawed, which is evidenced by its poor
forecasting performance.
https://www.facebook.com/sc...
- Konrad Hinsen:
  This is a very different critique that I actually mostly agree with. Policy decisions should indeed be based not just on "science", but on trustworthy scientific findings. How to do that in an emergency is of course a different question again.
MaxSchumacher:
The analogy to cars is flawed, because C++ isn't an end product for untrained users, if you want to stick to the car industry, then C++ is a blowtorch, a tool used by professionals. The scientists shouldn't have used tools they don't understand and base policy recommendations on the output of a blackbox they cannot reason about; admitting ignorance is vastly better than pretending to understand.
I don't believe in the perfect separation of model and implementation: you learn about the world once the code is running and results are produced. One can argue that if you cannot build it, you don't understand it.
- Konrad Hinsen:
  We seem to agree that C++ is not an end user product. But show me a single C++ tutorial aimed at novices that clearly says so! How are scientists supposed to realize that they don't understand a product if all the descriptions of that product tell them "don't worry, it's easy"?
  - MaxSchumacher:
    nobody in the history of the world has ever uttered the phrase "
    "don't worry, it's easy" to refer to C++ It is a famously complex and large language.
    Plenty of C++ books talk about how to write good code and how to use the language, violating those recommendations is akin to putting your dog in the microwave.
    The basics of software quality aren't arcane knowledge uniquely accessible to greybeards, you'll find them in countless entry-level books and blog posts:
    - use descriptive names for variables and functions
    - try to keep functions small
    - use comments for difficult spots
    - document your work
    - test your code vigorously
    - get a least one review on the code
    - use a version control system
    I wouldn't conduct brain surgery and, after failing miserably, complain to the people making the scalpel: "Hey! You should have put a warning label on this!"
    - Konrad Hinsen:
      Me neither. The people I'd complain to are the authors of "Brain surgery for dummies", as well as brain surgeons performing live on television, explaining their techniques. The problem is not proposing power tools, but advertising them to non-specialists.
    - boromict cumbordor:
      first hit for "c++" "don't worry" "it's easy": https://books.google.com/bo...

Wanted: a hierarchically modular software architecture

Konrad Hinsen — 2020-05-05

In his 1962 classic "The Architecture of Complexity", Herbert Simon described the hierarchical structure found in many complex systems, both natural and human-made. But even though complexity is recognized as a major issue in software development today, the architecture described by Simon is not common in software, and in fact seems unsupported by today's software development and deployment tools.

The prime characteristic that Simon identifies in most complex systems is a hierarchical structure. Systems consist of subsystems, which consist of sub-sub-systems, etc. Simon describes the subsystems at each level as "nearly decomposable", meaning that the interactions between subsystems are much less important than the interactions between the parts inside a subsystem. I prefer the shorter term "modular" for this feature, and thus end up with "hierarchically modular" as my label for the architecture that Simon describes in much detail. I won't repeat his arguments for the ubiquity of such systems, so please read the paper - it's definitely worth it, and it's very clearly written.

It may seem as if many of today's programming languages propose exactly this kind of architecture for designing software systems, but a critical inspection shows that they don't. To explain where the problem is, I will use Python as an example because it is widely known, but the arguments apply with some modifications to most other languages as well.

Python's module system is basically a hierarchy of namespaces, with namespaces containing mainly function and class definitions, but also variables referring to arbitrary data objects. Since namespaces are independent, and can contain sub-namespaces, this looks like a perfect match for a hierarchically modular architecture.

One obstacle is that there is no way to combine independently designed modules into a larger hierarchy. Suppose I want to create a software component called ode_solver that uses the popular packages NumPy and SciPy. In a hierarchically modular architecture, implementation details of a component, such as the names of the packages it uses, would be hidden from outside view. The packages would become ode_solver.numpy and ode_solver.scipy. In real Python, they can only remain numpy and scipy, as their authors decided to call them. Independently written software components in Python always live in the globally shared top-level namespace. And since developers are free to modify their packages as they like, this makes the top-level namespace an instance of shared mutable state, universally recognized as problematic in software engineering.

The shared top-level namespace creates a strong interaction between all components at all levels. Suppose I have another component called visualizer that also uses NumPy and SciPy, but requires different versions. That component becomes impossible to combine with my ode_solver because of conflicting version requirements - the well known dependency hell. Another way to look at this is to consider each package's detailed dependency list, with version requirements, as part of its interface.

The second obstacle is that the full specification of a module's interface (something that's never ever written down in Python) in general includes classes defined by its dependencies. My ode_solver could, for example, return some value as a NumPy array. That would make NumPy not only a run-time dependency of the code, but also a specification dependency for the interface. If visualizer expects a NumPy array as the input to one of its functions, I'd be in trouble again as the class definition in the two different versions of NumPy might not be the same. And that trouble would not go away if I could migrate NumPy and SciPy inside my component's namespace as suggested above.

Some readers' first reaction is likely to be "that's a symptom of bad specifications" or "that's the trouble you deserve for using a dynamically typed language". However, static typing doesn't solve the problem, it merely shifts it from run time to compile time. It's the types introduced by dependencies that end up in the static interface of a component. The impact on component compatibility is the same. And if that's a symptom of bad design, then good design is not only rare but also actively discouraged by today's software development tools. The only way out I can see is to create wrapper types and wrapper functions in the component that hide the implementation in terms of dependencies. Hands up if you find that idea appealing!

The only programming language I know of that does not suffer from this problem is Unison, which refers to functions and data types via hashes rather than names. It's a very young language, so it's too early to say how this feature will change software architecture on a larger scale.

Programming languages are not the only realm in which we can try to construct hierarchically modular software. It would in fact be preferable to do so at a language-neutral level, to escape from the silos that languages tend to represent. I'd love to be able to combine a component written in Python with a component written in R! So maybe we should try to make hierarchically modular assemblies at the level of compiled binaries.

One candidate would then be Linux' Executable and Linkable Format (ELF), which covers several types of binary files: executables, object files, shared libraries, and more. But there is no kind of ELF file that could represent hierarchically composable modules, as far as I can see. There's no way to combine two shared libraries into a bigger shared library, nor two executables into a larger executable, and moreover every executable has a global namespace that would create the same issues that I outlined above for Python. You can't have an executable that includes or refers to two different versions of the zlib library, for example.

The only approach that looks doable in the Unix world is working at the process level. A software component is then a process based on an executable, and data between processes is exchanged via files or sockets. Choosing a clever hash-based naming scheme (as done by Nix and Guix) makes it possible to keep any combination of versions accessible in parallel. Several processes could be managed as child processes by a superprocess, which would thus represent a component one level up in the hierarchy. In the Web world, a very similar setup could be constructed by making each component a Web service. There isn't much tool support for such techniques, but perhaps the most important obstacle is efficiency issues in the communication between components, which would require serialization and either file storage or network communication.

The main merit of the two approaches I have outlined in the last paragraph is that they can accommodate legacy code and systems, unlike the starting-from-scratch approach of Unison. With a bit of luck, improved tooling and optimization could turn the process/service-based approach into a viable technique for some types of real-life application, while Unison and perhaps others introduce the same basic idea at the programming language end of the scale of software component technologies. And then, if the concept turns out to be successful for taming software complexity, it might become the norm after a few decades. So far for my daily dose of wishful thinking!

Finally, let me reveal my motivation for writing this post: I hope that someone will prove me wrong. I'd love to see a comment pointing out that I am simply not aware of the right tools and techniques. And you get bonus points for references to actual hierarchically modular software systems that work!

Comments retrieved from Disqus

Konrad Hinsen:
A Twitter comment says that Rust's package management system satisfies my requirements.

Emacs as a malleable system

Konrad Hinsen — 2020-04-03

Malleable systems are software systems that are designed to be modified and extended by their users, eliminating the usually strict borderline between developers and users. Making scientific software more malleable is a goal that I have been pursuing for 25 years, starting with a shift from Fortran to Python as my main programming language, and a simultaneous shift from writing programs to writing toolkits, such as my Molecular Modelling Toolkit first published in 1997. Therefore I was pleased to discover the Malleable Systems Collective, which has just published a post in which I examine what is probably the most successful malleable system in the history of software: Emacs. If you care about users having more influence on their software, check out their site!

The rise of community-owned monopolies

Konrad Hinsen — 2020-02-26

One question I have been thinking about in the context of reproducible research is this: Why is all stable software technology old, and all recent technology fragile? Why is it easier to run 40-year-old Fortran code than ten-year-old Python code? A hypothesis that comes to mind immediately is growing code complexity, but I'd expect this to be an amplifier rather than a cause. In this pose, I will look at another candidate: the dominance of Open Source communities in the development of scientific software.

From markets to monopolies

In the 1990s, when I was working on my thesis, the world of scientific computing was very different from what it is now. Innovation was driven by hardware. Processor speeds kept increasing, and new processor architectures appeared on the market in rapid succession. In the course of the 1990s, I did most of my work on Unix workstations based on variety of architectures: PA-RISC, MIPS, PowerPC, DEC Alpha. I also worked on mainframe computers made by IBM, Fujitsu, and Cray, all using proprietary processors. Each manufacturer sold a package of hardware, operating system, and development tools such as compilers. Compilers implemented standardized programming languages, mainly Fortran and C, with manufacturer-specific extensions that most people stayed away from because they expected to be using different machines a few years later. The computing platforms that everybody was developing for were not processors nor operating systems, but Fortran~77 and ANSI-C, each of which had developed its ecosystem of scientific libraries. For an interactive development platform, add Unix and X11. Mixing Fortran and C was somewhat platform-specific, but very doable as well. Every time I changed labs and computers during my postdoc years, I had to spend a day or two to reinstall everything I needed, but I never suffered software collapse.

Today, hardware innovation in mainstream computing has almost come to a halt. All the processor architectures listed above are gone. The x86 architecture, implemented in chips from Intel and AMD, dominates scientific computing, and in fact all of computing except for mobile devices. Hardware manufacturers therefore no longer supply compilers. For everyday work, most people use the free GNU Compiler Collection or the equally free Clang compiler. For performance-critical work, commercial compilers from companies such as NAG, PGI, or Intel offer better performance and libraries fine-tuned for high-performance computing. The standards defining Fortran and C have evolved, but have maintained strict backwards compatibility.

However, in the everyday life of computational scientists, these traditional platforms have lost importance. A new breed of languages and scientific ecosystems, such as Python, R, and Julia, have become the dominant support for scientific software in many (though not all) domains of research. Their rise has gone hand in hand with software collapse becoming so common that many consider it normal or even inevitable. Scientists are starting to adopt heavy technology with large overheads in terms of complexity and invested effort to work around the problem (if you didn't guess yet, I am referring to containers). I waste a lot more time today with configuration and setup work (including configuration debugging) than I did in the 1990s. How did we get into this sad state of affairs? Is there any hope for getting out of it again?

One reason that immediately comes to mind is increasing software complexity. But that's more of a symptom than a cause. A better explanation would be an increased problem complexity that would then require more complex software. Problem complexity is much harder to measure, but I don't see much evidence supporting this hypothesis. We certainly do bigger computations, on larger datasets, but if I look at today's commonly used models and methods in computational science, they don't look more complex than what I saw in the 1990s. What has increased, however, is variety. Today's science relies on more computational models than it did 30 years ago, and I believe that this contributes to the fragility issue, as I will explain later.

There is another reason that I haven't heard anyone mention so far: the disappearance of technology markets in favor of monopolist players who can count on customer lock-in. This description will probably make you think of Microsoft's grip on the Windows user base, or the "walled gardens" that Google and Apple have created around their mobile platforms. But there is another category of monopoly owner in the tech world that is hardly recognized as such: Open Source communities.

Open Source monopolists

Consider two recent events: Microsoft killing Windows 7, and the Python community killing Python 2. The story is essentially the same in both cases: the creator of a piece of infrastructure software ends support for an old but still widely used version, forcing its users to move on to a later but not fully backwards compatible version. In both cases, a significant part of the user community would have preferred to stick to the older version, as has been nicely illustrated by xkcd. In both cases, the end-of-support decision is a rational one for the producer because supporting old versions is costly. And in both cases, the abandoned users have no other supplier they can turn to, because the producer holds a monopoly on the technology.

Compare this to the diverse market of the 1990s. Producers of infrastructure software could add new functionality and try to win new clients with such improvements, but they could not afford to cause damage to their existing user base because users would simply turn to a competitor. There are many sources for standards-conforming Fortran compilers, but there is only one source for Windows or Python.

I suspect some readers will feel anger at this point. How dare you compare a monopolist business to a community of unpaid volunteers offering their work to the world for free? The crucial point is that I am comparing them as seen from the outside. There is a wide gap between the self-image that Open Source communities have of themselves and the image that they present to the outside world, and I believe that this is a big part of the problem.

Open Source communities tend to see themselves as communities of like-minded people that get organized to work together towards shared objectives. They see themselves much like a sports club that organizes practice sessions for its members, or like a village community that collectively plans its road infrastructure. But this is not at all how Open Source communities present themselves to the outside world. The Web site of a sports club says something like "We are a bunch of people enthusiastic about playing football. If you are as well, come and join us." Now look at the Python Web site. Its first statement, in big letters, is "Python is a programming language that lets you work quickly and integrate systems more effectively." The site is about a product. Its goal is to convince people to use Python, not to join a community. It is more similar to Microsoft's Windows site than to the site of a sports club.

"But..." I hear you say. Open Source. Free Software, as in "free beer" and in "free speech". And everybody can join in, the community is so welcoming! Fine, but that's again the insiders' view, just slightly enlarged to the circle of people whose engagement with the technology is sufficiently deep that they consider joining the community. I suspect that most people who download and install Python the product will never know anything about the community, and many will even use Python without being aware of it at all. What they are aware of is an application or utility written in Python, e.g. Calibre for managing their e-books, or offlineimap for downloading e-mail. In contrast, a true community-oriented piece of software would have a splash screen saying "Welcome to the Python community! Before using this software, please become familiar wit how our community works".

Sports clubs and village communities focus on their members' needs, interacting with the outside world by necessity, but only as a side effect. Most Open Source communities are more like political parties or non-government organizations in that they want to have an impact on the outside world. They care about the popularity of their products, and make efforts to increase their mind share. The reward they get in return is not money, but that's the only difference from how a company works. Both Open Source communities and software companies have an interest in attracting new clients and keeping existing ones. Both can retain clients more efficiently by generating lock-in, and so they do.

Note that I am not saying that either one creates lock-in intentionally. For Open Source communities such as Python, which I know sufficiently well, I am convinced there is no such intention. For companies such as Microsoft or Google, I can't know for sure. But from the clients' perspective, it doesn't matter if lock-in is intentional or a side effect.

One particularity about computing technology is that lock-in happens by default. It takes a conscious effort (and thus an incentive) to avoid lock-in. The reason is the fine-grained complexity of software interfaces coupled with the near-zero cost of modifying them. There are so many details that re-implementing an existing interface exactly requires a precise documentation of that interface, a perfectionist attitude, and a lot of time. The markets of the 1990s were made possible only by lengthy and costly standardization processes. Which in turn the participants accepted only because without the markets defined by those standards, none of them could continue to innovate in the field of processor architectures.

Lock-in favors software collapse

So far for communities as monopoly holders. Back to my original question: how did software collapse become normal? I believe that this is a near-automatic consequence of infrastructure software being managed by monopoly holders. The monopoly situation prevents existing users from moving elsewhere, significantly reducing the effort that needs to be made to keep them. All effort can thus be concentrated on gaining new users, which leads to the paradoxical situation that the needs of non-users have a larger weight in strategic decisions than the needs of the user base. With backwards compatibility being costly, boring, and irrelevant to the non-users that matter for the future, why care about it? That is, in my opinion, what happened to the Scientific Python ecosystem starting in the 2010s: adoption by the explosively growing data science community drowned the existing user base. The best strategy for SciPy was then to focus on the needs of the data science people, which also became the primary source for recruiting developers and maintainers.

Which brings me back to what I said earlier: the diversification of techniques in computational science is part of the problem. While the various subdomains of computational science have overlapping requirements, they also have divergent needs. The longevity of code is one aspect whose importance varies a lot, but there are others: the size of a typical computational task, the size of the datasets being processed, the nature of the algorithms being applied, the hardware platforms that matter most, and many more. While in theory Open Source is good for supporting diversity ("just fork the code and adapt it to your needs"), the reality of today's major Open Source communities is exactly the opposite: a focus on "let's all work together". Combine this with the chronic lack of funding, and thus also a lack of incentives for developing the structured governance that would administrate funding and create activity reports, and you end up with large number of users depending on the work of a small number of inexperienced developers in precarious positions who cannot reasonably be expected to make an effort to even understand the needs of the user base at large. In a way, software collapse is a consequence of Conway's law applied to Open Source communities.

Can we do better?

Given that today's tech world is dominated by software and Open Source communities, rather than by hardware-producing companies, is it possible to return to a market situation with no or weak lock-in? I don't think so. Standards-based markets can only form when there are multiple competing producers right from the start. In contrast, Open Source communities start out small and adventurous, with a few growing big and becoming infrastructure suppliers. In the beginning, they have no competition, and when they are big, new communities cannot possibly start to compete with them in the mindshare market. Which leaves two possibilities: Open Source communities could become more user-oriented, or the maintenance of infrastructure software could be ensured by other types of organizations. Let's start by looking at the first possibility.

An important first step would be Open Source communities recognizing that they are developing and selling products to a user base that extends far beyond the circle of potential community members. A good time for that would be just now. Many Open Source communities have recently realized that the shared idealistic goal of an Open Source world is not sufficient for ensuring respectful collaboration, and have reacted by introducing codes of conduct. What I am suggesting here is a similar approach for making the relation with the user base more explicit. The absence of a legal contract between developers and users is one of the core principles of Open Source, but that doesn't imply the absence of moral obligations. Any organization that wants to have an impact on the outside world must consider how this impact affects the life and work of other people. It should then define moral commitments, in written, even if the license prevents them from being legally enforced. A nice example are the Big Data Biology Lab Software Tool Commitments.

Open Source communities could also more actively solicit feedback from the outside. Getting useful feedback from low-engagement users is difficult, but there are proxies, for example the people who package software for various distributions.

But perhaps Open Source communities are just not the right form of organization for infrastructure software. There are other entities that create Open Source software, such as the Mozilla and Apache foundations, or hybrids such as the Pharo community with the Pharo consortium and the Pharo User Association providing channels for users to influence development. It seems probable that more useful organizational forms are waiting to be discovered. In fact, a good guess is that software should best be managed much like other scientific infrastructure: by specific institutions that ensure long-term funding and provide software as a service to research communities.

Comments retrieved from Disqus

Konrad Hinsen:
An interesting related blog post: In my culture: the responsibilities of open source maintainers.
Luis Pedro Coelho:
Thanks for the shout out!
One factor that has impressed me is how shallow some of these "communities" are. Even Python, there are only a handful of big committers to the core (I think barely more than 20 over the whole lifetime of the project! which is barely more than 1 or 2 at any given time).
I think the Linux kernel may have some deeper community, but many of these central projects are a handful of individuals. (The Linux kernel is also known for keeping backwards compatibility, but I think that's Linus' personal values rather than just a function of the size of the community: most of his most famous angry rants are about this very topic: do not break other people's code).

Pharo year one

Konrad Hinsen — 2019-12-31

It's the season when everyone writes about the past year, or even the past decade for a year number ending in 9. I'll make a modest contribution by summarizing my experience with Pharo after one year of using it for projects of my own.

My first contact with Pharo happened a bit more than one year ago, when I signed up for the Pharo MOOC in October 2018. But following a MOOC means working on exercice problems defined by someone else. Getting a real feeling for a programming system requires moving on to problems you actually care about. That's why I started three Pharo-based projects in 2019. The main one is the Pharo edition of ActivePapers, the other ones are an exploration of the Interplanetary File System (IPFS) and a second implementation of my digital scientific notation Leibniz. In all these projects, the user interface is an important aspect, because that's one of my major motivations for using Pharo. However, instead of the standard Pharo user interface framework, which is an evolution of the original Smalltalk user interface of the 1980s, I used the Glamorous Toolkit, a complete redesign with many interesting new ideas. Perhaps the most significant innovation in the Glamorous Toolkit from my perspective is the introduction of a computational document. It resembles the fashionable computational notebooks in many ways, but differs in being an integral part of a live programming system.

As I wrote in my initial blog post on Pharo, I started out by exploring the system using the tools it provides for that purpose. In retrospect, this is clearly the strongest aspect of Pharo. The combination of code browsers, code search, object inspection, and execution inspection (via a tool misleadingly called a debugger) is an extremely powerful way to understand complex software systems. The best evidence is that I was able to write useful and non-trivial extensions to the Glamorous Toolkit, which still is rapidly evolving alpha-stage software and, judged by standard metrics such as lines of documentation per line of code, badly documented. But such metrics make no sense in a system in which searching the code base is faster than documentation lookup in standard environments. Going back to such environments after working with Pharo is a very frustrating experience.

Note that I am not saying that the Pharo environment is perfect. For my taste it requires way too much mouse use. I am still much more productive in Emacs than in Pharo for tasks supported by both, mainly because I can keep my hands on the keyboard. I also find the standard code browser in Pharo too limiting in only showing one method at a time. The Glamorous Toolkit is a clear improvement in that respect. But all the criticism I can come up with is about details that can be fixed, whereas the main defects that I now see in almost every other software development environment is much more fundamental: they suffer from a barrier that separates development tools on one side from the code under development on the other side.

Similar remarks apply to the Smalltalk language on which Pharo is built. It's a minimal programming language that puts its object system in center stage and pushes as many features as possible into its libraries. That's certainly an interesting point in design space to explore, but I'd personally prefer to have a couple of important concepts (for example immutable objects) as language features, rather than as implementation details of class hierarchies. But then, no language is perfect, and Smalltalk is certainly good enough for my needs.

The most serious problem that I have with Pharo is that I don't see how I could use it productively for my own research in computational biophysics in the near future. There is a small computational science community around Pharo (see e.g. this list of scientific libraries), but most of the infrastructure code that I'd need is missing. Moreover, Pharo evolves too rapidly for the kind of computational research that I do (see my critique of the SciPy ecosystem for some background information). Finally, reproducible computations remain a challenge because there isn't much of a support infrastructure for reproduciblity in Pharo so far, although the recent work on bootstrapping is an important first step.

On a longer time scale, I can imagine Pharo replacing Emacs as my main user interface to computing, with the hard-core science written in different languages but interfaced to Pharo. I expect IPFS to play an important role at the cross-language interface, for various reasons that deserve an entire blog post on their own. However, it takes a lot of not-yet-written code to get there. Too much to define this as a realistic goal for myself. This means that my future use of Pharo mainly depends on the directions taken by the Pharo community over the coming years. I am pretty sure that Pharo will remain an important tool in my toolbox, I just don't know what its exact role will be.

Industrialization of scientific software: a case study

Konrad Hinsen — 2019-11-12

A coffee break conversion at a scientific conference last week provided an excellent illustration for the industrialization of scientific research that I wrote about in a recent blog post. It has provoked some discussion on Twitter that deserves being recorded and commented on a more permanent medium. Which is here.

I was chatting with a colleague who I have been meeting at such occasions for about 15 years. He asked me if I was still developing my Molecular Modelling Toolkit. I replied that I had stopped working on it because the end of support for Python 2 in 2020 would quickly make it too hard to use for most of its intended audience, and that I didn't have the means nor the motivation to port it to Python 3. He was quite surprised by my explanations, since he had never heard of the end of support for Python 2, though he did know that there was also a version 3 that was a bit different. His own data analysis scripts were still Python 2 because he had never seen a good reason to even look at Python 3 - never break a working system! But he was alarmed by my prediction that Python 2 would soon disappear from Linux distributions, as he relied on Ubuntu (regularly updated by his lab's systems administrator) to provide him with Python 2 and the few libraries he used.

I was not surprised, as I have had similar conversations with various colleagues over the last years. In particular when someone contacts me with a Python question, which happens quite frequently as I have the reputation of being a Python expert in my little corner of science. The typical profile of these people is experimentalists who write and use small data analysis scripts, but for whom computation is not the central part of their research. They picked up Python from a colleague or a student, or perhaps through attending a short introductory course (such as a Software Carpentry workshop). They have a Python installation on their machine, which is managed by someone else. For them, Python is "just there", exactly like other Unix basics such as sh, or grep. Moreover, Python has been part of their computing life for many years, often for their entire scientific career, and it has never caused them any trouble.

When I mentioned my coffee break conversation on Twitter, Greg Landrum commented that he would expect every Python user to make an effort to stay informed about important Python news, so everyone should by now have heard of the end-of-life decision for Python 2. This reminded me of an earlier Twitter conversation with Stefano Zacchiroli, who expressed similar views. As did other actors of the FOSS universe in various real-life discussions. There seems to be a widely shared expectation among FOSS developers that users should follow news about the software they use and take the required steps to adapt to "mandatory changes", as Stefano put it. My story illustrates that this is not happening. There is a category of users who (1) don't follow development news and (2) expect the software they use to stay around forever without major breaking changes.

This is exactly the phenomenon that I call the industrialization of scientific software. Some software packages, such as the core of the Scientific Python ecosystem, become so popular beyond their core community that for an important part of their users they are industrial products, something they obtain once and then use without thinking much about its origins or possible evolution. One sign of a piece software becoming an industrial product is its inclusion in standard Linux distributions, where it is just one package out of many that users can choose from. Linux distributions take the role that department stores have for material goods, providing a platform for window-shopping and acquisition via a standardized procedure. For users who get their software from a Linux distribution, all software looks a bit alike. They have no reason to be more careful about Python than about sh or grep.

Just like material goods industries, the developers of industrial software, FOSS or not, have no easy way to communicate with their clients. If such communication becomes inevitable, as for example in the case of a product recall for safety reasons, an enormous effort must be deployed to ensure that the message reaches most of its audience. Pierre de Buyl made a suggestion along these lines, proposing to put up posters with an explanation of the Python 2->3 transition in every research lab. Asking research funders to support such an action would be an interesting experiment.

Is there anything that FOSS communities can do to prevent such miscommunication in the future? A look at industrial material goods may provide inspiration. Every non-trivial technical product comes with a user manual, which typically starts with pointing out safety precautions that users are expected to be aware of. Do this, don't do that, watch out for exceptional situations. The documentation of software packages could do the same, and tutorials could then emphasize the message when explaining the product to potential future customers. Here is what such a warning could look like:

This software package is developed for cutting-edge scientific
research. Our priority in development is to improve the software
and to adapt it for the needs of future applications. As a consequence,
we cannot maintain client code compatibility indefinitely.
Users of this package are expected to check the release notes
(available at http://...) at least once per year, and to adapt
their code to changes in the interfaces explained there.

I would expect such a notice in the introduction to the SciPy Lecture Notes, for example. It describes the SciPy ecosystem, comparing it to alternative choices, but says no word about what users need to do to safely use this ecosystem in their research work. As I said in my previous post, the FOSS community has largely been blind to the consequences of software industrialization, maintaining the outdated view that developers and users form a single community. It's time for an upgrade.

Note added after the initial publication: Dan Katz commented on Twitter with a reference to this very clear statement on the development priorities for Matlab. It would be very helpful if FOSS communities published similar statements about their products.

The industrialization of scientific research

Konrad Hinsen — 2019-10-29

Over the last few years, I have spent a lot of time thinking, speaking, and discussing about the reproducibility crisis in scientific research. An obvious but hard to answer question is: Why has reproducibility become such a major problem, in so many disciplines? And why now? In this post, I will make an attempt at formulating an hypothesis: the underlying cause for the reproducibility crisis is the ongoing industrialization of scientific research.

First of all, let me explain what I mean by industrialization. In the production of material goods, this term stands for a transition to high-volume production in large sites (factories), profiting from economies of scale. This doesn't directly carry over to immaterial goods such as information and knowledge, which can be copied at near-zero cost. There are, however, aspects of industrialization that do make sense for immaterial goods. The main one is a clear separation of producers, who design and make products for an anonymous group of potential clients, and consumers who choose from pre-existing products on the market. This stands in contrast to 1) producing for one's own consumption, and 2) commissioning someone else (e.g. a craftsman) to make a personalized product. Both of these approaches lead to products optimized for a specific consumer's need, whereas industrial products are made for a large and anonymous market.

In scientific research, immaterial industrial products are a recent phenomenon. The ones that I will concentrate on are software and datasets that are publicly available and used by scientists outside of any collaboration with their authors. Twenty years ago, this would have been a rare event. Most software was written for in-lab use, and not even made available to others. Only a small number of basic, standardized, and widely used tools, such as compilers, were already industrial products. Most data were likewise not shared outside the research group that collected them. The resulting non-verifiability of scientific findings was an obvious problem, and led ultimately to today's growing Open Science movement. However, the Open Science movement goes well beyond asking for the transparency that is fundamentally required by the scientific method. It wants software and data to be reusable by other scientists and for different purposes. This is stated most explicitly by the FAIR data label, in which the R stands for reusability. Open Science thus turns software and datasets into industrial commodities.

The knowledge gap

A characteristic feature of industrial products is that consumers know much less about them than producers. Consumers cannot ask for personalized explanations either, unlike in the case of a product tailor-made by a craftsman. For material goods, this has led to a wide range of professions, institutions, and regulations designed to help consumers choose suitable products and to protect them against producers' abuse of their superior knowledge. Examples are consumer protection agencies, independent experts, technical norms, quality labels, etc. For the industrial products in scientific research, we have no established equivalents yet, and it is not even clear if can ever have them. And that is, in my opinion, a major cause of the reproducibility crisis.

One piece of evidence is the nature of the cases discussed in the context of the crisis. Reproducibility has been an issue with experiments since the dawn of science, and yet experimental non-reproducibility never shows up in the examples cited. This is not because it is unimportant, but because it is well understood. Experimentalists of all disciplines know what ought to be reproducible in their field, and to which degree, and even the most theoretically minded theoreticians understand that experiments necessarily come with uncertainties. The issues that do show up in the catalogs of non-reproducible results are related to two specific research tools: statistics and computers. Both are recent, and both are routinely used by scientists who do not fully understand them. In other words, their users are consumers of industrial products who lack guidance in their choice of tools and methods.

Side note: I can almost hear some readers complain that statistics are nothing recent, going back to Arab mathematicians who lived 1000 years ago. You are right. What is recent is the widespread use of statistics in science. Before computers, statistical methods had to be applied manually, keeping them simple and the datasets small. The kind of statistical inference whose results turn out to be non-reproducible, e.g. in psychology, would not have been possible without computers.

As an illustration, consider the common use of p-value thresholds for deciding on significance. Anyone who understands the statistical framework to which p-values belong (hypothesis testing) agrees that most uses of such thresholds in the scientific literature make no sense. The fact that they are widely used nevertheless thus shows that most people who deal with them, as authors or as reviewers, do not understand the statistical hypothesis testing sufficiently well. And since the abuse of p-values has been going on for a while, it has now become a de-facto accepted practice, to the point that the people who do understand its absurdity have a hard time being heard. The same can be said about the abuse of journal impact factors for judging the authors of scientific articles, which are a sign of CVs and publication lists becoming industrial products as well.

The root cause of computational non-reproducibility is an even better illustration of software becoming an industrial product. I noticed that many scientists who have never experienced reproducibility issues themselves find it hard to imagine that they can exist. After all, 2 + 2 is 4, today and tomorrow. What happens when two people obtain different results from "the same" computation is that they performed in fact different computations (using different software) without being aware of the difference. Software has become ever more complex over the last decades, but software developers have also made an effort to hide this complexity from users - with great success. Most scientists are surprised to learn that when they run that little script sent by a colleague, they are really using hundreds of software packages written (and modified frequently) by hundreds of people over many years with only loose coordination. It's not only those hundreds of packages that are industrial commodities, but even the assembly of all those pieces, for example a Linux distribution.

What can we do?

We can look at the much better understood industrial production of material goods for inspiration for possible solutions. A complex industrial product, such as a car or a television set, comes with a user manual and perhaps an obligation for user training, such as obtaining a driver's license. Moreover, technical norms impose precautions on producers to make their products safe to use by non-experts. Independent experts evaluate products and publish reports that guide consumers in their choice. These approaches can be adapted to scientific software and statistical methods, but that work remains to be done.

I expect reproducibility to play a major role in this, as a quality label. A reproducible result can still be wrong, but nevertheless reproducibility guarantees the absence of some kinds of common problems. We need additional, complementary quality labels of course, and in fact we have a few, such as the presence of test suites for scientific software, or the existence of provenance metadata for datasets. But this is only the beginning. We do not yet know how to make data and code an industrial product that is safe to use by others, nor do we know how to prepare scientists for working in such an ecosystem. Best practices, even good enough practices, remain to be established.

Experts will likely be another ingredient of a solution. I suspect that most statistics-related problems could be solved by requiring that every publication making a claim based on statistical significance be validated by a trained statistician. We will have to figure out how to organize this validation. One possibility is to create independent certification agencies, similar to cascad for computational reproducibility, that employ qualified statisticians and deliver validation certificates that will figure prominently in a paper.

It's not just software and data

As I said above, I have focused on data and code because the computational aspects of science are what I am most familiar with. But industrialization isn't limited to computing. Even the good old journal article is slowly turning into an industrial product. With approaches such as meta-analyses or content mining, scientific papers are being used by people who are not part of the community that their authors belong to, and may thus not have the tacit knowledge shared by that community which might well be necessary to fully appreciate the published results. Interdisciplinary research is also a source of potential misunderstandings due to unshared tacit knowledge.

We can also see industrialization in the management of science. In fact, the term "management" in itself implies some form of industrialization. Unfortunately, management principles from the material goods and service industries are being applied uncritically to scientific research, leading to phenomena such as the abuse of the journal impact factor to measure an individual's productivity, or the attribution of budgets based on multiple-year predictions of research outcomes (called "grant proposals") that lack any credibility. This suggests that the people who design these management practices consider science itself a commodity, as an industry that can be run just like any other industry. There is, however, a crucial difference: whereas the production of material goods is by necessity based on well-known technologies and processes (otherwise their deployment at scale would be bound to fail), research is all about the unknown. Scientists can describe directions they want to take, but not promise to reach specific goals in the future. Science is intrinsically a bottom-up process, whereas management is about top-down organization.

Open Source and Open Science

Back to software, there is one aspect that deserves further discussion: the role of the FOSS (free/open source software) approach that has been gaining traction in research over the last decade, and that has furthermore inspired much of the Open Science movement. The origin of the FOSS movement can be seen as a rebellion against the industrialization of software, which made it difficult to impossible for users to adapt it to their needs. The widely shared story of Richard Stallman's fight against a proprietary printer driver (see here for example) is a nice illustration. Initially, the FOSS movement focused on establishing legal means (licenses) to protect software from becoming proprietary. More slowly, and less explicitly, it worked towards a view of software development as something a community does for its own needs, with the ideal that anyone sufficiently motivated should be able to join such a community and participate in the development process. This was a reasonable proposal in the 1980s, when software was simpler and most computer users had by necessity some programming experience.

Today's situation is very different. Most software has the status of an industrial product for most of its users, whether it's FOSS or not. In theory, anyone can learn anything about FOSS and participate in its evolution at all levels. In practice, the effort is prohibitive for most, and nobody today can envisage understanding all the software they depend on, let alone contributing to its development. As I explained above, it has even become close to impossible to just keep track of which software one depends on. From a user's perspective, the development communities of FOSS projects are industrial software producers just like commercial companies. In a way, FOSS users even have less power because the developer communities have no legal or moral obligations toward their users at all. There are a few cases of institutions that permit users to influence and support the development of FOSS, for example the Pharo consortium or the Inria foundation, but they are the exception rather than the rule.

In science, the FOSS ideal of communities producing software for their own use works very well for domain-specific software packages, whose developers are a representative subset of a well-defined scientific community. But infrastructure software that is used across many scientific disciplines will invariably end up being an industrial product for most of its users. This is true for most of the Scientific Python ecosystem, for example, and also for the statistical software universe that has grown around the R language. Note that I am not saying that the FOSS approach has no advantages there. Open source code is very important to ensure the transparency required for making science verifiable. What I am saying is that openness is not enough to ensure that software is a safe-to-use industrial product, nor does it provide a mechanism for keeping a product's evolution in sync with the needs of its user base.

Whereas the FOSS community has largely remained blind to this issue, the Open Science movement seems to be more aware of the pitfalls of "just" being open, at least for data. The I and R (interoperability, reusability) in FAIR are the best evidence for this. For now, they remain ideals for which practically usable implementations remain to be defined. Perhaps this will lead to a more careful consideration of reusability for software as well. As with the material goods industries, the key is to recognize users and educators as stakeholders and ensure that their needs are taken into account by producers. Open source communities working on widely used infrastructure software could, for example, adopt a governance model that includes representative non-developing users. Funders of such communities could make such a governance model a condition for funding. But the very first step is creating an awareness of the problem. Development communities should openly state their ambition. It's OK to develop software for use inside a delimited community, but then don't advertise it as easy to use for everyone. It's also OK to aim high and work on general-purpose infrastructure software, but then explain how users can make themselves heard without having to become contributors themselves. Being "open" is not enough.

Comments retrieved from Disqus

asmeurer:
Software, like all systems, does not just continue to work so long as you don't break it. It only works because people continuously work to keep it from breaking. Imagine if your city builds a bridge. Some years later, there is a bond election to pay for costs for the bridge. Now consider a voter who votes against the bond, saying, "they already built the bridge, why do they need more money? As long as they don't tear it down, it should continue to work." This is of course ridiculous. Bridges and roads require maintenance, or they will degrade. They do not just have a one time cost. Software is the same way. Even though the bits that make up the source code of software are just as immutable as the atoms of concrete in the bridge, it still requires ongoing maintenance or it will rot, just as the bridge will start to develop potholes, and eventually start to crumble if it is not maintained. The ecosystem of software and hardware that a piece of code runs on and alongside must be considered as part of the system, just as the cars should be considered as part of the system of a bridge.
The other thing to understand is that for open source software, this maintenance is provided almost exclusively by unpaid volunteers. I wonder how much your colleague has given to NumFOCUS, since he expects the software to be supported indefinitely. I would encourage you to show him this https://www.fordfoundation.....
Maintaining Python 2 support means splitting this development effort away from the development of new features, the fixing of bugs, and so on. It also means keeping a large amount of technical debt (I've written about this here https://www.asmeurer.com/bl.... Actually, if you want to continue to use Python 2, you can. What you can't do is expect the volunteers who work on CPython to continue to work on it, or the volunteers to work on libraries to continue to support it in addition to Python 3, or the volunteers who work on Linux distributions to continue to support it. These things would all require ongoing development efforts (see my first paragraph). You are of course free to pay a vendor to continue to provide Python 2 support for you (I'm sure some will pop up if the market demand is there), or attempt to fix any holes in the support yourself.
- Konrad Hinsen:
  Hi Aaron,
  thanks for your comments!
  Before giving my point of view on your first paragraph, let me reply to your second one, which is really the topic of my post. My colleague doesn't expect software to be maintained indefinitely by someone else for free. His expectation of Python being just there forever is nothing but an extrapolation of past experience. He has no idea about how software maintenance works, nor any opinion on how it should work. And even after our coffee break conversation, he probably has no more than a foggy notion of all that. Coffee breaks are way too short, as we all know.
  As for your statement that "software only works because people continuously work to keep it from breaking", that's a self-fulfilling prophecy in my opinion. What breaks software package A is a breaking change in its dependency, software package B. If everybody introduces breaking changes all the time, all software will break all the time, and your statement becomes true. In a world where everyone avoids breaking changes, software can work for a very long time without any maintenance. I have 25 year old Fortran programs that still work, as do the shell scripts that coordinate them. Software is as stable as its developers want it to be. What you can't have, given today's state of the art, is stable software *and* rapid improvement in functionality. That's a choice that developers must make. And then they should then make a clear public statement about their choice.
  - asmeurer:
    Right, I don't think there is any malice. For the most part, it is just ignorance of how open source maintenance works. Usually once you explain this to someone, they get it, but by default people don't think about it and they assume that things that just work will continue to work, and don't really consider that they only work because there are people out there who dedicate time or money to making them work.
    Even for Fortran there is a maintenance cost. Every Fortran compiler has to support multiple versions of the language, and any compiler that works on a modern machine is necessarily being actively developed, because the architectures of 30 years ago aren't the same as the ones today. So ultimately, "breaking changes" will always happen *somewhere* in the stack, unless you are exclusively using 30-year old software on 30-year old hardware. Your colleague can continue to use his Python code by not updating Python from Python 2, except it won't be available on the latest Linux distro, He can avoid updating Linux, except old versions of Linux won't work on newer hardware. He can avoid updating his hardware, except hardware eventually dies.
    - Konrad Hinsen:
      Yes, Fortran compilers are being maintained. Fortran (in any of its standardized versions) is what I call a stable platform. Compiler developers work on avoiding collapse from below, in order to ensure that programmers in the software stack above needn't worry about it. And they work on improvements that any particular user might care about or not (speed, new versions of the standard, new hardware...).
      But saying that Fortran requires maintenance hides enormous differences in degree. I am pretty sure that the first release of GNU Fortran for Linux would still work on a modern Linux, though you may have to install support for 32-bit code first. All of the software stack in the PC world has been very stable. People upgrade because they want new features or other improvements, not because they face software collapse.
      An interesting historical side note: all software platforms that go back to the 2000s or earlier are stable. All the ANSI standard languages, but also the JVM or the Linux ecosystem as a whole. Rapidly changing platforms are a recent phenomenon. What happened? One hypothesis: the advertising business, with its extreme short-term focus, become an important driving force for technology.
      What really bothers my experimentalist colleague is the risk of Python 2 dropping out of Linux distributions, because that's what makes Python easily accessible. You can't afford not to update Linux these days, for security reasons. Maybe a conservative distribution such as CentOS will keep Python 2 for some years to come.
Stephen Kell:
Thanks Konrad... another very thought-provoking post. I agree with your basic premise that FOSS-style culture of "fix it yourself", or more generally of conveniently identifying users with developers, doesn't match today's large-scale patterns of software distribution and co-evolution.
However, there's an elephant in the room: who is right? Why *shouldn't* Python 2 be around forever? Why is there any category difference between Python and (say) awk, or sh?
I lean towards the view that there shouldn't be, that this is another instance of the language-implementer tail wagging the working-programmer dog. (The mess known as "FFIs" is another massive example of this.)
This is cultural... stereotyping wildly, "PL" people often see dictating change to users as their prerogative; "systems" people often don't share this (e.g. witness Linus Torvalds's strong insistence on backwards compatibility).
The analogy with instruction manuals is also problematic. My toaster's instruction manual literally says "Caution: do not insert any objects into the toast slot". More
generally, these sorts of things often contain advice that is practically unfollowable, but exists to cover the backside of the manufacturer. It may or may not be legally enforceable, but my point is that this isn't necessarily the right culture to be inspired by. Putting signs and disclaimers everywhere seems like an "ambulance at the bottom of the cliff" solution. It's ducking the big question: how can we structure the "material" of software so that the things people quite reasonably want to do are the things that actually work?
I have some thoughts on that question, but would love to hear yours first. :-)
- Konrad Hinsen:
  Hi Stephen,
  thanks for your comments!
  I have been thinking about that elephant for a while, but I deliberately left it out of this blog post in order to concentrate on what I hope to be more consensual: that mutual miscomprehension between developers and users is a problem we collectively need to work on. But I'll happily come back to the elephant :-)
  For me, there is no category difference between Python 2 and sh or awk. They are just on almost opposite ends of a spectrum. As you say, the root of the difference is cultural. In the Unix/"system" approach, there is an ideally small set of infrastructure software that defines the rules of the system and which ought to be a stable basis that nobody perturbs without a very good reason. I'd say that sh belongs to this infrastructure, but awk probably not. Another principle more specifically for Unix is the famous "tools that do one thing but do it well". Small tools are easy to keep stable as well. Integration of these tools for solving a specific problem is someone else's job, meaning that Unix is designed for power users.
  Python, on the other hand, started out as a "batteries included" supertool, and needs to evolve constantly in order to remain the top supertool for many tasks. Unlike Unix tools, the different parts of the Python standard library cannot evolve independently at their own rhythm, which enourages an attitude of embracing change and pursuing it as a goal in itself. Compatibility is then not only seen as a waste of effort, but also as a sign of attachment to the past. Moreover, working on a supertool puts developers in a god-like position. They are not creating a humble part of a system, but a system on its own that transcends mere operating systems. And as you suggest, programming languages are probably the extreme case of god-like power.
  Another aspect is the size and structure of the communities. Unix is anarchy: everybody does their tool, period. No annual conferences, no governance, no code of conduct, no bureaucracy managing formal enhancement proposals. Python started out the same way, and was very stable in its early years. Today's Python community is big and organized. Such communities require shared beliefs, and "software changes" is one of these beliefs in the Python community. It probably helps that society at large is obsessed by innovation, even without any associated goal of improvement.
  Concerning your comment on instruction manuals, I agree that today's legalistic attitude has pushed alerts and warnings beyond the limit of the reasonable. Maybe I should have written "instruction manuals as they were 30 years ago".
  Finally, the big question. I doubt there is one general answer to it, so I will stick to what I know best: research in the natural sciences. I have tried to be neutral in my description of the ongoing industrialization, but I consider it mostly a bad development, with few but important exceptions. For software, the main exceptions are well-understood compute-intensive procedures in simulation and data analysis. For everything else, I believe we need more anarchy and more control over software in the hands of each individual scientist. Meaning small understandable building blocks, rather than the monolithic libraries of today's SciPy ecosystem, however convenient that may be for getting a job done quickly.
  Am I now entitled to learn about your thoughts on the big question ? ;-)
  - Stephen Kell:
    Thanks for the thoughtful reply. And sorry for the delay... I thought I had posted this, but I had merely written it.
    Firstly, I agree that keeping to consensus-inducing topics is often a good tactic... apologies for charging off a revolutionary direction. :-)
    "Batteries included" versus "one thing well" does identify a difference. Then again, Unix itself is in some sense a "batteries included" system and has a community process of sorts in the form of POSIX... its rate of change is tempered by both standardisation (slow) and plurality (many implementations). Perhaps if Python had several widely used implementations, the 2-vs-3 issue would have gone very differently... as you say, culture is a major factor.
    To answer your question... I see the whole "supported versions" issue (not just in Python) and the burden of porting software to "keep up", as most immediately a consequence of two big but very concrete problems that our operating systems and programming tools set us up with. Solving them is already currently "possible" but uneconomical.
    The first problem is that software packaging has no notion of isolation. Without extra effort, I can't have version X of some library/program installed and also version Y, because they collide/interfere with each other (e.g. they may want to install things at the same path, but also more directly that A links with B, say). The notion of "install" doesn't distinguish "coexistence" (both are available to me) and communication (both intentionally interact, including by presence in a shared namespace). This is a fairly direct consequence of Unix-style linking and sharing of the filesystem namespace. The right redesign of those could solve it, and I believe it needn't be very invasive. Some package managers do attempt something like this, but I've yet to see one that really goes deep enough.
    The second problem is that critical fixes are not isolated from general development. In order to get implementation fixes for a given piece of software, you have also to get interface "fixes". For example, if a security bug is discovered that dates back to a library version N, it will probably only be fixed in version N+k. The interface of that version is probably different, so you have to port your code. I've not seen much focus on black-box approaches to security defence ("block, not patch"). Again, I'd argue this can be traced to Unix -- if all you have is opaque byte streams, recognising bad input is a tall order because it must be coded from scratch each time -- but by evolving Unix we can fix it. A memory-safe C will also help here (am working on it!).
    Of course the reality of both of these is more complicated than I've made out. But in a world without both of these problems, I think widespread (cultural) expectations around software's "continued workingness" would be very different, because the "support" of large institutions wouldn't be necessary to keep a given codebase running acceptably. (And I did write even more about all this, but I think that rather than rambling away here, I should save the details for a blog post of my own....)
    - Konrad Hinsen:
      Thanks for taking the time to write and the initiative to actually post this reply :-)
      Python actually has multiple implementations. I don't know how widely used Jython, IronPython, and PyPy are these days, but I'd say it doesn't matter. The consensus in the Python community is that CPython is the reference implementation and everyone else has to follow.
      The first problem you describe looks like an early case of "convention over configuration". Recent package managers (I am thinking of Nix and Guix) are switching to explicit configuration, which does indeed solve most of the problems of what Windowsians call "DLL hell", except for one: the case where A depends on B and C, with B depending on D v1 and C on D v2. The only attempt I am aware of to solve that problem is Unison (https://www.unisonweb.org/), which refers to dependencies by hash code rather then by name.
      Your second problem looks much harder to me because it involves so many different aspects: culture, economics, power relations, etc. I am not convinced that enough people actually want to solve the problem, whose continued existence provides market dominance to some, and employment to others.
      So... I am looking forward to your blog post!
  - Stephen Kell:
    Thanks for the thoughtful reply. I agree that keeping to consensus-inducing topics is often a good tactic... apologies for charging off a revolutionary direction. :-)
    "Batteries included" versus "one thing well" does identify a difference. Then again, Unix itself is in some sense a "batteries included" system and has a community process of sorts in the form of POSIX... albeit with a very slow rate of change gated by both standardisation (slow) and plurality (many implementations). Perhaps if Python had several widely used implementations, the 2-vs-3 issue would have gone very differently... as you say, culture is a major factor.
    To answer your question... I see the whole "supported versions" issue (not just in Python) and the burden of porting software to "keep up", as most immediately a consequence of two big but very concrete problems that our operating systems and programming tools set us up with. Solving them is already currently "possible" but uneconomical.
    The first problem is that software packaging has no notion of isolation. Without extra effort, I can't have version X of some library/program installed and also version Y, because they collide/interfere with each other (e.g. they may want to install things at the same path, but also more directly that A links with B, say). The notion of "install" doesn't distinguish "coexistence" (both are available to me) and communication (both intentionally interact, including by presence in a shared namespace). This is a fairly direct consequence of Unix-style linking and sharing of the filesystem namespace. The right redesign of those could solve it, and I believe it needn't be very invasive. Some package managers do attempt something like this, but I've yet to see one that really goes deep enough.
    The second problem is that critical fixes are not isolated from general development. In order to get implementation fixes for a given piece of software, you have also to get interface fixes. For example, if a security bug is discovered that dates back to a library version N, it will probably only be fixed in version N+k. The interface of that version is probably different, so you have to port your code. I've not seen much focus on black-box approaches to security defence ("block, not patch"). Again, I'd argue this can be traced to Unix -- if all you have is opaque byte streams, recognising bad input is a job done from scratch each time -- but by evolving Unix we can fix it. A memory-safe C will also help here.
    Of course the reality of both of these is more complicated than I've made out. But in a world without both of these problems, I think widespread (cultural) expectations around software's "continued workingness" would be very different, because the "support" of large institutions wouldn't be necessary to keep a given codebase running acceptably. I did write even more here, but I think that rather than rambling away here, I should save the details for a blog post of my own....
  - asmeurer:
    It's curious that you consider the SciPy ecosystem to be monolithic. It's generally considered to be built out of building blocks. A typical scientific workflow will require several libraries, which work together but are developed separately. If you want to do plots, you will use matplotlib or some other plotting library. If you need basic scientific functions you will use numpy or scipy, and for something more domain specific you will use a domain specific library, and so on. Contrast this to something like MATLAB or Mathematica where there is a single application package that does everything.
    - Konrad Hinsen:
      The SciPy stack is monolithic from the end user's point of view: you can't pick individual versions of each library and expect them to work together. You can only combine versions from close points in time. The developers' perspective is certainly very different. But the requirement of co-evolution in a context of rapid change in interfaces leads to a similar end result as centrally coordinated development.

The computational notebook of the future (part 2)

Konrad Hinsen — 2019-05-09

A while ago I wrote about my ideas for a successor of today's computational notebooks. Since then I have made some progress on a prototype implementation, which is the topic of this post. Again I have made a companion screencast so that you can get a better idea of how all this works in practice.

As a reminder, the two aspects of today's notebooks (Mathematica, Jupyter, R markdown, Emacs/OrgMode) that I consider harmful for scientific communication are:

The linear structure of a notebook that forces the narrative to follow the order of the computation.
The impossibility to refer to data and code in a notebook from the outside, and in particular from another notebook, making reuse of code and data impossible.

Like the demo that I made last time, and which is best qualified as a quick hack, the computational document that I am presenting today is implemented in Pharo and builds on the Glamorous Toolkit, which is an innovative development environment designed around the notion of "moldable development", which means that developers should be able to adapt their tools to their specific needs with little effort. This is precisely what I have done. The code is on GitHub and includes the example document from the demo.

Contrary to today's notebooks, my computational documents consist of two distinct layers, which I show for an example in the screencast. A workflow layer consists of scripts (short pieces of code) that compute datasets keep track of the data dependencies. The workflow layer can be visualized as a graph. Scripts and datasets make up a standard Pharo object that can be used as a building block in subsequent work, unlike the code and data in today's notebooks. For example, the Pharo expression InfluenzaLikeIllnessInFrance data absoluteIncidence yields one of the data frames from my example document and can be used in any type of Pharo code, including code in another document.

On top of that workflow layer, there is a documentation layer consisting of a Wiki-style multi-page document in which each page can contain code snippets. These code snippets are intended for data presentation (plotting etc.) and for demonstrations (examples, verifications, etc.) They are not accessible from outside their pages, and they cannot change the datasets computed by the workflow. The documentation pages can refer to and include the datasets, the scripts, but also arbitrary other Pharo code. In particular, this allows including library code used by the workflow scripts in the documentation layer, as opposed to today's notebooks for which library code is undocumentable black-box code.

A third essential element is the playground attached to the workflow. This is where interactive exploration takes place. Code snippets in the playground can access datasets just like scripts, but they cannot modify them. The playground is meant both for authors and for readers. Authors develop scripts incrementally in the playground, and turn them into scripts (at the click of a button) when they are satisfied. Readers can write code snippets for exploring the data in more detail.

The code is currently "demo quality", so please don't rely on it for your own research. Even the underlying GToolkit library is still advertised as alpha level. There is a reason for calling this the future rather than the present! However, there are a few conclusions that I am already willing to draw from this work:

An authoring environment for computational documents should also be a more general software development environment. If you have to change tools for switching from library code to a computational document or back, you have a technological barrier to overcome that creates a mental separation between "inside" and "outside", whereas the science that you want to communicate is on both sides of your barrier.
The emphasis on making all code and data explorable that has been part of Smalltalk culture from the start is highly beneficial for computational science as well. Notebook environments such as Jupyter or RStudio feel extremely limited compared to the standard Pharo environment, let alone the more advanced GToolkit.
Decomposing the computation into smaller independent scripts with well-defined interfaces makes it more understandable. In the traditional linear notebooks, you never know how far further down a temporary variable will be used. You must read the code from top to bottom to be sure not to miss something. Likewise, separating "essential" computations on the data from "superficial" computations such as plotting makes the overall scientific logic stand out better.
A good authoring environment must support the full lifecycle of computer-aided research, starting with interactive exploration and iterating towards a computational document optimized for the reader rather than the author. Today's notebooks do not provide this support by sticking to a linear structure that is satisfactory only in the initial stages of the lifecycle.

Is reproducibility good for scientific progress? (a paper review)

Konrad Hinsen — 2019-04-23

A few days ago, a discussion in my Twitter timeline caught my attention. It was about a very high-level model for the process of scientific research whose conclusions included the affirmation that reproducibility does not improve the convergence of the research process towards truth. The Twitter discussion set off some alarm bells for me, in particular the use of the term "reproducibility" in the abstract, without specifying which of its many interpretations and application contexts everybody referred. But that's just the Twitter discussion, let's turn to the more relevant question of what to think of the paper itself (preprint on arXiv).

The core of the work presented in that paper is a stochastic model for the process of scientific research. There is some phenomenon described by a "true" mathematical model. Scientists do not know this model, but can obtain data points from it. This is how experiments are described. Scientists do have full access to their own models for reality. At each time step, a scientist generates a new model according to some strategy and evaluates the quality of that model to see if it is "better" (in a well-defined sense) than the current concensus model of the community. One of the strategies is replication of prior work.

Such highly simplified high-level models are easy to criticize because of the huge number of simplifying assumptions. And yet, in other branches of science (such as physics), simple toy models have proven to be very useful. In particular, they can help identify mechanisms that are also present in more realistic (and thus more complex) descriptions of the same phenomena. However, toy models require reality checks as well, in the form of validation, even if validation is qualitative rather than quantitative. This is in my opinion one of the weak spots of this paper: validation is limited to a few basic sanity checks. Given the scarcity of empirical data on the scientific process, this isn't really surprising.

As for the specific issue of reproducibility, the model presented in the paper has a major weakness in that it completely ignores the issues that motivate reproducibility checks and replication studies in real life. Scientists, like all humans, are prone to mistakes and biases. The collective process of scientific research therefore includes verification steps that reduce the impact of mistakes and bias. Peer review is probably the best known one, but reproducibility checks and replication studies fall into this category as well. It is then not surprising that a model without mistakes and bias predicts little utility for verification measures.

However, this is merely a criticism of the current proposed model. It should be possible to include mistakes and bias without profound changes to the basic idea of modelling scientific research by a stochastic process. Confirmation bias is perhaps the simplest case: Let authors of original research overestimate the benefit of their work (as part of the evaluation criterion S in the paper) and replicators underestimate it. As for mistakes, a crude technique would be to let some percentage of scientists generate two new models, evaluate the first one, but report the second one as having been tested. Mistakes detected in a replication study would then lead to erasure of the replicated study from the process of concensus formation.

The computational notebook of the future

Konrad Hinsen — 2019-02-11

Regular readers of this blog may have noticed that I am not very happy with today's state of computational notebooks, such as they were pioneered by Mathematica and popularized by more recent free incarnations such as Jupyter, R markdown, or Emacs/OrgMode. In this post and the accompanying screencast (my first one!), I will explain what I dislike about today's notebooks, and how I think we can do better.

There are two aspects of notebooks that I consider harmful for scientific communication:

The linear structure of a notebook that forces the narrative to follow the order of the computation.
The impossibility to refer to data and code in a notebook from the outside, and in particular from another notebook, making reuse of code and data impossible.

If you look at a traditional scientific article, or technical report, you will notice that its narrative is structured according to a high-level view of the work. It starts by describing the context of the work, then its goals and a very brief summary of the methods, and right after that it presents results and discusses them. Technical details are only discussed afterwards, once the reader understands why they actually matter. With today's notebooks, the technical details come first: a typical data analysis starts with cleanup and preprocessing steps, and therefore they also come first in the narrative.

An unpleasant side effect of the "narrative follows computation" principle is that some technical details actually cannot be discussed adequately. Scientific methods implemented in software libraries can be summarized in plain English, but the code is elsewhere, managed by a different toolset, and cannot be shown to the reader.

This makes the transition to the second problematic aspect: there is no way to refer to or reuse any specific part of a notebook. Neither the code nor the computed results are accessible from the outside. And that also makes it impossible to build up useful libraries from notebooks.

So far for the criticism - now let's make it constructive. At this point, you should watch the screencast before reading on. In the screencast, I show a simple data analysis both as a Jupyter notebook and as a demo prototype for what I consider the notebook of the future. This prototype is built using the Glamorous Toolkit, a very innovative software development environment for Pharo, which is a modern descendant of Smalltalk. If you want to play with this yourself, the code is on GitHub. It's really just a demo, because the simplistic approach to organizing the computation that I have used there would not scale to real-life computations (it does a lot of needless recomputation). My plan is to implement the ActivePapers approach for managing the computations. GToolkit is alpha software as well. So none of this is ready for prime time, but it does show that better notebooks are possible.

Unlike today's notebooks, which are a sequence of code snippets and documentation paragraphs, the computational documents of my demo are objects in the sense of object-oriented programming. Each document contains code, input data, and computed data, which can be accessed from the outside and thus reused in client code. The narrative is merely an additional view into these items, which can present and discuss them in any order that seems suitable for explaining the work. Like with scientific articles, the narrative is typically written in the final stages of the work, once the basic code skeleton is working. In the case of my demo, I started out writing the two Pharo classes, before even installing GToolkit which was a bit unstable at the time.

Note that this "one job, one object, one narrative" approach has a beneficial side effect in encouraging people to do each job well, rather than just well enough for going on with the next job. My Jupyter/Python version of the data analysis only extracts the minimum information required from the input dataset, without even mentioning what else is in there. The GToolkit/Pharo version provides a complete description of the dataset, including the data that is not used at all in the second document that describes the analysis.

Finally, there are other interesting aspects of GToolkit (and Pharo) for computational science, but I will leave them for future posts. I will just mention that the "inspectors" (a term familiar to every Smalltalk developer but probably unknown to anyone else) are easily extensible. Adding a pane that provides yet another view of the document is a matter of writing a couple of lines of Pharo code. It's as if you could implement a new widget for Jupyter in a few lines of Python code right in your notebook.

Update: There's a workaround for embedding figures (thanks to Tudor Gîrba for the hint!), which you can find in the current code version on GitHub.

Comments retrieved from Disqus

Tomas:
Hi Konrad,
Is the screencast still available somewhere? The link won't load for me.
- Konrad Hinsen:
  Unfortunately I didn't keep a copy, and peervideo.net seems to have disappeared. So far for my very first screencast... I did better the second time, so the screencast for [part 2](http://blog.khinsen.net/pos... is still around, and also more interesting in the long run.
  - relbus:
    Getting hit by linkrot really drives home all the points you raise about stability and reproducibility.
  - Tomas Fiers:
    Just watched that one. It's fantastic. I have been thinking about a new interface for computational science/play too, and this demo suddenly connected different loose threads (dependency graph, transclusion, intermediate value inspection)
    - Konrad Hinsen:
      Thanks for your feedback! Another line of work I recommend in this space is Sam Ritchie's dynamic notebooks: https://roadtoreality.subst...

Exploring Pharo

Konrad Hinsen — 2018-12-19

One of the more interesting things I have been playing with recently is Pharo, a modern descendent of Smalltalk. This is a summary of my first impressions after using it on a small (and unfinished) project, for which it might actually turn out to be very helpful.

The first time I read about Smalltalk was in the August 1981 issue of Byte magazine. Back then, I was a high school student and I had just invested my savings into my first home computer with characteristics typical for the time: Z80 processor, 16 KB of memory, Microsoft Basic, data storage on cassette tapes. From that perspective, Smalltalk was a utopia. The revolutionary aspect of Smalltalk was its design as an integrated computing system that combined a language, a huge standard library, a development environment, and perhaps most of all a graphical user interface (GUI), which in fact was the ancestor of all of today's desktop-style GUIs. As a consequence, it required a high-quality graphics display, a mouse, and plenty of CPU power. None of that was available in commodity hardware.

In 1995, a friend passed me a floppy disk with Smalltalk-80 for the Atari ST family, and I could finally lay my hands on a working Smalltalk system. By then I had an Atari TT with the awesome big high-resolution black-and-white screen that was available for it. Just perfect for Smalltalk. I was very impressed by the system, which in many respects was superior to the Atari's native TOS/GEM combo, and even to the Unix workstations I had in the lab. But I couldn't actually use it for anything productive, because Smalltalk lived in a separate universe, unable to access any file on my hard disk. It wasn't more than an impressive demo of what computing could be like.

I have faint memories of playing with Squeak a couple of years later, but I found its flashy colors and toy-inspired aesthetics so unpleasant that I didn't go very far. Pharo is actually a fork of Squeak that evolved into a different direction, with a more sober design that is much more to my liking. More importantly, some of the on-going developments in the Pharo community (in particular the Glamorous Toolkit) are much in line with my recent interest in the human-computer interface of computational science. The 2018 session of the Pharo MOOC was thus a good occasion to take a more serious look at this up-to-date incarnation of Smalltalk. The MOOC does a pretty good job at introducing Pharo to people with various interests, and it even includes some explanations of the internal workings of Pharo (look for the "black magic" label).

As a language, Smalltalk was revolutionary in the 1980s, but no longer today because many now better known languages have drawn on it for inspiration. If you know Python, for example, then Pharo won't surprise you much beyond the obvious and important syntactical differences. On the plus side, that means it is not much effort to do a first project in Pharo when coming from a Python background. But it also means that there isn't much to be gained from learning Pharo if you look at it as just another programming language. The really interesting part is not the language, but the user interface of Pharo the computing platform.

Pharo belongs to a rare species of computing environments that I think is best described by the label "explorable". All of Pharo is implemented in Pharo itself, and all the source code is there for you to inspect and modify. But it's not just the code that is inspectable, it's all the objects that exist in memory. You can, for example, evaluate Array instanceCount to find out how many arrays exist at the moment (213464 when I tried). You can then obtain an arbitrarily chosen instance with Array someInstance and open a graphical inspector using Array someInstance inspect. You can also modify that array, without any idea of where it is used and for what, and thus wreak havoc with your system. For a more thorough approach to breaking Pharo, one of my favorites is true become: false, which replaces true by false and vice versa everywhere in the system. Pharo reacts much like I'd expect a human logician to react: it freezes instantly.

The complete state of a Pharo system, including all code and all objects, and thus even GUI elements such as open windows, can be saved with a click in what is called an image. This is obviously very convenient, but should not be used as the only strategy for storing code because images are fragile, as my example above illustrates. Consider an image your development environment rather than your code repository. In fact, Pharo supports and encourages storing code in Git repositories.

It is important to understand that explorability is not an accidental feature of Pharo (and other Smalltalk derivates), but has been a design goal from the start. Those interested in the history of this idea should look at Alan Kay's Dynabook concept and then take another step back in history to Doug Englebart's "Mother of all Demos". The motivation behind all these developments is to make computing a tool not for performing tasks, but for augmenting human intellectual abilities. That goal is, unfortunately, very rare. In fact, the only other system I know of that was designed to be explorable is Emacs, also with the goal of maximally empowering users. Once you look beyond superficialities, Pharo and Emacs are actually quite similar. Both are built around a high-level programming language with a rich library, a user-interface framework, and development tools with inspection capabilities. Emacs then comes with a text editor as the default application at startup. Pharo has no such default application, meaning that it is pretty useless before you write some code of your own. That is probably the main reason why Emacs became so much more popular - people use it as a text editor and only later, if ever, discover its empowering features.

Explorability is what interests me most in Pharo, because I believe that computational science sorely needs it, and that existing interactive interfaces such as REPLs or notebooks are far from sufficient. They impose a linear thread of exploration, whereas I want to be able to go off on a tangent, dig in deeper into a model, compare two datasets side-by-side, etc. Notebooks are also rigid exploration environments which can be extended only with major effort, if at all. Pharo offers a much richer exploration environment, and makes it easy to adapt to problem-specific needs (another reference to the Glamorous Toolkit is compulsory here). The snag is that Pharo doesn't offer much support for working with scientific data or scientific models (though I must admit that I haven't checked out PolyMath yet). There are people who use Pharo for computational science (see e.g. this epidemiology simulation platform), so I suppose that there are useful tools I simply haven't looked at yet.

One power tool that I have already discovered (and explored interactively in Pharo) is the visualization library Roassal. It may superficially resemble various visualization libraries for JavaScript, but the big difference is that it integrates with the Pharo development and exploration tools. It is very easy to add a visualization pane to Pharo's object inspector and get a graphical view on your objects in addition to the standard browser-type interface for accessing an object's internals. And that means that you can easily use visualization as a tool in designing, implementing, and debugging code. It also helps a lot that the visualizations are themselves interactive. You can make them react to clicks, drags, and other events, and thus turn them into a user interface to your classes. For those familiar with Jupyter notebooks, it's as if you could implement interactive widgets in a few lines of Python code stored in your notebook.

I should perhaps say something about Pharo as a software development environment, but that aspect has been covered before by others in much more depth than I would do it myself. The demos in the Pharo MOOC are a good introduction, but for an overview of the possibilities, nothing beats Aditya Siram's recent demo aimed at adepts of functional programming languages.

After all that praise, I have to add some caveats. First of all, the Pharo community is tiny compared to, say, Python's, and therefore the choice in domain-specific libraries is rather small. Next, Pharo development moves on at a rapid pace, with the main consequence that nearly all available documentation is outdated, and what's left is often an update for insiders rather than an introduction for newcomers. No matter how explorable a system is, you need some higher-level information to use it productively, if only to know the jargon that permits you to start searching for stuff. As an example, when I tried to figure out how package dependency management works, I had to ask on the Pharo user mailing list to learn that the keyword to look for is "baseline". The three books Pharo by Example, Deep into Pharo, and Enterprise Pharo are probably the best place to start looking for introductory essays, but even they are two versions behind the current one.

Finally, let me anticipate a reaction that I expect regular readers of this blog to have. How is it possible for someone who underlines the importance of reproducibility in every second post to say something positive about a system that relies on persistent state to the point that it cannot even be bootstrapped from its own source code? There are a couple of replies. Most importantly, reproducibility is not what I am looking for in Pharo. Every system has its good and bad sides, and I am turning to Pharo for its good sides, explorability and user interfaces. Second, the Pharo developers are working on this. And finally, decades of dealing with persistent yet fragile system images have lead the Smalltalk community to figure out ways to cope with the resulting problems (e.g. changesets) that may be worth studying for inspiration. Computational science suffers from a fundamental tension between the short-term need for interactivity and the long-term need for reproducibility. So far, no one has found a satisfying answer, so it's worth looking for inspiration in unusual places.

Comments retrieved from Disqus

Richard Eng:
> But it also means that there isn’t much to be gained from learning Pharo if you look at it as just another programming language.
I disagree. Syntactically, Pharo is a much nicer language to use than Python, for example. It's incredibly elegant. Conceptually, there's hardly anything to Pharo's syntax. By comparison, Python has much more syntax, and some of it is decidedly unnatural or unintuitive.
Python's OOP feels "bolted on," like an afterthought. It hides instance variables and methods "in plain sight" by prefixing their names with underscores. Yuck!
Instance method definitions must include "self" as the first argument. Their excuse? "Explicit is better than implicit." Give me a f*cking break.
Python's lambdas can only accept single expressions. What other language does this???
Python prefers half-open intervals. For example, range(1,6) gives you 1, 2, 3, 4, 5. This just doesn't feel right.
Python's local variable scoping rules are peculiar.
Python's Off-side rule syntax makes many developers uncomfortable, myself included.
This is all to say that using Python imposes a greater cognitive load on the programmer. With Pharo, there is no such load. Pharo is simplicity incarnate.
- Konrad Hinsen:
  I agree with your criticisms of Python syntax, but in my personal experience of 25 years of Python coding, most of these are not serious issues in practice.
  Everything can be improved, but only problems that its users perceive as serious have a chance of actually being addressed, and syntax is overall not perceived as a serious problem in the Python community. The only point you raise that I have seen discussed in the Python community is the limitations of lambda. Also the local variable scoping rules, but that's not really syntax.
  There is of course the big issue of indentation that you mention, which probably deters some people to the point that they never become Python programmers. But that's in the realm of personal preferences, as many others just love it.
Ben Coman:
btw, Pharo's version of Jupyter is Grafoscopio
http://mutabit.com/grafosco...
Torsten Bergmann:
You seem to have missed one primary point: Pharo since Pharo 7 IS NOW bootstrapped - so we can build a image from our own source code.
Even a more minmal one than the default download.
Check the folder "bootstrap" in https://github.com/pharo-pr...
and also checkout https://github.com/guillep/...
This is possible since 2016 already - see https://pharoweekly.wordpre...
I followup on this on https://astares.blogspot.co...
- Konrad Hinsen:
  Thanks for pointing this out. I had at some time seen a comment about bootstrapping being planned as a feature for Pharo 7, but never an announcement of it being done. Since Pharo 7 hasn't been officially released yet, I had assumed it was still on the todo list.
  - Ben Coman:
    Pharo 7 is now released...
    http://forum.world.st/ANN-P...
  - Dollface93:
    khinsen briefly

Knowledge distillation in computer-aided research

Konrad Hinsen — 2018-10-21

There is an important and ubiquitous process in scientific research that scientists never seem to talk about. There isn't even a word for it, as far as I now, so I'll introduce my own: I'll call it knowledge distillation.

In today's scientific practice, there are two main variants of this process, one for individual research studies and one for managing the collective knowledge of a discipline. I'll briefly present both of them, before coming to the main point of this post, which is the integration of digital knowledge, and in particular software, into the knowledge distillation process.

The first variant is performed by individual researchers or closely collaborating teams who, starting from the raw information of their lab notebooks, describing methods applied and results obtained, write a journal article summarizing all of this information into an illustrated narrative that is much easier to digest for their fellow scientists. This narrative contains what the authors consider the essence of their work, leaving out what they consider technical details. Moreover, the narrative places the work into its wider scientific context. In a second step, the authors condense the article into an even smaller abstract, supposed to tell readers at a glance if the article is of interest to them without going into any details. This process can be illustrated as a pyramid:

At the bottom we have all the gory details, one level up the distilled version for communication, and at the top the minimal summary for first contact with a potential reader. It is not uncommon to have an additional layer between the bottom two, often published as "supplementary material".

Whereas authors work from the bottom to the top of this pyramid, readers work down from the top, gaining a more detailed understanding at each step. Until not so long ago, this was a two-step process: after the abstract, they could move on to the paper, but after that they had to contact the authors for obtaining more details, and the authors might well not care to reply. The Open Science movement has made some progress in pushing for more transparency by making deeper information layers available for critical inspection, in particular raw datasets and the source code for the software used to process them. The situation is very much in flux as various scientific disciplines are working out which information can and should be shared, and how. The maximal level of openness is known as Open Notebook science, which basically means making the whole pyramid public. Note, however, that giving access to the base of pyramid does not make the knowledge distillation steps superfluous. Readers would succumb to information overload if exposed to all the details without a proper introduction in the form of distilled knowledge. In fact, most readers don't want to anything else than the distilled version.

The second variant of knowledge distillation is performed collectively by domain experts who summarize the literature of their field into review articles and then into monographs or textbooks for students. The pyramid diagram is very similar to the first variant's:

It's really just the same process at another scale: knowledge transfer about a discipline, rather than about a specific study.

So far for good old science - let's move to the digital age. The base of our first pyramid now contains code and digital datasets. Some of the code was written by the authors of the study for this specific project and typically takes the form of scripts, workflows, or notebooks. This is complemented by the dependencies of this project-specific code - see my post on software collapse for an analysis of the full software stack. Full openness requires making all of this public, with computational reproducibility serving as a success indicator. If other researchers can re-run the software and get the same results, they possess all the information one could possibly ask for, from a computational point of view.

But as with Open Notebook science, making all the details open is not sufficient. Readers will again succumb to information overload when exposed to a complex software stack and digital datasets whose precise role in the study is not clear. Information overload is even a much more serious problem with software because the amount of detail that software source code contains is orders of magnitude bigger than what can be written down in a lab notebook.

So how do we distill the scientific knowledge embedded in software? The bad news is that we don't yet have any good techniques. What we find in journal articles when it comes to describing computational methods is very brief summaries in plain English, closer to the abstract level than to the journal article level. As a consequence, computational methods remain impenetrable to the reader who does not have prior experience with the software that has been applied. There is no way to work down the pyramid, readers have to acquire the base level skills on their own. Worse, there is no way to stop at the middle level of the pyramid and yet have a clear understanding of what is going on.

The recent years have seen a flurry of research and development concerning the publication of software and computations. One main focus has been the reproducibility of results, another the sustainability of scientific software development, and a third one the readability of computational analyses. This last focus has most notably led to the development of computational notebooks (such as Jupyter, Rmarkdown, Emacs/Org-mode and many more), which embed code and results in a narrative providing context and explanations. Notebooks are occasionally put forward as "the paper of the future", but in view of the knowledge pyramid, that's not what they are. They are closer to the digital age equivalent of lab notebooks, especially when combined with version control to capture the time evolution of their contents. The real paper of the future must contain a distilled version of the source code.

It is interesting to examine why notebooks have been so successful in some scientific domains. First of all, they are a much better human-readable presentation of source code than anything we had before, with the exception of the related idea of literate programming which I expect to see a come-back as well. Next, in domains where computational studies tend to be linear sequences of well-known standard operations, such as statistical analyses, the notebook is very similar to a distilled computational protocol, because the technical details are mostly hidden in libraries. These libraries also contain significant scientific knowledge, but because these methods are well-known, they have in a way been distilled in the form of textbooks.

More generally, though, notebooks contain both too little and too much information to qualify as distilled descriptions of computational studies. Too little because much scientific knowledge is hidden in the notebook's dependencies, which are not documented at the same level of readability (which is why I believe that literate programming has a future). Too much because they still expose technical details to the reader that is more a hindrance than a help for understanding.

How, then, should the paper of the future present distilled computational knowledge? I see three main requirements:

It must be possible to explain and discuss individual models, approximations, or algorithms without the constraints of an efficient working implementation.
These models, approximations, and algorithms must be presented in a sufficiently precise form that automatic verification procedures can ensure that the source code at the base level of the pyramid actually implements them.
Suitable user interfaces must allow a reader to explore these models, approximations, and algorithms through concrete examples.

The first requirement says that clarity of exposition must take absolute precedence over any technical considerations of software technology. The intrinsic complexity of computational methods makes understanding hard enough, so everything possible must be done to keep accidental complexity out of the way.

The second requirement ensures that the conformity between the distilled and the detailed representations of a computational protocol can be verified by computers rather than by humans. Humans aren't very good at checking that two complex artifacts are equivalent.

The third requirement is motivated by the observation that a real understanding of a computational method, which is usually too lengthy to be actually performed manually, requires both reading code and observing how it processes simple test cases. Observation is not limited to the final outcome, it may well be necessary to provide access to intermediate results.

To get an idea of what "suitable user interfaces" might look like, it's worth looking at the explorable explanations and the Complexity Explorables Web sites. Note, however, that none of these exploration user interfaces provide easy access to a precise formulation of the underlying models or algorithm. They exist in the form of JavaScript source code embedded in the Web site, but that's not exactly a reader-friendly medium of expression. Another interesting line of development is happening in the Pharo community (Pharo being a modern descendent of Smalltalk), e.g. the idea of moldable inspectors, which are user interfaces specifically designed to explore a particular kind of object, which in the O-O tradition combines code and data.

Back to requirements 1 and 2: we want a precise and easily inspectable description that can be embedded into an explanatory narrative. We also want to be sure that it actually corresponds to what the user interface lets us explore, and to what the software implementation applies efficiently to real-world problems. I am not aware of any existing technology that can fulfill this role, although there many that were designed with somewhat different goals in mind that can serve as guidelines, in particular the various modeling and specification languages.

My own research into this problem had led to the concept of digital scientific notations, and I am currently designing such a notation for physics and chemistry, called Leibniz. A first report on this research has been published earlier this year. Leibniz is mainly inspired by traditional mathematical notation concerning the way it is embedded into a narrative, and from specification languages in terms of semantics. Some relevant features of Leibniz for expressing distilled knowledge are

Its highly declarative nature. Leibniz code consists of short declarations that can be written down in (nearly) arbitrary order, making them easy to embed into a narrative, much like mathematical expressions and equations.
Its foundation in term rewriting (the same foundation adopted by most computer algebra systems). Among other advantages, this allows Leibniz code to concentrate on one aspect of a model or algorithm while leaving other aspects unspecified.
Its restriction to a single universal (but often inefficient) data structure.

These features mainly address requirement 1. As for requirement 2, Leibniz uses XML for its syntax and has very simple semantics, making it easy to write libraries that read and execute Leibniz code which in turn make it easy to integrate Leibniz into scientific software of all kinds. Only Leibniz development environments have to deal with the more complex user-facing syntax requiring a specific parser.

Leibniz does not try to address requirement 3, but since it meets requirement 2, it doesn't get in the way of people wishing to build exploration and inspection user interfaces for Leibniz-based models and algorithms.

Leibniz is still very much experimental, and I am not at all sure that it will turn out to be useful in its current form. In fact, I am almost certain that it will require modification to be of practical use. If that doesn't scare you off, have a look at the example collection to get an idea of what Leibniz can do and what it looks like. Feedback of any kind is more than welcome!

Literate computational science

Konrad Hinsen — 2018-07-26

Since the dawn of computer programming, software developers have been aware of the rapidly growing complexity of code as its size increases. Keeping in mind all the details in a few hundred lines of code is not trivial, and understanding someone else's code is even more difficult because many higher-level decisions about algorithms and data structures are not visible unless the authors have carefully documented them and keep those comments up to date.

The main angle of attack to keep software source code manageable has been the development of ever more sophisticated programming languages and development paradigms, but it is not the only one. Another approach was initiated by Donald Knuth's invention of literate programming. Its basic idea is to invert the roles of code and documentation. Rather than adding doxumentation as annotations to the code, literate programming puts an explanatory narrative about the software at the center of the software author's attention. Code snippets are embedded into this narrative, much like mathematical formulas are embedded into scientific articles and textbooks.

Literate programming never gained much popularity, for reasons that, to the best of my knowledge, have never been explored systematically. Insufficient tool support is often cited as an obstacle, but I suspect that the mismatch between the structure of the narrative and the language-imposed structure of the code is equally problematic. Programmers need to name code blocks and then assemble them into valid source code by hand. My own experience is that it's usually easier to write and test the code first and then re-create it as a literate program, but this doesn't lead to code that naturally fits the narrative.

The main argument in support of this suspicion is the much higher popularity of a variant of literate programming that both adds and removes features compared to Knuth's original system. Computational notebooks (implemented e.g. by Jupyter) document a computation rather than a piece of software. In addition to code, they embed input data and results into the narrative, but they also restrict code to a linear assembly of code cells executed in sequence. This limitation removes the need to name and assemble code blocks.

An idea I have been exploring recently is to take another step towards letting the explanatory narrative take center stage, by designing a formal language specifically for embedding into such a narrative. However, my language called Leibniz is not a programming language. I call it a digital scientific notation to emphasize its intended use in the documentation of scientific models and methods, but in terms of computer science terminology it is a specification language designed for models expressed in terms of equations and algorithms. Leibniz code must be embedded into a narrative, although the Leibniz authoring environment also extracts a machine-readable version as an XML file for easy processing by scientific software.

For getting an overview of Leibniz, I suggest to look first at a simple example, and then read my paper describing Leibniz and the problems it is designed to solve, which just appeared in PeerJ CompSci (Open Access like all of PeerJ). The explanations in the paper should prepare you for a look at the currently most extensive example, which documents, for a toy problem, the full path of assumptions and approximations that lead from a theoretical framework (Newton's equations of motion) to a numerical algorithm, with all models along the way being machine-readable.

As the paper explains, Leibniz is best described as a research prototype at the current stage. It has known limitations that make its application to complex real-world problems a bit challenging. However, I am confident that these limitations can be overcome, and that Leibniz will be suitable for a wide range of scientific models and methods, starting with mathematical equations and ending with literate workflows. As Silicon Valley startups would say, make sure you won't be left behind by the Leibniz revolution!

Scientific software is different from lab equipment

Konrad Hinsen — 2018-05-07

My most recent paper submission (preprint available) is about improving the verifiability of computer-aided research, and contains many references to the related subject of reproducibility. A reviewer asked the same question about all these references: isn't this the same as for experiments done with lab equipment? Is software worse? I think the answers are of general interest, so here they are.

First of all, an inevitable remark about terminology, which is still far from standardized (see this preprint and this article for two recent contributions to the controversy). I will use the term "computational reproducibility" in its historically first sense introduced by Claerbout in 1992, because it seems to me that this is currently the dominant usage. Reproducing a computation thus means running the same software on the same data, though it's usually done by a different person using a different computer. In contrast, replication refers to solving the same problem using different software. This terminological subtlety matters for the following discussion, because experimental reproducibility is actually more similar to replicability, rather than reproducibility, in the computational case.

There are two aspects in which I think scientific software differs significantly from lab equipment:

Its characteristics as a human-made artifact
Its role in the process of doing science.

Software is more complex and less robust than lab equipment

The first point I raised in my paper is the epistemic opacity of automated computation. Quote:

The overarching issue is that performing a computation by hand, step by step, on concrete data, yields a level of understanding and awareness of potential pitfalls that cannot be achieved by reasoning more abstractly about algorithms. As one moves up the ladder of abstraction from manual computation via writing code from scratch, writing code that relies on libraries, and running code written by others, to having code run by a graduate student, more and more aspects of the computation fade from a researcher's attention. While a certain level of epistemic opacity is inevitable if we want to delegate computations to a machine, there are also many sources of accidental epistemic opacity that can and should be eliminated in order to make scientific results as understandable as possible.

The reviewer asks: isn't this the same as when doing experiments using lab equipment constructed by somebody else? My answer is no.

Let's do a little thought experiment, introducing Alice and Bob as virtual guinea pigs. Alice is an experienced microscopist, Bob is an experienced computational scientist. We give Alice a microscope she hasn't seen before, and ask her to evaluate if it is suitable for her research. We give Bob a simulation program (with source code and documentation) that he hasn't seen before, and ask him the same question.

My expectation is that Alice will go off an do some tests with samples that she knows well, and perhaps do some measurements on the microscope. After that, she will tell us for which aspects of her work she can use this new microscope. Meanwhile, Bob will be scratching his head while trying to figure out how to deal with our question.

One reason for the difference is that a microscope is a much simpler artifact than a simulation program. While it is certainly difficult to design and produce a good microscope, from a user's perspective its characteristics can be described by a handful of parameters, and its quality can be evaluated by a series of test observations. Software, on the contrary, can do almost anything. A typical simulation program has lots of options, whose precise meaning isn't always obvious from its documentation. More importantly, no two simulation programs have identical options. Even the most experienced user of simulation software A falls back to near-novice status when given simulation software B.

A more subtle difference is that microscopes, and lab equipment in general, are designed to be robust against small production defects and small variations of environmental conditions. Such small variations cause only small changes in the generated images. With software, on the other hands, all bets are off. A one-character mistake in the source code can cause the program to crash, but also to produce arbitrarily different numbers. In fact, there is no notion of similarity and thus of small variations for software. For a more detailed discussion, see my CiSE article on this topic. This is why you can evaluate the quality of a microscope using a few judiciously chosen samples, whereas no amount of test runs can assure you that a piece of software is free of bugs. Unless you can afford to test all possible inputs, of course, but then you don't really need the software.

These two differences explain why Alice knows how to evaluate the microscope, whereas Bob doesn't know where to start. He might look at the documentation and the test cases to see if the program is meant to be used for the kind of work he does. But the documentation almost certainly lacks some important details of the approximations that are made in the code and that matter for Bob's work. Moreover, he would still have to check that the software has no serious bugs related to the functionality he plans to use. Without knowing the implemented algorithms in detail, he cannot even anticipate what bugs to watch out for.

Bob could also choose a very different approach and judge the software by quality standards from software engineering. Is the code well structured? Does it have unit and integration tests? These are the criteria that software journal ask their reviewers to evaluate (e.g. the Journal of Open Research Software or the Journal of Open Source Software). Statistically, they are probably related to the risk of encountering bugs (if anyone knows about research into this question, please leave a comment!). But even the most meticulous developers make mistakes, and, more importantly, may have different applications in mind than those that Bob cares about.

Finally, Bob could do what in my experience (and also according to this study ) most scientists do in choosing research software: they use what their colleagues use. Bob would then send a few emails asking if anyone he knows uses this software and is happy with it. This is a reasonable approach if you can assume that your colleagues, or at least a sizable fraction of them, are in a better position to judge the suitability of a piece of software than yourself. But if everyone adopts this approach, it becomes a popularity contest with little intrinsic value (see this paper for a detailed example). In any case, it is not a way to actually answer our question.

In the end, if you really want to know if your software does what you expect it to do, you have to go through every line of the source code until you understand what it does. You are then at the minimal level of epistemic opacity that you can attain without actually doing the computations by hand. Unfortunately, in the case of complex wide-spectrum software, this is likely to be much more effort than writing your own special-purpose software.

The solution I propose in my paper is to use human-readable formal specifications as a form of documentation that is rigorous and complete, and can be used as a reference to verify the software against. The idea is to have a statement of the implemented algorithms that is precise and complete but as simple as possible, without being encumbered by considerations such as performance. Note that I don't know if this will turn out to be possible - my work is merely a first step into that direction that, to the best of my knowledge, has not been explored until now.

Software is about models, lab equipment is about observations

A popular meme in explaining science describes it as founded on two pillars, experiment and theory. Some people propose to add computation and/or simulation as a third pillar, and data mining as a fourth, although these additions remain controversial. In my opinion, they are misguided by a bad identification of the initial pillars. They are not experiment and theory, but observations and models. We often speak of computational experiments when doing simulations, and there are good reasons for the analogy, but it is important to keep in mind that these are experiments on models, not on natural phenomena.

Observations provide us with information about nature, and models allow us to organize and generalize this information. In this picture, computation has two roles: evaluating the consequences of a model, and comparing them to observations. Simulation is an example for the first role, data mining for the second. Both of these roles predate electronic computers, they simply received more modest labels such as "solving differential equations" or "fitting parameters" in the past.

In the context of reproducibility and verifiability, it is important to realize that there is no symmetry between these two pillars. Nature is the big unknown that we probe through observations. To do this, we use lab equipment that can never be perfect, for two reasons: first, it is constructed on the basis of our imperfect understanding of nature, and second, our control of matter is limited, so we cannot produce equipment that behaves precisely as we imagine it. Models, on the other hand, are symbolic artifacts that are under our precise control. We can formulate and communicate them without any ambiguity, if only we are careful enough.

Because of these very different roles of observations and models, computational reproducibility has no analogue in the universe of observations. It is almost exclusively a communication issue, the one exception being the non-determinism in parallel computing that we accept in exchange for getting results faster. Non-determinism aside, if Alice cannot reproduce Bob's computations, that simply means that Bob has not been able or willing to describe his work in enough detail for Alice to re-do it identically. There is no fundamental obstacle to such a description, because models and software are symbolic artifacts. We actually know how to achieve computational reproducibility, but we still need to make it straightforward in practice.

Similarly, if Alice cannot verify that Bob's computation solves the problem he claims them to solve, this means that Bob has not succeeded in explaining his work clearly enough for Alice to understand what is going on. An unverifiable computation is thus very similar to a badly written article. The big difference in practice is that centuries of experience with writing have lead to accepted and documented standards of good writing style, whereas after a few decades of scientific computing, we still do not know how to expose complex algorithms to human readers in the most understandable way. My paper is a first small step towards developing appropriate techniques.

Experimental reproducibility, on the other hand, is an ideal that can never be achieved perfectly, because no two setups are strictly the same. Verifiability is equally limited because observations can never be repeated identically, even when done with the same equipment. Reproducibility is a quality attribute much like accuracy, precision, or cost. Tradeoffs between these attributes are inevitable, and have to be made by each scientific discipline as a function of what its main obstacles to progress are.

Science has been adjusting to the inevitable limits of observations since its beginnings, whereas the issue of incomplete model descriptions has come up only with the introduction of computers permitting to work with complex models. We don't know how yet if non-verifiable models are a real problem or not. However, as a theoretician I am not comfortable with the current situation. Models can be simple or complex, good or bad, grounded in solid theory or ad-hoc, but they should not be fuzzy. In particular not for complex systems, where it is very hard to foresee the consequences of minor changes.

Scientific communication is a research problem

Konrad Hinsen — 2018-04-09

A recent article in "The Atlantic" has been the subject of many comments in my Twittersphere. It's about scientific communication in the age of computer-aided research, which requires communicating computations (i.e. code, data, and results) in addition to the traditional narrative of a paper. The article focuses on computational notebooks, a technology introduced in the late 1980s by Mathematica but which has become accessible to most researchers only since Project Jupyter (formerly known as the IPython notebook) started to offer an open-source implementation supporting a wide range of programming languages. The gist of the article is that today's practice of publishing science in PDF files is obsolete, and that notebooks are the future.

One interesting follow-up thread on Twitter explored if any scientific papers had actually been published in the form of Jupyter notebooks. It seems that the answer is no. Notebooks are published as supplementary material to standard papers, or as informal communication outside of the official scientific record, in particular for teaching purposes, but no one could point to a paper indexed in any article database that was written as a Jupyter notebook. As to the question of why his hasn't happened, all answers remain speculative in the absence of research into the subject. Publishers' format requirements are certainly a part of the problem, but limitations of today's notebook format also matter. In particular, notebooks lack support for bibliographies and for cross referencing.

Another interesting follow-up is a blog post by Luis Pedro Coelho who predicts that PDFs will stay with us for many years to come, because none of the proposed successors is actually mature enough for use in real life. In particular, he points out the complexity and lack of longevity and stability of most of today's computational tools. My personal experience is very similar to his. He also asks the very relevant question if a notebook-style presentation of results and computations is actually a good idea in the context of a scientific paper. I suspect nobody can provide an evidence-based answer at this time.

As these discussions illustrate, scientific communication about computer-aided research remains a research problem. As a community, we do not know how to explain, share, or review computer-aided research in a satisfactory way. Most of us agree that PDFs are no longer sufficient, and that we need to share code and data. However, we do not yet have good enough practices for doing so, at least not for all practically relevant situations. We do not know either if sharing code and data will actually be sufficient to enable effective communication. It is well possible that we will also need to develop practices for better explaining computations to each other, and have them peer reviewed in some form.

From this point of view, all of today's technology, be it Jupyter, Org mode, knitr or similar tools, should best be seen as support tools for performing experiments in scientific communication. What is still largely missing is systematic research that evaluates these experiments with the goal of summarizing the collective experience and drawing conclusions. There are promising starts, such as this study on the actual use of Jupyter notebooks, but their number is negligible compared to the number of articles proclaiming that this or that technology is going to revolutionize scientific communication without providing any tangible evidence.

I think it is time for the scientific community to acknowledge that it doesn't really know how to communicate computer-aided research effectively, and encourage research into the question. Experimenting with the various proposed approaches is essential, but analyzing the outcomes of these experiments is essential as well. In my opinion, we currently over-emphasize tool development, community building, and teaching, which are all directed at implementing new practices, but neglect research into what these practices actually should be. Future generations of scientists may well remember today's hot developments as sources of technical debt.

A personal anecdote provide and illustration of the dominating attitude. My ActivePapers project is clearly labeled as research. Its goal is to explore how non-trivial computations (long run times, big data sets) can be performed, archived, and published reproducibly. For first results, see this paper. Whenever I present this project, I know there is one question someone in the audience will ask: What are your plans for increasing your user base? I answer that I am doing research and not product development, and that I am not recruiting users but at best collaborators. This always causes surprise and sometimes animated discussions. It almost seems that doing research on doing research is a strange idea for professional scientists. On the other hand, my other research project on scientific communication, the digital scientific notation Leibniz, does not generate this kind of reaction, but then it hasn't see that much exposure yet. It explores the question of how we can explain a complex computation in a way that allows readers to verify its scientific assumptions. For a first account, see this preprint.

Finally, readers might be interested in two of my earlier blog posts that are related to notebooks:

"Beyond Jupyter: what’s in a notebook?" looks at notebooks as digital documents, focusing on the information content rather than on the tool for doing computations.
"From facts to narratives" explores various approaches, one of them being notebooks, to combining formal elements of a computation (code, date) with a explanatory narrative.

What can we do to check scientific computation more effectively?

Konrad Hinsen — 2018-03-07

It is widely recognized by now that software is an important ingredient to modern scientific research. If we want to check that published results are valid, and if we want to build on our colleagues' published work, we must have access to the software and data that were used in the computations. The latest high-impact statement along these lines is a Nature editorial that argues that with any manuscript submission, authors should also submit the data and the software for review. I am all for that, and I hope that more journals will follow.

However, we must also be aware of the inherent limitations of simply including software in peer review. With the exception of small and focused software, of the kind we typically have in replications submitted to ReScience (one of the very few scientific journals that actually does code review), the task of evaluating scientific software is so enormous that asking a single person to do it within two weeks is simply unreasonable. For that reason, journals specialized in software papers, such as the Journal of Open Research Software or the Journal of Open Source Software, limit the reviewing process to more easily verifiable formal aspects, such as the presence of documentation and the use of appropriate software engineering techniques. Which is, of course, much better than nothing, but it isn't enough.

A few months ago I wrote about the kinds of mistakes that we tend to make in scientific computing. In my experience (I'd love to see a systematic study on this), most mistakes are due to discrepancies between what a paper describes and what is actually computed. This covers simple mistakes such as a wrong sign in a computed formula (such as in the widely publicized case of protein structure retractions), or a typo in the input parameter file for a simulation program, but also more complex situations such as the inflated false-positive rates in fMRI studies that also made it into the headlines of science news. In this case, the fundamental issue was a mismatch between the methods implemented in the software and the methods that would have been appropriate for many typical use cases of the software. Put differently, the users of the software did not fully understand what exactly the software did. They trusted the software authors blindly to do "the right thing", whatever that was. And they were probably reinforced in their blind trust by the fact that many of their colleagues used the same software. It's the research version of "nobody ever got fired for buying IBM equipment".

Code review is an important step to a better verification of scientific computations, but in the cases I just described its utility is very limited. Neither the wrong sign in the protein crystallography code nor the not-quite-universally-applicable statistical analysis method used by the fMRI software would be detectable by software engineering methods. In the first case, the code would have to be compared to the set of mathematical formulas on which it was based, a task requiring expert knowledge in both crystallography and programming, plus a lot of time - much more than what a reviewer can typically invest. In the second case, code review cannot do anything at all. Only the reviewers of the application papers could have spotted the inappropriateness of the methods - but why should they be expected to be more knowledgeable about the pitfalls than the authors?

An important but not yet widely recognized aspect of these situations is that today's scientific software incorporates a significant amount of scientific knowledge that is very difficult to access and verify by users and reviewers. The translation of mathematical equations in a paper into efficient computer code is almost a form of encryption from the point of view of scientific knowledge transformation. Extracting equations from software source code is not much easier than extracting source code from compiled binaries.

But can we do anything about this? I believe we can, but it will require a serious rethinking of the way we use computers to do research. My first explorations in this direction are described in a paper that is now available as a PeerJ preprint. Please have a look, and don't hesitate to ask a question or leave other feedback of any kind!

Data science in ancient Greece

Konrad Hinsen — 2017-12-19

Data science is usually considered a very recent invention, made possible by electronic computing and communication technologies. Some consider it the fourth paradigm of science, suggesting that it came after three other paradigms, though the whole idea of distinct paradigms remains controversial. What I want to point out in this post is that the principles of data science are much older than most of today's practitioners imagine. Let me introduce you to Apollonius, Hipparchus, and Ptolemy, who applied these principles about 2000 years ago.

The focus of interest of these early researchers was a topic that had kept humanity busy for quite a while already, all over the world: the motion of heavenly bodies. The main motivation was making predictions for the near future. The configuration of the stars and planets was widely believed to have an impact on human affairs (a belief we call astrology today), so knowing them in advance was of obvious interest. They had astronomical observations at their disposal, but numbers alone are not sufficient to make predictions. You also need a model for extrapolating the numbers to the future.

The tool that Apollonius, Hipparchus, Ptolemy, and probably many others, developed and improved to near perfection was epicycles: a model for the orbit of a heavenly body consisting of a superposition of circles, with each circle's center moving along a bigger circle's circumference. Epicycles are similar in spirit to Fourier series. Any periodic orbit can be described as a superposition of circular motions. Given enough data, one can fit an epicycle model and make predictions. But since the epicycle model does not contain any physics, it doesn't come with any safeguards against mistakes. Epicycles can equally well describe real and completely unrealistic orbits, and therefore the quality of the data is very important.

Today's data science works much the same. Very general models, such as neural networks, are fitted to large datasets and then used to make predictions. Again the models contain very few assumptions about underlying laws of nature. They are by design very general (see e.g this visual proof that neural networks can compute any function) in order to capture any kind of regularity in the input datasets. As for epicycles, data quality is important, which is why data scientist invest a significant effort into cleaning up the raw data they work on.

Aside from the obvious technological aspects and the associated change of scale in the size of datasets, the main improvement of today's data science on epicyle models for orbits is even more generality. Early astronomers had periodicity baked into their models from the start. Neural networks (and other models used in data science) could predict the motion of heavenly bodies with even less theoretical input. However, it is important to realize that every model imposes some a priori assumptions, even if, as in the case of neural networks, these assumptions are not fully understood and therefore not formalized. Seen in this light, the improvement of modern data science over epicycles is gradual rather than fundamental.

Adopting an historical perspective, data science turns out to mark the beginning of scientific disciplines rather than their refinement. It permits the very first step from raw observations to a description of regularities. Connecting these regularities to known more fundamental principles, or even discovering new fundamental principles as in the case of Newton's laws for celestial mechanics, can only happen afterwards.

Perhaps a more fundamental distinction than the one between experiment and theory (plus, according to some, simulation and data science) is the one between data-driven and model-driven science. Data-driven science starts from observations and searches for regularities using generic models. Model-driven science takes more advanced problem-specific models and aims at evaluating and improving their quality on one hand, and explore their consequences on the other hand. In terms of day-to-day research activities, data-driven science collects observations that promise to be interesting and uses statistical methods to interpret them. Model-driven science has theoreticians exploring models and experimentalists asking Nature specific questions arising from this exploration. The oldest and best-known scientific disciplines, i.e. physics and chemistry, are primarily model-driven today, which may contribute to the impression that data-driven science is new. As the epicycle example shows, this is really just a lack of historical perspective.

Stability in the SciPy ecosystem: a summary of the discussion

Konrad Hinsen — 2017-11-22

The plea for stability in the SciPy ecosystem that I posted last week on this blog has generated a lot of feedback, both as comments and in a lengthy Twitter thread. For the benefit of people discovering it late, here is a summary of the main arguments and my reply to them.

Just freeze your code and it will be reproducible forever

By far the most frequent argument against my claim that we need more stability in the SciPy ecosystem was that people can simply archive their code with all the dependencies (down to the Python language itself) in a way that lets others re-run it later for reproducibility. The most frequently proposed technical approaches were the conda package manager and Docker containers.

There are three main reasons why this is not a sufficient solution:

Freezing code is fine for archival reproducibility, as I mentioned in my original post. It is not sufficient for living code bases that people work on over decades. Computational biologist Luis Pedro Coelho has explained this very well and I recommend everyone to read his short writeup. My situation is very much the same as his. On Twitter, astronomer Tuan Do has chimed in with a similar comment.
The technical solutions proposed all depend on yet more infrastructure whose longevity is uncertain. For how long will a Docker container image produced in 2017 remain usable? For how long will conda and its repositories be supported, and for how long will the binaries in these repositories function on current platforms?
None of today's code freezing approaches comes with easy-to-use tooling and clear documentation that make it accessible to the average computational scientist. The technologies are today in a "good for early adopters" state. This means we cannot rely on them to preserve today's research even though they may well take on this role in the future.

To illustrate point 3, let me introduce Alice and Bob, who are real scientists I know, except that I have changed the names. Alice is a chemist with a decent knowledge of Python and basic software engineering techniques (Software Carpentry level), which she eagerly applies because she cares about the quality of her work. Alice considers herself an experimentalist. She develops and maintains a Python codebase for interpreting certain types of experimental data, but software development is not the focus of her work. The code she writes is not public, because her boss doesn't want it to be. Worse, her code depends on a small library developed by a collaborator who doesn't even hand out source code. What Alice gets is pre-compiled shared libraries for the three platforms that matter to herself and to her users.

Bob is an experimental biologist who uses the same instruments as Alice and is happy that Alice has written nice software for interpreting the results. He gets that software, including the binary-only dependencies, by personal arrangements with the various people involved. Bob doesn't know much about Python, nor does he care. His software installation was mostly done by Alice during a one-afternoon meeting in which they worked together to reach a state he could work with. Ideally, he would like to never touch it again, but he also wants the new features that Alice adds from time to time.

To all those who replied "just use conda" or "just use Docker", I recommend considering the situation of Alice and Bob. Do you really believe that conda or Docker are the right solution for them today? Could you point them to suitable documentation written at the right level? Both for building and for re-using frozen environments?

To prevent another round of misunderstanding, I am not saying that the situation of Alice and Bob would be perfect if only they could have a stable Python infrastructure. Research code should be open, for example, for many reasons including the possibility to upload it to various repositories. Fortunately, the attitudes towards software use in science are changing in the right direction, but this will take a lot of time, like all social change.

I also fully understand the point of view that the SciPy ecosystem is for advanced users who value methodological innovation, and that it cannot cater for the needs of Alice and Bob because of conflicting requirements and insufficient resources to deal with them. But then, as I said in my original post, please have the courage to say so openly and clearly. Every beginner-level tutorial for scientists should state during the first five minutes that you cannot expect stability and that you should either use Python only for throw-away code or else be sure you can assume maintenance. In other words, make sure that people like Alice have no false expectations. They can then look for other technology, or team up with like-minded people to maintain long-time-stable branches of SciPy, or try whatever else.

Stability is an unrealistic expectation

Another frequently expressed opinion was that it is unrealistic to expect the kind of stability I advocated in a modern software environment. This is a self-fulfilling prophecy: if you consider the goal impossible, you won't even try to achieve it. As I have pointed out, long-time stability is a reality in other ecosystems, built around languages such as Fortran or Java. A few people said that Fortran or Java are unfair comparisons, because they encourage very different approaches to dependency management. This is actually my point: you can have stability, but only if it's an explicit goal and if some effort is made to reach that goal. This includes finding suitable approaches to dependency management.

David Cournape made the interesting observation that no technology less than 20 years old is better than Python in terms of stability. That rings true, in the sense that I cannot find a counterexample. But I see this as a statement about dominant attitudes in software engineering (way beyond scientific computing), not as a statement about technological constraints that would make stability fundamentally incompatible with other requirements. Software development today is dominated by short-lived technologies but also by short-lived applications. The application domains where stability is valued probably represent a much smaller part of the pie than 20 years ago. But then, this is just another illustration for what I wrote about recently: There is no such thing as software development in the abstract, there is only domain-specific software development. The needs of scientific computing are clearly different from the needs of Silicon Valley startups. The conclusion is that the software development tools and practices should be different as well.

Finally, even within the somewhat tumultuous SciPy ecosystem, stability is not impossible. My own MMTK library has been around for 20 years, but in spite of continuous extensions and one API redesign (from version 1.x to version 2.x), I have never knowingly broken anyone's application code. With the end of support Python 2, I can unfortunately no longer maintain that policy.

Everybody lacks resources for maintenance

Many comments addressed the lack of human resources for developing and maintaining scientific software, and in particular infrastructure software like the core of the SciPy ecosystem. In combination with the fact that new developments are more attractive to most people than boring maintenance, and also more valued by the community, this leads to a culture favoring innovation over stability when most of the work is done by volunteers. This was best expressed by Peter Wang in a short sequence of tweets.

This is indeed an important factor, and one whose importance transcends scientific computing and even science itself. If you look back at the history of civilization, or even at the history of life on earth, you can't fail to notice that all living organisms have invested the lion's share of their efforts into maintaining the status quo: staying alive, staying safe, maintaining an environment that ensures a certain quality of life, etc. In modern societies whose very survival depends on technology, infrastructure maintenance (roads, power grid, ...) has always been a priority of state administrations - until recently, that is. Today, we hear politicians and even intellectuals proclaim the importance of innovation and disruption, while basic infrastructure starts to rot for lack of maintenance.

I can only hope that the innovation and disruption fashion will die out before the societies that have fallen victim to this fashion will do so by natural selection. In the meantime, I propose that scientists try to resist as best as possible. The fact that infrastructure software such as NumPy does get funding is a good sign in my opinion. I believe we can also get funding for stability, if only we clearly state that we need it.

Data supremacy

Pierre de Buyl reminded me of an article I wrote five years ago, in which I proposed that data rather than software tools should be the focus of scientific computing because data is of longer-lasting scientific importance. As I have pointed out two years later, that data includes scientific models (equations etc.), even though for technical reasons they are mostly embedded into software tools today (see here for an idea for doing things differently).

In a world where all scientifically relevant information is stored in stable and well-defined open file formats, software tools can evolve much more freely without disturbing ongoing work or harming reproducibility. New versions of software tools would merely have to maintain the functionality of their predecessors, but not their implementation details. However, this is at best a promise for the future. We don't even have the basic technology to make this happen, nor a consensus that it would be a good idea, which would open up the possibility of getting funding towards that goal. We will therefore need stable software environments for many more years to come.

A plea for stability in the SciPy ecosystem

Konrad Hinsen — 2017-11-16

Two NumPy-related news items appeared on my Twitter feed yesterday, just a few days after I had accidentally started a somewhat heated debate myself concerning the poor reproducibility of Python-based computer-aided research. The first was the announcement of a plan for dropping support for Python 2. The second was a pointer to a recent presentation by Nathaniel Smith entitled "Inside NumPy" and dealing mainly with the NumPy team's plans for the near future. Lots of material to think about... and comment on.

The end of Python 2 support for NumPy didn't come as a surprise to anyone in the Python community. With Python 2 itself not being supported after 2020, it doesn't make any sense for Python-dependent software to continue support beyond that date. The detailed plan for the transition of NumPy to a Python-3-only package looks quite reasonable. Which doesn't mean that everything is fine. The disappearance of Python 2 will leave much scientific software orphaned, and many published results irreproducible. Yes, the big well-known packages of the SciPy ecosystem all work with Python 3 by now, but the same cannot be said for many domain-specific libraries that have a much smaller user and developer base, and much more limited resources. As an example, my own Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread) for which resources (funding plus competent staff) are very difficult to find.

Speaking purely from a computational science point of view, the Python 2->3 transition was a big mistake. While Python 3 does have some interesting new features for scientists, most of them could have been implemented in Python 2 as well, without breaking backward compatibility. There are, of course, good reasons for the modernization of the language. I am not saying that Guido van Rossum is an idiot - far from it. As popular as Python may be in today's scientific research, scientific users make up for a very small part of the total Python user base. Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it's mostly a calamity for computational science.

Apart from the major earthquake caused by this change in the Python language itself, whose victims we will be able to count starting from 2020, the SciPy ecosystem has been subject to regular minor seismic activities by breaking changes in its foundational libraries, such as NumPy or matplotlib. I am not aware of any systematic study of their impact, but my personal anecdotal evidence (see e.g. this report) suggests that a Python script can be expected to work for two to three years, but not for five or more. Older scripts will either crash, which is a nuisance, or produce different results, which is much worse because the problem may well go unnoticed.

In my corner of science, biomolecular simulation, the time scale of methodological progress is decades. This doesn't mean that nothing exciting happens in shorter time spans. It just means that methods and techniques, including software, remain relevant for one to three decades. It isn't even uncommon for a single research project to extend over several years. As an example, I just edited a script whose last modification date was December 2015. It's part of collaborative project involving methodological development and application work in both experiment and theory. The back-and-forth exchanges between experimentalists and theoreticians take a lot of time. In the course of such projects, I update software and even change computers. If infrastructure updates break my code in progress, that's a major productivity loss.

Beyond personal productivity considerations, breaking changes are a threat to the reproducibility of scientific studies, an aspect that has been gaining more and more attention recently because so many published results were found to be non-reproducible or erroneous (note that these are very different things, but that's not my topic for today), with software taking a big share of the responsibility. The two main issues are: (1) non-reproducible results cannot be trusted, because nobody really knows how they were obtained and (2) code whose results are non-reproducible is not a reliable basis for further work (Newton's famous "standing on the shoulders of giants"). Many researchers, myself included, are advocating better practices to ensure computational reproducibility. In view of the seismic activities outlined above, I have been wondering for a while whether I should add "don't use Python" to my list of recommendations. What's holding me back is mainly the lack of any decent alternative to today's SciPy ecosystem.

Watching Nathaniel's BIDS talk, I was rather disappointed that these issues were not treated at all. There is a general discussion of "change", including a short reference to breaking changes and their impact on downstream projects, which suggests that there has been some debate of these questions in the NumPy community (note that I am no longer following the NumPy discussion mailing list for lack of time). However, assuming that Nathaniel's summary is representative of that debate, neither reproducibility nor the requirements of the different software layers in scientific computing seem to have received the attention they deserve.

I have written before about software layers and the lifecycle of digital scientific knowledge, so I will just give a summary here. A scientific software stack looks like this:

Layer 4: project-specific code
Layer 3: domain-specific libraries
Layer 2: scientific infrastructure
Layer 1: non-scientific infrastructure

In the SciPy universe, we have Python in layer 1, NumPy and friends in layer 2, lots of lesser-known libraries (including my MMTK mentioned above) in layer 3, and application scripts and notebooks in layer 4.

A breaking change in any layer affects everything in the layers above. The authors of the affected higher-level code have three options:

adapt their code (maintenance)
freeze their code (describe the stack they actually used)
do nothing

The first choice is of course the ideal case but it requires serious development resources. With the second one, archival reproducibility is guaranteed, i.e. a reader knows under which conditions the code can be used and trusted, and how these conditions can be recreated. But frozen code is not a good basis for further work. Using it requires much work for re-creating an outdated environment. Worse, using two or more of such packages together is in general impossible because each one has different dependency version requirements. Finally, the third option leaves the code in a limbo state where it isn't even clear under which conditions it can be expected to work. In a research context, this ought to be considered unacceptable.

Let's consider now how these three choices are applied in practice, for each layer in the software stack. Software in layers 1 and 2 must obviously be maintained, otherwise people would quickly abandon it. Fortunately these layers also suffer the least from collapse, because there is less code below them. Layer 3 code gets more or less well maintained, depending on the size of the communities supporting it, and on the development resources available. Quite often, maintenance is sub-optimal for lack of resources, with the maintainers aware of the problem but unable to do a better job. That's my situation with MMTK.

Layer 4 code is the focus of the reproducible research movement. Today, most of this code is still not published, and of the small part that does get out, a large part is neither maintained nor frozen but simply dumped to a repository. In fact, the best practices recommended for reproducible research can be summarized as "freeze and publish layer 4 code". Maintaining layer 4 code has been proposed (see e.g. continuous analysis ), but it is unclear if the idea will find acceptance. The obvious open question is who should do the maintenance. Considering that most research is done by people who spend a few years in a lab and then move on, it's difficult to assign the responsibility for maintenance to the original authors of the code. But anyone else is less competent, less motivated, and would likely expect to be payed for doing a service job.

An argument I hear frequently in the SciPy community (and elsewhere) is that scientific code that is not actively used and maintained isn't worth bothering with (see e.g. this tweet by Titus Brown). The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don't agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn't matter.

I would like to see the SciPy community define its point of view on these issues openly and clearly. We all know that development resources are scarce, that not everything that's desirable can be done. The real world requires compromises and priorities. But these compromises and priorities need to be discussed and communicated openly. It's OK to say that the community's priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don't have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community's business, and that those who have such needs should look elsewhere. Or else, decide that SciPy is inclusive and caters for all computer-aided research - and draw the conclusion that stability must take a larger weight in future development decisions.

What is not OK is what I perceive as the dominant attitude today: sell SciPy as a great easy-to-use tool for all scientists, and then, when people get bitten by breaking changes, tell them that it's their fault for not having a solid maintenance plan for their code.

Finally, in anticipation of an argument that I expect to see, let me stress that this is not a technical issue. Computing technology moves at a fast pace, but that doesn't mean that lack of stability is a fatality. My last Fortran code, published in 1994, still works without changing a single line. Banks have been running Cobol code unchanged for decades. Today's Java implementations will run the very first Java code from 1995 without changes, and even much faster thanks to JIT technology. This last example also shows that stability is not in contradiction with progress. You can have both if that's a design goal. It's all a matter of policy, not technology.

Note added 2017-11-22: see also my summary of the discussion in reaction to this post.

Comments retrieved from Disqus

xoviat:
Honestly if you actually want MMTK to be ported to Python 3, the least you can do is sign up for a GitHub account and upload the code to a repository. Right now, it's definitely not going to be ported because no one can look at the code.
- Konrad Hinsen:
  It has been on Bitbucket for a couple of years:
  https://bitbucket.org/khins...
  Releases have been on SourceSup, where they have always been among the top-ten downloads:
  https://sourcesup.renater.f...
Luis Pedro Coelho:
Long form follow-up: https://metarabbit.wordpres...
bastibe:
You can always install old versions of Python and packages using "pip install scipy==0.9.0". Old versions are not going away. If you need stability, this seems to be an easy option. Am I missing something?
- Konrad Hinsen:
  Many people have made this suggestion. In theory it works, as long as all dependencies are in PyPI. C library dependencies are often a problem. But the main issue is that you cannot suppose that everyone (all program authors and users) know exactly what to do and do it correctly. In practice, the approach you describe almost never works because some information is missing. To make it practical, we'd need easy-to-use tooling for all phases: producing a complete list of versioned dependencies (including C libraries), verifying the completeness of this list, and restoring the environment on a different machine. All that with simple tools that everybody can figure out how to use on all platforms.
  People are working on this, and I am optimistic that we will get there, but for a few more years we will have to live with the current state. Which is why stability still matters for reproducibility.
  In addition, stability will always matter for slow-moving science, where you need to combine ten-year-old and two-year-old libraries in a single program.
  - Syndafloden:
    If you want a completely reproduceable case, you'll likely need to package it with the specific runtimes or dependencies -- Which shouldn't be very hard at all, with, say, a Nanobox solution or something similar.
    You usually want that either way, regardless of use-case, language or environment.
    - Robert Jamie Munro:
      Python is really terrible here compared to, for example, node/npm or even Java / Maven. There's even an XKCD comic about it: https://m.xkcd.com/1987/
      - Justin Black:
        So this is an operating system specific solution, but one could use a docker image with versioned binaries, and pinned python packages using requirements.txt
        That way, the image has everything you need in it.
NPoisson:
Hmm. I will tend to consider that a numerical work should be distributed as a git repository with freezed source code. Even better, new tech allow to freeze the software stack if it's not too much hardware dependent.
Aka, a pip requirements with proper versioning + a Dockerfile should be able to provide a freezed ecosystem and allow good reproducibility. Of course, these tech are new and not known for their stability... for now. But I think that it will be an important part of the scentific stack : you define well your OS and software needs, you provide your source in a well documented way and you distribute both with your publication.
- Konrad Hinsen:
  Many people are working on various solutions for freezing, mostly at a level below Python/SciPy and thus generic. I am rather optimistic that this will work out fine ultimately, although my personal bet is not specifically on Docker. However, it will take a long time to come up with a reliable and stable solution and then develop good tooling to make it easy to use.
  This is in fact what I call "archival reproducibility" in my post. It's an important step, but not a replacement for stable infrastructure.
gerritholl:
Scipy moved to version 1.0 three weeks ago (https://github.com/scipy/sc... ), after 16 years of development. Within those 16 years, many what you call layer-3 and layer-4 code has been built on top of scipy, in the full knowledge the API was not stable yet, as the 0.x version number indicated. The bump to version 1.0 suggests the API should be more stable from now on, which hopefully will be the case.
I agree that communication is key. If you want to build code that will run unchanged for 20 years, relying on a library that is in version `0.x` is probably not a good idea, unless you freeze the version and bundle it along. When scientific software is in beta, as scipy effectively was until three weeks ago, the API *should* be able to change. But 16 years to go from initial release to initial stable release, as scipy did, is very long.
- Konrad Hinsen:
  I fully agree, though I'd recommend more explicit communication than just a version number. Non-developers are often not familiar with version number conventions.
  I have no personal experience with scipy stability because I have always avoided scipy except for ephemeral experimentations. The reason is the difficult installation procedure, for which I didn't want to do technical support to the users of my own code.
  - stefanvdwalt:
    With the arrival of binary wheels, hopefully this is now a non-issue.
    - Konrad Hinsen:
      It's indeed much less of an issue. The remaining difficult situation is HPC systems (clusters, supercomputers) with severe Internet access restrictions that render pip non-operational. While downloading wheels on a different machine is possible in principle, few people know it's possible and fewer know how to do it. In practice, people install from source code on those machines.
      - Nathaniel J. Smith:
        Surely if you can get the source code onto the machine, then you can also get a wheel onto it? It's literally exactly the same process, except you click on the '.whl' link instead of the '.tar.gz' link. Actually, downloading wheels is easier, because you can type 'pip wheel ' and it will automatically download the whole transitive dependency tree as wheels, which you can then rsync over or copy onto a USB stick or whatever the magic transfer system is.
        I understand that not everyone may not realize this, but rewriting every scipy feature inside every package seems like a lot more work than explaining how to download wheels :-).
        
        Konrad Hinsen:
        You are right that all the technology is there. As so often, the remaining big issue is making sure that everybody who has the problem can find the solution in a reasonable amount of time.
        BTW, the alternative to using scipy is not rewriting all its features, but rewriting, or finding in a smaller dependency, the one or two features that a given application needs. And that is sometimes easier than dealing with your users' installation questions, in my experience.
Luis Pedro Coelho:
+1 on this.
I find that often this discussion devolves into a binary "let's be like the kernel: stable APIs forever" vs "let's move fast and break things", but I would be happy with "let's break thing if we must, but try hard to avoid breaking other people's code when there is an obvious alternative".
The python2/3 transition is annoying (and py3 was an avoidable mistake), but I think that numpy/scipy changing their interfaces without any regard for backwards compatibility is much worse. For example, scipy.stats.mannwhitneyu has had at least 3 different behaviours in as many years without a lot of discussion of the possible effects on people's code. I almost published wrong results because of this particular change.
Histogram() changes has also caused me problems (for a while, people would email me every few months about not being able to reproduce my paper because numpy broke the code [https://metarabbit.wordpres...].
I once filed what I thought was an obvious bugfix (make the code follow the documented API instead of changing it for one high profile project) and had to argue for it: https://github.com/numpy/nu... Again, they broke my code for absolutely no good reason.
- Nathaniel J. Smith:
  I clicked through your links because I sympathize with your frustration, and wanted to see what we did wrong in case it's something we can handle better in the future. I'm still not sure what your issue with histogram was -- the link in that paragraph leads to a blog post that doesn't have any more details either. But I did read through PR #2780, which is linked both from that blog post and the bottom of your comment.
  I have to say, I found this extremely frustrating. The change that broke your code wasn't for "no good reason" or "aesthetic grounds" (as you describe it in the linked blog post) -- it was made because the 1.7 release broke Theano, and they submitted a fix to un-break it. I.e., your evidence that we don't care about backwards compatibility is that we *made a backwards compatibility fix*. In the process, we did accidentally break your code -- sorry about that. The patch was reviewed, but at the time no-one realized that it could cause compatibility breakage. (I'm still not entirely clear on why that happened -- I think it has to do with ways in which C++ is stricter than C? Nonetheless, it obviously did. Again, I apologize for this part.) Once you submitted your PR and alerted us of the problem, we confirmed with Theano that your fix wasn't going to break their code again, and then we merged it and backported it to the stable release branch. This all happened within 12 hours, and I posted the first reply – which linked to the previous context explaining why the change was made, and started the process of checking with Theano – 6 minutes after your original submission, at 3am my time.
  It's true that everyone mostly ignored your argument about the documentation. This is for two reasons: first, when documentation and code disagree, the default is to change the documentation. This is mandatory if you care about backwards compatibility -- in fact it follows directly from the rule that you cited at the beginning of your post. Changing the code might break users, and changing the documentation is an "obvious alternative" that doesn't risk breaking users. So everyone was focused on the breakage, not the documentation. And second, it didn't even matter anyway – we were already in the process of fixing the problem, so we focused on that instead of getting into a tangential discussion about engineering principles.
  All in all, I'm shocked that *this* is your example you use to go around sneering about how we're a bunch of terrible engineers who don't care about our users. You should feel ashamed of yourself.
  We've certainly made mistakes, and doubtless will continue to do so in the future. NumPy's a complex project, maintained by a small handful of volunteers, who are trying to support millions of users with contradictory requirements – inevitably we do mess up. When we do, we know it causes real harm to our users, and we try to do better. But at least acknowledge that we're trying. Geez.
- stefanvdwalt:
  While there may be isolated cases that have been badly handled, the general approach is to be conservative with API changes unless there is a significant benefit (e.g., clarity, or additional usage possibilities). Many libraries in the SciPy ecosystem follow a three-release deprecation cycle, which means in practice that if you run your code once a year, you will at least see warnings that indicate what needs to be changed. The expectation that libraries should *never* change APIs is unreasonable; for papers you should consider either specifying the version or NumPy, or publish the code in a location where you have the ability to change it later. Your comment seems to suggest that the NumPy and SciPy developers do not care about backward compatibility, which I don't think is an accurate reflection.
  - Luis Pedro Coelho:
    As I wrote, I don't think that the choice is a binary one between "never change the API" like the kernel and changing it at will.
    "in practice that if you run your code once a year, you will at least see warnings that indicate what needs to be changed"
    This is only true if I run my code once a year with the (at the time) most up to date version; not true otherwise. Also, sometimes I want to retrieve code that I used 2 years ago in another project and I would rather have an expectation that it works.
    "While there may be isolated cases that have been badly handled, the general approach is to be conservative with API changes unless there is a significant benefit (e.g., clarity, or additional usage possibilities)."
    This is exactly our disagreement. I don't think that "clarity or additional usage possibilities" is anywhere close to something that would justify breaking backwards compatibility for a foundational project like numpy or scipy.
    Add new functions while deprecating the older ones. Most new functionality can be done with new functions or even just new arguments. This way, you improve the API and evolve it. After a few years, remove old functions. But changing the behaviour of working code in 3 release cycles (18 months) is not what I'd consider conservative, it's rather on the "move fast and break things" side of the scale. For more cutting edge projects, that could be OK, even expected, but numpy/scipy should be more like infrastructure.
    I won't even ask for something like semantic versioning (where there would be a commitment to supporting the APIs for duration of a major release), but 18 months is way too short for a project like numpy, especially for changes that silently change results. And if I report a change to a documented API that caused code to stop compiling, it should be treated it as a bona fides bug (and not a discussion of which API is best).
Pierre de Buyl:
Hi Konrad, interesting read!
In the direction of "mitigation" of these issues, your other idea that data is more important than code (Hinsen 2012, CISE). Whether you maintain, freeze, or ignore, the availability of reference data allows future "you" or future "someone else" to perform at least a comparison test.
- Konrad Hinsen:
  Yes, data in open, documented, and software-independent formats is a big plus for longevity. My own MMTK is a bad example there, because it uses a trajectory format that includes executable Python code, making it very hard to process from other languages. I have repented and defined a more open and language-neutral format (MOSAIC, https://mosaic-data-model.g....
  Unfortunately, data supremacy is almost as hard to sell as stable software!
:
Nathan Goldbaum:
NumPy LTS will continue to be available on Python2 and MMTK will continue to be able to be built with it.
- Konrad Hinsen:
  Indeed, but very soon Python 2 will have to be banned into some sandbox because security bugs are no longer fixed. It's good that NumPy LTS will remain available for frozen code, but it's not sufficient to keep code alive and useful.
jsierles:
Freezing the stack may end up being the only real solution as dependencies trees grow in complexity. If this were easy to do, and long term reproducibility could be guaranteed, would you accept it as a solution?
- Konrad Hinsen:
  As I wrote in my post, it's a partial solution, OK from a reproducibility point of view, but insufficient for long-running projects, or for taking up old projects again. For that, I need to be able to use ten-year-old and two-year-old libraries together from the same script.
  - jsierles:
    I won't argue that long term support is important at a library level. However, it seems unrealistic in modern software environments to expect it. Rather, I think we need to look towards new ways of WRITING code, and of defining dependencies. For example, if you could split your script into sections, each using a different dependency tree for each, but passing values between them outside the runtime, you could avoid a lot of typical problems with dependency hell. Also, tools like Guix (which you've written about) help solve the underlying dependency graph problem in a manageable way. I've seen some success with this approach.
    I agree this is not a 'technical' issue, but also think there are more solutions available than are made obvious at this level of discussion. Would love to see some actual code and see how we could specific problems!
    - Konrad Hinsen:
      There are various *possible* technical solutions, and more are being worked on. But today, we have no solution that works in practice, meaning that it is sufficiently simple on all major platforms that the majority of scientists can work with it. Which is why for now, and a few more years to come, breaking changes in infrastructure are a danger for reproducibility.
      BTW, labelling a potential solution as "unrealistic" is a major contribution to the problem itself. As I pointed out with the examples of the Fortran, COBOL, and Java ecosystems, stability is possible not only in theory but also in practice, under the condition that everyone keeps it in mind during design and development. In a community where most people consider stability unrealistic, it cannot happen.
      - jsierles:
        I completely agree that the label contributes to the problem. And that some call for stability is justified in any heavily used software project. However, in the case of Python, and other languages like Javascript, the issues run deeper than the label. Down to how packaging systems work and the language designers goals when making changes. Stability? Simplicity? Programmer happiness? It's truly hard to reconcile these, and less and less so as more languages enter the space. So I don't see adopting stability as something necessarily easier or faster to do than exploring other solutions that can apply to a wider range of problems.
        Furthermore, I see that technical solutions are equally unfairly labeled as unrealistic because of an unquantifiable cost of adoption. The result is that we see a lot of talk about reproducibility that boil down to a lengthy laundry list of best practices, i.e. (http://journals.plos.org/pl....
        Instead, as technologists, I think we are responsible to build better tools and more creative solutions to the problem.
        
        Konrad Hinsen:
        I pretty much agree with all that. And I would definitely encourage technologists to continue looking for better solutions. The one mistake not to make is to declare victory when a proof of concept has been achieved. That's just the beginning of the next episode: convince enough early adopters that communities like Software Carpentry will add the new technology to their courses.

There is no such thing as software development

Konrad Hinsen — 2017-11-09

It's hard to find an aspect of modern life that is not influenced in some way by software. Some of it is very visible, for example the Web browser I start on my computer. Other software is completely invisible, such as the software controlling my car's diesel engine. Some software is safety critical, for example flight control software in airplanes. Other software is used in a much more futile way, such as playing games. I could go on listing characteristics in which different software packages differ, but I will leave it at that - I don't really expect anyone to disagree about the ubiquity and diversity of software in our increasingly digital world.

Given this diversity, it is surprising how many seem to consider "software development", and related terms such as "software engineering", as general concepts requiring no further qualification. In particular, plenty of people are happy to discuss in an abstract way how software should best be developed, without any reference to a concrete application domain, project size, expected longevity, etc. Imagine we did the same for the world of atoms, lumping together activities as distinct as chemical synthesis, carpentry, and dental surgery under the label "matter manipulation", and starting a discussion about best practices for matter manipulation. I doubt anyone would take such a debate seriously.

A good example of such an overly abstract discussion is the one about the benefits of static typing. There is a large camp of static typing enthusiasts who claim that static typing is Right with a capital R. They argue that it's always better to have correctness guarantees than not to have them. The implicit assumption is that static typing comes at no cost, which is manifestly false. The main contributions to this cost are 1) additional cognitive load, 2) the need to work around the limitations of a type checker, and 3) additional barriers to the combination of independently developed libraries. As soon as one admits the necessity of a cost-benefit analysis for static typing, it quickly becomes obvious that this can only be done for 1) some specific category of software and 2) a specific type system. The question then becomes: is type system A useful for improving the quality of software in application domain X? A nice example of this point of view is given by Rich Hickey in his keynote on "Effective Programs", where he explains why none of the well-known type systems are useful for the kind of software he writes, leading to his decision to design Clojure as a dynamically typed language.

Focusing software development questions on specific software categories has many potential benefits. Perhaps most importantly, it permits formulating questions in a precise enough way to make them amenable to empirical verification (aka "the scientific method"), acting at the same time as a safeguard against overly generalizing the conclusions from empirical studies. Moreover, the study of specific use cases is likely to lead to improvements in the methodology. In my example of static typing, it can be expected that once type system designers adopt the habit of thinking about specific software categories, they will design and evaluate type systems for various important application domains, taking into account both the kind of data being processed and the kinds of mistakes one would like to protect oneself against. Even better, once type system designers recognize that there is no single type system to rule them all, they might start to think about how to combine pieces of software written using different type systems. In the end, the three cost factors I mentioned might all end up heavily reduced.

Since there is a chance that some type system designers are reading this, I'll profit from having their attention and suggest developing a type system for numerical computations, which by some strange coincidence is what I do in my own work. In this application domain, most data represents physical quantities and its low-level representation is "float" or "array of floats". Properties that one could usefully monitor in the course of type checking are dimensions and units, but also positivity or non-zeroness. For array operations, the compatibility of array dimensions is worth a check as well. A static proof of complete absence of such mistakes is probably not doable, but detecting as many mistakes as possible while inserting run-time checks for the rest is probably a very useful compromise. It is also worth considering some important sub-categories of numerical software, in particular the different layers of the scientific software stack that I have described before. The required guarantees are much higher for infrastructure software (layer 2) than for scripts and workflows (layer 4), and infrastructure developers can be expected to invest more effort to ensure correctness. However, this does raise the question of type-checking at the interface between layers, a possible solution being gradual typing.

Static typing is merely one example for the importance of looking at specific software application domains, there are many others. The utility of paradigms such as object-oriented or functional programming is also mostly discussed in the abstract, as are the relative merits of development strategies like test-driven or agile development. Finally, some less discussed but practically important questions could get more limelight exposure if formulated more concretely in the context of specific applications. I am thinking for example of the choice between using external libraries and writing one's own code, involving the trade-off between development effort and the long-term risk of uncontrollable dependencies.

Comments retrieved from Disqus

Thomas Arildsen:
I think you raise some very important points here. It is similar in spirit to what I usually spend a substantial amount of time trying to to convince students in my courses: The choice between fast, compiled, "low-level" languages (such as C) and slower, interpreted, "high-level" languages (such as Python) is not one language to rule them all. It depends highly on how much time/cost you are willing to spend on developing the program vs how much it is actually going to be used after completion. In the case of custom scientific computing software, I find Python or similar languages is what makes sense.
Also, I find it very relevant that you point out how data types in numerical computing applications are not simply a case of int vs float. In fact, this is what two PhD students in a recent research project I was involved in tried to solve in this way: http://vbn.aau.dk/en/public... & http://magni.readthedocs.io.... The idea is to do run-time detailed numerical type-checking of function arguments using decorators in Python.
- Konrad Hinsen:
  Thanks for your comment!
  You point out another tradeoff, language choice, that very much depends on what your software is actually supposed to do. I didn't mention this example because I rarely see language choice discussed abstractly, although it certainly happens.
  It's good to see we agree on the importance of unit checking :-) If it's so rarely done in practice, that's because it is not well supported. For Python, your approach of run-time checking is very appropriate, but people who turn to Fortran or C for speed would expect compile-time checks with no run-time overhead. There is actually a tool (not so well known for now) that does static unit checking for Fortran (https://camfort.github.io/) and for C++ it can be done via template metaprogramming (http://www.boost.org/doc/li.... Microsoft's F# language has dimensional analysis as a built-in feature, as does Frink (https://frinklang.org/). But I am not aware of any language with a general-purpose type system that would allow the implementation of dimensional analysis. If anyone does, I'd appreciate a pointer.
  - Franklin Chen:
    General-purpose languages like Haskell have type systems that enable building your own dimensional analysis system if you want. One example mature library contributed to the community is https://hackage.haskell.org...
    - Konrad Hinsen:
      Thanks for the pointer! That library looks interesting, though I don't see how exactly it works, given that I have never heard of data kinds and type families before. But I can see from the source code that it does standard dimensional analysis, that it does the checking at compile time, which is the basic list of requirements. What I don't see is how it handles the well-known tricky cases such as making both Hz and Bq compatible with 1/s but not with each other.
      - Franklin Chen:
        Unfortunately, in `dimensional`, currently Hz and Bq are not kept different at all, actually. I see that although the types look different
```
hertz :: Num a => Unit Metric DFrequency a
becquerel :: Num a => Unit Metric DActivity a
```
        in fact
        DActivity is just an alias to DFrequency rather than a different type. I've submitted an issue at https://github.com/bjornbm/...
        
        Konrad Hinsen:
        And I have added a comment to prevent the authors from believing that there is a simple fix. Doing this correctly is probably a research project. But I hope somebody will go for it!

Why Python does so well in scientific computing

Konrad Hinsen — 2017-09-12

A few days ago, I noticed this tweet in my timeline:

I 'still' program in C. Why? Hint: it's not about performance. I wrote an essay to elaborate... appearing at Onward! https://t.co/pzxjfvUs5B
— Stephen Kell (@stephenrkell) September 5, 2017

That sounded like a good read for the weekend, which it was. The main argument the author makes is that C remains unsurpassed as a system integration language, because it permits interfacing with "alien" code, i.e. code written independently and perhaps even in different languages, down to assembly. In fact, C is one of the few programming languages that lets you deal with whatever data at the byte level. Most more "modern" languages prohibit such interfacing in the name of safety - the only memory you can access is memory allocated through your safe language's runtime system. As a consequence, you are stuck in the closed universe of your language.

System integration is indeed an important and often overlooked aspect of working with software. And this is particularly true for scientific computing, where application software with a fixed set of functionality is rare. Solving a scientific problem typically involves combining many pieces of software into a very problem-specific whole, which may well be run only a few times (see also my earlier post on this topic). This is exactly the task of system integration: assembling pieces into a whole using glue code where necessary. In computational science, this glue code takes the form of scripts, workflows, or more recently notebooks. This is technically quite different from the OS-level system integration that Stephen Kell refers to, but functionally it is the same.

Stephen's post reminded me of my long-standing plan to write a blog post about why Python has been so successful in scientific computing, in spite of having a reputation for bad performance. So... here it is.

There are of course many reasons for Python's success, but one of them is that it does a pretty good job at system integration. There are two Python features that I consider important for this, which are not shared by many other languages. One is data types explicitly designed for interfacing, the other is duck typing in combination with a small but versatile set of standard interfaces.

The first Python data type designed for interfacing in a scientific computing context is the good old NumPy array - which is in fact older than NumPy, having been introduced in 1995 by NumPy's predecessor, Numeric. Arrays are one of the bread-and-butter data types in scientific computing, to the point of being the only one available in languages like Fortran 77 or APL. The implementation of arrays in Numeric was designed to use the same data layout as Fortran and C, in order to allow interfacing to the Fortran and C libraries that dominated scientific computing in 1995 (and still do, though to a somewhat lesser extent). The idea behind Numeric and later NumPy was always to use Python as a glue language for Fortran and C libraries, and achieve speed by delegating time-critical operations to code written in these languages.

The second Python data type designed for interfacing is memoryview, related to the buffer protocol. This is as close as Python gets to C-style memory access. The buffer protocol lets different Python data types access each other's internal memory at the byte level. A typical use case would be an image data type (e.g. from Pillow) allowing access to the in-memory representation of an image through an array type (e.g. from NumPy), permitting the implementation of image manipulation algorithms in terms of array operations.

The third and least known Python data type for interfacing is the capsule that replaces the earlier CObject. Capsules exist solely for the benefit of Python modules written in C, which can exchange opaque data with one another via glue code written in Python, even though the glue code itself cannot inspect or manipulate the data in any way. A typical use is to wrap C function pointers in a Python object such that Python glue code, e.g. a script, can pass a C function from one module to a to C code from another module.

All these interfacing data types mediate between Python and C code, although quite often the Python system integrator is hardly aware of using C code at all. The other Python feature for system integration, duck typing with standard interfaces, is what facilitates glueing together independently written Python modules. By "standard interfaces", I mean the sequence and dictionary interfaces, but also the standard method names for operator overloading.

To see why this is an important feature, let us look at statically typed languages that by design do not have it. As a concrete example, consider multidimensional arrays in Java. They are not part of the language or its standard library, but they can be implemented on top of it with reasonable effort. In fact, there are several Java implementations you can choose from. And that's the problem. Suppose you want to use an FFT library based on array implementation A together with a linear algebra library based on array implementation B. Bad luck - the arrays from A and B have different types, so you cannot use the output of an FFT as the input to a linear equation solver. It doesn't matter that the underlying abstraction is the same, and that even the implementations have much in common. For a Java compiler, tje types don't match, period.

Python is not completely immune to this problem. It is perfectly possible to write Python code, or C code in a C module, that expects a precise type of data as input, and will raise an exception otherwise. But in Python code that would be considered bad style, and in C modules for Python as well except where required for performance or for compatibility with the C code. Wherever possible, Python programmers are expected to use the standard interfaces for working with data. Iteration and indexing work the same way for arrays as for the built-in lists, for example. For operations that are not covered by the standard interfaces, Python programmers are supposed to use Python methods, which are subject to duck typing as well. In practice, independently implemented Python types are much more interoperable than independently implemented Java types. For the specific case of n-dimensional arrays, Python has had the chance of overwhelming acceptance of a single implementation, which is due more to social and historical than to technical issues.

Finally, even though Python is a pretty good choice for system integration in scientific computing, there are of course limits, which are exactly of the kind that Stephen Kell explains in his essay: combining Python code with code in other managed languages, say R or Julia, requires a lot of work and even then is fragile, because the required hacks depend on undocumented implementation details. I suspect that the only solution would be to have language-neutral garbage-collected data objects proposed as an OS-level service that maintains an option for non-managed byte-level access à la C. The closest existing technology I am aware of is Microsoft's CLR, better known by its commercial name .NET. Its implementation is now Open Source and runs on multiple platforms, but its Windows-only origins and strong ties to a huge Microsoft-y library have been an obstacle to adoption by the traditionally Unix-centric scientific computing communty.

Comments retrieved from Disqus

vikas jain:
Very Impressive Python tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Python. I'm also a learner taken up Python training and I think your content has cleared some concepts of mine. While browsing for Python tutorials on YouTube i found this fantastic video on Python. Do check it out if you are interested to know more.:-https://www.youtube.com/wat...
vikas jain:
I appreciate your work on Python. It's such a wonderful read on Python tutorial. Keep sharing stuffs like this. I am also educating people on similar Python so if you are interested to know more you can watch this Python tutorial:-https://www.youtube.com/wat...
Urmila pandey:
Worthful Python tutorial. Appreciate a lot for taking up the pain to write such a quality content on Python course. Just now I watched this similar Python tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:- https://www.youtube.com/wat...
Urmila pandey:
Worthful Python tutorial. Appreciate a lot for taking up the pain to write such a quality content on Python course. Just now I watched this similar Python tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:- https://www.youtube.com/wat...
Manju Gupta:
Very Impressive Python tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Python. I'm also a learner taken up Python training and I think your content has cleared some concepts of mine. While browsing for Python tutorials on YouTube i found this fantastic video on Python. Do check it out if you are interested to know more.:-https://www.youtube.com/wat...
Manju Gupta:
I appreciate your work on Python. It's such a wonderful read on Python tutorial. Keep sharing stuffs like this. I am also educating people on similar Python so if you are interested to know more you can watch this Python tutorial:-https://www.youtube.com/wat...
Chris Barker:
You write:
"The idea behind Numeric and later NumPy was always to use Python as a glue language for Fortran and C libraries"
I have often wondered about this -- I started using Numeric in 1999, and followed the development through numarray, and then numpy, and onward :-)
I've often said that an ndarray is two things:
1) a nice featureful n-dimensional array object for Python, and
2) a wrapper around a C array (or really, a pointer to a data block).
(2) allows enormous power in communicating with Fortran and C codes -- as you mention.
The question is -- was this an intentional design decision? or a happy accident?
Has any one ever asked Jim Hugunin or David Asher?
(though I see your name on my historical copy of the docs from 2000 -- so maybe you were in on that decision at the time!)
- Konrad Hinsen:
  Yes, I was part of the initial Numerical Python development team, so I can confirm that interfacing to C and Fortran code was an important goal at the time. There is actually some evidence for this in the code and the API. For example, the separation of the array object storage into a data space and a small Python object with just the bookkeeping information. Plus the possibility to create an array using an externally allocated and managed data space.
  - Chris Barker:
    Thanks! good to know it wasn't just a happy accident.
    I haven't followed it recently, but at one point the folks working on the NumPyPy project really didn't "get" the importance of this aspect of numpy.
Stephen Kell:
Hi Konrad. Thanks for the "citation" and the kind words. :-)
One question: would you say it's the design of CPython and/or the Python language that have enabled this, or just the happenstance that somebody wrote those modules (NumPy, memoryview, capsule) and got them adopted? Could it have happened as easily in another dynamic language, say? I'm not familiar enough with the Python library ecosystem to distinguish these cases.
Your closing paragraph's idea of a "language-neutral garbage-collected data objects proposed as an OS-level service", is very close to what I've been working on with liballocs (https://github.com/stephenr.... I believe the trick is to tolerate as much diversity as possible, rather than fixing "one true way" to implement higher-level languages and cutting loose the non-conformers (the CLR approach, more-or-less). In particular, I'm (slowly) working towards a treatment of garbage collection that allows a considerable degree of pluralism -- think multiple somewhat-cooperating allocators/collectors, rather than a single shared one.
- Konrad Hinsen:
  Hi Stephen,
  You will probably find Rich Hickey's talk on the design of Clojure interesting: https://www.youtube.com/wat...
  He insists very much on a systems point of view and points out the dangers of language lock-in. His context is very different from yours and mine, but the overall message is the same.
- Konrad Hinsen:
  Hi Stephen,
  thanks for your comments!
  To answer your question, I'd say it is a bit of both. CPython had C modules right from the start, in fact it used them in its own implementation. Those C modules are a bit more than the FFI that any modern language has. It is bidirectional in that it gives C modules access to Python data types, and lets them define new ones. That was a perfect basis for the later developments (NumPy, memoryview, capsule), which wouldn't have made much sense otherwise. It didn't have to happen, but it definitely wouldn't have happened without the existing support.
  Your liballocs looks interesting, although the list of build dependencies is a bit discouraging. I'll start by reading the paper :-) The idea of a lightweight and minimalistic storage management, not tied to a language or even to a bytecode interpreter/compiler, looks very useful. In scientific computing, it could solve many problems of interfacing languages operating at different levels of storage abstraction, e.g. C, C++, Fortran 90 , and dynamic languages such as Python.
  - Stephen Kell:
    Thanks for the clarification!
    Certainly I am interested in finding users of my work within scientific computing... I'm currently scratching my head about how best to achieve this. One possible blocker is that even small run-time overheads are often considered intolerable.
    In case I can offer encouragement, most of the build dependencies are standard (and do not transfer to runtime)... the build instructions "should" "just work", at least on Debian-based machines (and with close equivalents on RPM distros). If not, do file a GitHub issue... but yes, I am working on packaging the library and tools more nicely in various ways. :-) Some of the dependencies will be eliminated once I have integrated more closely with gcc/clang... again, some work is ongoing, though not going as fast as I'd like.
    - Konrad Hinsen:
      After a quick look at your 2015 paper, I confirm that this looks very interesting. But it seems that all language implementers must build on liballocs for this to work. This might take some time to happen.
      As for run-time overheads, it all depends on where they occur. Much scientific code works on large uniform datasets, typically arrays. An overhead for the first access to an array is usually not a problem. An overhead for every element access would be prohibitive.
      - Stephen Kell:
        My hypothesis is that existing implementations can be retrofitted, rather than building new ones from scratch. But yes, this work needs to be done. And I admit the hypothesis is not tested yet, but is rather a case of "seems to be true" based on my current knowledge of the internals of various language implementations. The V8 modification mentioned in the paper started in this direction... but V8 is a particularly complex case. I hope to do some more work on this fairly soon, using on some simpler VM.
        Yes, I try to confine overheads to rare operations, such as malloc-style allocation. So I think the core run-time services should be supportable on scientific code... just not every possible use of them (e.g. bounds-checking array accesses may have to be skipped).
        
        Konrad Hinsen:
        The main problem I see is not so much the amount of work that must be done but the number of people that need to contribute to make it happen. That's perhaps more of a marketing question than a technical one.
        Are people in your corner of computing at least aware of the importance of the problem you are trying to solve? In my corner (computational science), they are not, although in my opinion it's one of our biggest problems in daily life. Most people don't see it as a problem because they don't envisage a solution. Languages being isolated universes is just normal, there is nothing to be done about this. I tried to explain the issues in an earlier blog post (http://blog.khinsen.net/pos..., but apparently with little success.
        As for array bounds checking, I am still hoping some PL designer will come up with a good solution. Mistakes in array index expressions are very frequent, but everyone turns off array bounds checking at compile time because of the huge runtime cost. Static array access validation would be very nice to have.

Which mistakes do we actually make in scientific code?

Konrad Hinsen — 2017-05-04

Over the last few years, I have repeated a little experiment: Have two scientists, or two teams of scientists, write code for the same task, described in plain English as it would appear in a paper, and then compare the results produced by the two programs. Each person/team was asked to do a maximum amount of verification and testing before comparing to the other person's/team's work.

Let me state the most disturbing outcome of this experiment first: we never found complete agreement between the two programs. Not once. And when we explored to find the cause of the discrepancies, we most often found bugs in both programs, plus missing details in the description written initially for human readers.

The two most practically significant experiments of this kind were actual research projects that have since been published:

A comparison of reduced coordinate sets for describing protein structure. For this work, Shuangwei Hu wrote Matlab code, and I wrote the Python code that was ultimately published.
Model-free simulation approach to molecular diffusion tensors. In this case, Gerald Kneller wrote Mathematica code, and I wrote the Python version again.

Later on, I did a series of similar experiments with PhD students participating in what can be summarized as advanced Python programming courses. PhD students with limited programming experience are exactly the kind of scientists who write much of the software for research projects. But the setting was "exercises in a course", with programming tasks being much simpler, and much better specified, than what the typical research project requires.

The results of these experiments that I will summarize here are no more than anecdotal evidence. In fact, the initial goal was not to perform an experiment in scientific computing, but to perform better checks on the code for a research project. It would be interesting to do a larger-scale proper study, but that's beyond my means and competence.

As I already mentioned, there was never complete agreement between the two programs supposed to solve the same problem. In many cases the differences were small, and I suspect many would have brushed them away as caused by uncontrollable round-off, given that all problems were numerical in nature. But upon closer scrutiny, we always found different issues, and got much better agreement after fixing them. This is why I still believe that bitwise reproducibility matters. When small numerical differences are inevitable, as they are with today's scientific programming languages, it becomes much more difficult to search for and eliminate mistakes.

So which are the mistakes that were uncovered by comparing two independent implementations of the same method?

Number one, by far, is discrepancies between the informal description for human readers and the executable implementation. Put simply, the programs did not compute what the informal description said they should compute, or the informal description was incomplete, admitting more than one interpretation.

Number two is typos in numerical constants and in variable names. Since I can almost hear proponents of static typing saying "that's what you deserve for using Python", let me add that most typos in variable names would not have been caught by static type checking. If you have two integer loop indices i and j, no type checker will complain when you interchange them by mistake.

Number three is off-by-one-or-two errors in loops and in array indices. If you have a complex formula involving lots of x[i], x[i+1], and x[i-1], it's hard to avoid getting an index wrong occasionally. Unfortunately, array bounds checking does not catch all of these mistakes. Another interesting observation is that this type of mistake is just as likely in the informal description as in the code. Humans are apparently not very good at handling this kind of "detail".

Is there anything we can do to reduce the risk of these types of mistakes? I'd say yes, but it's not going to be easy.

Let's start with what software engineering techniques could do to improve the situation. The main opportunity I see is for mistakes of the third kind. Index arithmetic could be eliminated altogether by abstracting it away. Most situations correspond to one of a handful of patterns, often called stencils, which could become functions or macros in a suitable domain-specific language. Another idea, applicable to legacy code, is to have code checking tools recognize stencils and small deviations from common stencils and point out potential mistakes - see this presentation at the recent 2nd Meeting on Testing and Verification for Computational Science.

Similar heuristic searches for potential mistakes could be applied to typos in variable names, though it is not sure that such reports would ultimately be useful. The real issue is the widespread use of short and similar variable names. A radical approach would be to ban them as part of a programming style guide, and have source code checkers flag violations of such a rule.

For the main source of mistakes, discrepancies between informal specification and implementation, software engineering approaches are totally hopeless in my opinion. After all, the programs are perfectly reasonable and consistent, they merely solve a problem that is different from the one they were written to solve. Given the current state of technology, the comparison between the two problem decriptions can only be done by human proofreading, as long as at least one problem description is informal. I suspect the best approach we have today is exactly what I described above - develop two independent implementations and compare.

In the long run, we can work on reducing the gap between informal descriptions (papers, software documentations) and executable implementations. I vaguely remember hearing about people exploring the possibility of turning informal descriptions into formal specifications by natural language processing - if anyone has a reference, please leave a comment! But I am rather skeptical of this approach, and therefore I prefer to let humans make the move towards formal specifications. The human-computer interface for such specifications is what I call digital scientific notations, and I am currently working on developing such a notation for my corner of science, which is computational physics and chemistry.

Finally, let me point out that my experiments and their conclusion apply only to research code in the strict sense, i.e. code that was written to compute a result that is a priori unknown. Referring to my earlier post on software collapse, this is the fourth and project-specific layer of scientific software. When writing libraries and software tools that implement established methods for wider use, the situation is different because testing can be used much more effectively.

Comments retrieved from Disqus

Vicky Pawar:
In fact, most scientists would tell you that they wouldn't have it any other way if they didn't make mistakes. This is because making mistakes is frequently the most effective way to learn.
You can learn about science and other latest news and articles on
Dewwool.
alqualond:
Interesting post, thank you. Do you think that software engineering practices for extracting requirements and describing systems (eg UML diagrams) could help with the first problem (mismatch between specification and implementation), or is research software too different than "production code" for them to be useful?
- Konrad Hinsen:
  The particularity of project-level research code is that its specification evolves as the research is done. In the beginning, the specification is no more than a list of ideas to explore, with references to earlier, more mature work. I have never seen anything like this done with UML or other software engineering notations.
asmeurer:
Do you think 0-based indexing vs. 1-based indexing makes any difference for the index related bugs?
- Konrad Hinsen:
  All my experiments were done in Python and therefore using 0-based indexing, so they cannot help answer this question. There wasn't any complex index arithmetic in any of the programs, as far as I remember so I wouldn't expect 1-based indexing to make any difference, but that's theory, not practice.

Reproducible research in the Python ecosystem: a reality check

Konrad Hinsen — 2017-04-06

A few years ago, I decided to adopt the practices of reproducible research as far as possible within the technical and social constraints I have to live with. So how reproducible is my published code over time?

The example I have chosen for this reproducibility study is a 2013 paper about computing diffusion coefficients from molecular simulations. All code and data has been published as an ActivePaper on figshare. To save space, intermediate results had been removed from the published archive. This makes my reproducibility check very straightforward: a simple aptool update will recompute everything starting from these intermediate results up to the plots that went into the paper.

One nice aspect of ActivePapers is that it stores the version numbers of all dependencies, so I can quickly verify that in 2013, I had used Python 2.7.3, NumPy 1.6.2, h5py 2.1.3, and matplotlib 1.2.x (yes, the x is part of the reported version number).

First try: use my current Python environment

The evironment in which I do most of my current research has Python 3.5.2, NumPy 1.11.1, h5py 2.6, and Matplotlib 1.5.1. I set it up about a year ago when I got a new laptop, and haven't had a good reason to update it since then. I had made some effort back in 2013 to make my code compatible with Python 3, so why not try now if this was a worthy investment?

Outcome: running the computations works just fine, with results that are not identical at the bit level but close enough for my application. However, I get some warnings from matplotlib when generating the plots. Here is the first one, the others are similar:

UserWarning: Legend does not support 'x' instances.
A proxy artist may be used instead.
See: http://matplotlib.org/users/legend_guide.html#using-proxy-artist
  "#using-proxy-artist".format(orig_handle)

A quick inspection of the plots shows that the legends have almost disappeared, all that's left is a small white box. That makes many of the plots unintellegible.

Just out of curiosity, I made a quick attempt to figure out the error message. What's that 'x' instance? The following messages also refer to 'yz' instances and a few others. A look at my script reveals that 'x', 'yz' etc. are in fact the strings that I supplied as legends. Sounds strange to call them 'x' instances, as if 'x' were a class. And what's that cryptic reference to a proxy artist?

Better stop here: my goal was to see if I can reproduce my data and figures from 2013 in a Python environment from 2016, and the answer is no. The plots are mutilated to the point of no longer being useful.

Second try: use my current Python 2.7 environment

Some of my research code still lives in the Python 2.7 universe, so I also have a Python environment based on Python 2.7.11 on my laptop, with NumPy 1.8.2, h5py 2.5, and matplotlib 1.4.3. That's much closer to the original one, so let's see how well it does in my reproducibility evaluation.

Outcome: Much better. The computations work fine as before, and the plots generate a single warning:

MatplotlibDeprecationWarning: The "loc" positional argument to legend is deprecated. Please use the "loc" keyword instead.

The legends still look OK, so the warning is just a minor nuisance, as one would expect from a deprecation-related message. Interestingly, this warning is also about legends, so it looks like there was a serious backwards-incompatible change in matplotlib's legend function between 1.2 and 1.5, which was prepared by a deprecation warning in 1.4.

Third try: reconstructing the original environment

Since I have the version numbers of everything, why not try to reconstruct the original environment exactly? Let's go for the same major and minor version numbers, which should be sufficient. That's a job for Anaconda:

conda create -n python2013 python=2.7 numpy=1.6 h5py=2.1 matplotlib=1.2 anaconda
source active python2013
pip install tempdir
pip install ActivePapers.Py

Outcome: no warnings, no errors. Identical results. Reproducibility bliss at its best.

Conclusions

In summary, my little experiment has shown that reproducibility of Python scripts requires preserving the original environment, which fortunately is not so difficult over a time span of four years, at least if everything you need is part of the Anaconda distribution. I am not sure I would have had the patience to reinstall everything from source, given an earlier bad experience.

The purely computational part of my code was even surprisingly robust under updates in its dependencies. But the plotting code wasn't, as matplotlib has introduced backwards-incompatible changes in a widely used function. Clearly the matplotlib team prepared this carefully, introducing a deprecation warning before introducing the breaking change. For properly maintained client code, this can probably be dealt with.

The problem is that I do not intend to maintain the plotting scripts for all the papers I publish. And that's not only out of laziness, but because doing so would violate the spirit of reproducible research. The code I publish is exactly the code that I used for the original work, without any modification. If I started maintaining it, I could easily change the results by accident. I'd thus have to introduce regression tests as a safeguard against such changes. But... how do I test for visual equivalence of plots? Bitwise reproducibility is about as realistic to expect for image files as for floating-point numbers: I don't even get bitwise identical image files running the same Python code with identical matplotlib versions on different machines.

For my next paper, I will look for alternatives to matplotlib. My plotting needs are rather basic, so perhaps there is some other library with a more stable API that is good enough for me. Suggestions are welcome!

Comments retrieved from Disqus

Vicky Steeves:
Hi! I'd also recommend checking out ReproZip, which is designed to capture the computational environment of research for reproducibility: https://reprozip.org && https://examples.reprozip.org
To create a completely reproducible package (a .rpz file), you just prepend "reprozip trace" to your current command -- so it would look like "reprozip trace python funScript.py" Then to create the package, you just type "reprozip pack "
You can send that to someone else (a reviewer, a collaborator, yourself in 5 years) who can then reproduce your work across different operating systems/configs using our unpacker plugins. You can use a graphical interface or the command line to unpack.
The point of ReproZip is to create computationally reproducible work -- that is, capture research at the environment level, like your blog post captured so accurately, as easily as possible. 2 commands to pack, 2 to unpack (unless you use the GUI, then it's only a few clicks).
This is a low-barrier way to create nice little reproducible packages of your research, to either share with others or share with yourself. Anyway, you should check it out!
- Konrad Hinsen:
  Thanks for mentioning ReproZip! It does sound familiar - I discovered this a few months ago but at least back then it was Linux only, so I couldn't use it for my own work. Judging from a quick look at the documentation, it seems there is now support for re-executing archives under MacOS and Windows, but making archives still requires Linux.
  Not that this is meant as a criticism - a good tool for Linux is significant progress. And I understand that capturing environments at the executable level requires low-level systems operations that are hardly portable.
Peter Amstutz:
Docker helps a lot. Best practice seems to be to store the Dockerfile and everything that goes into it (including sw packages) in git so you can rebuild the same environment later. But even then, I've seen changes in kernel version or VM configuration break even Dockerized workflows.
- Konrad Hinsen:
  In the spirit of my reality check, I would appreciate hearing of experiments with the long-term stability of Docker images. Did anyone try to revive a four-year-old Docker image? Yes, I know that means going back to the very first Docker release, so it's perhaps asking for too much. But two years should be reasonable. Any takers?
  - F. Pina Martins:
    I think there's a paper somewhere in this idea. =-)
    That being said, I'm genuinely interested in how well this works, since I'm currently using docker containers to make my own research reproducible.
Damien Irving:
Have you seen this post from Titus Brown?
http://ivory.idyll.org/blog...
It explores the concept of a half life for the repeatability of your research:
"... it is at least plausible to argue that we don't really care about our ability
to exactly re-run a decade old computational analysis. What we do
care about is our ability to figure out what was run and what the
important decisions were -- something that Yolanda Gil refers to as
"inspectability." But exact repeatability has a short shelf-life."
- Konrad Hinsen:
  I do remember that post, since I participated quite a bit in the discussion about it. And I think that discussion deserves to continue.
  Most of the current reflections about reproducibility, include my post here, start from the technical end: What is the state of the art? What is feasible in principle, what is feasible with reasonable effort? How can we do a bit better than the current state of the art? An aspect that has been neglected in comparison is the scientific end: What do we need to be able to do with published computational work in order to consider it a part of the scientific record?
  Inspectabilty is an interesting concept in this context, but it remains vague for now. What makes a work inspectable, what makes it verifiable? How can we ensure/check inspectability and/or verifiability at publication time? What are the time scales over which we need to ensure them?
  One key problem with the inspectability concept is that it is not obvious that reading program source code without being able to run it is useful in real life. Once a program reaches a modest level of complexity, looking at the source code is not sufficient to understand what it does, in my experience. A related issue is potential bugs in dependencies, which can only be detected if the precise versions of all dependencies are there - which you can only be sure of in practice if you can actually run the code.
F. Pina Martins:
Great Post!
I'd recommend keep using matplotlib, regardless of how the package evolves. Focus instead on what you have shown here - reproducing the environment.
Keep in mind that *for now* matplotlib broke, but in 5 years, other components may break, since software is constantly evolving.
Having a way to "fixate" the environment seems to me like the way to go.
Regarding the plot "comparison", I wouldn't worry too much about image comparisons, as long as the data used to generate it can be regression tested.
- Konrad Hinsen:
  Focusing on reproducing the environment is fine with me in principle, but we don't have any approach to this that has been around for long enough to be considered reliable. So for now, I prefer to do my best at both ends - preserving the environment AND avoiding unstable dependencies.
Pierre de Buyl:
Looking at old plotting packages then? Would gnuplot somehow fit your needs?
- Konrad Hinsen:
  I'd prefer something Python-based for two reasons:
  - avoid the configuration/installation/version-checking issues related to external commands
  - integration with ActivePapers

Reproducibility does not imply reproduction

Konrad Hinsen — 2017-01-24

In discussions about computational reproducibility (or replicability, or repeatability, according to the preference of each author), I often see the argument that reproducing computations may not be worth the investment in terms of human effort and computational resources. I think this argument misses the point of computational reproducibility.

Obviously, there is no point in repeating a computation identically. The results will be the same. So the only reason to re-run a computation is when there are doubts about the exact software or data that were used in the original work, or doubts about the reliability of the hardware.

The point of computational reproducibility is to dispel those doubts. The holy grail of computational reproducibility is not a world in which every computation is run five times, but a world in which a straightforward and cheap analysis of the published material verifies that it is reproducible, so that there is no need to run it again. Actual reproduction attempts would be rare and reserved for situations such as suspicion of hardware failure or suspicion of fraud.

So how can we make reproducibility credible without actually doing reproduction? By using toolchains that have been proven in practice to make computations reproducible. Of course we do need to attempt some reproductions in order to validate these toolchains, but it's sufficient to do this for short computations. And if the toolchain is any good, the human effort should be close to zero as well.

The mere fact that we discuss computational reproducibility at all shows that we do have doubts. Most of us doing computational science have at some point had doubts about our own work. How did I make this figure? Was it made with the latest version of this script, or an earlier one? Did I run that simulations before or after installing the recent important bug fix? And when it comes to examining work by others described in a journal article, our ignorance usually reaches a level that the word "doubt" cannot convey - we don't really know anything. All we have is someone else's incomplete story. If we have doubts about our own work whose full story we know, why should we trust someone else's story blindly?

So the question about "how much" reproducibility we need comes down to a more basic question: What would it take to make you trust a computational result beyond a reasonable doubt? Here is my personal list of acceptable evidence as of today:

I can repeat the computation on my computer and get close enough results.
The results are published as an ActivePaper.
The results come with a Nix or Guix recipe for reproducing them.

The last two cases point to toolchains that I personally consider trustworthy, given the experience I have with them. Both toolchains generate a detailed trace of what happened, with references to all the software and data. And both toolchains make mistakes improbable enough that the remaining risk is acceptable for me. Neither toolchain provides protection from fraud, so if I had a reason to suspect fraud, I'd still attempt a reproduction.

Note that I am not saying that everybody should use one of those toolchains. In their current state, they are neither universal nor sufficiently easy to use. But they do show the toolchain approach to reproducibility is viable.

Sustainable software and reproducible research: dealing with software collapse

Konrad Hinsen — 2017-01-13

Two currently much discussed issues in scientific computing are the sustainability of research software and the reproducibility of computer-aided research. I believe that the communities behind these two ideals should work together on taming their common enemy: software collapse. As a starting point, I propose an analysis of how the risk of collapse affects sustainability and reproducibility.

What I call software collapse is what is more commonly referred to as software rot: the fact that software stops working eventually if is not actively maintained. The rot/maintenance metaphor is not appropriate in my opinion because it blames the phenomenon on the wrong part. Software does not disintegrate with time. It stops working because the foundations on which it was built start to move. This is more like an earthquake destroying a house than like fungi or bacteria transforming food, which is why I am trying out the term collapse.

The software stacks used in computational science have a multi-layer structure that seems to be nearly universal. At the bottom, there is non-scientific infrastructure, such as operating systems, compilers, and support code for I/O, user interfaces, etc. All of this software is used by scientists in the same way as by other computer users. The predominant view is that this software is external to scientific computing, much like computer hardware. One exception is infrastructure software for high-performance computing, which like the hardware it runs on is often designed specifically for use in science and engineering.

The second layer is scientific infrastructure. Here we find libraries and utilities used for research in many different disciplines, such as LAPACK, NumPy, or Gnuplot. The people developing this software tend to be researchers or research software engineers, i.e. people with a scientific background. The methods (algorithms, data structures) implemented in these packages are typically well-known and stable. This does not exclude ongoing research on improving the implementations, but from the users' point of view, the job done by the software remains the same, often for several decades.

The third layer contains discipline-specific research software. These are tools and libraries that implement models and methods which are developed and used by research communities. Often the developers are simply a subset of the user community, but even if they aren't, they work in very close contact with their users, who provide essential feedback not only on the quality of the software, but also on the directions that future development should take.

The fourth and final layer is project-specific software, which is whatever it takes to do a computation using software building blocks from the lower three levels: scripts, workflows, computational notebooks, small special-purpose libraries and utilities. At the end of a project, such software may become the starting point for software specific to another project, but it is rarely reused without modification, and rarely used by anyone except the members of the project that developed it.

Computational models and methods often move down the stack in the course of time. They are developed initially within a specific project, then the more widely useful ones become part of discipline-specific software, and some of them may find adoption in other fields of research and become a part of the scientific infrastructure layer.

Software in each layer builds on and depends on software in all layers below it, meaning that changes in any lower layer can cause it to collapse.

The reproducible research community focuses on the fourth layer, the project-specific software. Traditionally, the main obstacle to reproducibility was that this layer was not published, and sometimes even deleted by its authors at the end of a project. This layer also contains algorithms executed by a human user, e.g. by entering commands one by one into the computer. This ephemeral software is typically not even recorded. Fixing these problems is mainly a matter of creating an awareness of their importance, and much progress has been made in this respect. But the problem of layer-4 software collapsing due to changes in the lower levels remains largely unsolved. Project-specific software is particularly vulnerable to collapse because it is almost never maintained, since its active days are over.

The sustainable software community is mainly interested in layer 3, the discipline-specific community software. Its development is fragile because the importance of this software is not yet recognized by institutions and funders, unlike the scientific infrastructure software one layer below. Moreover, this software is often developed by scientists with insufficient training in software engineering techniques. There are essentially two tasks that need to be organized and financed: preventing collapse due to changes in layers 1 and 2, and implementing new models and methods as the scientific state of the art advances. These two tasks go on in parallel and are often executed by the same people, but in principle they are separate and one could concentrate on just one or the other.

The common problem of both communities is collapse, and the common enemy is changes in the foundations that scientists and developers build on. The options they have for dealing with this are about the same as for house owners facing the risk of earthquakes:

Accept that your house or software is short-lived. In case of collapse, start from scratch.
Whenever shaking foundations cause damage, do repair work before more serious collapse happens.
Make your house or software robust against perturbations from below.
Choose stable foundations.

House owners generally opt for strategies 3 or 4, or a mixture of them. Strategies 1 and 2 are unattractive because house owners might well be injured or killed during a collapse.

Most software developers, in science or elsewhere, prefer strategies 1 or 2. In many business settings, this makes sense because software is short-lived or rapidly evolving anyway, due to changing requirements and newly appearing possibilities. In science, these motivations exist as well, but must be weighed against the need for preservation of the scientific knowledge embodied by scientific software. You may not care about losing the Web browser you used long ago, given that there's a better one now. But if ten years from now, doubts come up about the analysis of LIGO data, you want to be able to go back to the analysis code and check what exactly was done at the time.

A difference between the sustainable software and the reproducible research communities is that the former privileges strategy 2, continuous repair, whereas the latter dreams of strategy 4, stable foundations. Strategy 2 is in fact easier to adopt, given that most of the software industry is applying it. Strategy 4 is seen as unrealistic by many, because stable foundations are hard to find, and the few we have impose unpleasant restrictions. But if developers in layer 3 adopt the continuous-repair strategy, this leaves only one option for the code in layer 4 - accept that it is short-lived. This is more or less what we see happening at the moment. For a recent discussion, see this blog post by C. Titus Brown and the discussion following it.

In one of the comments there, Daniel S. Katz proposes a cost-benefit analysis, which to the best of my knowledge has not been attempted until now. However, I think it should be done globally, rather than for an individual research project. A move towards stable foundations (strategy 4) is likely to require a large up-front investment, but lower development costs later on, for scientific code in all layers. It might well be interesting for nothing else but reducing global development costs, not even counting the hard to evaluate benefit of long-term reproducibility.

It's also worth looking at why software foundations are shaking all the time. Why can't we just keep on using the same software forever, if we are happy with the way it works?

One reason is the bottom layer of our software stack, which we share with non-scientific software. There are market incentives for shaking up the foundations of commercial software, which then cause collateral damage elsewhere, such as in science. For example, some markets rely on planned obsolescence and never-ending change to create continuous customer demand. Smartphones are a good example. Also, a company controlling a software platform might benefit from changing it a bit all the time in order to retain control and customer attention. Finally, security problems in systems software are discovered regularly, and their fixes can send ripples up the software stack. All this makes it difficult to find stable foundations to build on. However, it is clearly not impossible. After all, banks have been keeping their COBOL software alive for decades. At worst, we could build our own bottom layer instead of sharing it with other application domains. One advantage of scientific software in that respect is that it has few if any security concerns to deal with.

Unfortunately, we also have home-made quakes in our software stack, due to changes in layers 2 and 3. In the fast-paced development of layer 3, collateral damage sometimes leads to collapse in layer 4. I suspect much of this could be avoided with some more attention on stability, plus extensive testing. What's worse is a widespread attitude that considers stability impossible anyway and concludes that one more breaking change is not such a big problem after all. This is particularly harmful for the scientific infrastructure of layer 2. I'll just mention my two-year-old rant about NumPy as an example. In view of the systematic non-maintenance of layer-4 software, this is an inappropriate attitude in the world of scientific computing in my opinion.

As a final remark, strategy 3 does not seem to exist in the software world. There are no proven techniques for making a program robust against changes in its foundations. Software interfaces are much too rigid for that. I vaguely remember Alan Kay speaking about more lenient interface mechanisms - if anyone has a reference to share, please leave a comment! A recent presentation by Rich Hickey, the creator of the Clojure language, also contains useful ideas for dealing with change in interfaces (executive summary: add new features, but don't remove or change existing ones), but it's more of a move towards strategy 4 than strategy 3. More generally, I would like to see more research and development along these lines. Robustness is a major design principle in other engineering domains, and software would benefit from a larger dose as well.

Note added 2019-09-04: I have written a more detailed article about Dealing with Software Collapse for the May 2019 issue of Computing in Science and Engineering magazine. A preprint is available as well.

From reproducible to verifiable computer-aided research

Konrad Hinsen — 2016-05-11

The importance of reproducibility in computer-aided research (and elsewhere) is by now widely recognized in the scientific community. Of course, a lot of work remains to be done before reproducibility can be considered the default. Doing computational research reproducibly must become easier, which requires in particular better support in computational tools. Incentives for working and publishing reproducibly must also be improved. But I believe that the Reproducible Research movement has made enough progress that it's worth considering the next step towards doing trustworthy research with the help of computers: verifiable research.

Verifiable research is research that you can verify for yourself. Not in the sense of verifying the scientific conclusions, which often can only be done many years later. The more modest goal is to verify that a publication contains no mistakes of the kind that every human being tends to make: mistakes in manual computations, mistakes in transcribing observations from a lab notebook, etc.

Ideally, all research should be verifiable. A paper is supposed to provide sufficient details about the work that was done to enable competent peers to verify the reasoning and repeat any experiments. Peer review is supposed to certify that a paper is verifiable, and reviewers are even encouraged to do the verification if that is possible with reasonable effort.

In the pre-computing era, much published research was indeed verifiable. Given the high cost of verifying experimental work, it is safe to assume that actual verification was the exeception. But theoretical work of any importance was commonly verified by many readers who repeated the (manual) computations.

With the increasing use of computers, papers slowly turned into mere summaries of research work. Providing all the details was simply impossible - software was too complex to be fully described in a journal article. It also became common to use software written by other people, and even commercial software whose detailed workings are secret. This development was nicely summarized by Buckheit and Donoho in 1995 in what became a famous quote in the Reproducible Research movement:

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

Today this statement applies not only to computational science, but to all of computer-aided research, as many experimental and theoretical studies involve computers and software as well. The publication of all software and all input datasets in a form that other scientists can actually process on their own computers has become the main objective for making computer-aided research reproducible.

Unfortunately, having all the software and input data that go with a journal article is still not sufficient to make the work verifiable. With the exception of particularly simple computations, it is practically impossible to figure out what the software really computes, and in particular to verify that it computes what the paper claims it computes. Assuming, of course, that the paper actually does provide a detailed description of its claims, which is often not the case. Much computer-aided research is thus "not even wrong".

It is the complexity of much modern scientific software that makes verification practically impossible, and for that reason software is rarely subjected to peer review. After all, who would accept the Herculean task to verify the correct functioning of a piece of software? Even "software papers", i.e. papers that merely exist to provide a citable reference for some software, are reviewed without any serious validation of the software itself. At best, reviewers check that best practices of software engineering have been respected, for example by writing a test suite with good code coverage. But no amount of testing can verify that the software computes what it is supposed to compute. If some numerical constant in the source code is off by 10% due to a typo, there's a good chance that nobody will ever notice. Such mistakes have happened (see this article for a few stories), and there are good reasons to believe they are actually frequent (see this article for arguments). The most convincing argument should be our daily experience with computers that crash or ask us to install "critical updates". If systems software is so clearly full of mistakes, is it reasonable to assume that scientific software has none at all?

The difficulty of verifying computational results in combination with the obvious importance of computational techniques in science has lead to a change of attitude that in my opinion is detrimental to science in the long run. Most importantly, the burden of proof has been shifted from the proponents of a new hypothesis to its opponents. If you cannot show that a computational study is wrong, then it is silently assumed correct. If you want to publish results that are contradictory to work published earlier, it's your obligation to explain why, even though you cannot possibly verify the earlier work. This is why protein structures in contradiction with the later retracted ones from Geoffrey Chang's group were rejected for publication for a long time. Contradictory results should be handled by a critical inspection of all of them, but this is possible only for verifiable research.

Another detrimental change of attitude is that "correct" has been replaced by "community-accepted" as a quality criterion in many fields. Recently, I have started to ask a simple question after seminars on computational work: "Why should I believe your results? What did you do to verify them?" Most often, the answer is "We used software and protocols that are widely applied in our community". Unfortunately, popularity can be taken as an indicator of correctness only if it is safe to assume that many users have actually verified those tools and methods. Which again assumes verifiability as a minimum criterion.

So... what can we do?

Verifiable computer-aided research is a tiny subset of today's published research. It's even a small subset of today's reproducible research. Can we do something about this? I believe we can, and I will summarize some possible approaches.

The most obvious approach to make a computation verifiable is to document all code and data well enough that a competent reader is convinced of its correctness. Literate programming (for algorithms) and computational notebooks (for computations) are good techniques for this. As with any scientific proofreading, verification by inspection requires much care and a critical attitude. People are easily fooled into believing something because it is well presented, for example. But the most important obstacle to this approach is the modularity of much of today's scientific software. If you reuse existing libraries - and there are of course good reasons to do so - then you probably won't rewrite them in literate programming style for explaining their algorithms to your critical reader. A computation is only as verifiable as its least verifiable ingredient.

Another way to make computer-aided research verifiable is to make the computations reimplementable. This means that the published journal article, or some supplementary material to that article, contains a precise enough human-readable description of the algorithms that a scientist competent in the field can write a new implementation from scratch, and verify that it produces the same (or close enough) results. This is not a fool-proof approach, of course, and again modularity is a major risk factor. If the computation uses some complex library and the reimplementor chooses to use the same library, then the library code is not verified by the reimplementation. The more the reimplementation differs from the original authors' code, the better it is as a verification aid. This is by the way also a strong argument for diversity in scientific software. In terms of development efficiency, a single community-supported software package per field is great, but for verifiability, it is better to have multiple packages that can do the same job.

Both approaches I have outlined fail for complex software. A million-line simulation code developed over many years by an entire research group can neither be studied nor reimplemented by a single person wishing to verify it. Even a small team working in close collaboration wouldn't be up to the task. The solution I propose for this situation is to introduce an intermediate layer between the software and the human-readable documents (papers, software documentation) that describe what it computes. A layer that contains all the science but none of the technicalities of the software, such as parallelism, platform-dependence, or resource management. The idea is to "factor out" the accidental complexity and retain only the essential complexity, the one due to the complexity of the models and methods that the software implements. This idea is very similar to the use of formal specifications in software development. The specification would be verified by human scientists, whereas the conformity of the software to the specification would be checked by automated methods, of which randomized unit testing is probably the most immediately useful one.

An intermediate layer that factors out accidental complexity is also of interest for other uses in scientific research. That new layer would be the closest we can get to a digital representation of a model or a method. Rather than use it just in the specification of a single piece of software, we can use it for all kinds of analyses and comparisons, and cite it as the main scientific reference in work based on it, in addition to the citation to the software as the technical tool for doing the computations. For this reason, I call this layer "digital scientific knowledge" and the languages for expressing it "digital scientific notation". None of this exists today, but many developments in computer science can be used as a basis for its development. For the details, see this article.

Comments retrieved from Disqus

paper editor:
It sounds nice that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.
Sisaos:
It looks nice that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.
paper editor:
It seems like a good thing that this kind of information was being shared in order for the guidance and ideas that they can promote. Through this, it would be an easy thing for them to see if this will all be worth it.

Composition is the root of all evil

Konrad Hinsen — 2016-03-04

Think of all the things you hate about using computers in doing research. Software installation. Getting your colleagues' scripts to work on your machine. System updates that break your computational code. The multitude of file formats and the eternal need for conversion. That great library that's unfortunately written in the wrong language for you. Dependency and provenance tracking. Irreproducible computations. They all have something in common: they are consequences of the difficulty of composing digital information. In the following, I will explain the root causes of these problem. That won't make them go away, but understanding the issues will perhaps help you to deal with them more efficiently, and to avoid them as much as possible in the future.

Composing information is something we all do every day, mostly without thinking of it. A shopping list is the composition of names of things you need to buy. An e-mail message is the composition of the recipients' addresses, a subject line, and the body of the message. An address book is a composition of addresses, which in turn are compositions of various pieces of information related to some person.

Science has its own information items and associated compositions. Measurements are composed into tables. Mathematical equations are composed into more complex equations. Datasets are composed to make a database. Hypotheses are composed to make a model.

Writing computer programs means composing expressions and statements into procedures or functions, composing procedures to make modules, and composing modules to make programs. Reading data from a file means composing your algorithms with the data they work on into a complete computation. Configuring a new computer and installing software are about composing an operating system, various libraries, and application software into a functioning whole.

When you look at these examples more closely, you might notice that some of these acts of composition are so trivial that we don't even think about them, whereas others are a real pain. In that second category, we find most of the composition work related to computers. So what is the difference?

Human and computational information processing

Humans process information in terms of concepts. We all have accumulated a vast amount of conceptual knowledge over our lifetime, starting with the most basic concepts that we learned in infancy. This knowledge includes the definitions of all the concepts, but also the relations between them. Our knowledge of concepts helps us to "make sense" of information, which includes the detection of probable mistakes and sometimes even their correction. Humans are very tolerant to mistakes and variations in how some piece of information is expressed. We don't care if the items in a shopping list are arranged vertically or horizontally, for example.

When composing information, we read the individual items, translate them into concepts, and then write out the composition. I use the vocabulary of processing written language here, but the same holds for oral or visual communication. Variations in notation may be an inconvenience, but not a real problem. As long as the information refers to familiar concepts, we can deal with it.

Computers process information by applying precise mechanical rules. They don't care about concepts, nor about context. If you ask a computer to do something stupid, it will happily do so. This may look like a criticism of how computers work, but it's also exactly why they are so useful in research: they have different strengths and weaknesses compared to humans, and are therefore complementary "partners" in solving problems.

Formal languages

At the hardware level of a digital computer, a computation is a multi-step process that transforms an input bit sequence into an output bit sequence under the control of a program that is stored as a bit sequence as well. Information processing by computers thus requires all data to be expressed as bit sequences. Dealing with bit sequences is, however, very inconvenient for humans. We therefore use data representations that are more suitable for human brains, but still exactly convertible from and to the bit sequences that are stored in a computer's memory. These representations are called formal languages. The definition of a formal language specifies precisely how some piece of information is encoded as sequences of bits. Many formal languages are defined in terms of sequences of text characters instead of sequences of bits, for another level of human convenience. Since the mapping from text characters to bits is straightforward, this makes little difference in practice. The term "formal language" is commonly used in computer science, but in computational science we usually speak of "data formats", "file formats", and "programming languages", all of which are specific kinds of formal languages. The use of formal languages, rather the the informal languages of human communication, is the defining characteristic of digital information.

The definition of a formal language consists of two parts, syntax and semantics. Syntax defines which bit patterns or text strings are valid data items in the language. Syntax rules can be verified by a suitable program called a parser. Semantics define the meaning of syntactically correct data items. With one important exception, semantics are mere conventions for the interpretation of digital data. Meaning refers to conceptual knowledge that a computer neither has nor needs: all it does is process bit sequences. The exception concerns formal languages for expressing algorithms, i.e. rules for the transformation of data. The semantics of an algorithmic language defines how each operation transforms input data into output data. Writing down such transformation rules obviously requires a notation for the data is being worked on. For that reason, a formal language that can express algorithms also defines the syntax and semantics of the input and output data for these algorithms. Your favorite programming language, whichever it is, provides a good illustration.

There is a huge number of formal languages today, which can be organized into a hierarchy of abstraction layers, such that languages at a higher level incorporate languages from lower levels. As a simple example, a programming language such as Fortran incorporates formal languages defining individual data elements - integers, floating-point numbers, etc. At the lowest level of this hierarchy, close to the bit level at which computing hardware operates, we have formal languages such as Unicode for text characters or the floating-point number formats of IEEE standard 754. One level up we find the memory layout of Fortran arrays, the layout of UTF-8 encoded text files, and many other basic data structures and file formats. Structured file formats such as XML or HDF5 are defined on the next higher level, as they incorporate basic data structures such as arrays or text strings. Programming languages such as Python or C reside on that level as well.

Different formal languages that encode the same information at the semantic level can be converted into each other. The two best-known translations of this kind in the daily life of a computational scientist are file-format conversion and the compilation of software source code into processor instructions. However, if you take into account that the in-memory data layout of any program is a formal language as well, all I/O operations can be considered conversions between two formal languages.

Composition of digital information

Digital information is, by definition, information expressed in a formal language. Composition of digital information produces a new, more complex, digital information item, which is of course expressed in a formal language as well. And since the ingredients remain accessible as parts of the whole, everything must be expressed in one and the same formal language. And that's where all our trouble comes from.

If we start from ingredients expressed in different languages, we have basically two options: translate everything to a common language, or define a new formal superlanguage that incorporates all the languages used for expressing the various ingredients. We can of course choose a mixture of these two extreme approaches. But both of them imply a lot of overhead and add considerable complexity to the composed assembly. Translation requires either tedious and error-prone manual labor, or writing a program to do the job. Defining a superlanguage requires implementing software tools for processing this new superlanguage.

As an illustration, consider a frequent situation in computational science: a data processing program that reads a specific file format, and a dataset stored in a different format. The translation option means writing a file format converter. The superlanguage option means extending the data processing program to read a second file format. In both cases, the use of multiple formal languages adds complexity to the composition that is unrelated to the real problem to be solved, which is the data analysis. In software engineering, this is known as "accidental complexity", as opposed to the "essential complexity" inherent in the problem.

As a second example, consider writing a program that is supposed to call a procedure written in language A and another procedure written in language B. The translation option means writing a compiler from A to B or vice-versa. The superlanguage option means writing an interpreter or compiler that accepts both languages A and B. A mixed approach could use two compilers, one for A and one for B, that share a common target language. The latter solution seems easy at first sight, because compilers from A and B to processor instructions probably already exist. However, the target language of a compiler is not "processor instruction set" but "processor instruction set plus specific representations of data structures and conventions for memory management". It is unlikely that two unrelated compilers for A and B are compatible at that level. Practice has shown that combining code written in different programming languages is always a source of trouble, except when using tools that were explicitly designed for implementing the superlanguage from the start.

In the last paragraph, I have adopted a somewhat unusual point of view which I will continue to use in the following. We usually think of a language as something named and documented, such as C or Unicode. The point of view I adopt here is that the language in which a piece of digital information is expressed consists of all the rules and constraints that must be satisfied, including the rules and constraints due to composition. To illustrate the difference, consider the Python language and the Python language with the NumPy extension. According to the standard point of view, Python is the language and NumPy is a library written in Python. In my point of view, Python+NumPy is a language different from plain Python. To see that libraries modify their underlying languages, consider the Python statement import numpy. It fails in plain Python, so it is not a valid statement in the Python language, whereas it is valid in the Python+NumPy language. Moreover, in the Python+NumPy language you are not allowed to write a module called numpy. The addition of NumPy to plain Python makes some formerly invalid programs valid and vice versa, which justifies speaking of different, though certainly similar, languages.

Lots of languages, lots of problems

The above discussion suggests that to keep our lives simple, we should use as few different formal languages as possible. Unfortunately, an inventory of what we have to deal with shows that we are very far from that optimum.

Data formats are the easiest part. Even the number of "standard" formats is enormous, and many of them aren't that well standardized, leading to different dialects. Worse, many scientific programs make up their own ad-hoc data formats that are scarcely documented. That's why file conversion takes up so much of our time. Moreover, we usually have different on-disk and in-memory data formats for the same data, which is why we need to write I/O routines for our software.

But the complexity of formal languages used to define programs completely dwarfs the complexity of data formats. Let's start at the bottom level: the processor's instruction set. If you write an operating system (OS), that's the level you work at. Otherwise, your program is a plug-in to be composed with an operating system, and the operating system defines the formal language in which you need to provide your program. The "OS language" includes the processor's instruction set, but also adds constraints (memory use, relocatability, ...) and access to OS functions. The OS language can be as simple as the COM file format from CP/M and DOS days but also as complex as Linux' ELF format.

The ELF format introduces the next level of composition: object files and dynamic libraries, in addition to executable files. In a modern OS, a program is composed from several ingredients immediately before execution. The motivation for introducing this last-minute composition was the possibility to share frequently used program building blocks among the hundreds of processes running in parallel, thus reducing their memory footprint. But this comes at the price of considerable accidental complexity. The OS language that your program must be written in now includes note only the processor instruction set and the ELF format specification, but also conventions about where certain shared libraries are stored in the file system. That's why it is no longer possible to prepare a generic program for the Linux platform. Different Linux distributions have different conventions for arranging the shared libraries in the file system, and moreover these conventions change over time. They have different OS languages.

Upon closer inspection, the situation is actually even worse. The OS language for a given piece of software includes all the software packages that have been installed on the same computer before. Obviously, only one software package can occupy a given filename. Once you have installed a package that uses the file /usr/lib/libm.so, no other package can occupy the same slot. That makes it impossible to wrap up "my program and all the files it requires" for installation on some other machine. If package A contains /usr/lib/libm.so and package B another /usr/lib/libm.so, even if it is only a slightly older version of the same library, the two packages could not coexist. The only solution is to distribute programs and libraries as building blocks to be added to a growing assembly, whose composition - now called "software installation - is left to the system administrator. Each block comes with a list of "required dependencies", whose presence the system administrator must ensure. Moreover, each block occupies certain slots that must be available in the system. In the terminology of formal languages, each new block must conform to a language that its author cannot know in advance, and cannot even fully describe. I have described this error-prone approach in an earlier blog post as the Tetris model of software installation, because of its obvious similarities with the well-known video game. It's the most widely used model in scientific computing today.

The obvious problems caused by this approach have motivated the development of various tools for the management of software installations. Some are specific to some OS platform (the package managers of Debian, RedHat, BSD, etc.). Others are specific to a programming language, e.g. Python's distutils system and its derivates. The multitude of software installation managers has created a secondary composition problem: to install a Python package on a Debian system, you must negotiate a compromise between Python's and Debian's views on how software installation should be managed.

Another approach is to give up on sharing common resources, and provide some way to package programs with all the files they need into a single unit, even if this leads to duplication of data on disk and in memory. This is the idea behind MacOS X application bundles (which go back to NextSTEP) and also Docker containers. Tools such as Python's virtualenv proceed in a similar way, by isolating a specific composition of building blocks from other potentially conflicting compositions of building blocks on the same computer.

An ingenious construction that combines the best of both worlds is the approach taken by the Nix package manager and its offshoot Guix. Instead of having building blocks refer to each other through filenames, they use a hash code computed from the actual contents of the files. This allows the composition of arbitrary building blocks, including pairs that would claim the same filenames in a standard Linux system, but also prevents multiple identical copies of any building block. This idea is known as content-addressable storage, and is also used in the popular version control system git.

Up to here, I have described the composition of specific programs with an operating system. But the program that is prepared as a plug-in to an OS is itself already a composition. How it is composed and from which constituents depends on the programming language(s) being used and on the tools that implement them. In Python, for example, a program consists in general of packages which consist of modules which consist of name-value pairs. A C program consists of source code files and header files, which each contain value and function definitions and interact via macro definitions. Like in the case of the "OS language", the precise formal language in which each piece is written is not just Python or C. It also includes constraints and extensions coming from other building blocks -- libraries -- that the program refers to, as I have illustrated above for the example of Python plus NumPy.

Comparing these two situations, we can identify the common culprit: the use of a global namespace for composing building blocks. In the "OS language" of a typical Linux system, the global namespace is the filesystem. In Python, it's the namespace of top-level module names. In C, it's the namespace of non-static top-level definitions. Composition requires one building block to refer to another building block through a name in that namespace. And that in turn requires each building block to occupy a specific name in that namespace, so that others can refer to it.

One way to alleviate this problem is encouraging the use of very specific names. That's the approach taken by Java, whose global namespace for packages is supposed to contain "reversed domain names" such as org.apache.commons.lang3.math. While such a rule, if respected, indeed reduces the risk of name collisions between unrelated packages to almost zero, the most frequent source of name collisions remains: different versions of a package have the same name and can therefore not be used together in a composition. When composing building blocks into a program, one can argue that mixing different versions is bad practice anyway. But in the Tetris model of a single global software collection per computer, not being able to have several versions of a building block is often a serious restriction.

A final kind of formal language worth mentioning in this context is languages for defining compositions. This category includes Makefiles, Dockerfiles autoconf configuration files, and of course the package specification files of the various package managers. Their multitude shows the importance of the composition problem, but it also contributes to it. It is not rare to see a specification file for one package manager refer to another package manager. Conversion from one such language to another is nearly impossible, because the precise language for defining a composition depends not only on the package manager, but also on the other existing packages. It's exactly the same situation as with the "OS language" and programming languages extended by libraries.

Is there a way out?

I believe that there is, and I have some ideas about this, but I will leave them for another time as this post is already quite long. I hope that the above analysis contributes to a better understanding of the problems that computational scientists are facing in their daily work, which is the prerequisite to improving the situation.

As a first step, I encourage everyone to prefer solutions to workarounds when faced with composition-related issues. Solutions identify a cause and eliminate it, whereas workarounds merely alleviate the impact of the problem, often re-creating the same problem at another level later on. In the approaches I have discussed above, an example of a solution is content-addressable storage, as used in Nix. In contrast, the traditional Linux package managers are workarounds, because they re-create a composition issue at the package level. Linux distribution authors have done a lot of hard and useful work with these package managers, which I don't want to play down in any way. But the fruit of that work can be carried over to better foundations. The Tetris model of software installation is not sustainable in my opinion. We have to move on.

Comments retrieved from Disqus

Shalabh:
I strongly agree with much of what you have written above. The composition problem is widespread and eating most of our resources. Unfortunately it is mostly *invisible* in the sense that there is no deep understanding or study of 'composition models' like there is of 'programming languages'. Bit formats and PLs are also something we keep getting better at, so the tendency is "let's do more of those". We must first put on the 'composition oriented glasses' to look at our systems to even start seeing the problem.
The solution vs workaround distinction indeed seems useful. Perhaps it also depends on perspective? E.g. we could consider Nix to be a workaround to the root problem that the file system substrate does not provide content addressable storage.
This problem affects not just scientists, but almost all computer users, and even developers, ironically. Did you write more about the ideas you have around this? I skimmed this blog but didn't find anything specific on this theme.
- Konrad Hinsen:
  I didn't write much else on this topic, but I have been working a lot on and with various reproducibility tools. One of them is [Guix](https://guix.gnu.org/), which can be summarized as an alternative implementation of the Nix idea. You can certainly consider it a workaround for the lack of content-addressable storage at the OS level, but then content-addressable storage is only one ingredient to reproducible software systems. Most of the effort of the Guix community go into the package definitions. The build procedures of many software packages must be heavily modified to replace the standard paths by configurable ones, and then there is the work shared by all packagers of figuring out which versions of packages actually work together.
  One idea I have pursuing is to start from the integration end of software building. Given Guix as a system integration tool, how should software be written and distributed to fit into a Guix system without prior torture? Some measures would be simple to define and apply, other more complicated, but all would hit the obstacle of rendering the software more difficult to manage without Guix, so all of that is unlikely to happen.
Stephen Kell:
Hello again Konrad! Commenting here this time....
I really enjoyed this article, and it parallels my own thinking to an uncanny degree.
I particularly appreciate the wide treatment of different kinds of "language", extending to file formats and memory layouts, and seeing the resulting problems as language composition problems. And I completely agree that focus on programming languages, instruction sets and the like, distracts us from the fact that the conventions we layer in top of PLs, or indeed underneath them at the implementation level, are as important to composition as the languages themselves -- often more so, in fact. My own thinking with liballocs has definitely been working towards better support for automating (or even semi-automating) solutions to these problems, roughly limiting my scope to "within one process" for now.
I believe that the proliferation of "[generalised] languages" is something we need to address primarily by mitigation, and only secondarily by minimisation. Computers can help us mitigate a lot more than at present. Minimisation is to be preferred as far as it is feasible, butdiversity of requirements and decentralised working will guarantee the existence, survival and (re-)creation of more languages than are in theory necessary.
Both translation and supersetting can become tasks that the machine assists us in, even though at present they are mostly manual. We lack the kind of metaprogramming infrastructure for recovering commonality among distinct "languages". In fact, for my PhD (http://www.cl.cam.ac.uk/~sr... I more directly attempted a version of this problem, without huge success; in any case, a better infrastructure for doing these things is exactly my medium-term vision for liballocs. Although I have started with type information for in-memory objects, binary file formats are an obvious next step, and there's nothing to prevent an extended descriptive framework from covering other languages, filesystem objects, etc.. It feels particularly important to capture the layering of encodings (somehow!).
I agree that the problem extends to namespace management in general and software installation/deployment in particular... I have some ideas in that space too, again having much in common with Nix/Guix... though in general I am not fitting this into liballocs yet... I would rather come at these problems "from the other end" in the hope of converging later.
- Konrad Hinsen:
  Thanks for your extensive comments! It's good to see that at least one reader has understood this post. Most feedback I got at the time (privately) was along the lines of "I have no idea what you are talking about".
  Mitigation is certainly the right approach for progressing without having to start from scratch. Mediation (as in your Cake language) sounds good as well. As a start, I propose that every PL design or implementation team should include an experienced diplomat. A lot would be gained if people would stop designing closed universes.
  Ubiquituous metadata for introspection, which is my hopefully not too wrong summary of your liballocs project, looks like another approach to mediation. I have some experience with this on the file format level (via HDF5, which stores data structure definitions along with datasets, see https://support.hdfgroup.or... for details), where it works very well, in part because the HDF5 library includes data conversion utilities for the most frequent situations.
  Now we just have to convince the rest of the world that this is an important problem to solve...
online assignment in australia:
Well, this was being explained well especially that there might be a lot of people who might mislead their thought regarding on this kind of matter. At least, there are this kind of blogs that helps them to understand more.

On HDF5 and the future of data management

Konrad Hinsen — 2016-01-07

Yesterday a blog post by Cyrille Rossant entitled "Moving away from HDF5" caught my eye. My own tendency at the moment is to use HDF5 more and more, so I was interested in why someone else would want to do the opposite. Here is my conclusion after reading his post, plus some ideas about where scientific data management is or should be heading in my opinion.

Any evaluation of some technology happens in the context of a specific application's requirements, and this is where Cyrille's and my own experience differ in an important point: I have never run into performance problems with HDF5, probably because my jobs do much more computation (relative to I/O) than his. This also makes parallel access less of a problem for me, although I agree that HDF5's parallel support could be better.

Otherwise, I agree with much of his criticism of HDF5, but I still conclude that its problems are the smallest evil compared to any other technology I know of. The big problem with HDF5 from my point of view is what Cyrille calls "opacity": the complexity of the file format which in practice means that the only way to use HDF5 files is via the HDF5 library. Which is, indeed, far from perfect. However, given my requirements, there is pretty much no competition to HDF5. The only alternative would be to roll my own system, which isn't a pleasant idea either.

The peculiar combination of requirements that to the best of my knowledge only HDF5 fulfills is:

the hierarchical management of multiple datasets with associated metadata as a single unit for archiving and publishing
efficient access to the individual datasets

The first requirement rules out the approach of using a directory with lot of individual files. The second requirement rules out container formats such as zip - having to unpack a dataset for processing is too much overhead.

My first requirement is exactly what Cyrille describes as the "HDF5 philosophy", so it's no wonder that HDF5 fits my needs rather well. His question "One can wonder why not just use a hierarchy of files within a directory." thus deserves a few comments. I have done that for a while, and many of my colleagues still do it. My experience is that, after copying around the data between different machines a few times, I always ended up losing files or having mismatched versions. Which, of course, raises the questions why I copy around the data.

Cyrille says that "today's datasets are so big that they don't tend to move a lot." Well, first of all, mine are not that big. My HDF5 files are a few MB to a few GB in size. Individual datasets range from a few hundred bytes to a few GB, and the number of datasets in a HDF5 file ranges from ten to a few thousand. And I copy them around because I handle different tasks in my workflow on different machines. Most data transfers happen between my desktop/laptop and the computing cluster that I use for number crunching. I couldn't do the number crunching on my desk, nor the data inspection and visualization on the cluster in batch mode. Since the two machines have no shared file storage, I can't avoid copying the data back and forth. Moreover, collaborators' desktop machines participate in the overall workflow as well.

For jobs that handle much bigger datasets, copying is indeed not an option, and the usual way to work is to keep the data on a single server-type machine that also handled the computation. I cannot use that kind of setup because I have neither my software nor my computers are made for it. All my software was written with local disk storage in mind - just like HDF5.

Taking a step back from the technical details, my analysis of the situation is that we are living in a transition period from local to distributed storage of scientific data. Local storage was the only option in the past, before fast networks came along. Distributed storage is what fits today's working patterns best: large data, geographically widespread collaborations, etc. But distributed storage still lacks good infrastructure, and is therefore badly supported by much scientific software.

The future of scientific data management is, in my opinion, something like IPFS: a single logical view of data spread out over a vast network of machines. Software accesses the data using a mixture of references (like filenames, URLs, etc.) and content-based addressing (e.g. through hashes). If performance demands local storage, the data is cached by the middleware. The middleware also ensures availability with decent performance and redundant storage to prevent data loss. No data would ever be copied explicitly, but simply retrieved "from the cloud".

In such a world, my HDF5 files would become small datasets containing references to other, potentially big datasets, plus metadata. Content-based addressing plus transparant data movements performed by the middleware would ensure coherence - nothing would be messed up by me shoveling data around with manually typed scp commands. I suspect Cyrille would be happy with this as well. The only problem is that we do not have this infrastructure. Worse, given the cost and building and maintaining such infrastructure, we are not likely to have it for many years to come. So... after this short dream, it's back to HDF5 for me.

Comments retrieved from Disqus

Daan van Vugt:
Hi Konrad,
For that kind of analysis tasks you might look at sshfs with a generous cache as a kind of distributed file system middleware, I use it very often when analysing data.
Ipfs looks very interesting and loads better!
Daan

From facts to narratives

Konrad Hinsen — 2015-12-08

A recurrent theme in computational science (and elsewhere) is the need to combine machine-readable information (which in the following I will call "facts" for simplicity) with a narrative for the benefit of human readers. The most obvious situation is a scientific publication, which is essentially a narrative explaining the context and motivation for a study, the work that was undertaken, the results that were observed, and conclusions drawn from these results. For a scientific study that made use of computation (which is almost all of today's research work), the narrative refers to various computational facts, in particular machine-readable input data, program code, and computed results.

A computational notebook, as pioneered by Mathematica and recently popularized by Jupyter (formerly known as the IPython notebook), is another document that mixes facts and narratives. Compared to a scientific article, program code takes a much more prominent role, and the narrative is focused on the computation. In software development tools, we find the fact-narrative mixture in version control, where the commits are a stream of facts to which the commit messages attach a narrative. At a more basic level, comments in program code can be thought of as narratives embedded into the code. Literate programming inverts this relation by embedding the code into a narrative.

All these situations share a common problem: the tools we have today force us to choose between treating the facts first-class, accepting a low-quality narrative, or to optimize the narrative while compromising on the quality of fact management. In the following, I will argue that this is due to a poorly thought-out relation between facts and narratives, and outline possible improvements.

Comments in source code are an example where priority is given to the facts, i.e. the program. The reader is supposed to read the code, the comments are there only to provide non-obvious background information, and sometimes to outline an overall structure. Reading commented code takes a lot of time and effort, because the reader has to deal with all the details of the program code. A pure narrative would explain software at a more abstract level, leaving out details or relegating them to an appendix. As an example for the opposite extreme, a scientific article is primarily a narrative, including only small pieces of the facts for illustration. A complete description of the facts would require all of the program code and input data. This is why replicability and reproducibility are currently big issues in computational science.

Facts and narratives live in two different universes. Facts belong to the computational universe, in which all information is encoded in formal languages with (ideally) well-defined syntax and semantics. Computation processes input data (which includes the program code) and produces output data in a process that is perfecly well-defined and deterministic. A real-life computation depends on a lot of input data due to the many details that matter. That means a lot of facts, but computers are very good at handling a lot of facts.

Narratives belong to the universe of human thought and communication. They rely on a rich context that human readers are expected to have acquired through prior study. This context contains in particular the appropriate abstractions that allow the narrative to remain at a manageable level, because humans can only keep a limited amount of details in their heads. To see the importance of this point, imagine a narrative that explains how to "open a door" in terms of the detailed eye movements and muscle contractions required to perform this task - such a narrative would be completely incomprehensible. On the other hand, narratives do not need to be very precise in many aspects because humans excel at "making sense" of information even if it contains mistakes and incongruences.

Computers are good at handling facts but not narratives. Humans are good at handling narratives but not facts in the quantities that typically define a computation. Letting computers intervene in the processing of narratives leads to funny results - try Google Translate on a non-trivial text for an illustration. Letting humans intervene in the execution of a computation is a major source of mistakes. That is why a key ingredient to improving replicability is the automation of all computational steps. In the ideal world, no part of a computation would be defined by a narrative providing nstructions for a human operator. Anyone who has every had to install software knows that we are still far away from that ideal world.

Note that I only said that humans should not intervene in the execution of a computation. They do of course intervene in its definition. Program source code, after all, is written by humans. More generally, humans intervene quite often in computational science by using interactive tools. In that case, the stream of user interactions becomes part of the definition of a computation. If it is recorded, the computation can later be executed again without human intervention. This is of course well known: replicability requires that all user interaction must be recorded.

Since facts and narratives live in different universes, we should avoid mixing them carelessly. Crossing the boundary between the two universes should always be explicit. A narrative should not include copies of pieces of facts, but references to locations in a fact universe. And facts should not refer to narratives at all. The relation between the two universes is not symmetric: computers are tools made by humans for their benefit, so the computational universe is subordinate to the human universe.

Now let us look at the examples cited in the beginning from this new point of view. In scientific communication, the separation of facts and narratives was actually well respected initially. The lab notebook recorded facts, and the published paper contained a narrative quoting facts from the lab notebook. No scientist would ever have contemplated writing a paper by modifying the contents of his or her lab notebook! Unfortunately, this basic wisdom was lost with the adoption of computers. Computers make it very easy to modify information, to the point that version control had to be invented to prevent massive information loss by careless editing. Moreover, the distinction between a lab notebook and a paper became blurred by both being files processed using a computer. Finally, computational scientists never adopted the habit of keeping lab notebooks until very recently, coming mostly from a theoretical rather then an experimental background.

Today there is a lot of discussion about "electronic lab notebooks", but the fundamental characteristic of a lab notebook being a record of facts is not often mentioned in this context. Very frequently, computational notebooks as implemented by Jupyter or Mathematica are claimed to be lab notebooks for computational science. It is probably clear at this point that I do not agree. Computational notebooks are designed for writing narratives that include computations and their results. They are best considered specialized word processors that encourage refining a document through many iterations of modification involving the code, its results, and the textual elements. The computational side of notebooks is limited to efficient interactive code evaluation. There is no logging of interactions, and no description of the computational infrastructure (libraries, ...) on which the interactive computations rely. As a consequence, computations in a notebook are in general not replicable. I believe this can be fixed, and I have made a concrete proposal for doing so, but unfortunately I do not have the means to actually implement this idea.

In version control as it was originally designed, a repository is a fact database that contains sequences of versions of file sets. Commit messages, like comments in a program, are small narratives that provide a high-level overview and often a motivation for each change. The role of a repository is similar to the role of a lab notebook: it is a permanent record of what happened, with narratives written close in time to the recorded events. As commits and commit messages accumulate over time, following along becomes an arduous task for a human reader: the narrative contains too much irrelevant detail. This became a serious practical issue as version control was adopted as a tool for collaboration, with members of a team communicating through commit messages. Git therefore introduced the approach of "rewriting history". The idea is to "clean up" a stream of commits by re-ordering and merging them and by writing new commit messages, with the goal of creating a better narrative. Rewriting history remains a hot topic of debate. Most people realize the utility of cleaning up the narrative, but it also feels wrong to destroy the original historical record in the process. Moreover, there is a clear risk of introducing mistakes when rewriting history. In view of what I said above, the basic mistake is the failure to separate cleanly facts from narratives. The cleaned-up narrative should be separate from the original commented stream of commits and refer to it. In git terminology, rewriting history should create a new branch, and the rebasing operations done in deriving the new branch from the initial one should be recorded. Moreover, the editing tools should ensure that the final file contents are the same in the two branches.

I hope that these two examples have illustrated why it is desirable to keep facts and narratives distinct, with well-defined references from narratives to facts. Unfortunately, today's computational technology doesn't help much with reaching this goal when the facts are parts of a complex computation. We cannot define such a computation while remaining completely in the computational universe. And we cannot define unambiguous references to arbitrary facts inside a computational universe either. Most of the data formats and tools we use for preparing narratives do not even try to respect the separation of universes. Finally, the formal languages we use to encode computational facts (programming languages, file formats, etc.) are mostly not designed for being embedded into narratives. There's still a lot to do.

This blog is moving!

Konrad Hinsen — 2015-11-12

Welcome to the last post on this WordPress blog. I have set up a new blog for all my future writing.

The reason for the move is that the user interface at WordPress is changing all the time without ever getting better. I like to write my posts on my own computer using Emacs, rather than typing into a rudimentary editing window on a Web site. This is not completely impossible with WordPress, but more hassle than it's worth.

My new blog is hosted on GitHub and powered by Frog, a static Web site generator that mixes my posts written as plain Markdown files with HTML templates based on the Bootstrap framework to produce the pages you can read. This setup gives me much more control over my blog, while at the same time making it easier for me to publish new posts.

The one feature that will disappear is the possibility to subscribe to my blog in order to be informed about new posts by e-mail. If you have a GitHub account, you can get the same effect by following updates to the repository that contains my blog. But the easiest way to learn about new posts is to follow me on Twitter.

The lifecycle of digital scientific knowledge

Konrad Hinsen — 2015-11-09

Like all information with a complex structure, scientific knowledge evolves over time. New ideas turn into validated models, and are ultimately integrated into a coherent body of knowledge defined by the concensus of a scientific community. In this essay, I explore how this process is affected by the ever increasing use of computers in scientific research. More precisely, I look at "digital scientific knowledge", by which I mean scientific knowledge that is processed using computers. This includes both software and digital datasets. For simplicity, I will concentrate on software, but much of the reasoning applies to datasets as well, if only because the precise meaning of non-trivial datasets is often defined by the software that treats them.

Before looking at the "digital" aspects, I will summarize the traditional lifecycle of scientific knowledge from the "printed page" era. It has been going on for centuries and follows well-established procedures and habits. I will then argue that these procedures should serve as a guideline for the management of digital scientific knowledge as well, and that computing technology for science should be designed to support this lifecycle.

New observations, instruments, models, methods, and ideas are first published in journal articles. Such an article explains the background and motivation for the work, summarizes the state of the art, and then exposes the new elements that the authors wish to contribute to the scientific record. Other scientists from the field read the article, and draw conclusions for their own work, which are translated to citations to the article in their own publications. After some time, if the original publication creates enough interest, it will become a subject of discussion in its research community, and it will be mentioned in review articles, which place it in the context of other recent work in the field.

Being cited in review articles is typically the last step in the lifecycle of an individual contribution. Its ideas and conclusions are then merged with related ideas and conclusions and reformulated to become part of the state of the art of the field, recorded in reference works, monographs, and textbooks. These works represent a kind of community concensus. New research, in the same or in other domains, builds on such concensus knowledge, often implicitly by assuming that every reader of a journal article is familiar with the contents of reference works, monographs, and textbooks.

The introduction of computers into scientfic research has lead to many changes to this process. Some of them, such as the transition from paper to computer files as a support medium for scientific article and, reference works, are relatively minor. The most profound change is that an important part of digital scientific knowledge exists only in the form of software. This is true in particular for complex scientific models, for which we have no other convenient form of representation. An example where this situation is very explicit is the Community Earth System Model for climate research, which takes the form of a software package. Most often, the status of computational models is more fuzzy. As an example, consider force fields for proteins such as AMBER or CHARMM. People refer to these force fields by citing scientific articles, but these articles contain only outlines of the models. Their only complete recorded expressions are implementations as part of simulation software packages, but unlike for the Community Earth System Model, there is no software package designed to function as a reference implementation defining the model.

The fundamental difference between software and other media for storing scientific knowledge is that software has two sides: a human-facing side, and a machine-facing side. As a medium for expressing scientific knowledge, software fulfills the same role as prose or mathematical formulas. But the necessity of specifying a computation so precisely that a machine can execute it imposes severe constraints (software must be expressed using formal languages), and the desire to perform computations efficiently in a world of finite resources adds a different set of priorities in software development that are often in conflict with the criteria attached to the role of a medium for expressing ideas. As an illustration, the source code of a simulation program that has been heavily optimized for parallel execution combines 10% of scientific model with 90% of resource management and bookkeeping, making the scientific model not only hard to understand but even hard to find in the source code. For a more detailed discussion, see my article in F1000Research.

Many of the problems that computational science is facing today (reliability, reproducibility, black-box mentality, etc.) can be traced back to an insufficient support for the lifecycle of scientific knowledge by today's software development tools. Practically all of them were developed by and for software development communities outside of scientific research. As a consequence, these tools (programming languages, compilers, packaging and deployment tools, version control systems, etc.) do not take into account the specificities of scientific computing. Worse, computational scientists do nothing to improve the situation. The dominant attitude today is "scientists have to adopt best practices from software engineering and acquire the skills required to apply them". What I advocate is a somewhat different point of view: scientists should adapt these practices and the tools that implement them to their specific needs.

To see where the problems are, let's look at the lifecycle of scientific knowledge expressed as software. New models and methods are developed by a mixture of thinking, tinkering, and exploring the consequences. This requires a representation that humans can understand and manipulate easily. Executability by a computer is a condition, but other machine-related criteria hardly matter at this stage. Once some useful contribution to the field has been identified, it is communicated to the research community, in a form that is easily understandable, but also easy to deploy on other people's computers. This step is the equivalent of publishing a scientific paper. Next, other scientists start to play with the new stuff. This includes comparisons with other models and methods, analysis of model properties, application to different scenarios, etc. The conclusions from this work should take a form similar to a review article. This would be a toolkit in which different models and methods are made available for execution, with added annotations about their relative strengths and weaknesses. Finally, a synthesis of different ideas leads to a concensus implementation supported and maintained by a wider community of scientists, both as a basis for their own future work and as an infrastructure tool for other communities. This last step corresponds to reference works, and should be accompanied by tutorials that take the role of textbooks. At this stage, usability and performance become major criteria, whereas it is acceptable that not everyone can easily understand the implementation. Those who do wish to understand the method can go back to the "review paper" stage.

Most of the discussion about scientific software today is focused on the last stage. It's about community-supported software packages, whose sustained development requires significant efforts and investments. Most of this effort is required to keep the software useful in a world of rapidly changing computational environments, and to improve its human interfaces. A smaller part is dedicated to implementing new scientific models and methods. This effort has no equally important counterpart in the traditional lifecycle of scientific knowledge, and therefore the people who work on it find it hard to get recognition for their work. It is "not science" by the standards of the generation that occupies most leadership positions in research today. Fortunately, this attitude is starting to change.

This focus on the last stage is perhaps also the reason for the dominating attitude that scientists should simply adopt best practices from software engineering. In fact, the development and maintenance of community software packages implementing concensus models and methods is technically close enough to software development in business and industry that the same tools and procedures can be applied. This is not true, however, for the the earlier stages in the lifecycle of digital scientific knowledge. As we will see, they are not well supported by today's software development tools and practices. What's worse is that most computational scientists accept this situation as inevitable.

At the first stage, a scientist's activity is better described by "manipulating and exploring models and methods" than by "software development". Computational models are of course algorithms, and thus software, but this is almost a technical detail. What is more important is a clear view of the hypotheses and approximations that have lead to a specific model, and a trace of the scientific validation that has been performed (comparison with experimental data and with other models). Programming languages are not at all a good match for this kind of work, nor are software engineering approaches such as testing. In terms of software technology, a computational model is much closer to a specification than to a piece of software.

For the next stage, the evaluation of a new idea in a narrow community of specialists, the technical requirements are somewhere in between the two neighboring stages. The manipulation of computational models loses some importance, whereas evaluation and comparison become more relevant. Interoperability matters a lot: even if the authors of two models chose different languages (corresponding to different scientific notations in the traditional scenario), a comparative evaluation should be a straightforward task. With programming languages, it clearly isn't. The technical difficulties of making programs written in different languages talk to each other are effectively discouraging scientists from even trying. We would need tools such as "notational adapters" and, even more importantly, some low-level conventions for code and data that everybody can agree and build on. As a guideline for developing such technology, keep the analogy with review articles in mind. What would an executable review article about similar but independently developed computational methods look like? Which authoring tools are available to support such work?

Finally, the transition from the first two stages to the last one is not as smooth as it ought to be. Quite often, an implementation written for convenient manipulation by humans must be completely rewritten in order to fit into a collection of optimized subroutines. What we should have is compiler-like tools that translate code from the first two stages into standard programming languages, using annotations added by expert programmers for guidance. The idea is to have a toolchain (1) guarantee the equivalence of the initial and the optimized level, and (2) keep track of additional approximations that were made for performance reasons. Moreover, community-supported optimized software libraries should be usable as infrastructure tools in the next level of model and method development, and thus be interoperable with the tools appropriate for the first stage, which are inexistent for now.

Another way to describe this specificity of scientific computing, compared to other application domains, is the absence of a clear borderline between software developers and software users. Most scientists are users of tried and trusted computational methods while working on the development or validation of methods at another level. The only clear separation we have, conceptually, is the one between scientific models and methods on one hand and computing technology (in particular resource management) on the other hand. Unfortunately, that is exactly the separation that current software technology does not allow us to make.

Comments retrieved from Disqus

:
- Konrad Hinsen:
  I kind of agree with much of what you say, but it's about publishing, not about knowledge representation. In that respect, the transition to digital has opened up many new options and I am all for exploring these - in fact, I am participating actively in doing so.
  The specific topic of this post is not how information is shared and archived, but how knowledge is encoded in the form of symbols. What I want to preserve from the printed paper era is the flexibility of adapting notation to the task. It is programming languages that are rigid and constraining when seen as a medium for expressing thoughts. There are good (but also bad) reasons for that in the context of software development, but they don't carry over to doing science.
  Finally, the analogy with statically linked executables is not very useful in my opinion: an executable is not at all useful for communication scientific knowledge.

A rant about software deployment in 2015

Konrad Hinsen — 2015-11-06

We all know that software deployment in a research environment can be a pain, but knowing this as a fact is not quite the same as experiencing it in reality. Over the last days, I spent way more time that I would have imagined on what sounds like a simple task: installing a scientific application written in Python on a Linux machine for use by a group of students in a training session. Here is an outline of the difficulties, in the hope that it will (1) help others who face similar problems and (2) contributes a little bit to improving the situation.

The software that I installed is nMOLDYN, an analysis tool for Molecular Dynamics trajectories. From a software engineering point of view, this is a rather standard Python program building on NumPy and MMTK for its computations and on Tkinter and matplotlib for the graphical user interface. There is no need for anything on the bleeding edge, a decent three-year old installation of the scientific Python stack would support this perfectly well.

The machine that was set up for the training session is configured much like a typical node in a compute cluster: stable and trusted software installed once and never updated. More specifically, the machine runs CentOS 6.7. Another feature rather typical of compute nodes is the very restricted network connectivity: users can log in via ssh, and copy data in and out using scp. Everything else is blocked, in particular all outgoing network traffic. The idea is that students will work on desktop or laptop machines, from where they have full network access to search for information, and connect to the compute server only for running scientific software. For my own software installation I had to limit myself to a user account, i.e. no administrator rights, although I could ask the systems administrator to install additional RPMs from CentOS.

A first exploration of the system's Python installation showed a collection of oldies: Python 2.6.6, NumPy 1.4.1, matplotlib 0.99.1.1. That's the state of the art five years ago. I quickly decided not to use it at all, for two reasons. First, I wasn't sure how much of what I had to add would still work with such old versions. All the software was already around five years ago, but I would have had to track down the versions that were current back then. Second, adding modules in a user account to a Python installation at the system level can easily lead to a fragile total. Following Murphy's law such problems would show up during the student sessions. So I decided to start with a fresh install of Python 2.7.

First surprise: no C compiler. An e-mail to the administrator, and I had gcc. Trying to install Python showed that the Tcl/Tk setup was incomplete: the header files were missing. An another e-mail asking for tcl-devel and tk-devel, and that was settled as well. Python, NumPy, netCDF, ScientificPython, and MMTK were up and running half an hour later. An attempt to install nMOLDYN resulted in the information that I still needed to install Pyro and matplotlib. That can't be so hard, right?

Pyro was no problem indeed, but matplotlib kept me busy for a few more hours. All I had done in the past was pip install matplotlib, but pip is useless without outgoing network connections. I had to track down source tarballs for matplotlib and all its dependencies. There's a list of dependencies on the matplotlib Web site, but it's incomplete in two ways: some dependencies are missing (setuptools and six), and others are given by name but without a link. Try googling for "cycler" - you will learn a lot about celestial mechanics before you find a package with this name on PyPI. Of all the matplotlib dependencides, only freetype was already available on my machine, so I had some searching and downloading to do.

The installation instructions for setuptools clearly do not consider the possibility of not having a network connection. They tell me to download a Python script and execute it to download the real software. Fortunately, there is the "advanced" installation option via a tarball. Which ends rather quickly with an error message complaining about the absence of the zlib module.

That module is part of the Python standard library, but it is compiled only if zlib (the C library) is installed on the machine. It wasn't on mine. This is not particularly difficult to fix, but rather annoying: I had to install zlib, and then run the Python installation once more. Not to forget: I knew zlib was in the standard library, and I immediately saw why it was missing on my machine, because I have been installing Pythons in lots of environments over twenty years. Someone else might well have spent a few hours figuring out what to do about zlib.

From then on everything went smoothly, so this is the end of my story. In order to provide something constructive, here is the complete list of matplotlib dependencies with links, and in the order of installation:

Finally, I will pass on a hint that came in this morning via Twitter:

@khinsen @MrTheodor pip install —download . —no-use-wheel Won’t work if there are Linux specific dependencies though.
— Donald Stufft (@dstufft) 6. November 2015

Using pip install --download . --no-use-wheel matplotlib, run of course on a machine that has a network connection, you get tarballs for matplotlib and all its dependencies that pip knows about. You still have to add setuptools (which pip doesn't download because it depends on it itself), the C libraries libpng and zlib, and of course Python's standard but not-always-there zlib module.

Looking back at my twenty years using Python, I come to the unfortunate conclusion that software installation is much more of a problem today than it was back in 1995. The main reason is of course that Python software has become more feature-rich and complex - in 1995 something like matplotlib was only a dream. But the state of Python packaging tools is also to blame, with three overlapping and partially compatible tools (distutils, setuptools, and distribute) creating a lot of confusion and various distribution formats (tarballs, eggs, wheels) adding another layer of complexity. What is also sorely missing is a straightforward way to package an application program with all its dependencies in such a way that it can be installed with reasonable effort on all common platforms.

Comments retrieved from Disqus

ostrokach:
Check out the Anaconda python distribution (https://anaconda.org). It addresses most of the issues that you list in this blog.
- Konrad Hinsen:
  Anaconda is great if all the packages you need are in it. If you have to add pure Python packages by hand afterwards, that's OK as well, but if you have to add packages with C extension modules, Anaconda is more of a pain than a help in my experience. And that's what I needed in this specific case (Scientific Python + MMTK + nMOLDYN, all with extension modules).
  Moreover, I am not sure that Anaconda is usable in an environment without Internet connection, though I haven't tried.
  - ostrokach:
    You could set up a local Anaconda repository, and copy into it all the packages (*.tar.bz files) that you require. Anaconda comes with `zlib` and all the other binaries which were giving you a hard time, compiled on old CentOS 5 and thus working on most Linux distros. Furthermore, you could created conda packages for MMTK and nMOLDYN on a local machine, and so you wouldn't even need `gcc` on the server.
    Edit:
    > If you have to add packages with C extension modules, Anaconda is more of a pain than a help in my experience
    If you don't have root access on your machine, you will end up compiling many of the required C libraries in your home directory anyway. You might as well make a conda package out of them so you can use them anywhere. It also helps you want to install several python packages that have different dependencies (NumPy, Boost, etc.).
    - Konrad Hinsen:
      That all sounds nice in theory, but practice is different. I did )try to build conda packages for ScientificPython and MMTK but gave up. Linking extension modules to the shared libraries already provided by Anconda proved to be impossible in a portable way.
      - Chris Barker:
        It's been another year, and Conda / Anaconda has grown more features and support. Particularly with conda-forge
        This would probably be very easy today. Though a non-connected environment is a lot harder.
        You'd need to download all your conda packages by hand, but that's not too hard to do, if annoying.
        And there is the Constructor project:
        https://github.com/conda/co...
        that might not even be neccesary
        So it HAS gotten better!
Emmanuel V.:
Mmm. You're right about Python packaging tools.
For your setup, an approach could have been:
1) replicate the "hostile" CentOS environment in a local VM (eg VirtualBox on your laptop, same base OS version, but with Internet connectivity)
2) Install all needed software as an unprivileged user
3) copy this user account on the target CentOS
Emmanuel
- matthew scholz:
  This is a solution that should not be required.
  - Chris Barker:
    well, it's not, but really, on eo f the sources of the problem here is a locked down environment -- having a not-isolated "copy" of the exact same environment should be standard part of such a system.
- Konrad Hinsen:
  Yes, that's another possible strategy. But setting up a virtual machine is also a lot of work, so it comes down to estimating how close one will be to the break-even point. I really didn't expect to spend much time on this before I started.

Beyond Jupyter: what's in a notebook?

Konrad Hinsen — 2015-09-03

Yesterday I participated (as a visitor) in the kickoff meeting for OpenDreamKit, where one recurrent topic of discussion was notebooks, both Jupyter and Sage, including the question if they could be brought together. This reminded me of a recent blog post by Kirill Pomogajko entitled "Why I don't like Jupyter". And it reminded me of my own long-term project of integrating Jupyter with my ActivePapers system for reproducible research. That's three reasons for writing down my thoughts about notebooks and their role(s) in computational research, so here we go.

One key observation is in Gaël Varoquaux's comment on Kirill's blog post: using Jupyter for doing science creates a lock-in, because all collaborators on a project must agree on using Jupyter. There is no other tool that can be used productively for working with notebooks. It's a case of "wordization": digital content is taken hostage by a tool that defines a storage format for its own convenience without much consideration for other tools, be they competing or complementary. Wordization not only restricts the users' freedom to work with their data, but also creates headaches for the future. A data format defined by a tool can easily become unusable as the tool evolves and introduces incompatibilities, or of course if it disappears. In the case of Jupyter, its developers have always provided upgrade paths for notebooks between versions, but at some time this is bound to create trouble. Bugs are a fact of life, and I don't expect that the version-2-compatibility-feature will get much testing in Jupyter version 23. To make it worse, a Jupyter notebook can depend on third-party code that implements embedded widgets. This is one of the reasons why I don't use Jupyter for my research, although I am a big fan of using it for teaching. The other reason is that I cannot usefully link a notebook to other relevant information, such as code and data dependencies. Jupyter doesn't provide any functionality for this, and they are hard to implement externally exactly because of wordization.

Wordization is often associated with evil intentions of market dominance, as they are regularly assumed for a company like Microsoft. But I believe that the fundamental cause is the obsession with tools over content that has driven the computing industry for many years. The tool aspects of a piece of software, such as its feature list and its user interface, are immediately visible. On the contrary, its data model attracts attention only by a few specialists, if at all. Users feel the consequences of bad (or absent) data model design through the symptoms of wordization, in particular lock-in, but rarely understand where it comes from. Interestingly, this problem was also mentioned yesterday at the OpenDreamKit meeting, by Michael Kohlhase who discussed the digital representation of mathematical knowledge and the difficulty of exchanging it between different software tools. I have written earlier about another aspect, the representation of scientific models in computational science, which illustrates the extreme case of tools having absorbed scientific content to the point that its users don't even realize that something is missing.

Back to notebooks. Let's forget about tools for the moment and consider the question of what a notebook actually is, as a digital document. I think that notebooks are trying to be two different things, and that many of the problems we have with them come from this ambiguity. One role of notebooks is the documentation of computational work as a narrative with direct access to the data. This is why people publish notebooks. The other role is as a protocol of interactive explorative work, i.e. the computational scientist's equivalent of a lab notebook. The two roles are not completely unrelated, but they still significatively different.

To see the difference, look at how experimental scientists worked in the good old days of pencil, paper, and the printing press. As experiments were done, all the relevant information (preparation, results, …) was written down, immediately, with a time stamp, in the lab notebook. Like a bank ledger, a lab notebook is an immutable protocol of what happened. You don't go back and change earlier entries, that would even be considered fraud. You just add information at the end. Of course, the resulting protocol is not a good way to communicate one's findings. Therefore they are distilled and written up in a separate narrative, which surrounds a description of the work and its most important results by a motivating introduction and summarizing conclusions. This is the classic scientific article.

Today's computational notebooks are trying to be both protocol and narrative, and pretend that there is a fluent transition between them. One unfortunate consequence is that computational protocols disappear as they are edited to become narratives. This could be alleviated by keeping notebooks under version control, but I have yet to see good versioning support in any notebook-type tool. But, fundamentally, today's notebook tools don't encourage keeping a protocol. They encourage frequent changes to the code and the results, keeping only the latest version. As editors for narratives, notebook tools are also far from ideal because they encourage interactive execution of small code snippets, making it easy to lose track of what was actually executed and in what order. In Jupyter, the only way to ensure a coherent narrative is to (1) restart the kernel and (2) re-execute all cells. There is not even a single menu entry for this operation. Actually, I wonder how many Jupyter users are aware that they must restart the kernel before re-executing all the cells if they want to ensure reproducibility.

With all that said, here is my current idea of what a notebook should look like at the bit level. A notebook data model should have two distinct entries, one for a protocol and one for a narrative. The protocol entry is a sequence of code cells and results, as they were executed since the start of the computation (for Jupyter, that means the last kernel restart). The narrative is a user-edited sequence of code cells, documentation cells, and results. The actual cell contents could well be shared between the two views: store each cell with a unique ID, and make the protocol and the narrative simple lists of IDs. The representation of code and documentation cells in such a data model is straightforward, though there's a huge potential for bikeshedding in defining the details. The representation of results is much more difficult if you want to support more than plain text output. In the long run, it will be inevitable to define clear data models for every type of display widget, which is a lot of work.

From the tool point of view, the current Jupyter interface could be complemented by a non-editable protocol view. I'd also like to see a single command (menu/keyboard) for the "clean slate" operation: save the current state as a snapshot (or commit it directly to version control), restart the kernel, and re-initialize the protocol to an empty list. But what really matters to me is the data model. Contrary to the current one implemented in Jupyter, the one outlined above could be integrated into workflow management and archivation tools, such as my own ActivePapers. We'd probably see an Emacs mode for working with it as well. Plus pretty-printing tools, analysis tools, etc. We'd see an ecosystem of tools working with notebooks. A Dream of Openness.

The future of the Scientific Python ecosystem

Konrad Hinsen — 2015-07-16

SciPy 2015 is over, meaning that many non-participants like myself are now busy catching up with what happened by watching the videos. Today's dose for me was Jake VanderPlas' keynote entitled "State of the Tools". It's about the history, current state, and potential future of what is now generally known as the Scientific Python ecosystem: the large number of libraries and tools written in or for Python that scientists from many disciplines use to get their day-to-day computational work done.

History is done, the present status is a fact, but the future is open to both speculation and planning, so that's what I find most interesting in Jake's keynote. What struck me is that everything he discussed was about paying back technical debt: refactoring the core libraries, fixing compatibility problems, removing technical obstacles to installation and use of various tools. In fact, 20 years after Python showed up in scientific computing, the ecoystem is in a state that is typical for software projects of that age: a bit of a mess. The future work outlined by Jake would help to make it less of a mess, and I hope that something like this will actually happen. The big question mark for me is how this can be funded, given that it is "only" maintenance work, producing nothing fundamentally new. Fortunately there are people much better than me at thinking about funding, for example everyone involved in the NumFOCUS foundation.

Jake's approach to outlining the future is basically "how can we fix known problems and introduce some obvious improvements" (but please do watch the video to get the full story!). What I'd like to present here is an alternate approach: imagine an ideal scientific computing environment in 2015, and try to approximate it by an evolution of the current SciPy ecosystem while retaining a sane level of backwards compatibility. Think of it as the equivalent of Python 3 at the level of the core of the scientific ecosystem.

One aspect that has changed quite a bit over 20 years is the interaction between Python and low-level code. Back then, Python had an excellent C interface, which also worked well for Fortran 77 code, and the ease of wrapping C and Fortran libraries was one of the major reasons for Python's success in scientific computing. We have seen a few generations of wrapper code generators, starting with SWIG, and the idea of a hybrid language called Pyrex that was the ancestor of today's Cython. LLVM has been a major game changer, because it permits low-level code to be generated and compiled on-the-fly, without explicitly generating wrappers and compiling code. While wrapping C/C++/Fortran libraries still remains important, the equally important task of writing low-level code for performance can be handled much better with such tools. Numba is perhaps the best-known LLVM-based code generator in the Python world, providing JIT compilation for a language that is very similar to a subset of Python. But Numba is also an example of the mindset that has led to the current mess: take the existing ecosystem as given, and add a piece to it that solves a specific problem.

So how would one approach the high-/low-level interface today, having gained experience with LLVM and PyPy? Some claim that the distinction doesn't make sense any more. The authors of the Julia language, for example, claim that it "avoids the two-language problem". However, as I have pointed out on this blog, Julia is fundamentally a performance-oriented low-level language, in spite of having two features, interactivity and automatic memory management, that are traditionally associated with high-level languages. By the way, I don't believe the idea of a both-high-and-low-level language is worth pursuing for scientific computing. The closest realization of that idea is Common Lisp, which is as high-level as Python, perhaps more so, and also as low-level as Julia, but at the cost of being a very complex language with a very steep learning curve, especially for mastering the low-level aspects. Having two clearly distinct language levels makes it possible to keep both of them manageable, and the separation line serves as a clear warning sign to scientists, who should not attempt to cross it without first acquiring some serious knowledge about software development.

The model to follow, in my opinion, is the one of Lush and Terra. They embed a low-level language into a high-level language in such a way that the low-level code is a data structure at the high level. You can use literals for this data structure and get the equivalent of Numba. But you can also write code generators that specialize low-level code for a given problem. Specialization allows both optimization and simplification, both of which are desirable. The low-level language would have arrays as a primitive data structure, and both NumPy and Pandas, or evolutions such as xray, would become shallow Python APIs to such low-level array functionality. I think this is much more powerful than today's Numba building on NumPy. Moreover, wrapper generators become simple plain Python code, making the construction of interfaces to complex libraries (think of h5py) much easier than it is today. Think of it as ctypes on steroids. For more examples of what one could do with such a system, look at metaprogramming in Julia, which is exactly the same idea.

Another aspect that Jake talks about in some detail is visualization. There again, two decades of code written by people scratching their own itches has led to a mess of different libraries with a lot of overlap and no clear distinctive features. For cleaning it up, I propose the same approach: what are the needs and the available technologies for scientific visualization in 2015? We clearly want to profit from all the Web-based technologies, both for portability (think of mobile platforms) and for integration with Jupyter notebooks. But we also need to be able to integrate visualization into GUI applications. From the API point of view, we need something simple for simple plots (Toyplot looks promising), but also more sophisticad APIs for high-volume data visualization. The main barrier to overcome, in my opinion, is the current dominance of Matplotlib, which isn't particularly good in any of the categories I have outlined. Personally, I don't believe that any evolution of Matplotlib can lead to something pleasant to use, but I'd of course be happy to be proven wrong.

Perhaps the nastiest problem that Jake addresses is packaging. He seems to believe that conda is the solution, but I don't quite agree with that. Unless I missed some recent evolutions, a Python package prepared for installation through conda can only be used easily with a Python distribution built on conda as well. And that means Anaconda, because it's the only one. Since Anaconda is not Open Source, there is no way one can build a Python installation from scratch using conda. Of course, Anaconda is perfectly fine for many users. But if you need something that Anaconda does not provide, you may not be able to add it yourself. On the Mac, for example, I cannot compile C extensions compatible with Anaconda, because Mac Anaconda is built for compatibility with ancient OSX versions that are not supported by a standard XCode installation. Presumably that can be fixed, but I suspect that would be a major headache. And then, how about platforms unsupported by Anaconda?

Unfortunately I will have to leave this at the rant level, because I have no better proposition to make. Packaging has always been a mess, and will likely remain a mess, because the underlying platforms on which Python builds are already a mess. Unfortunately, it's becoming more and more of a problem as scientific Python packages grow in size and features. It's gotten to the point where I am not motivated to figure out how to install the latest version of nMOLDYN on my Mac, although I am a co-author of that program. The previous version is good enough for my own needs, and much simpler to install though already a bit tricky. That's how you get to love the command line… in 2015.

Another look at Julia

Konrad Hinsen — 2015-06-18

Three years ago, I first looked at the then-very-new language Julia. Back then, I concluded that there were many interesting features, but also regretted too much bad Matlab influence in the array handling.

A hands-on Julia tutorial in my neighborhood was a good occasion to take another look at this language, which has evolved quite a bit since 2012, and continues to evolve rapidly. The tutorial taught by David Sanders was an excellent introduction, and his notebooks should even be good for self-teaching. If you already have some experience in computational science, and are interested in trying Julia out on small practical applications, have a look at them.

The good news is that Julia has much improved over the years, not only by being more complete (in particular in terms of libraries), but also through changes in the language itself. More changes are about to happen with version~0.4 which is currently under development. The changes being discussed include the array behavior that I criticized three years ago. It's good to see references to APL in this discussion. I still believe that when it comes to arrays, APL and its successors are an excellent reference. It's also good to see that the Julia developers take the time to improve their language, rather than rushing towards a 1.0 release.

Due to David's tutorial, this time my contact with Julia was much more practical, working on realistic problems. This was a good occasion to appreciate many nice features of the language. Julia has taken many good features from both Lisp and APL, and combined them seamlessly into a language that, in spite of some warts, is overall a pleasure to use. A major aspect of Julia's Lisp heritage is the built-in metaprogramming support. Metaprogramming has always been difficult to grasp, which was clear as well during the tutorial. It isn't obvious at all what kind of problem it helps to solve. But everyone who has used a language with good metaprogramming support doesn't want to go back.

A distinctive feature of Julia is that it occupies a corner of the programming language universe that was almost empty until now. In scientific computing, we have traditionally had two major categories of languages. "Low-level" languages such as Fortran, C, and C++, are close to the machine level: data types reflect those directly handled by today's processors, memory management is explicit and thus left to the programmer. "High-level" languages such as Python or Mathematica present a more abstract view of computing in which resources are managed automatically and the data types and their operations are as close as possible to the mathematical concepts of arithmetic. High-level languages are typically interpreted or JIT-compiled, whereas low-level languages require an explicit compilation step, but this is not so much a feature of the language as of their age and implementation.

Julia is resolutely modern in opting for modern code transformation techniques, in particular under-the-hood JIT compilation, making it both fully compiled and fully interactive. In terms of the more fundamental differences between "low-level" and "high-level", Julia chooses an unconventional approach: automatic memory management, but data types at the machine level.

As an illustration, consider integer handling. Julia's default integers are the same as C's: optimal machine-size signed integers with no overflow checks on arithmetic. The result of 10^50 is -5376172055173529600, for example. This is the best choice for performance, but it should be clear that it can easily create bugs. Traditional high-level languages use unlimited integers by default, eventually offering machine-size integers as a optimization option for experienced programmers. Julia does have a BigInt type, but using it requires a careful insertion of big(...) in many places. It's there if you absolutely need it, but you are expected to use machine-sized integers most of the time.

As a consequence, Julia is a power tool for experienced scientific programmers who are aware of the traps and the techniques to avoid falling into them. Julia is not a language suitable for beginners or occasional users of scientific programming, because such inexperienced scientists need more of a safety net than Julia provides. Neither is Julia a prototyping language for trying out new ideas, because when concentrating on the science you also need a safety net that protects you from the traps of machine-level abstractions. In Julia, you have to design your own safety net, and you also have to verify that it is strong enough for your needs.

Perhaps the biggest problem with Julia is that this is not obvious at first glance. Julia comes with all the nice interactive tools for rapid development and interactive data analysis, in particular the IJulia notebook which is basically the same as the now-famous IPython/Jupyter notebook. At a first glance, Julia looks like a traditional high-level language. A strong point of David's Julia tutorial is that it points out right from the start that Julia is different. Whenever a choice must be made between run-time efficiency and simplicity, clarity, or correctness, Julia always chooses efficiency. The least important consequence is surprising error messages that make sense only with a basic understanding of how the compiler works. The worst consequence is that inexperienced users are easily induced to write unsafe code. There are nice testing tools, in particular FactCheck which looks very nice, but scientists are notoriously unaware of the need of testing.

The worst design decision I see in Julia is the explicit platform dependence of the language: the default integer size is either 32 or 64 bits, depending on the underlying platform. This default size is used in particular for integer constants. As a consequence, a Julia program does in general not have a single well-defined result, but two distinct results. This means that programs must be tested on two different architectures, which is hard to do even for experienced programmers. Given the ongoing very visible debate about the (non-)reproducibility of computational research, I cannot understand how anyone can make such a decision today. Of course I do understand the performance advantage that results from this choice, but this clearly goes to far for my taste. If I ever use Julia for my research, I'll start each source code file with @assert WORD_SIZE==64 just to make sure that everyone knows what kind of machine I tested my code on.

As for the surprising but not dangerous features that can probably only be explained by convenience for the compiler, there is first of all the impossibility to redefine a data type without clearing the workspace first - and that means losing your whole session. It's a bit of a pain for interactive development, in particular in IJulia notebooks. Another oddity is the const declaration, which makes a variable to which you can assign new values as often as you like, as long as the type remains the same. It's more a typed variable declaration than the constant suggested by the name.

Finally, there is another point where I think the design for speed has gone too far. The choice of machine-size integers turns into something completely useless (in my opinion) when it comes to rational arithmetic. Julia lets you create fractions by writing 3//2 etc., but the result is a fraction whose nominator and denominator are machine-size integers. Rational arithmetic has the well-known performance and memory problem of denominators growing with each additional operation. With machine-size integers, rational arithmetic rapidly crashes or returns wrong results. Given that the primary application of rationals is unlimited precision arithmetic, I don't see a practical use for anything but Rational{BigInt}.

In the end, Julia leaves me with a feeling of a lost opportunity. My ideal software development environment for computational science would support the whole life cycle of computational methods, starting from prototyping and ending with platform-specific optimizations. As code is progressively optimized based on profiling information, each version would be used as a reference to test the next optimization level. In terms of fundamental language design, Julia seems to have everything required for such an approach. However, the default choice of fast-and-unsafe operations almost forces programmers into premature optimization. Like in the traditional high-/low-level language world, computational science will require two distinct languages, a safe and a fast one.

The compartmentalization of knowledge

Konrad Hinsen — 2015-06-05

Now that the birch pollen season is definitely over, I can draw some conclusions from a two-year experiment with the impressive sample size of one - myself. As you will see, my topic is not so much the experiment itself, but the circumstances in which it happened.

I have been allergic to birch pollen for more than thirty years. My allergy is strong enough to make normal life impossible when the birch pollen concentration is high, which happens for about three to four weeks every year. For those who have no experience with allergies, consider how sneezing five times in five minutes a few times per hour would impact your daily activities. Like most victims of pollen allergy, I consulted medical doctors in search for relief. In the course of thirty years spent in various places, even different countries, I have seen many of them, from three categories: general practitioner, otorhinolaryngologists, and allergologists. All these doctors agreed that the only reasonable treatment is antiihistamines, arguing that the only other option, immunosuppressive treatments such as cortisone, has side effects that are too severe compared to the benefit obtained.

Unfortunately, antihistamines also have a frequent side effect: drowsiness. Its degree varies between people and across different antihistamines. But in spite of undeniable progress over the years, I have yet to try an antihistamine that I could live with comfortably. I was always faced with the choice of the lesser evil: sneezing or drowsiness. I usually tried to take antihistamines as little as possible, based of birch pollen concentration forecasts, but I found that strategy hard to apply in practice.

So far for the motivation for my recent experiment. Last year I discovered, somewhat by accident, a herbalist in Paris offering a mixture of eight plant extracts for treating allergy symptoms. I asked if they considered their product sufficient as the sole treatment for a rather severe case of birch pollen allergy. They said it's worth a try, though they didn't want to make a clear promise. I tried, and it worked. Perfectly. No sneezing, no side effects. Spring 2014 was the first one I fully enjoyed since ages ago. Spring 2015 was the second. I haven't taken any antihistamines since then, nor any other allergy treatment recognized by official medecine. Of course, my new treatments has its drawbacks as well. First, it's rather expensive, about 40€ for one birch pollen season. Second, you can't take a single daily dose, you have to distribute it over the day. I followed the recommendation to dilute the daily dose in a bottle of water, which I carried with me and drank over the day.

My sample-size-one study doesn't of course permit any conclusions about the efficiency of this treatment for allergies in general, but that's not my point anyway. What I find remarkable about this story is that a small herbalist shop in Paris offers something that according to all medical doctors I ever consulted doesn't exist. Herbal remedies have been used by people all over the world for all of known history. All the eight plants in my new treatment (Plantago lanceolata, artichoke, arctium, boldo, desmodium, dandelion, horsetail, thyme) have been used by herbalists for centuries. Combining them into an efficient treatment certainly requires some solid knowledge about medical plants, but probably not a stroke of genius. How is it possible then that not even specialized allergologists are aware of such treatments? Even if it works only for 10% of pollen victims (a number I just made up), it's worth knowing about.

This compartmentalization of knowledge between traditional herbalists and 21st century medical doctors, which I suspect to be due to pure snobism, is also a lost opportunity for medical research. According to the description of my plant mixture on the Web site, its mode of action is completely different from that of antihistamines. Studying these mechanisms might well lead to new insight into the causes of pollen allergies and their treatments.

Software in scientific research

Konrad Hinsen — 2015-04-23

In a recent blog post, Titus Brown asks if software is a primary product of science, and basically says "no" (but do read the post for the details). A blog-post length reply by Daniel Katz comes to the opposite conclusion (again, please read the post before continuing here). I left a short comment on Titus' blog but also felt compelled to expand this into a blog post of its own - so here it is.

Titus introduces a useful criterion for what "primary product of science" is: could you get a Nobel prize for it? As Dan comments, Nobel prizes in science are awarded for discoveries and inventions. There we no computers when Alfred Nobel set up his foundation, so we have to extrapolate this definition a bit to today's situation. Is software like a discovery? Clearly not. Like an invention? Perhaps, but it doesn't fit very well. Dan makes a comparison with scientific writing, i.e. papers, textbooks, etc. Scientific writing is the traditional way to communicate discoveries and inventions. But what scientists get Nobel prizes for is not the papers, but the work described therein. Papers are not primary products of science either, they are just a means of communication. There is a fairly good analogy between papers and their contents on one hand, and software and algorithms on the other hand. And algorithms are very well comparable to discoveries and inventions. Moreover, many of today's scientific models are in fact expressed as algorithms. My conclusion is that algorithms clearly count as a primary product of science, but software doesn't. Software is a means of communication, just like papers or textbooks.

The analogy isn't perfect, however. The big difference between a paper and a piece of software is that you can feed the latter into a computer to make it do something. Software is thus a scientific tool a well as a means of communication. In fact, today's computational science gives more importance to the tool aspect than to the communication aspect. The main questions asked about scientific software are "What does it do?" and "How efficient is it?" When considering software as a means of communication, we would ask questions such as "Is it well-written, clear, elegant?", "How general is the formulation?", or "Can I use it as the basis for developing new science?". These questions are beginning to be heard, in the context of the scientific software crisis and the need for reproducible research. But they are still second thoughts. We actually accept as normal that the scientific contents of software, i.e. the models implemented by it, are understandable only to software specialists, meaning that for the majority of users, the software is just a black box. Could you imagine this for a paper? "This paper is very obscure, but the people who wrote it are very smart, so let's trust them and base our research on their conclusions." Did you ever hear such a claim? Not me.

Scientists haven't yet fully grasped the particular status of software as both an information carrier and a tool. That may be one of the few characteristics they share with lawyers. The latter make a difference between "data" (including written text), which is covered by copyright, and "software", which is covered by both copyright and licenses, and in some countries also by patents. Superficially, this makes sense, as it reflects the dual nature of software. It suffers, however, from two problems. First of all, the distinction exists only in the intention of the author, which is hard to pin down. Software is just data that can be interpreted as instructions for a computer. One could conceivably write some interpreter that turns previously generated data into software by executing it. Second, and that's a problem for science, the licensing aspect of software is much more restrictive than the copyright aspect. If you describe an algorithm informally in a paper, you have to deal only with copyright. If you communicate it in executable form, you have to worry about licensing and patents as well, even if your main intention is more precise communication.

I have written a detailed article about the problems resulting from the badly understood dual nature of scientific software, which I won't repeat here. I have also proposed a solution, the development of formal languages for expressing complex scientific models, and I am experimenting with a concrete approach to get there. I mention this here mainly to motivate my conclusion:

Q: Is software a primary product of science?

A: No. But neither is a paper or a textbook.

Q: Is software a means of communication for primary products of science?

A: Yes, but it's a bad one. We need something better.

Why bitwise reproducibility matters

Konrad Hinsen — 2015-01-07

While reading the final report of the reproducibility workshop at XSEDE14, I noticed a statement that I encounter frequently in discussions about reproducible research:

"One general consensus was that bitwise reproducibility is often an unrealistic expectation"

In the interest of clarity, let me start by pointing out that within the systematic terminology that I am trying to adopt (see this post for an explanation), I will write "bitwise replicability" from now on, as the problem falls into the technical domain (getting the same result from running the same program on the same data) rather than into the scientific one (verifying a result with similar but not identical methods and tools).

The particularity of bitwise replicability is that is almost always brushed aside as "unrealistic", which prevents any discussion about its possible importance in computational science. The main point of this post is to explain why I consider bitwise replicability important, but first of all I need to get the label "unrealistic" out of the way.

"Unrealistic" means more or less "possible in principle but impossible given various real-life contraints", and therefore the term should always be qualified by listing the constraints that make something impossible. In the context of bitwise replicability, which always refers to floating-point computations, the main constraint is that floating-point arithmetic is incompletely specified in most of today's programming languages, and that whatever specification there is is incompletely implemented in many of today's compilers. This is a valid reason for proclaiming bitwise replicability unrealistic for a short-term research project, but it is not an insurmountable barrier on a longer time scale. All we need are tighter specifications and implementations that respect them. That's a lot of work, but not a technical challenge. We know how to do it, but we are not (yet) willing to invest the effort to make it happen.

The main reason why I consider bitwise replicability important is software testing. No matter what precise approach is used for testing, it always involves comparing results of computations, either to a known good result, or to the result of another, presumably more reliable, computation. For any application of computing other than number crunching, comparing results means testing for equality, at the bit level. The results are equal or they aren't. If they aren't, there's a reason. You have to figure out what that reason is, and fix the problem.

If you accept the idea that floating-point operations are only approximate, the notion of a computation having one and only one result disappears, and testing becomes impossible. If two computations lead to similar but slightly different results, how do you decide if this is due to a bug or to some "inevitable" fuzziness of floating-point arithmetic? The answer is that you can't. If you accept that bitwise replicability is not possible, you also accept that rigorous software testing is not possible. For some illustrations of this problem, and some interesting discussion around them, see this post on the Software Carpentry blog.

The most common counterargument is that numerical methods are only approximate, that floating-point arithmetic is approximate as well, and that the main source of error comes from these two sources. That may or may not be true in any specific situation, as it really depends on what you are computing. But my point is that this statement can only be true if you assume that the implementation of your method contains no mistakes. The amount of error introduced by a bug in the code is completely unbounded. And even if it's small for some particular test run, it can be very large elsewhere. There is not much point in worrying about the error in an approximate numerical method unless you have some confidence in your code actually implementing this method correctly.

In fact, the common counterargument discussed above conflates several sources of error, which can and should be discussed and analyzed separately. A typical numerical computation is the result of several steps, starting from a mathematical model that takes the form of algebraic or differential equations:

Construct a computable approximation¹ to the original equations, using techniques such as discretization of continuous quantities.

Replace real-numbers by floating-point numbers.

Implement the floating-point version in software.

The errors introduced in the first step are the subject of numerical analysis, a well-established domain of applied mathematics. They are well understood for most commonly employed numerical methods. The errors introduced in the second step are rarely discussed explicitly, outside of a small circle of researchers interested in the peculiarities of floating-point arithmetic. The third step should not introduce any errors, and that should be verified by testing. But uncoupling steps 2 and 3 is possible only if our software tools guarantee bitwise replicability.

So why don't today's tools permit this? The reason is a mixture of widespread ignorance about floating-point arithmetic and the desire to get maximum performance. Both come into play in step 2, which is approximating discrete equations for real numbers by discrete equations for floating-point numbers. Most scientific programmers are unaware that this is an approximation that they should understand and control. They just type their real-number equation into a program and expect the computer to handle it somehow. Compiler writers and language specification authors take advantage of this ignorance and declare this step their business, profiting from the many optimization possibilities it offers.

The optimization opportunities come from the fact that a typical real-number equation has a large number of a priori equally plausible floating-point number approximations. Many of the identities for real numbers do not apply to floating-point numbers, for example associativity of addition and multiplication. Where the real-number equation says a+b+c, there are three floating-point approximations: (a+b)+c, a+(b+c), and (a+c)+b. For more complex equations, the number of variants quickly becomes important. The results of these variants are not the same, but which one to choose? The choice should be made after a careful analysis of the relative precision and performance of each variant. There should be tool support to help with this. But what happens in practice, most of the time, is that the choice is made by the compiler, which goes exclusively for performance. Since every compiler optimizes differently, the same program source code yields different results on different platforms. And that's why we don't have bitwise replicability.

To prevent any misunderstanding: I am not saying that production-level compiled code needs to ensure bitwise reproducibility across machines. It's OK to have compiler optimization options that introduce platform-specific approximations. But it should be possible to reproduce one unique result identically on all platforms. This result is then the reference against which additional "lossy" optimizations can be tested.

Footnotes:

1 I am using the term "computable approximation" somewhat vaguely here. While the original continuous-variable equations are almost always non-computable, and the numerical approximations are mostly computable, there are exceptions on both sides. The main focus of numerical analysis is not computability in the strict sense of computability theory, but "practical" computability that has the subsequent transformation to floating-point operations in mind.

Drawing conclusions from empirical science

Konrad Hinsen — 2014-12-29

A recent paper in PLOS One made some noise in my twittersphere over the Christmas days. It compares the productivity of writing scientific documents using Microsoft Word and using LaTeX, and concludes that Microsoft Word is so clearly superior that, in the interest of saving taxpayers' money, scientific publishers should abandon LaTeX to allow authors to become more productive.

The noise in my twittersphere is about the technical shortcomings of the study, whose findings are in clear contradiction to the personal experience of everyone who has used both LaTeX and Microsoft Word in preparing real-life scientific articles for publication. This is well discussed in the comments on the paper. In short, the situations explored in the study are limited to the reproduction of a given piece of text with some typical "scientific" elements such as tables or formulas, but without the complexity of real-life documents: references, citations, revisions, collaborative editing, etc.

The topic of this post is a more fundamental problem illustrated by the study cited above, and which is shared by a large number of scientific explorations of much more important subjects, in particular concerning health and medicine. It is the problem of drawing practical conclusions from the results of a scientific study, such as the conclusion cited above that abandoning LaTeX would lead to significant savings in the field of scientific publishing. In the following, I will concentrate on this issue and leave aside everything else: let's assume for a few minutes that published scientific studies are 100% reliable and described clearly enough that no misunderstandings or erroneous interpretations ever occur.

The feature that the Word vs. LaTeX study shares with much of modern research is that it is purely empirical. It starts from the question if science writers are more productive using Word or using LaTeX, taking into account a few obvious parameters such as prior experience with one or the other system. To answer that question, a specific experiment is designed, performed, and analyzed. Importantly, there is no underlying model that is used to interpret the results, which is what makes the model purely empirical.

Empirical studies are characteristic of relatively young domains of scientific exploration. It's what every new field starts out with: the search for systematic relations between observable facts and quantities. As our understanding of some aspect of nature improves, we move on to the next level of scientific inquiry: the construction of models. A model makes assumptions about the mechanisms underlying the observed behavior, and allows the prediction of results that some not-yet-performed experiment should produce. The introduction of models is an enormous boost to the power and efficiency of scientific research. First of all, predictions can be tested, and therefore the models can be tested. Of course, an isolated hypothesis ("Word makes scientists more productive than LaTeX") can also be tested, but a model produces a whole family of related hypotheses that can be tested as a whole. In particular, one can search for corner cases that may be untypical from a real-world point of view, but provide a particularly precise way to test a model. Second, a model allows scientists to develop an intuitive understanding of the phenomena they are looking at, which again makes their work more efficient and more reliable. But perhaps most importantly, a model that has been exposed to several rounds of serious testing comes with a list of scenarios in which it works or doesn't work, which is a very important element in generating trust in its predictions.

As an example of a successful model, consider Newtonian mechanics as taught in high-school physics classes. It has been around for a few centuries, and its strengths and limitations are well known. Contrary to what people believed initially, it is not universally true. It breaks down for objects moving at extremely high speed, and for objects of atomic size. But it works very well for many practically relevant situations. Thanks to this and other well-tested models, engineers and architects can design engines and buildings that work as expected.

In contrast, purely empirical science provides only provisional answers to the questions asked, because it is impossible to know, or even test, that all relevant aspects of the situation have been taken into account. In the Word vs. LaTeX study, prior knowledge of either system was taken into account as a parameter, but many other factors weren't. It is conceivable, for example, that a person's native language may make them "better tuned" to one or the other system. Or their work experience, or their education. And why not genetic factors or dietary habits - this sounds far-fetched, but it can't be excluded. As long as there is no model explaining where productivity differences come from, it is not even clear what one would have to study in order to improve our understanding of the situation.

This uncertainty stemming from the existence of many unexplored potential factors makes it very risky to draw practical conclusions from purely empirical studies, no matter how well they were designed and executed. And this is a very real problem in many aspects of today's life. Suppose you are determined to adopt the "healthiest" dietary regime possible, and turn to the scientific literature for guidance. You will find a bewildering collection of partially contradicting findings. Does eating eggs expose you to a higher risk of cardiovascular diseases? Do oranges protect you against the flu? You will find studies that claim to provide the answers to such questions, but they are purely empirical and based on a small number of observations. They may even be based on experiments on mice that were extrapolated to humans. And they definitely have not explored all imaginable aspects of the question. What it vitamin C is beneficial to everyone except people with some rare blood group? What if a specific gene variant decides how your body reacts to high sugar intake? Most probably no one has ever looked into these possibilities. Not to mention the much more fundamental question if a "healthiest" diet exists at all. Perhaps the best you can do is choose between a higher risk of a stroke and a higher risk of cancer.

To end with some practical advice: the next time you see some recommendation made on a "scientific basis", check what that basis is. If it's a single recent study, it's safe to assume that the recommendation is premature. But even if it's a larger body of scientific evidence, check if there is a model behind it, and if it has been tested. If it isn't, be prepared to get a contradictory recommendation in a few years.

The state of NumPy

Konrad Hinsen — 2014-09-12

The release of NumPy 1.9 a few days ago was a bit of a revelation for me. For the first time in the combined history of NumPy and its predecessor Numeric, a new release broke my own code so severely thatI don't see any obvious way to fix it, given the limited means I can dedicate to software maintenance. And that makes me wonder for which scientific uses today's Python ecosystem can still be recommended, since the lack of means for code maintenance is a chronic and endemic problem in science.

I'll start with a historical review, for which I am particularly well placed as one of the oldtimers in the community: I was a founding member of the Matrix-SIG, a small group of scientists who in 1995 set out to use the still young Python language for computational science, starting with the design and implementation of a module called Numeric. Back then Python was a minority language in a field dominated by Fortran. The number of users started to grow seriously from 2000, to the point of now being a well-recognized and respected community that spans all domains of scientific research and holds several
conferences per year across the globe. The combination of technological change and the needs of new users has caused regular changes in the code base, which has grown as significantly as the user base: the first releases were small packages written and maintained by a single person (Jim Hugunin, who later became famous for Jython and IronPython), whereas today's NumPy is a complex beast maintained by a team.

My oldest published Python packages, ScientificPython and MMTK, go back to 1997 and are still widely used. They underwent a single major code reorganization, from module collections to packages when Python 1.5 introduced the package system. Other than that, most of the changes to the code base were implementations of new features and the inevitable bug fixes. The two main dependencies of my code, NumPy and Python itself, did sometimes introduce incompatible changes (by design or as consequences of bug fixes) that required changes on my own code base, but they were surprisingly minor and never required more than about a day of work.

However, I now realize that I have simply been lucky. While Python and its standard library have indeed been very stable (not counting the transition to Python 3), NumPy has introduced incompatible changes with almost every new version over the last years. None of them ever touched functionalities that I was using, so I barely noticed them when looking at each new version's release notes. That changed with release 1.9, which removes the compatbility layer with the old Numeric package, on which all of my code relies because of its early origins.

Backwards-incompatible changes are of course nothing exceptional in the computing world. User needs change, new ideas permit improvements, but existing APIs often prevent a clean or efficient implementation of new features or fundamental code redesigns. This is particularly true for APIs that are not the result of careful design, but of organic growth, which is the case for almost all scientific software. As a result, there is always a tension between improving a piece of software and keeping it compatible with code that depends on it. Several strategies have emerged to deal with, depending on the priorities of each community. The point I want to make in this post is that NumPy has made a bad choice, for several reasons.

The NumPy attitude can be summarized as "introduce incompatible changes slowly but continuously". Every change goes through several stages. First, the intention of an upcoming changes is announced. Next, deprecation warnings are added in the code, which are printed when code relying on the soon-to-disappear feature is executed. Finally, the change becomes effective. Sometimes changes are made in several steps to ease the transition. A good example from the 1.9 release notes is this:

In NumPy 1.8, the diagonal and diag functions returned readonly copies, in NumPy 1.9 they return readonly views, and in 1.10 they
will return writeable views.

The idea behind this approach to change is that client code that depends on NumPy is expected to be adapted continuously. The early warnings and the slow but regular rythm of change help developers of client code to keep up with NumPy.

The main problem with this attitude is that it works only under the assumption that client code is actively maintained. In scientific computing, that's not a reasonable assumption to make. Anyone who has followed the discussions about the scientific software crisis and the lack of reproduciblity in computational science should be well aware of this point that is frequently made. Much if not most scientific code is written by individuals or small teams for a specific study and then modified only as much as strictly required. One step up on the maintenance ladder, there is scientific code that is published and maintained by computational scientists as a side activity, without any significant means attributed to software development, usually because the work is not sufficiently valued by funding agencies. This is the category that my own libraries belong to. Of course the most visible software packages are those that are actively maintained by a sufficiently strong community, but I doubt they are representative for computational science as a whole.

A secondary problem with the "slow continuous change" philosophy is that client code becomes hard to read and understand. If you get a Python script, say as a reviewer for a submitted article, and see "import numpy", you don't know which version of numpy the authors had in mind. If that script calls array.diag() and modifies the return value, does it expect to modify a copy or a view? The result is very different, but there is no way to tell. It is possible, even quite probable, that the code would execute fine with both NumPy 1.8 and the upcoming NumPy 1.10, but yield different results.

Given the importance of NumPy in the scientific Python ecosystem - the majority of scientific libraries and applications depends on it -, I consider its lack of stability alarming. I would much prefer the NumPy developers to adopt the attitude to change taken by the Python language itself: accumulate ideas for incompatible changes, and apply them in a new version that is clearly labelled and announced as incompatible. Everyone in the Python community knows that there are important differences between Python 2 and Python 3. There's a good chance that a scientist publishing a Python script will clearly say if it's for Python 2 or Python 3, but even if not, the answer is often evident from looking at the code, because at least some of the many differences will be visible.

As for my initial question for which scientific uses today's Python ecosystem can still be recommended, I hesitate to provide an answer. Today's scientific Python ecosystem is not stable enough for use in small-scale science, in my opinion, although it remains an excellent choice for big communities that can somehow find the resources to maintain their code. What makes me hesitate to recommend not using Python is that there is no better alternative. The only widely used scientific programming language that can be considered stable, but anyone who has used Python is unlikely to be willing to switch to an environment with tedious edit-compile-run cycles.

One possible solution would be a long-time-support version of the core libraries of the Python ecosystem, maintained without any functional change by a separate development team. But that development team has be created and funded. Any volunteers?

Reproducibility, replicability, and the two layers of computational science

Konrad Hinsen — 2014-08-27

The importance of reproducibility in computational science is being more and more recognized, which I think is a good sign. However, I also notice a lot of confusion about what reproducibility means exactly, and also confusion about the difference (if any) between reproducibility and replicability. I don't see a consensus yet about the exact meaning of these terms, but I would like to give my own definitions and justify them by putting them into the general context of computational science.

I'll start with the concept of reproducibility as it was used in science long before computers even existed. It refers to the reproducibility of the conclusions of a scientific study. These conclusions can take very different forms depending on the question that was being explored. It can be a simple "yes" or "no", e.g. in answering questions such as "Is the gravitational force acting in this stone the same everywhere on the Earth's surface?" or "Does ligand A bind more strongly to protein X than ligand B?" It can also be a number, as in "What is the lattice energy of NaCl?", or a mathematical function, as in "How does a spring's restoring force vary with elongation?" Any such result should come with an estimation of its precision, such as an error bar on numbers, or a reliability estimate for a yes/no answer. Reproducing a scientific conclusion means finding a "close enough" answer by performing "similar" experiments and analyses. As the terms "close enough" and "similar" show, reproducibility involves human judgement, which may well evolve over time. Reproducibility is thus not an absolute feature of a specific result, but the evaluation of a result in the context of the current state of knowledge and technology in a scientific domain. Every attempt to reproduce a given result independently (different people, tools, methods, …) augments scientific knowledge: If the reproduction leads to a "close enough" results, it provides information about the precision with which the results can be obtained, and if if doesn't, it points to some previously unrecognized crucial difference between the two experiments, which can then be explored.

Replication refers to something much more specific: repeating the exact steps in an experiment using the same (or equivalent) equipment, and comparing the outcomes. Replication is part of testing an experimental setup, or a form of quality assurance. If I measure the same quantity ten times using the same equipment and experimental samples, and get ten slightly different values, then I can use these numbers to estimate the precision of my equipment. If that precision is not sufficient for the purposes of my planned scientific study, then the equipment is not suitable.

It is useful to describe the process of doing research by a two-layer model. The fundamental layer is the technology layer: equipment and procedures that are well understood and whose precision is known from many replication attempts. On top of this, there is the research layer: the well-understood equipment is used in order to obtain new scientific information and draw conclusions from them. Any scientific project aims at improving one or the other layer, but not both at the same time. When you want to get new scientific knowledge, you use trusted equipment and procedures. When you want to improve the equipment or the procedures, you do so by doing test measurements on well-known systems. Reproducibility is a concept of the research layer, replicability belongs to the technology layer.

All this carries over identically to computational science, in principle. There is the technology layer, consisting of computers and the software that runs on them, and the research layer, which uses this technology to explore theoretical models or to interpret experimental data. Replicability belongs to the technology level. It increases trust in a computation and thus its components (hardware, software, overall workflow, provenance tracking, …). If a computation cannot be replicated, then this points to some kind of problem:

different input data that was not recorded in the workflow (interactive user input, a random number stream initialized from the current time, …)

a bug in the software (uninitialized variables, compiler bugs, …)

a fault in the hardware (an unreliable memory chip, a design flaw in the processor, …)

an ambiguous specification of the result of the computation

Ideally, the non-replicability should be eliminated, but at the very least its cause should be understood. This turns out to be very difficult in practice, in today's computing environments, essentially because case 4 is frequent and hard to avoid (today's popular programming languages are ambiguous), and because case 4 makes it impossible to identify cases 2 and 3 with certainty. I see this as a symptom of the immaturity of today's computing environments, which the computational science community should aim to improve on. The technology for removing case 4 exists. The keyword is "formal methods", and there are first attempts to apply them to scientific computing, but this remains an exotic approach for now.

As in experimental science, reproducibility belongs to the research layer and cannot be guaranteed or verified by any technology. In fact, the "reproducible research" movement is really about replicability - which is perhaps one reason for the above-mentioned confusion.

There is at the moment significant disagreement about the importance of replicability. At one end of the spectrum, there is for example Ian Gent's recomputation manifesto, which stresses the importance of replicability (which in the context of computational science he calls recomputability) because building on past work is possible only if it can be replicated as a first step. At the other end, Chris Drummond argues that replicability is "not worth having" because it doesn't contribute much to the real goal, which is reprodcucibility. It is worth reading both of these papers, because they both do a very good job at explaining their arguments. There is actually no contradiction between the two lines of arguments, the different conclusions are due to different criteria being applied: Chris Drummond sees replicability as valuable only if it improves reproducibility (which indeed it doesn't), whereas Ian Gent sees value in it for a completely different reason: it makes future research more efficient. Neither one mentions the main point in favor of replicability that I have made above: that replicability is a form of quality assurance and thus increases trust in published results.

It is probably a coincidence that both of the papers cited above use the term "computational experiment", which I think should best be avoided in this context. In the natural sciences, the term "experiment" traditionally refers to constructing a setup to observe nature, which makes experiments the ultimate source of truth in science. Computations do not have this status at all: they are applications of theoretical models, which are always imperfect. In fact, there is an interesting duality between the two: experiments are imperfect observations of the ultimate truth, whereas computations are, in the absence of buggy or ambiguous software, perfect observations of the consequences of imperfect models. Using the same term for these two concepts is a source of confusion, as I have pointed out earlier.

This fundamental difference between experiments and computations also means that replicability has a different status in experimental and computational science. When doing imperfect observations of nature, evaluating replicability is one aspect of evaluating the imperfection of the observation. Perfect observation is impossible, both due to technological limitations and for fundamental reasons (any observation modifies what is being observed). On the other hand, when computing the consequences of imperfect models, replicability does not measure the imperfections of the model, but the imperfections of the computation, which can theoretically be eliminated.

The main source of imperfections in computations is the complexity of computer software (considering the whole software stack, from the operating system to the scientific software). At this time, it is not clear if we will ever succeed in taming this complexity. Our current digital computers are chaotic systems, in which even the tiniest change (flipping a bit in memory, or replacing a single character in a program source code file) can change the result of a computation beyond any bounds. Chaotic behavior is clearly an undesirable feature in any scientific equipment (I can't think of any experimental apparatus suffering from it), but for computation we currently have no other choice. This makes quality assurance techniques, including replicability but also more standard software engineering practices such as unit testing, all the more important if we want computational results to be trustworthy.

A first experience with Open Access publishing

Konrad Hinsen — 2014-07-04

Most scientists have found out by now that a lot has been going wrong with scientific publishing over the years. In many fields, scientific journals are no longer fulfilling what used to be their primary role: disseminating and archiving the results of scientific studies. One of the new approaches that were developed to fix the publishing system is Open Access: the principle that published articles should be freely accessible to everyone (under conditions that vary according to which "dialect" of Open Access is used) and that the cost of the publishing procedure should be payed in some other way than subscription fees. The universe of Open Access publishing has become quite complex in itself. For those who want to know more about it, a good starting point is this book, whose electronic form is, of course, Open Access.

While I have been following the developments in Open Access publishing for a few years, I had never published any Open Access article myself. I work at the borderline of theoretical physics and biophysics, which sounds like closely related fields but they nevertheless have very different publishing traditions. In theoretical physics, the most well-known journals are produced by non-commercial publishers, in particular scientific societies. Their prices have not exploded, nor do these publishers put pressure on libraries to subscribe to more than they want to. There is a also a strong tradition of making preprints freely available, e.g. on arXiv.org. This combined model continues to work well for theoretical physics, meaning that there is little incentive to look at Open Access publishing models. However, as soon as the "bio" prefix comes into play, the main journals are commercial. Some offer a per-article Open Access option, in exchange for the authors paying a few hundred to a few thousand dollars per article. There are also pure Open Access journals covering this field (e.g. PLOS Computational Biology), whose price range is similar. On the scale of the working budget of a theoretician working in France, these publishing fees are way too high, which is why I never considered Open Access for my "applied" research.

The fact that I have recently published my first Open Access article, in the pure Open Access journal F1000Research, is almost a bit accidental. The topic of the article is the role of computation in science, with a particular emphasis on the necessity to keep scientific models distinct from software tools. I had the plan to write such an artile for a while, but it didn't really fit into any of the journals I knew. The subject is computational science, but more its philosophical foundations than the technicalities that journals on computational science specialize in. The audience is scientists applying computations, which is a much larger group than the methodology specialists who subscribe to and read computational science journals. Even if some computational science journal might have accepted my article, it wouldn't have reached most of its intended audience. A journal on the philosphy of science would have been worse, as almost no practitioner of computational science looks at this literature. Since there was no clear venue where the intended audience would have a chance of finding my article, the best option was some Open Access journal where at least the article would be accessible to everyone. Publicity through social networks could then help potentially interested readers discover it. Two obstacles remained: finding an Open Access journal with a suitable subject domain, and getting around the money problem.

At the January 2014 Community Call of the Mozilla Science Lab, I learned that F1000Research was starting a new section on "science communication", and was waiving article processing charges for that section in 2014. This was confirmed shortly thereafter on the journal's blog. Science communication was in fact a very good label for what I wanted to write about. And F1000Research looked like an interesting journal to test because its attitude to openness goes beyond Open Access: the review process is open as well, meaning that reviews are published with the reviewers' names, and get their own DOI for reference. So there was my opportunity.

For those new to the Open Access world, I will give a quick overview of the submission and publishing process. Everything is handled online, through the journal's Web site and by e-mail. Since I very much prefer writing LaTeX to using Word, I chose the option of submitting through the writeLaTeX service. The idea of writeLaTeX is that you edit your article using their Web tools, but nothing stops you from downloading the template provided by F1000Research, writing locally, and uploading the final text in the end. I thus wrote my article using my preferred tool (Emacs) and on my laptop even when I didn't have a network connection. Once you submit your article, it is revised by the editorial staff (concerning language, style, and layout, they don't touch the contents). Once you approve the revision, the article is published almost instantaneously on the journal Web site. You are then asked to suggest reviewers, and the journal asks some of them (I don't know how they make their choice) to review the article. Reviews are published as they come in, and you get an e-mail alert. In addition to providing detailed comments, reviewers judge the article as "approved", "approved with reservations" or "not approved". As soon as two reviewers "approve", the article status changes to "indexed", meaning that it gets a DOI and it is listed in databases such as PubMed or Scopus. Authors can reply to reviewers (again in public), and they are encouraged to revise their article based on the reviewers' suggestions. All versions of an article remain accesible indefinitely on the journal's Web site, so the history of the article remains accessible forever.

Overall I would judge my experience with F1000Research as very positive. The editorial staff replies rapidly and gets problems solved (in my case, technical problems with the Web site). Open review is much more reasonable than the traditional secret peer review process. No more guessing who the reviewers are in order to please them with citations with the hope of getting your revision accepted rapidly. No more lengthy letters to the editor trying to explain diplomatically that the reviewer is incompetent. With open reviewing, authors and reviewers act as equals, as it should always have been.

The only criticism I have concerns a technical point that I hope will be improved in the future. Even if you submit your original article through writeLaTeX, you have to prepapre revisions using Microsoft Word: you download a Word file for the initially published version, activate "track changes" mode, make your changes, and send the file back. For someone who doesn't have Microsoft Word, or is not familiar with its operation, this is an enormous barrier. A journal that encourages authors to revise their articles should also allow them to do so using tools that they have and are familiar with.

Will I publish in F1000Research again? I don't expect to do so in the near future. With the exception of the science communication section, F1000Research is heavily oriented towards the life sciences, so most of my research doesn't fit in. And then there is the money problem. Without the waiver mentioned above, I'd have had to pay 500 USD for my manuscript classified as an "opinion article". Regular research articles are twice as much. Compared to a theoretician's budget, which needs to cover mostly travel, these amounts are important. Moreover, in France's heavily bureaucratized public research, every euro comes with strings attached that define when, where, and on what you are allowed to spend it. Project-specific research grants often do allow to pay publication costs, but research outside of such projects, which is still common in the theoretical sciences, doesn't have any specific budget to turn to. The idea of the Open Access movement is to re-orient the money currently spent on subscriptions towards paying publishing costs directly, but such decisions are made on a political and administrational level very remote from my daily work. Until they happen, it is rather unlikely that I will publish in Open Access mode again.

Exploring Racket

Konrad Hinsen — 2014-05-10

Over the last few months I have been exploring the Racket language for its potential as a language for computational science, and it's time to summarize my first impressions.

Why Racket?

There are essentially two reasons for learning a programing language: (1) getting acquainted with a new tool that promises to get some job done better than with other tools, and (2) learning about other approaches to computing and programming. My interest in Racket was driven by a combination of these two aspects. My background is in computational science (phsyics, chemistry, and structural biology), so I use computation extensively in my work. Like most computational scientists of my generation, I started working in Fortran, but quickly found this unsatisfactory. Looking for a better way to do computational science, I discovered Python in 1994 and joined the Matrix-SIG that developed what is now known as NumPy. Since then, Python has become my main programming language, and the ecosystem for scientific computing in Python has flourished to a degree unimaginable twenty years ago. For doing computational science, Python is one of the top choices today.

However, we shouldn't forget that we are still living in the stone age of computational science. Fortran was the Paleolithic, Python is the Neolithic, but we have to move on. I am convinced that computing will become as much an integral part of doing science as mathematics, but we are not there yet. One important aspect has not evolved since the beginnings of scientific computing in the 1950s: the work of a computational scientist is dominated by the technicalities of computing, rather than by the scientific concerns. We write, debug, optimize, and extend software, port it to new machines and operating systems, install messy software stacks, convert file formats, etc. These technical aspects, which are mostly unrelated to doing science, take so much of our time and attention that we think less and less about why we do a specific computation, how it fits into more general theoretical frameworks, how we can verify its soundness, and how we can improve the scientific models that underly our computations. Compare this to how theoreticians in a field like physics or chemistry use mathematics: they have acquired most of their knowledge and expertise in mathematics during their studies, and spend much more time applying mathematics to do science than worrying about the intrinsic problems of mathematics. Computing should one day have the same role. For a more detailed description of what I am aiming at, see my recent article.

This lengthy foreword was necessary to explain what I am looking for in Racket: not so much another language for doing today's computational science (Python is a better choice for that, if only for its well-developed ecosystem), but as an evironment for developing tomorrow's computational science. The Racket Web site opens with the title "A programmable programming language", and that is exactly the aspect of Racket that I am most interested in.

There are two more features of Racket that I found particularly attractive. First, it is one of the few languages that have good support for immutable data structures without being extremist about it. Mutable state is the most important cause of bugs in my experience (see my article on "Managing State" for details), and I fully agree with Clojure's Rich Hickey who says that "immutability is the right default". Racket has all the basic data structures in a mutable and an immutable variant, which provides a nice environment to try "going immutable" in practice. Second, there is a statically typed dialect called Typed Racket which promises a straightforward transition from fast prototyping in plain Racket to type-safe and more efficient production code in Typed Racket. I haven't looked at this yet, so I won't say any more about it.

Racket characteristics

For readers unfamiliar with Racket, I'll give a quick overview of the language. It's part of the Lisp family, more precisely a derivative of Scheme. In fact, Racket was formerly known as "PLT Scheme", but its authors decided that it had diverged sufficiently from Scheme to give it a different name. People familiar with Scheme will still recognize much of the language, but some changes are quite profound, such as the fact that lists are immutable. There are also many extensions not found in standard Scheme implementations.

The hallmark of the Lisp family is that programs are defined in terms of data structures rather than in terms of a text-based syntax. The most visible consequence is a rather peculiar visual aspect, which is dominated by parentheses. The more profound implication, and in fact the motivation for this uncommon choice, is the equivalence of code and data. Program execution in Lisp is nothing but interpretation of a data structure. It is possible, and common practice, to construct data structures programmatically and then evaluate them. The most frequent use of this characteristic is writing macros (which can be seen as code preprocessors) to effectively extend the language with new features. In that sense, all members of the Lisp family are "programmable programming languages".

However, Racket takes this approach to another level. Whereas traditional Lisp macros are small code preprocessors, Racket's macro system feels more like a programming API for the compiler. In fact, much of Racket is implemented in terms of Racket macros. Racket also provides a way to define a complete new language in terms of existing bits and pieces (see the paper "Languages as libraries" for an in-depth discussion of this philosophy). Racket can be seen as a construction kit for languages that are by design interoperable, making it feasible to define highly specific languages for some application domain and yet use it in combination with a general-purpose language.

Another particularity of Racket is its origin: it is developed by a network of academic research groups, who use it as tool for their own research (much of which is related to programming languages), and as a medium for teaching. However, contrary to most programming languages developed in the academic world, Racket is developed for use in the "real world" as well. There is documentation, learning aids, development tools, and the members of the core development team are always ready to answer questions on the Racket user mailing list. This mixed academic-application strategy is of interest for both sides: researchers get feedback on the utility of their ideas and developments, and application programmers get quick access to new technology. I am aware of only three other languages developed in a similar context: OCaml, Haskell, and Scala.

Learning and using Racket

A first look at the Racket Guide (an extended tutorial) and the Racket Reference shows that Racket is not a small language: there is a bewildering variety of data types, control structures, abstraction techniques, program structuration methods, and so on. Racket is a very comprehensive language that allows both fine-tuning and large-scale composition. It definitely doesn't fit into the popular "low-level" vs. "high-level" dichotomy. For the experienced programmer, this is good news: whatever technique you know to be good for the task at hand is probably supported by Racket. For students of software development, it's probably easy to get lost. Racket comes with several subsets developed for pedagogical purposes, which are used in courses and textbooks, but I didn't look at those. What I describe here is the "standard" Racket language.

Racket comes with its own development environment called "DrRacket". It looks quite poweful, but I won't say more about it because I haven't used it much. I use too many languages to be interested in any language-specific environment. Instead, I use Emacs for everything, with Geiser for Racket development.

The documentation is complete, precise, and well presented, including a pleasant visual layout. But it is not always an easy read. Be prepared to read through some background material before understanding all the details in the reference documentation of some function you are interested in. It can be frustrating sometimes, but I have never been disappointed: you do find everything you need to know if you just keep on following links.

My personal project for learning Racket is an implementation of the MOSAIC data model for molecular simulations. While my implementation is not yet complete (it supports only two kinds of data items, universes and configurations), it has data structure definitions, I/O to and from XML, data validation code, and contains a test suite for everything. It uses some advanced Racket features such as generators and interfaces, not so much out of necessity but because I wanted to play with them.

Overall I had few surprises during my first Racket project. As I already said, finding what you need in the documentation takes a lot of time initially, mostly because there is so much to look at. But once you find the construct you are looking for, it does what you expect and often more. I remember only one ongoing source of frustration: the multitude of specialized data structures, which force you to make choices you often don't really care about, and to insert conversion functions when function A returns a data structure that isn't exactly the one that function B expects to get. As an illustration, consider the Racket equivalent of Python dictionaries, hash tables. They come in a mutable and an immutable variant, each of which can use one of three different equality tests. It's certainly nice to have that flexibility when you need it, but when you don't, you don't want to have to read about all those details either.

As for Racket's warts, I ran into two of them. First, the worst supported data structure in Racket must be the immutable vector, which is so frustrating to work with (every operation on an immutable vector returns a mutable vector, which has to be manually converted back to an immutable vector) that I ended up switching to lists instead, which are immutable by default. Second, the distinction (and obligatory conversion) between lists, streams, generators and a somewhat unclear sequence abstraction makes you long for the simplicity of a single sequence interface as found in Python or Clojure. In Racket, you can decompose a list into head and tail using first and rest. The same operations on a stream are stream-first and stream-rest. The sequence abstraction, which covers both lists and streams and more, has sequence-tail for the tail, but to the best of my knowledge nothing for getting the first element, other than the somewhat heavy (for/first ([element sequence]) element).

The macro requirements of my first project were modest, not exceeding what any competent Lisp programmer would easily do using defmacro (which, BTW, exists in Racket for compatibility even though its use is discouraged). Nevertheless, in the spirit of my exploration, I tried all three levels of Racket's hygienic macro definitions: syntax-rule, syntax-case, and syntax-parse, in order of increasing power and complexity. The first, syntax-rule is straightforward but limited. The last one, syntax-parse, is the one you want for implementing industrial-strength compiler extensions. I don't quite see the need for the middle one, syntax-case, so I suppose it's there for historical reasons, being older than syntax-parse. Macros are the one aspect of Racket for which I recommend starting with something else than the Racket documentation: Greg Hendershott's Fear of Macros is a much more accessible introduction.

Scientific computing

As I said in the beginning of this post, my goal in exploring Racket was not to use it for my day-to-day work in computational science, but nevertheless I had a look at the support for scientific computing that Racket offers. In summary, there isn't much, but what there is looks very good.

The basic Racket language has good support for numerical computation, much of which is inherited from Scheme. There are integers of arbitrary size, rational numbers, and floating-point numbers (single and double precision), all with the usual operations. There are also complex numbers whose real/imaginary parts can be exact (integer or rational) or inexact (floats). Unlimited-precision floats are provided by an interface to MPFR in the Racket math library.

The math library (which is part of every standard Racket installation) offers many more goodies: multidimensional arrays, linear algebra, Fourier transforms, special functions, probability distributions, statistics, etc. The plot library, also in the standard Racket installation, adds one of the nicest collections of plotting and visualization routines that I have seen in any language. If you use DrRacket, you can even rotate 3D scenes interactively, a feature that I found quite useful when I used (abused?) plots for molecular visualization.

Outside of the Racket distribution, the only library I could find for scientific applications is Doug Williams' "science collection", which predates the Racket math library. It looks quite good as well, but I didn't find an occasion yet for using it.

Could I do my current day-to-day computations with Racket? A better way to put it is, how much support code would I have to write that is readily available for more mature scientific languages such as Python? What I miss most is access to my data in HDF5 and netCDF formats. And the domain-specific code for molecular simulation, i.e. the equivalent of my own Molecular Modeling Toolkit. Porting the latter to Racket would be doable (I wrote it myself, so I am familiar with all the algorithms and its pitfalls), and would in fact be an opportunity to improve many details. But interfacing HDF5 or netCDF sounds like a lot of work with no intrinsic interest, at least to me.

The community

Racket has an apparently small but active, competent, and friendly community. I say "apparently" because all I have to base my judgement on is the Racket user mailing list. Given Racket's academic and teaching background, it is quite possible that there are lots of students using Racket who find sufficient support locally that they never manifest themselves on the mailing list. Asking a question on the mailing list almost certainly leads to a competent answer, sometimes from one of the core developers, many of whom are very present. There are clearly many Racket beginners (and also programming newbies) on the list, but compared to other programming language users' lists, there are very few naive questions and comments. It seems like people who get into Racket are serious about programming and are aware that problems they encounter are most probably due to their lack of experience rathen than caused by bugs or bad design in Racket.

I also noticed that the Racket community is mostly localized in North America, judging from the peak posting times on the mailing list. This looks strange in today's Internet-dominated world, but perhaps real-life ties still matter more than we think.

Even though the Racket community looks small compared to other languages I have used, it is big and healthy enough to ensure its existence for many years to come. Racket is not the kind of experimental language that is likely to disappear when its inventor moves on to the next project.

Conclusion

Overall I am quite happy with Racket as a development language, though I have to add that I haven't used it for anything mission-critical yet. I plan to continue improving and completing my Racket implementation of Mosaic, and move it to Typed Racket as much as possible. But I am not ready to abandon Python as my workhorse for computational science, there are simply too many good libraries in the scientific Python ecosystem that are important for working efficiently.

The roles of computer programs in science

Konrad Hinsen — 2014-01-21

Why do people write computer programs? The answer seems obvious: in order to produce useful tools that help them (or their clients) do whatever they want to do. That answer is clearly an oversimplification. Some people write programs just for the fun of it, for example. But when we replace "people" by "scientists", and limit ourselves to the scientists' professional activities, we get a
statement that rings true: Scientists write programs because these programs do useful work for them. Lengthy computations, for example, or visualization of complex data.

This perspective of "software as a tool for doing research" is so pervasive in computational science that it is hardly ever expressed. Many scientists even see software, or perhaps the combination of computer hardware plus software as just another piece of lab equipment. A nice illustration is this TEDx lecture by Klaus Schulten about his "computational microscope", which is in fact Molecular Dynamics simulation software for studying biological macromolecules such as proteins or DNA.

To see the fallacy behind equating computer programs with lab equipment, let's take a step back and look at the basic principles of science. The ultimate goal of science is to develop an understanding of the universe that we inhabit. The specificity of science (compared to other approaches such as philosophy or religion) is that it constructs precise models for natural phenomena that it validates and improves by repeated confrontation with observations made on the real thing:

An experiment is just an optimization: it's a setup designed for making a very specific kind of observation that might be difficult or impossible to make by just looking at the world around us. The process of doing science is an eternal cycle: the model is used to make predictions of yet-to-make observations, whereas the real observations are compared to these predictions in order to validate the model and, in case of a significant discrepancies, to correct it.

In this cycle of prediction and observation, the role of a traditional microscope is to help make observations of what happens in nature. In contrast, the role of Schulten's computational microscope is to make predictions from a theoretical model. Once you think about this for a while, it seems obvious. To make observations on a protein, you need to have that protein. A real one, made of real atoms. There is no protein anywhere in a computer, so a computer cannot do observations on proteins, no matter which software is being run on it. What you look at with the computational microscope is not a protein, but a model of a protein. If you actually watch Klaus Schulten's video to the end, you will see that this distinction is made at some point, although not as clearly as I think it should be.

So it seems that the term "a tool for exploring a theoretical model" is a good description of a simulation program. And in fact that's what early simulation programs were. The direct ancestors of Schulten's computational microscope are the first Molecular Dynamics simulation programs made for atomic liquids. A classic reference is Rahman's 1964 paper on the simulation of liquid argon. The papers of that time specify the model in terms of a few mathematical equations plus a some numerical parameters. Molecular Dynamics is basically Newton's equations of motion, discretized for numerical integration, plus a simple model for the interactions between the atoms, known as the Lennard-Jones potential. A simulation program of the time was a rather straightforward translation of the equations into FORTRAN, plus some bookkeeping and I/O code. It was indeed a tool for exploring a theoretical model.

Since then, computer simulation has been applied to ever bigger and ever more complex systems. The examples shown by Klaus Schulten in his video represent the state of the art: assemblies of biological macromolecules, consisting of millions of atoms. The theoretical model for these systems is still a discretized version of Newton's equations plus a model for the interactions. But this model for the interactions has become extremely complex. So complex in fact that nobody bothers to write it down any more. It's not even clear how you would write it down, since standard mathematical notation is no longer adequate for the task. A full specification requires some algorithms and a database of chemical information. Specific aspects of model construction have been discussed at length in the scientific literature (for example how best to describe electrostatic interactions), but a complete and precise specification of the model used in a simulation-based study is never provided.

The evolution from simple simulations (liquid argon) to complex ones (assemblies of macromolecules) looks superficially like a quantitative change, but there is in fact a qualitative difference: for today's complex simulations, the computer program is the model. Questions such as "Does program X correctly implement model A?", a question that made perfect sense in the 1960s, have become meaningless. Instead, we can only ask "Does program X implement the same model as program Y?", but that question is impossible to answer in practice. The reason is that the programs are even more complex than the models, because they also deal with purely practical issues such as optimization, parallelization, I/O, etc. This phenomenon is not limited to Molecular Dynamics simulations. The transition from mathematical models to computational models, which can only be expressed in the form of computer programs, is happening in many branches of science. However, scientists are slow to recognize what is happening, and I think that is one reason for the frequent misidentification of software as experimental equipment. Once a theoretical model is complex and drowned in even more complex software, it acquires many of the characteristics of experiments. Like a sample in an experiment, it cannot be known exactly, it can only be studied by observing its behavior. Moreover, these observations are associated with systematic and statistical errors resulting from numerical issues that frequently even the program authors don't fully understand.

From my point of view (I am a theoretical physicist), this situation is not acceptable. Models play a central role in science, in particular in theoretical science. Anyone claiming to be theoretician should be able to state precisely which models he/she is using. Differences between models, and approximations to them, must be discussed in scientific studies. A prerequisite is that the models can be written down in a human-readable form. Computational models are here to stay, meaning that computer programs as models will become part of the daily bread of theoreticians. What we will have to develop is notations and techniques that permit a separation of the model aspect of a program from all the other aspects, such as optimization, parallelization, and I/O handling. I have presented some ideas for reaching this goal in this article (click here for a free copy of the issue containing it, it's on page 77), but a lot of details remain to be worked out.

The idea of programs as a notation for models is not new. It has been discussed in the context of education, for example in this paper by Gerald Sussman and Jack Wisdom, as well as in their book that presents classical mechanics in a form directly executable on a computer. The constraint of executability imposed by computer programs forces scientists to remove any ambiguities from their models. The idea is that if you can run it on your computer, it's completely specified. Sussman and Wisdom actually designed a specialized programming language for this purpose. They say it's Scheme, which is technically correct, but Scheme is a member of the Lisp family of extensible programming languages, and the extensions written by Sussman and Wisdom are highly non-trivial, to the point of including a special-purpose computer algebra system.

For the specific example that I have used above, Molecular Dynamics simulations of proteins, the model is based on classical mechanics and it should thus be possible to use the language of Sussman and Wisdom to write down a complete specification. Deriving an efficient simulation program from such a model should also be possible, but requires significant research and devlopment effort.

However, any progress in this direction can happen only when the computational science community takes a step back from its everyday occupations (producing ever more efficient tools for running ever bigger simulations on ever bigger computers) and starts thinking about the place that it occupies in the pursuit of scientific research.

Update (2014-5-26) I have also written a more detailed article on this subject.

Python as a platform for reproducible research

Konrad Hinsen — 2013-11-19

The other day I was looking at the release notes for the recently published release 1.8 of NumPy, the library that is the basis for most of the Scientific Python ecosystem. As usual, it contains a list of new features and improvements, but also sections such as "dropped support" (for Python 2.4 and 2.5) and "future changes", to be understood as "incompatible changes that you should start to prepare for". Dropping support for old Python releases is understandable: maintaining compatibility and testing it is work that needs to be done by someone, and manpower is notoriously scarce for projects such as NumPy. Many of the announced changes are in the same category: they permit removing old code and thus reduce maintenance effort. Other announced changes have the goal of improving the API, and I suppose they were more controversial than the others, as it is rarely obvious that one API is better than another one.

From the point of view of reproducible research, all these changes are bad news. They mean that libraries and scripts that work today will fail to work with future NumPy releases, in ways that their users, who are usually not the authors, cannot easily understand or fix. Actively maintained libraries will of course be adapted to changes in NumPy, but much, perhaps most, scientific software is not actively maintained. A PhD student doing computational reasearch might well publish his/her software along with the thesis, but then switch subjects, or leave research altogether, and never look at the old code again. There are also specialized libraries developed by small teams who don't have the resources to do as much maintenance as they would like.

Of course NumPy is not the only source of instability in the Python platform. The most visible change in the Python ecosystem is the evolution of Python itself, whose 3.x series is not compatible with the initial Python language. It is difficult to say at this time for how long Python 2.x will be maintained, but it is well possible that much of today's scientific software written in Python will become difficult to run ten years from now.

The problem of scientific publications becoming more and more difficult to use is not specific to computational science. A theoretical physicist trying to read Isaac Newton's works would have a hard time, because the mathematical language of physics has changed considerably over time. Similarly, an experimentalist trying to reproduce Galileo Galilei's experiments would find it hard to follow his descriptions. Neither is a problem in practice, because the insights obtained by Newton and Galilei have been reformulated many times since then and are available in today's language in the form of textbooks. Reading the original works is required only for studying the history of science. However, it typically takes a few decades before specific results are universally recognized as important and enter the perpetually maintained canon of science.

The crucial difference with computations is that computing platforms evolve much faster than scientific research. Researchers in fields such as physics and chemistry routinely consult original research works that are up to thirty years old. But scientific software from thirty years ago is almost certainly unusable today without changes. The state of today's software thirty years from now is likely to be worse, since software complexity has increased significantly. Thirty years ago, the only dependencies a scientific program would have is a compiler and perhaps one of a few widely known numerical libraries. Today, even a simple ten-line Python script has lots of dependencies, most of the indirectly through the Python interpreter.

One popular attitude is to say: Just run old Python packages with old versions of Python, NumPy, etc. This is an option as long as the versions you need are recent enough that they can still be built and installed on a modern computer system. And even then, the practical difficulties of working with parallel installation of multiple versions of several packages are considerable, in spite of tools designed to help with this task (have a look at EasyBuild, hashdist, conda, and Nix or its offshoot Guix).

An additional difficulty is that the installation instructions for a library or script at best mention a minimum version number for dependencies, but not the last version with which they were tested. There is a tacit assumption in the computing world that later versions of a package are compatible with earlier ones, although this is not true in practice, as the example of NumPy shows. The Python platform would be a nicer place if any backwards-incompatible change were accompanied by a change in package name. Dependencies would then be evident, and the different incompatible versions could easily be installed in parallel. Unfortunately this approach is rarely taken, a laudable exception being Pyro, whose latest incarnation is called Pyro4 to distinguish it from its not fully compatible predecessors.

I have been thinking a lot about this issue recently, because it directly impacts my ActivePapers project. ActivePapers solves the dependency versioning problem for all code that lives within the ActivePaper universe, by abandoning the notion of a single collection of "installed packages" and replacing it by explicit references to a specific published version. However, the problem persists for packages that cannot be moved inside the ActivePaper universe, typically because of extension modules written in a compiled language. The most fundamental dependencies of this kind are NumPy and h5py, which are guaranteed to be available in an ActivePapers installation. ActivePapers does record the version numbers of NumPy and h5py (and also HDF5) that were used for each individual computation, but it has currently no way to reproduce that exact environment at a later time. If anyone has a good idea for solving this problem, in a way that the average scientist can handle without becoming a professional systems administrator, please leave a comment!

As I have pointed out in an earlier post, long-term reproducibility in computational science will become possible only if the community adopts a stable code representation, which needs to be situated somewhere in between processor instruction sets and programming languages, since both ends of this spectrum are moving targets. In the meantime, we will have to live with workarounds.

ActivePapers for Python

Konrad Hinsen — 2013-09-27

Today I have published the first release of ActivePapers for Python, available on PyPI or directly from the Mercurial repository on Bitbucket. The release coincides with the publication of my first scientific paper for which the complete code and data is in the supplementary material, available through the J. Chem. Phys. Web site or from Figshare. There is a good chance that this is the first fully reproducible paper in the field of biomolecular simulation, but it is of course difficult to verify such a claim.

ActivePapers is a framework for doing and publishing reproducible research. An ActivePaper is a file that contains code (Python modules and scripts) and data (HDF5 datasets), plus the dependency information between all these pieces. You can change a script and re-run all the computations that depend on it, for example. Once your project is finished, you can publish the ActivePaper as supplementary material to your standard paper. You can also re-use code and data from a published ActivePaper by using DOI-based links, although for the moment this works only for ActivePapers stored on Figshare.

I consider this first release of ActivePapers quite usable (I use it, after all), but it's definitely for "early adopters". You should be comfortable working with command-line tools, for example, and of course you need some experience with writing Python scripts if you want to create your own ActivePaper. For inspecting data, you can use any HDF5-based tool, such as HDFView, though this makes sense only for data that generic tools can handle. My first published ActivePaper contains lots of protein structures, which HDFView doesn't understand at all. I expect tool support for ActivePapers to improve significantly in the near future.

Platforms for reproducible research

Konrad Hinsen — 2013-08-14

This post was motivated by Ian Gent's recomputation manifesto and his blog post about it. While I agree with pretty much everything said there, there is one point that I strongly disagree with, and here I'd like to explain the reasons in some detail. The point in question is "The only way to ensure recomputability is to provide virtual machines". To be fair, the manifesto specifies that it's the only way "at least for now", so perhaps our disagreement is not as pronounced as it may seem.

I'll start with a quote from the manifesto that shows that we have similar ideas of the time scales over which computational research should be reproducible:
"It may be true that code you make available today can be built with only minor pain by many people on current computers. That is unlikely to be true in 5 years, and hardly credible in 20."

So the question is: how can we best ensure that the software used in our computational studies can still be run, with reasonable effort, 20 years from now. To answer that question, we have to look at the possible platforms for computational research.

By a "platform", I mean the combination of hardware and software that is required to use a given piece of digital information. For example, Flash video requires a Flash player and a computer plus operating system that the Flash player can run on. That's what defines the "Flash platform". Likewise, today's "Web platform" (a description that requires a date stamp to be precise, because Web standards evolve so quickly) consists of HTML5, JavaScript, and a couple of related standards. If you want to watch a Flash video in 20 years, you will need a working Flash platform, and if you want to use an archived copy of a 2013 Web site, you need the 2013 Web platform.

If you plan to distribute some piece of digital information with the hope that it will make sense 20 years from now, you must either have confidence in the longevity of the platform, or be willing and able to ensure its long-term maintenance yourself. For the Flash platform, that means confidence in Adobe and its willingness to keep Flash alive (I wouldn't bet on that). For the 2013 Web platform, you may hope that its sheer popularity will motivate someone to keep it alive, but I wouldn't bet on it either. The Web platform is too complex and too ill-defined to be kept alive reliably when no one uses it in daily life any more.

Back to computational science. 20 years ago, most scientific software was written in Fortran 77, often with extensions specific to a machine or compiler. Much software from that era relied on libraries as well, but they were usually written in the same language, so as long as their source code remains available, the platform for all that is a Fortran compiler compatible with the one from back then. For standard Fortran 77, that's not much of a problem, whereas most of the vendor-specific extensions have disappeared since. Much of that 20-year-old software can in fact still be used today. However, reproducing a computational study based on that software is a very different problem: it also requires all the input data and an executable description of the computational protocol. Even in the rare case that all that information is available, it is likely to depend on lots of other software pieces that may not be easy to get hold of any more. The total computational platform for a given research project is in fact as ill-defined as the 2013 Web platform.

Today's situation is worse, because we use more diverse software written in more different languages, and also use more interactive software whose use is notoriously non-reproducible. The only aspect where we have gained in standardization is the underlying hardware and OS layer: pretty much all computational science is done today on x86 processors running Linux. Hence the idea of conserving the full operating environment in the form of a virtual machine. Just fire up VirtualBox (or one of the other virtual machine managers) and run an exact copy of the original study's work environment.

But what is the platform required to run today's virtual machines? It's VirtualBox, or one of its peers. Note however that it's not "any of today's virtual machine managers" because compatibility between their virtual machine formats is not perfect. It may work, or it may not. For simplicity I will use VirtualBox in the following, but you can substitute another name and the basic arguments still hold.

VirtualBox is a highly non-trivial piece of software, and it has very stringent hardware requirements. Those hardware requirements are met by the vast majority of today's computing equipment used in computational science, but the x86 platform is losing market share rapidly on the wider computing device market. VirtualBox doesn't run on an iPad, for example, and probably it never will. Is VirtualBox likely to be around in 20 years? I won't dare a prediction. If x86 survives for another 20 years AND if Oracle sees a continuing interest in this product, then it will. I won't bet on it though.

What we really need for long-term recomputability is a simple platform. A platform that is simple enough that the scientific community alone can afford to keep it alive for its own needs, even if no one else in the world cares about it.

Unfortunately there is no suitable platform today, to the best of my knowledge. Which is why virtual machines are perhaps the best option right now, for lack of a satisfactory one. But if we care about recomputability, we should design and develop a good supporting platform, starting as soon as possible.

For a more detailed discussion of this issue, see this paper written by yours truly. It comes to the conclusion that the closest existing approximation to a good platform is the Java virtual machine. What we'd want ideally is something similar to the JVM, but designed and optimized for scientific applications. A basic JVM implementation is quite simple (the complex JIT stuff is not a requirement), a few orders of magnitude simpler than VirtualBox, and it has no specific hardware dependencies. It's even simpler than many of today's scientific software packages, so the scientific community can definitely afford to keep it alive, The tough part is... no, it's not designing or writing the required software, it's agreeing on a specification. Perhaps it will never happen. Perhaps virtual machines will remain the best choice for lack of a satisfactory one. Or perhaps we will end up compiling our software to asm.js and run in the browser, just because someone else will keep that platform alive for us, no matter how ill-adapted it is to our needs. But don't say you haven't been warned.

Bye bye Address Book, welcome BBDB

Konrad Hinsen — 2013-06-03

About two years ago I wrote a post about why and how I abandoned Apple's iCal for my agenda management and moved to Emacs org-mode instead. Now I am in the process of making the second step in the same direction: I am abandoning Apple's Address Book and starting to use the "Big Brother DataBase", the most popular contact management system from the Emacs universe.

What started to annoy me seriously about Address Book is a bug that makes the database and its backups grow over time, even if no contacts are added, because the images for the contacts keep getting copied and never deleted under certain circumstances. I ended up having address book backups of 200 MB for just 500 contacts, which is ridiculous. A quick Web search shows that the problem has been known for years but has not yet been fixed.

When I upgraded from MacOS 10.6 to 10.7 about a year ago (I am certainly not an early adopter of new MacOS versions), I had a second reason to dislike Address Book: the user interface had been completely re-designed and become a mess in the process. Every time I use it I have to figure out again how to navigate groups and contacts.

I had been considering moving to BBDB for a while, but I hadn't found any good solution for synchronizing contacts with my Android phone. That changed when I discovered ASynK, which does a bi-directional synchronization between a BBDB database and a Google Contacts account. That setup actually works better than anything I ever tried to synchronize Address Book with Google Contacts, so I gained more than I expected in the transition.

At first glance, it may seem weird to move from technology of the 2000's to technology of the 1970's. But the progress over that period in managing rather simple data such as contact information has been negligible. The big advantage of the Emacs platform over the MacOS platform is that it doesn't try to take control over my data. A BBDB database is just a plain text file whose structure is apparent after five minutes of study, whereas an Address Book database is stored in a proprietary format. A second advantage is that the Emacs developer community fixes bugs a lot faster than Apple does. A less shiny (but perfectly usable) user interface is a small price to pay.

A critical view of altmetrics

Konrad Hinsen — 2013-05-08

Altmetrics is one of the hotly debated topics in the Open Science movement today. In summary, the idea is that traditional bibliometric measures (citation counts, impact factors, h factors, ...) are too limited because they miss all the scientific activity that happens outside of the traditional journals. That includes the production of scientific contributions that are not traditional papers (i.e. datasets, software, blog posts, etc.) and the references to scientific contributions that are not in the citation list of a traditional paper (blogs, social networks, etc.). Note that the altmetrics manifesto describes altmetrics as a tool to help find scientists publications worth reading. I find it hard to believe that its authors have not thought of applications in evaluation of researchers and institutions, which will inevitably happen if altmetrics ever takes off.

At first sight, altmetrics appear as an evident "update" to traditional bibliometry. It sounds pretty obvious that, as scientific communication moves on to new media and finds new forms of expressions, bibliometry should adapt. On the other hand, bibliometry is considered a more less necessary evil by most scientists. Many deplore today's "publish or perish" culture and correctly observe that it is harmful to science in the long term, giving more importance to the marketing of research studies than to their careful design and meticulous execution. I haven't yet seen any discussion of this aspect in the context of altmetrics, so I'd like to start such a discussion with this post.

First of all, why is bibliometry so popular, and why is it harmful in the long run? Second, how will this change if and when altmetrics are adopted by the scientific community?

Bibliometry provides measures of scientific activity that have two important advantages: they are objective, based on data that anyone can check in principle, and they can be evaluated by anyone, even by a computer, without any need to understand the contents of scientific papers. On the downside, those measures can only indirectly represent scientific quality precisely because they ignore the contents. Bibliometry makes the fundamental assumption that the way specific articles are received by the scientific community can be used as a proxy for quality. That assumption is, of course, wrong, and that's how bibliometry ultimately harms the progress of science.

The techniques that people use to improve their bibliometrical scores without contributing to scientific progress are well known: dilution of content (more articles with less content per article), dilution of authorship (agreements between scientists to add each others' names to their works), marketing campaigns for getting more citations, application of a single technique to lots of very similar applications even if that adds no insight whatsoever. Altmetrics will cause the same techniques to be applied to datasets and software. For example, I expect scientific software developers to take Open Source libraries and re-publish them with small modifications under a new name, in order to have their name attached to them. Unless we come up with better techniques for software installation and deployment, this will probably make the management of scientific software a bit more complicated because we will have to deal with lots of small libraries. That's a technical problem that can and should be solved with a technical solution.

However, these most direct and most discussed negative consequences of bibliometry are not the only ones and perhaps not the worst. The replacement of expert judgement by majority vote, which is the basis of bibliometry, also in its altmetrics incarnation, leads to a phenomenon which I will call "scientiic bubbles" in analogy to market bubbles in economy. A market bubble occurs if the price of a good is determined not by the people who buy it to satisfy some need, but by traders and speculators who try to estimate the future price of the good and make a profit from a rise or fall relative to the current price. In science, the "client" whose "need" is fulfilled by a scientific study is mainly future science, plus in the case of applied research engineering and product development. The role of traders and speculators is taken by referees and journal editors. A scientific bubble is a fashionable topic that many people work on not because of its scientific interest but because of the chance it provides to get a highly visible publication. Like market bubbles, scientific bubbles eventually explode when people realize that the once fashionable topic was a dead end. But before exploding, a bubble has wasted much money and intellectual energy. It may also have blocked alternative and ultimately more fruitful research projects that were refused funding because they were in contradiction with the dominating fashionable point of view.

My prediction is that altmetrics will make bubbles more numerous and more severe. One reason is the wider basis of sources from which references are counted. In today's citation-based bibliometry, citations come from articles that went through some journal's peer-reviewing process. No matter how imperfect peer review is, it does sort out most of the unfounded and obviously wrong contributions. To get a paper published in a journal whose citations count, you need a minimum of scientific competence. In contrast, anyone can publish an opinion on Twitter or Facebook. Since for any given topic the number of experts is much smaller than the number of people with just some interest, a wider basis for judgement automatically means less competence on average. As a consequence, high altmetrics scores are best obtained by writing articles that appeal to the masses who can understand what the work is about but not judge if it is well-founded. Another reason why altmetrics will contribute to bubbles is the positive feedback loop created by people reading and citing publications because they are already widely read and cited. That effect is dampened in traditional bibliometry because of the slowness of the publishing and citation mechanism.

My main argument ends here, but I will try to anticipate some criticisms and reply to them immediately.

One objection I expect is that the analysis of citation graphs can be used to assign a kind of reputation to each source and weight references by this reputation. That is the principle of Google's famous PageRank algorithm. However, any analysis of the citation graph suffers from the same fundamental problem as bibliometry itself: a method that only looks at relations between publications but not at their contents can't distinguish a gem from a shiny bubble. There will be reputation bubbles just like there are topic bubbles. No purely quantitative analysis can ever make a statement about quality. The situation is similar to mathematical formalisms, with citation graph analysis taking the role of formal proof and scientific quality the role of truth in Gödel's incompleteness theorem.

Another likely criticism is that the concept of the scientific bubble is dubious. Many paths of scientific explorations have turned out to be failures, but no one could possibly have predicted this in the beginning. In fact, many ultimately successful strategies have initially been criticized as hopeless. Moreover, exploration of a wrong path can still lead to scientific progress, once the mistake has been understood. How can one distinguish promising but ultimately wrong ideas from bubbles? The borderline is indeed fuzzy, but that doesn't mean that the concept of a bubble is useless. It's the same for market bubbles, which exist but are less severe when a good is traded both for consumption and for speculation. My point is that the bubble phenomenon exists and is detrimental to scientific progress.

Lessons from sixteen years of molecular simulation in Python

Konrad Hinsen — 2013-04-10

A while ago I was chatting with two users of my Molecular Modelling Toolkit (MMTK), a library for molecular simulations written in Python. One of them asked me what I would do differently if I were to write MMTK today. That's an interesting question, but not the kind of question I can answer in a sentence or two, so I promised to write a blog post about this. Here it is.

First, a bit of history. The first version of MMTK was released about 16 years ago. I don't have the exact data, but the first message on the MMTK mailing list, announcing MMTK release 1.0b2, is dated 29 May 1997. Back then Python 1.4 was the state of the art and Numerical Python was a young project that was just beginning to stabilize. MMTK was one of the first domain-specific scientific libraries written in Python, at a time when the scientific Python community was very small and its members were mostly considered cranks by their peers. MMTK was designed from the start as a Python library, with relatively small bits of C code for the time-critical stuff (mainly energy evaluation and MD integration), with NumPy arrays at the Python-C interface. This has since become one of the two main approaches to using Python in scientific computing, the other one being wrapper code around libraries written in C/C++ or Fortran.

So what would I do differently if I were to start writing MMTK today? Many things, for different reasons. Lets first get the obvious stuff out of the way: the Python ecosystem has evolved significantly since 1997, and of course I would use Python 3, and Cython instead of C for the time-critical parts. I would also adopt many of the conventions that the community has developed but which weren't around in 1997. I might even be tempted to use bleeding-edge tools like Numba, although with hesitation: Numba is not only a moving target at this time, but also requires dependencies (I am thinking mostly of LLVM) which are big and non-trivial to install. One lesson I have learned in 16 years of scientific Python is that dependencies can cause more trouble than they are worth. It's nice in theory to re-use existing tested code, but it also makes installation and deployment more cumbersome.

So far for changes in the Python ecosystem. What has changed as well, though at a slower pace, is the role of computation in science and in particular in molecular simulation. Back in 1997, there were a few molecular simulation ecosystems that operated almost in isolation. The big players were the CHARMM, AMBER, and GROMOS/GROMACS communities. Each of them had their own software, their own file formats, and their own force fields. Members of these communities would of course talk about science to each other, but not share any software or data. Developing new computational methods required a serious investment into one of these ecosystems. That was in fact my main motivation for developing MMTK: I figured that I would be more efficient (not to mention more satisfied) writing a new system from scratch using modern development tools than trying to get familiar with crufty Fortran code. But I adopted basically the same approach with MMTK: I created a new ecosystem without much regard to sharing code or data with the rest of the world. As an illustration, MMTK defines its own trajectory format which I still consider superior to what the rest of the world is doing, but which is undeniably hard to use without MMTK, given that the definition of a universe is stored as an executable Python expression. MMTK also encourages storing data as Python pickle files, which are even harder to deal with for other programs.

Today we are seeing a change in attitude in computational science that I am sure will soon reach the molecular simulation community as well. People are starting to realize that computational results have serious reliability problems. The most publicized case in the structural biology community was the retraction of a few important published protein structures following the discovery of a bug in the data processing software that lead to completely wrong final structures. This and similar events point to the urgent need for better validation of computational results. One aspect of validation is re-running the same computation with different tools. Another aspect is publishing both software and raw data, enabling other scientists to inspect them and check their validity. Technology for sharing scientific code and data exists today (have a look at Github, Bitbucket, and figshare, for example). But in molecular simulation, there are still important practical barriers to such validation attempts, in particular the use of program-specific and badly documented file formats. While MMTK's file formats are documented, they are still program-specific and thus incompatible with the requirements of the future.

The sentence that I would like to write now is "If I were to rewrite MMTK today, I would use the exchange data formats accepted by the molecular simulation community". But those formats don't exist yet, although there are a few initiatives to develop them. My own contribution to this effort is the Mosaic data model and data formats - if you are interested in this subject, please have a look at it and send me your feedback. Mosaic will of course find its way into future versions of MMTK.

Finally, there are things I would do differently because the experience with MMTK has shown that a few initial design decisions were not the best ones. Number one is the absence of stable atom numbers. In MMTK, each atom and molecule is represented by a unique Python object, and there are ways to refer uniquely to everything by using Python expressions. But there is no such thing as a unique order of atoms that would assign a number to each one. Atoms do have numbers by which the low-level C code refers to them, but these numbers can be different every time you run a Python script. My original design goal was to discourage the use of numbers to refer to atoms, because this is an important source of mistakes if the simulated system undergoes changes. But every other molecular simulation program out there uses numbers to refer to atoms, so people are used to them. For interoperability with other programs, atom numbers are fundamental. There are ways to handle such situations, of course, but it's a constant source of headaches.

The other design aspect that I would change if I were to rewrite MMTK today is the hierarchy of chemical objects. MMTK has Atoms, Groups, Molecules, and Complexes, plus specializations such as AminoAcidResidue (a special Group), PeptideChain (a special Molecule), and Protein (a special Complex). While all of these correspond to some chemical reality, the system is more complex than required for molecular simulation, leading in some situations to code that is bloated by irrelevant special cases. Today I'd go for just Atoms and Groups, with special features of specific kinds of groups indicated by attributes rather than specific classes.

Integrating scientific software and datasets into the citation record

Konrad Hinsen — 2012-11-14

This morning I read C. Titus Brown's blog post on how science could be so much better if scientitic data and the software used to work with it were openly available for reuse. One problem he mentions, like many others have done before, is the lack of incentive for publishing anything else but standard scientific papers. What matters for a scientist's career and for grant applications is papers, papers, papers. Any contribution that's not in a scientific journal with a reputation and an impact factor is usually ignored, even if its real impact exceeds that of many papers that nobody really wants to read.

Ideally, published scientific data and software should be treated just like a paper: it should be citeable and it should appear in the citation databases that are used to calculate impact factors, h factors, and whatever other metrics bibliometrists come up with and evaluation committees appreciate for their ease of use.

Treating text (i.e. papers), data, and code identically also happens to be useful for making scientific publications more useful to the reader, by adding interactive visualization and exploration of procedures (such as varying parameters) to the static presentation of results in a standard paper. This idea of "executable papers" has generated a lot of interest recently, as shown by Elsevier's Executable Paper Challenge and the Beyond the PDF workshop. For a technical description of how this can be achieved, see my ActivePapers project and/or the paper describing it. In the ActivePapers framework, a reference to code being called, or to a dataset being reused, is exactly identical to a reference to a published paper. It would then be much easier for citation databases to include all references rather than filter out the ones that are "classical" citations. And that's a good motivation to finally treat all scientific contributions equally.

Since the executable papers idea is much easier to sell than the idea of an upated incentive system, a seemingly innocent choice in technology could end up helping to change the way scientists and research projects are evaluated.

The ultimate calculator for Android and iOS

Konrad Hinsen — 2012-09-07

Calculators are among the most popular applications for smartphones, and therefore it is not surprising that the Google Play Store has more than 1000 calculators for the Android platform. Having used HP's scientific calculators for more than 20 years, I picked RealCalc when I got my Android phone and set it to RPN mode. It works fine, I have no complaints about it. But I no longer use it because I found something much more powerful.

It's called "J", which isn't exactly a very descriptive name. And that's probably a good idea because describing it it not so easy. J is much more than a calculator, but it does the calculator job very well. It's actually a full programming language, but one that differs substantially from everything else that goes by that label. The best description for J I can come up with is "executable mathematical notation". You type an expression, and you get the result. That's in fact not very different from working interactively with Python or Matlab, except that the expressions are very different. You can write traditional programs in J, using loops, conditionals, etc., but you can a lot of work done without ever using these features.

The basic data structure in J is the array, which can have any number of dimensions. Array elements can be numbers, characters, or other arrays. Numbers (zero-dimensional arrays) and text strings (one-dimensional arrays of characters) are just special cases. In J jargon, which takes its inspiration from linguistics, data items are called "nouns". Standard mathematical operators (such as + or -) are called "verbs" and can have one or two arguments (one left, one right). An expression is called a "sentence". There are no precedence rules, the right argument of any verb being everything to its right. Given the large number of verbs in J, this initially unfamiliar rule makes a lot of sense. A simple example (also showing the use of arrays) is

   2 * 3 + 10 20 30
26 46 66

Up to here, J expressions are not very different from Python or Matlab expressions. What J doesn't have is functions with the familiar f(x, y, z) syntax, accepting any number of arguments. There are only verbs, with one or two arguments. But what makes J really different from the well-known languages for scientific computing are the "parts of speech" that have no simple equivalent elsewhere: adverbs and conjunctions.

An adverb takes a verb argument and produces a derived verb from it. For example, the adverb ~ takes a two-argument verb (a dyad in J jargon) and turns it into a one-argument verb (a monad) that's equivalent to using the dyad with two equal arguments. With + standing for plain addition, +~ thus doubles its argument:

   +~ 1 5 10 20
2 10 20 40

meaning it is the same as

   1 5 10 20 + 1 5 10 20
2 10 20 40

A conjunction combines a verb with a noun or another verb to produce a derived verb. An example is ^:, the power conjunction, which applies a verb several times:

   +~(^:2) 1 5 10 20
4 20 40 80
   +~(^:3) 1 5 10 20
8 40 80 160

The parentheses are required to separate the argument of the power conjunction (2 or 3) from the array that is the argument to the resulting derived verb. To see the real power of the power conjunction, consider that it accepts negative arguments as well:

   +~(^:_1) 1 5 10 20
0.5 2.5 5 10

You have seen right: J can figure out that the inverse of adding a number to itself is dividing that number by two!

Pretty much any programming language permits you to assign values to names for re-use in later expressions. J is no exception:

   data =. 1 5 10 20
   double =. +~
   double data
2 10 20 40
   inv =. ^:_1
   halve =. double inv
   halve data
0.5 2.5 5 10

As you can see, names can be given not just to nouns (i.e. data), but also to verbs, adverbs, and conjunctions. Most J programs are just pieces of expressions that are assigned to names. Which means that the short summary of J that I have given here could well be all you ever need to know about the language - apart from the fact that you will have to acquire a working knowledge of many more verbs, adverbs, and conjunctions.

Before you rush off to the Play Store looking for J, let me add that J is not yet there, although it's supposed to arrive soon. For now, you have to download the APK and install it yourself, using your preferred Android file manager. I should also point out that J is not just for Android. It's been around for more than 20 years, and you can get J for all the common computing platforms from Jsoftware. There's also an iOS version for the iPhone and iPad. J's extreme terseness is a perfect fit for smartphones, where screen space is a scarce resource and where every character you don't have to type saves you a lot of time.

The Nix package manager in computational science

Konrad Hinsen — 2012-05-14

In an earlier post, I mentioned the Nix package management system as a candidate for ensuring reproducibility in computational science. What distinguishes Nix from the better known package managers (Debian, RPM, ...) is that it permits the installation of different versions of the same package in parallel, with a dependency tracking system that refers to a precise version of everything, including the versions of the development tools (compilers, ...) that were used to build the libraries and executables. Nix thus remembers for each package the complete details of how it can be reconstructed, which is what we would like to see for ensuring reproducibility.

There are, however, two caveats. First of all, Nix was designed for software installation management and not for computation. While in principle one could define the results (figures, tables, datasets) of some computation as a Nix package and perform the computation by installing the package, such an approach is quite cumbersome with the Nix support tools designed with a different task in mind. However, computation-specific support tools would probably suffice to fix this. Second, while the design of Nix looks quite sound, it is a young project with much less manpower behind it than the big package managers of the Linux world. This means there are fewer package definitions and they are overall less reliable. For example, I haven't yet managed to install my research computing environment (Python, NumPy, matplotlib, plus a few more packages) using Nix under MacOS X, because some packages simply fail to build. Again this is not an insurmountable problem, but it requires some serious effort to fix.

The Nix documentation is pretty good at describing how to use the package manager and the collection of package definitions for Linux and MacOS X named Nixpkgs. It is not so good at giving a basic understanding of how Nix works, which becomes important when you want to use it for something else than traditional package management. The following overview is the result of my own explorations of Nix. I am not a Nix authority, so be warned that there may be mistakes or misunderstandings.

At the heart of Nix is the "Nix store", a central database where everything managed by Nix is kept. Its default location is /nix/store and if you look at it you see an overwhelmingly long list of crypic filenames. Let's zoom in on something to see what's going on. Here is what ls -l /nix/store/*zlib* shows on my machine:


-r--r--r-- 1 hinsen staff 1000 Jan  1  1970
 /nix/store/12vkkhs36xffzpqjaaa3vqhqv2yc97vs-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1181 Jan  1  1970
 /nix/store/gymcn145ihhmymm6yk2wxqfd49s5dzdq-zlib-1.2.6.drv
dr-xr-xr-x 5 hinsen staff  170 Jan  1  1970
 /nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
-r--r--r-- 1 hinsen staff 1000 Jan  1  1970
 /nix/store/sj8l48kfc40wh8adb5pa843lwy38hskb-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1686 Jan  1  1970
 /nix/store/xpm2xja2zv5agmdzgi362jqd5xx9ny10-zlib-1.2.6.tar.gz.drv

The single directory in that list actually contains the zlib installation in the familiar Unix file layout that you find under /usr or /usr/local:


~> ls -R /nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6:
include  lib  share

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/include:
zconf.h  zlib.h

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib:
libz.1.2.6.dylib  libz.1.dylib	libz.a	libz.dylib  pkgconfig

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib/pkgconfig:
zlib.pc

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share:
man

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man:
man3

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man/man3:
zlib.3.gz

Note that it contains just zlib, and nothing else, in particular not zlib's dependencies. Each library or application has its own directory in the Nix store.

Next, let's look at all the other files, those with the extension .drv (for "derivation", a Nix term for any artefact derived from human-provided input). There are three files that end in zlib-1.2.6.drv and one that ends in zlib-1.2.6.tar.gz.drv. Let's look at the contents of the last one first. I have made it more readable by adding whitespace:


Derive(
   [("out",
     "/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz",
     "sha256",
     "21235e08552e6feba09ea5e8d750805b3391c62fb81c71a235c0044dc7a8a61b")],
   [("/nix/store/lhc0qhfdrw32rj1z7s5p90nbjfnkydhb-stdenv.drv",
     ["out"]),
    ("/nix/store/pawry9l3415kwfbfh4zrhgnynwfb10bs-mirrors-list.drv",
     ["out"])],

   ["/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],

   "x86_64-darwin",
   "/bin/bash",
   ["-e",
    "/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],

   [("buildInputs",""),
    ("buildNativeInputs",""),
    ("builder","/bin/bash"),
    ("id",""),
    ("impureEnvVars","http_proxy https_proxy ftp_proxy all_proxy no_proxy NIX_CURL_FLAGS NIX_HASHED_MIRRORS NIX_MIRRORS_apache NIX_MIRRORS_bitlbee NIX_MIRRORS_cpan NIX_MIRRORS_debian NIX_MIRRORS_fedora NIX_MIRRORS_gcc NIX_MIRRORS_gentoo NIX_MIRRORS_gnome NIX_MIRRORS_gnu NIX_MIRRORS_gnupg NIX_MIRRORS_hashedMirrors NIX_MIRRORS_imagemagick NIX_MIRRORS_kde NIX_MIRRORS_kernel NIX_MIRRORS_metalab NIX_MIRRORS_oldsuse NIX_MIRRORS_opensuse NIX_MIRRORS_postgresql NIX_MIRRORS_savannah NIX_MIRRORS_sf NIX_MIRRORS_sourceforge NIX_MIRRORS_ubuntu NIX_MIRRORS_xorg"),
    ("mirrorsFile","/nix/store/mmk41rbja1fvclbr7ghirzcigxlzl6f0-mirrors-list"),
    ("name","zlib-1.2.6.tar.gz"),
    ("out","/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz"),
    ("outputHash","06x6m33ls1606ni7275q5z392csvh18dgs55kshfnvrfal45w8r1"),
    ("outputHashAlgo","sha256"),
    ("preferHashedMirrors","1"),
    ("preferLocalBuild","1"),
    ("propagatedBuildInputs",""),
    ("propagatedBuildNativeInputs",""),
    ("showURLs",""),
    ("stdenv","/nix/store/9fnvs0bvhrszazham5cnl13h52hvm1rk-stdenv"),
    ("system","x86_64-darwin"),
    ("urls","http://www.zlib.net/zlib-1.2.6.tar.gz mirror://sourceforge/libpng/zlib/1.2.6/zlib-1.2.6.tar.gz")])

If that looks like a computational expression in a programming language, that's because it is. Don't worry, it's not something you are expected to write yourself, these expressions are created from the package definitions written in a more user-friendly syntax called "Nix expressions", which is very well documneted in the Nix documentation.. The expression shown above defines how to make (or "realise" in Nix jargon) the derivation /nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz, which is a rather simple one because the file is simply downloaded and verified for a known checksum. But even such a simple derivation has dependencies: the "standard environment" stdenv and the list of download mirror sites, mirrors-list.

It's time to say something about those funny 32-character prefixes in all the file names in the Nix store. You may have noticed that the zlib file list above contains two entries for zlib-1.2.6.drv that are identical except for this prefix. It looks as if the prefix is there to distinguish things that would otherwise be identical. This is true, and the information encoded in the prefix (which is a hash code) is the complete set of dependencies. The two zlib derivations differ in the version of the standard environment they were built with. I have both of these in my Nix store because I have played around with different releases of Nixpkgs. Nix really tries to keep track of every single dependency, including the exact versions of the various tools (mainly compilers) that were used in building a binary installation. That means you can keep lots of different versions of every single item on your system at the same time, and trace back exactly how they were built. You can also send a copy of the relevant derivation files (those with the .drv extension) to someone else, who can reproduce the exact same environment by "realising" those derivations again.

With so many zlibs floating around, which one does Nix use when you ask it to install some application that uses zlib? The one you specify. When some application requires zlib as a dependency, you have to tell Nix exactly which zlib derivation you want to be used. You don't normally do this manually for every single build (though you could), you'd rather use a coherent set of package definitions (such as Nixpkgs) that specifies all the interdependencies among hundreds of packages. The package definitions take the form of "Nix expressions", which are written in a language specifically designed for this purpose. Files containing Nix expressions have the extension .nix. Since the language is rather well documented in the Nix manual, I won't say any more about it here. A good starting point is to explore Nixpkgs. It helps to know that the central file is pkgs/top-level/all-packages.nix. This file imports the definitions of individual packages from their respective packages and makes a consistent package collection from them. When you build a particular derivation from Nixpkgs, only the packages listed explicitly as its dependencies are available in the build environment that is set up specifically for this build operation. No "default library" (such as /usr/lib) is used at all.

There is one more layer to Nix, whose role is twofold: making it convenient for users to work with programs installed through Nix, and pemitting to remove packages that were installed but are no longer needed.
Let's start with the second aspect because it is the simpler one: packages can be removed as soon as nobody needs them any more. This requires a way to figure out which packages are still needed. Obviously the packages that some user on the system wants to access are "needed", and that's why cleanup is related to user profiles which I will cover in a minute. The remaining needed packages are the dependencies of other needed packages. So once we know the packages that all users put together request to use, we can figure out which packages can safely be deleted. This clean-up operation is called "garbage collection" and handled by the command nix-store --gc.

Nix user environments are managed using the command nix-env, and if you don't care about how Nix works, that command is the only one you may ever need. Each user has his/her own environment, of course, which consists mainly of a directory named $HOME/.nix-profile. That directory contains subdirectories called bin, lib, man etc. whose names should sound familiar. They contain nothing but symbolic links into the Nix store. These links define which package the user actually accesses, by putting $HOME/.nix-profile/bin on th3 PATH environment variable. When you use nix-env to install a package, Nix builds it and puts it into the Nix store (unless it's already there), and then creates symbolic links in your Nix profile, which may replace links to some different version of a package. It is important to understand that your use profile never enters into the build process of any Nix derivation. Your profile is exclusively for your own use and has no impact on Nix package management other than protecting the packages you use from being removed during garbage collection.

So far for a first report on my exploration of Nix. I will continue trying to get my computational environment built with Nix, so that I can start to explore how to use it for reproducible computations. Watch this space for news.

PS: After I published this post initially, the friendly people on the Nix mailing list pointed out some additional material for learning about Nix. First of all, there is Eelco Dolstra's thesis entitled "The Purely Functional Software Deployment Model", which is what you should read if you really want to know everything about Nix. There's also Sander van der Burg's blog which has some very detailed posts about Nix and what it can be used for. You could start with this introduction.

Unifying version control and dependency management for reproducible research

Konrad Hinsen — 2012-04-10

When the Greek philosopher Heraclitus pronounced his famous "πάντα ῥεῖ" (everything flows), he most probably was not thinking about software. But it applies to software as much as to other aspects of life: software is in perpetual change, being modified to remove bugs, add features, and adapt it to changing environments. The management of change is now a well-established part of software engineering, with the most emblematic tool being version control. If you are developing software without using version control, stop reading this immediately and learn about Mercurial or Git, the two best version control systems available today. That's way more important than reading the rest of this post.

Software developers use version control to keep track of the evolution of their software, to coordinate team development, and to manage experimental features. But version control is also of interest for software users: it permits them to refer to a specific version of a piece of software they use in a unique and reproducible way, even if that version is not the current one, nor perhaps even an official numbered release. In fact, official numbered releases are becoming a relict of the past. They make little sense in an Open Source universe where everyone has access to source code repositories under version control. In that situation, an official release is nothing but a bookmark pointing to a specific commit number. There is no need for a release number.

Why would you want to refer to a specific version of a piece of software, rather than always use the latest one? There are many reasons. As software evolves, some bugs get fixed but others sneak in. You may prefer the bugs you know to the ones that could surprise you. Sometimes later versions of some software are not fully compatible with their predecessors, be it by design or by mistake. And even if you want to use the very latest version at any time, you might still want to note which version you used for a specific application. In scientific computing, this is one of the fundamental principles of reproducible research: note carefully, and publish, the exact versions of all pieces of software that were used for obtaining any published research result. It's the only way for you and others to be able to understand exactly what happened when you look at your work many years later.

Another undeniable reality of modern software, in particular in the Open Source universe, is that it's modular. Developers use other people's software, especially if it's well written and has the reputation of being reliable, rather than reinventing the wheel. The typical installation instructions of a piece of Open Source software start with a list of dependencies, i.e. packages you have to install before you can install the current one. And of course the packages in the dependency list have their own dependency list. The number of packages to install can be overwhelming. The difficulties of dependency management are so widespread that the term "dependency hell" has been coined to refer to them.

Systems programmers have come up with a solution to that problem as well: dependency management tools, better known as package managers. Such tools keep a database of what is installed and which package depends on which other ones. The well-known Linux distributions are based on such package managers, of which the ones developed by Debian and RedHat are the most popular ones and are now used by other distributions as well. For MacOS X, MacPorts and Fink are the two most popular package managers, and I suspect that the Windows world has its own ones.

One of the major headaches that many computer users face is that version management and dependency management don't cooperate. While most package managers permit to state a minimal version number for a dependency, they don't permit to prescribe a precise version number. There is a good reason for this: the way software installation is managed traditionally on Unix systems makes it impossible to install multiple versions of the same package in parallel. If packages A and B both depend on C, but require different versions of it, there is simply no simple solution. Today's package managers sweep this problem under the rug and pretend that higher version numbers are always as least as good as their predecessors. They will therefore install the higher of the two version numbers required by A and B, forcing one of them to use a version different from its preference.

Anyone who has been using computers intensively for a few years has probably run into such a problem, which manifests itself by some program not working correctly any more after another one, seemingly unrelated, has been installed. Another variant is that an installation fails because some dependency is available in a wrong version. Such problems are part of "dependency hell".

This situation is particularly problematic for the computational scientist who cares about the reproducibility of computed results. At worst, verifying results from 2005 by comparing to results from 2009 can require two completely separate operating system installations running in separate virtual machines. Under such conditions, it is difficult to convince one's colleagues to adopt reproducible research practices.

While I can't propose a ready-to-use solution, I can point out some work that shows that there is hope for the future. One interesting tool is the Nix package manager, which works much like the package managers by Debian or RedHat, but permits installing multiple versions of the same package in parallel, and registers dependencies with precise versions. It could be used as a starting point for managing software for reproducible research, the main advantage being that it should work with all existing software. The next step would be to make each result dataset or figure a separate "package" whose complete dependency list (software and datasets) is managed by Nix with references to precise version numbers. I am currently exploring this approach; watch this space for news about my progress.

For a system even better suited to the needs of reproducible computational science, I refer to my own ActivePapers framework, which combines dependency management and version control for code and data with mechanisms for publishing code+data+documentation packages and re-use code from other publications in a secure way. I have to admit that it has a major drawback as well: it requires all code to run on the Java Virtual Machine (in order to guarantee portability and secure execution), which unfortunately means that most of today's scientific programs cannot be used. Time will tell if scientific computing will adopt some virtual machine in the future that will make such a system feasible in real life. Reproducible research might actually become a strong argument in favour of such a development.

Julia: a new language for scientific computing

Konrad Hinsen — 2012-04-04

New programming languages are probably invented every day, and even those that get developed and published are too numerous to mention. New programming languages developed specifically for science and engineering are very rare, however, and that's why such a rare event deserves some publicity. A while ago, I saw an announcement for Julia, which announces itself as "a fresh approach to technical computing". I couldn't resist the temptation to download, install, and test-drive it. Here are my first impressions.

The languages used today for scientific computing can be grouped into four categories:

Traditional compiled languages optimized for number crunching. The big player in this category is of course Fortran, but some recent languages such as X10, Chapel, or Fortress are trying to challenge it.

Rapid-development domain-specific languages, usually interpreted. Well-known examples are Matlab an R.

General-purpose statically compiled languages with libraries for scientific computing. C and C++ come to mind immediately.

General-purpose dynamic languages with libraries for scientific computing. The number one here is Python with its vast library ecosystem.

What sets Julia apart is that it sits somewhere between the first two categories. It's compiled, but fully interactive, there is no separate compilation phase. It is statically typed, allowing for efficient compilation, but also has the default type "Any" that makes it work just like dynamically typed languages in the absence of type declarations. Type infererence makes the mix even better. If that sounds like the best of both worlds, it actually is. It has been made possible by modern code transformation techniques that don't really fit into the traditional categories of "compilers" and "interpreters". Like many other recent languages and language implementations, Julia uses LLVM as its infrastructure for these code transformations.

Julia has a well-designed type system with a clear orientation towards maths and number crunching: there is support for complex numbers, and first-class array support. What may seem surprising is that Julia is not object-oriented. This is neither an oversight nor a nostalgic return to the days of Fortran 77, but a clear design decision. Julia has type hierarchies and function polymorphism with dispatch on the types of all arguments. For scientific applications (and arguably for some others), this is more useful than OO style method dispatch on a single value.

Another unusual feature of Julia is a metaprogramming system that is very similar to Lisp macros, although it is slightly more complicated by the fact that Julia has a traditional syntax layer, whereas Lisp represents code by data structures.

So far for a summary of the language. The real question is: does it live up to its promises? Before I try to answer that question, I would like to point out that Julia is a young language that is still in flux and for now has almost no development tool support. For many real-life problems, there is no really good solution at the moment but it is clear that a good solution can be provided, it just needs to be done. What I am trying to evaluate is not if Julia is ready for real-life use (it is not), but whether there are any fundamental design problems.

The first question I asked myself is how well Julia can handle non-scientific applications. I just happened to see a blog post by John D. Cook explaining why it's preferable to write math in a general-purpose language than to write non-math in a math language. My experience is exactly the same, and that's why I have adopted Python for most of my scientific programming. The point is that any non-trivial program sooner or later requires solving non-math problems (I/O, Web publishing, GUIs, ...). If you use a general-purpose language, you can usually just pick a suitable library and go ahead. With math-only languages such as Matlab, your options are limited, with interfacing to C code sometimes being the only way out.

So is it feasible to write Web servers or GUI libraries in Julia? I would say yes. All the features of general-purpose languages are there or under consideration (I am thinking in particular of namespaces there). With the exception of systems programming (device drivers and the like), pretty much every programming problem can be solved in Julia with no more effort than in most other languages. The real question is if it will happen. Julia is clearly aimed at scientists and engineers. It is probably good enough for doing Web development, but it has nothing to offer for Web developers compared to well-established languages. Will scientists and engineers develop their own Web servers in Julia? Will Web developers adopt Julia? I don't know.

A somewhat related question is that of interfacing to other languages. That's a quick way to make lots of existing code available. Julia has a C interface (which clearly needs better tool support, but I am confident that it will come), which can be used for other sufficiently C-like languages. It is not clear what effort will be required to interface Julia with languages like Python or Ruby. I don't see why it couldn't be done, but I can't say yet whether the result will be pleasant to work with.

The second question I explored is how well Julia is suited to my application domain, which is molecular simulations and the analysis of experimental data. Doing molecular simulation in Julia looks perfectly feasible, although I didn't really implement any non-trivial algorithm yet. What I concentrated on first is data analysis, because that's where I could profit most from Julia's advantages. The kinds of data I mainly deal with are (1) time series and frequency spectra and (2) volumetric data. For time series, Julia works just fine. My biggest stumbling block so far has been volumetric data.

Volumetric data is usually stored in a 3-dimensional array where each axis corresponds to one spatial dimension. Typical operations on such data are interpolation, selection of a plane (2-d subarray) or line (1-d subarray), element-wise multiplication of volume, plane, or line arrays, and sums over selected regions of the data. Using the general-purpose array systems I am familiar with (languages such as APL, libraries such as NumPy for Python), all of this is easy to handle.

Julia's arrays are different, however. Apparently the developers' priority was to make the transition to Julia easy for people coming from Matlab. Matlab is based on the principle that "everything is a matrix", i.e. a two-dimensional array-like data structure. Matlab vectors come on two flavors, row and column vectors, which are actually matrices with a single row or column, respectively. Matlab scalars are considered 1x1 matrices. Julia is different because it has arrays of arbitrary dimension. However, array literals are made to resemble Matlab literals, and array operations are designed to behave as similar as possible to Matlab operations, in particular for linear algebra functions. In Julia, as in Matlab, matrix multiplication is considered more fundamental than elementwise multiplication of two arrays.

For someone used to arrays that are nothing more than data structures, the result looks a bit messy. Here are some examples:


julia> a = [1; 2]
[1, 2]

julia> size(a)
(2,)

julia> size(transpose(a))
(1,2)

julia> size(transpose(transpose(a)))
(2,1)

I'd expect that the transpose of the transpose is equal to the original array, but that's not the case. But what does transpose do to a 3d array? Let's see:


julia> a = [x+y+z | x=1:4, y=1:2, z = 1:3]
4x2x3 Int64 Array:
...

ulia> transpose(a)
no method transpose(Array{Int64,3},)
 in method_missing at base.jl:60

OK, so it seems this was not considered important enough, but of course that can be fixed.

Next comes indexing:


julia> a = [1 2; 3 4]
2x2 Int64 Array:
 1  2
 3  4

julia> size(a)
(2,2)

julia> size(a[1, :])
(1,2)

julia> size(a[:, 1])
(2,1)

julia> size(a[1, 1])
()

Indexing a 2-d array with a single number (all other indices being the all-inclusive range :) yields a 2-d array. Indexing with two number indices yields a scalar. So how do I extract a 1-d array? This generalizes to higher dimensions: if the number of number indices is equal to the rank of the array, the result is a scalar, otherwise it's an array of the same rank as the original.

Array literals aren't that frequent in practice, but they are used a lot in development, for quickly testing functions. Here are some experiments:


julia> size([1 2])
(1,2)

julia> size([1; 2])
(2,)

julia> size([[1;2] ; [3;4]])
(4,)

julia> size([[1;2] [3;4]])
(2,2)

julia> size([[1 2] [3 4]])
(1,4)

julia> size([[[1 2] [3 4]] [[5 6] [7 8]]])
(1,8)

Can you guess the rules? Once you have them (or looked them up in the Julia manual), can you figure out how to write a 3-d array literal? I suspect it's not possible.

Next, summing up array elements:


julia> sum([1; 2])
3

julia> sum([1 2; 3 4])
10

Apparently sum doesn't care about the shape of my array, it always sums the individual elements. Then how do I do a sum over all the rows?

I have tried to convert some of my basic data manipulation code from Python/NumPy to Julia, but found that I always spent most of the time fighting against the built-in array operations, which are clearly not made for my kind of application. In some cases a change of attitude may be sufficient. It seems natural to me that a plane extracted from volumetric data should be a 2-d array, but maybe if I decide that should be a 3-d array of "thickness" 1, everything will be easy.

I haven't tried yet, because I know there are cases that cannot be dealt with in that way. Suppose I have a time series of volumetric data that I store in a 4-d array. Obviously I want to be able to apply functions written for static volumetric data (i.e. 3-d arrays) to an element of such a time series. Which means I do need a way to extract a 3-d array out of a 4-d array.

I hope that what I need is there and I just didn't find it yet. Any suggestions are welcome. For now, I must conclude that test-driving Julia is a frustrating experience: the language holds so many promises, but fails for my needs due to superficial but practically very important problems.

Binary operators in Python

Konrad Hinsen — 2012-03-29

A two-hour train journey provided the opportunity to watch the video recording of the Panel with Guido van Rossum at the recent PyData Workshop. The lengthy discussion about PEP 225 (which proposes to add additional operators to Python that would enable to have both elementwise and aggregate operations on the same objects, in particular for providing both matrix and elementwise multiplication on arrays with a nice syntax) motivated me to write up my own thoughts about what's wrong with operators in Python from my computational scientist's point of view.

The real problem I see is that operators map to methods. In Python, a*b is just syntactic sugar for a.__mul__(b). This means that it's the type of a that decides how to do the multiplication. The method implementing this operation can of course check the type of b, and it can even decide to give up and let b handle everything, in which case Python does b.__rmul__(a). But this is just a kludge to work around the real weakness of the operators-map-to-methods approach. Binary operators fundamentally require a dispatch on both types, the type of a and the type of b. What a*b should map to is __builtins__.__mul__(a, b), a global function that would then implement a binary dispatch operation. Implementing that dispatch would in fact be the real problem to solve, as Python currently has no multiple dispatch mechanisms at all.

But would multiple dispatch solve the issue addressed by PEP 225? Not at all, directly. But it would make some of the alternatives mentioned there feasible. A proper multiple dispatch system would allow NumPy (or any other library) to decide what multiplication of its own objects by a number means, no matter if the number is the first or the second factor.

More importantly, multiple dispatch would allow a major cleanup of many scientific packages, including NumPy, and even clean up the basic Python language by getting rid of __rmul__ and friends. NumPy's current aggressive handling of binary operations is actually more of a problem for me than the lack of a nice syntax for matrix multiplication.

There are many details that would need to be discussed before binary dispatch could be proposed as a PEP. Of course the old method-based approach would need to remain in place as a fallback, to ensure compatibility with existing code. But the real work is defining a good multiple dispatch system that integrates well with Python's dynamical type system and allows the right kind of extensibility. That same multiple dispatch method could then also be made available for use in plain functions.

Python becomes a platform

Konrad Hinsen — 2012-03-15

The recent announcement of clojure-py made some noise in the Clojure community, but not, as far as I can tell, in the Python community. For those who haven't heard of it before, clojure-py is an implementation of the Clojure language in Python, compiling Clojure code to bytecode for Python's virtual machine. It's still incomplete, but already usable if you can live with the subset of Clojure that has been implemented.

I think that this is an important event for the Python community, because it means that Python is no longer just a language, but is becoming a platform. One of the stated motivations of the clojure-py developers is to tap into the rich set of libraries that the Python ecosystem provides, in particular for scientific applications. Python is thus following the path that Java already went in the past: the Java virtual machine, initially designed only to support the Java language, became the target of many different language implementations which all provide interoperation with Java itself.

It will of course be interesting to see if more languages will follow once people realize it can be done. The prospect of speed through PyPy's JIT, another stated motivation for the clojure-py community, could also get more lanuage developers interested in Python as a platform.

Should Python programmers care about clojure-py? I'd say yes. Clojure is strong in two areas in which Python isn't. One of them is metaprogramming, a feature absent from Python which Clojure had from the start through its Lisp heritage. The other feature is persistent immutable data structures, for which clojure-py provides an implementation in Python. Immutable data structures make for more robust code, in particular but not exclusively for concurrent applications.

Teaching parallel computing in Python

Konrad Hinsen — 2012-02-06

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I'll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:

from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it's nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn't.

Imagine your typical student typing this script and running it. Here's what happens:


Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what's wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it's not possible. Implementing it today within the framework of Python's existing pickle protocol is unfortunately not trivial, and that's why it hasn't been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn't hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the "advanced topics" part.

OK, you have passed the message, and your students fix the script:


from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
    return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:


Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is "fork", you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:


from multiprocessing import Pool
import numpy

def square_root(x):
    return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But... how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn't work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a "pool of worker processes" abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I'd prefer an API that does not suggest possibilities that don't exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit "run parallel job now" call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

A rant about mail clients

Konrad Hinsen — 2011-11-04

A while ago I described why migrated my agendas from iCal to orgmode. To sum it up, my main motivation was to gain more freedom in managing my information: where iCal imposes a rigid format for events and insists on storing them in its own database, inaccessible to other programs, orgmode lets me mix agenda information with whatever else I like in plain text files. Today's story is a similar one, but without the happy end. I am as much fed up with mail clients as I was with iCal, and for much the same reasons, but I haven't yet found anything I could migrate to.

From an information processing point of view, an e-mail message is not very different from lots of other pieces of data. It's a sequence of bytes respecting a specific format (defined by a handful of standards) to allow its unambiguous interpretation by various programs in the processing chain. An e-mail message can perfectly well be stored in a file and in fact most e-mail clients permit saving a message to a file. Unfortunately, the number of e-mail clients able to open and display correctly such a file is already much smaller. But when it comes to collections of messages, information processing freedom ends completely.

Pretty much every mail client's point of view is that all of a user's mail is stored in some database, and that it (the client) is free to handle this database in whatever way it likes. The user's only access to the messages is the mail client. The one and only. The only exception is server-based mail databases handled via the IMAP protocol, where multiple clients can work with a common database. If you don't use IMAP, you have no control over how and where your mail is stored, who has access to it, etc.

What I'd like to do is manage mail just like I manage other files. A mailbox should just be a directory containing messages, one per file. Mailboxes could be stored anywhere in the file system. Mailboxes could be shared through the file system, and backed up via the file system. They could be grouped with whatever other information in whatever way that suits me. I would double-click on a message to view it, or double-click on a mailbox directory to view a summary, sorted in the way I like it. Or I would use command-line tools to work on a message or a mailbox. I'd pick the best tool for each job, just like I do when working with any other kind of file.

Why all that isn't possible remains a mystery to me. The technology has been around for decades. The good old Maildir format would be just fine for storing mailboxes anywhere in the file system, as would the even more venerable mbox format. But even mail clients that use mbox or Maildir internally insist that all such mailboxes must reside in a single master directory. Moreover, they won't let me open a mailbox from outside, I have to run the mail client and work through its hierarchical presentation of mailboxes to get to my destination.

Before I get inundated by comments pointing out that mail client X has feature Y from the list above: Yes, I know, there are small exceptions here and there. But unless I have the complete freedom to put my mail where I want it, the isolated feature won't do me much good. If someone knows of a mail client that has all the features I am asking for, plus the features we all expect from a modern mail client, then please do leave a comment!

EuroSciPy 2011

Konrad Hinsen — 2011-08-30

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how "expert" design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn't exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian's reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management - a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won't mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando's keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw's presentation of ROS, the Robot Operating System, Chris Myers' presentation about SloppyCell, and Yann Le Du's talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.

Executable Papers

Konrad Hinsen — 2011-06-03

The last two days I participated in the "Executable Papers workshop" at this year's ICCS conference. It was not just another workshop among the many ICCS workshops. The participants had all submitted a proposal to the "Executable Paper Grand Challenge" run by Elsevier, one of the biggest scientific publishers. On the first day, the nine finalists presented their work, and on the second day, the remaining accepted proposals were presented.

The term "executable papers" stands for the expected next revolution in scientific publishing. The move from printed journals to electronic on-line journals (or a combination of both) has changed little for authors and readers. It is the libraries that have seen the largest impact because they now do little more than paying subscription fees. Readers obtain papers as PDF files directly from the publishers' Web sites. The one change that does matter to scientists is that most journals now propose the distribute "supplementary material" in addition to the main paper. This can in principle be any kind of file, in practice it is mostly used for additional explanations, images, and tables, i.e. to keep the main paper shorter. Occasionally there are also videos, a first step towards exploring the new possibilities opened up by electronic distribution. The step to executable papers is a much bigger one: the goal is to integrate computer-readable data and executable program code together with the text part of a paper. The goals are a richer reader experience (e.g. interactive visualizations), verifiability of results by both referees and readers (by re-running part of the computations described in the paper), and re-use of data and code in later work by the same or other authors. There is some overlap in these goals with the "reproducible research" movement, whose goal is to make computational research reproducible by providing tools and methods that permit to store a trace of everything that entered into some computational procedure (input data, program code, description of the computing environment) such that someone else (or even the original author a month later) can re-run everything and obtain the same results. The new aspect in executable papers is the packaging and distribution of everything, as well as the handling of bibliographic references.

The proposals' variety mostly reflected the different background of the presenters. A mathematician documenting proofs obviously has different needs than an astrophysicist simulating a supernova on a supercomputer. Unfortunately this important aspect was never explicitly discussed. Most presenters did not even mention their field of work, much less what it implies in terms of data handling. This was probably due to the enormous time pressure; 15 to 20 minutes for a presentation plus demonstration of a complex tool was clearly not enough.

The proposals could roughly be grouped into three categories:

Web-based tools that permit the author to compose his executable paper by supplying data, code, and text, and permit the reviewer and reader to consult this material and re-run computations.

Systems for preserving the author's computational environment in order to permit reviewers and readers to use the author's software with little effort and without any security risks.

Semantic markup systems that make parts of the written text interpretable by a computer for various kinds of processing

Some proposals covered two of these categories but with a clear emphasis on one of them. For the details of each propsal, see the ICCS proceedings which are freely available.

While it was interesting to see all the different ideas presented, my main impression of the Executable Paper Workshop is that of a missed opportunity. Having all those people who had thought long and hard about the various issues in one room for two days would have been a unique occasion to make progress towards better tools for the future. In fact, none of the solutions presented cover the needs of the all the domains of computational science. They make assumptions about the nature of the data and the code that are not universally valid. One or two hours of discussion might have helped a lot to improve everyone's tools.

The implementation of my own proposal, which addresses the questions of how to store code and data in a flexible, efficient, and future-proof way, is available here. It contains a multi-platform binary (MacOS, Linux, Windows, all on the x86 platform) and requires version 6 of the Java Runtime Environment. The source code is also included, but there is no build system at the moment (I use a collection of scripts that have my home-directory hard-coded in lots of places). There is, however, a tutorial. Feedback is welcome!

Text input on mobile devices

Konrad Hinsen — 2011-05-02

I have been using mobile (pocket size) computers for about 15 years, starting with the Palm Pilot. Currently I use an Android smartphone (Samsung Galaxy S). While mobile devices are mostly used for consulting rather than for entering information, text entry has always been a hot topic of debate.

Apple's Newton Messagepad, probably the first mobile computing device in the modern sense, pursued the ambitious goal of handwriting recognition. It was both an impressive technical achievement and a practical failure. I don't think anyone ever managed to use the Newton's handwriting recognition satisfactorily in daily life.

The Palm Pilot had a more modest but also more achievable goal: its Graffiti technology was based on single letter recognition with simplified letter shapes. It took a while to become fluent with Graffiti, but many people managed and I don't remember anyone complainig about. the nearning curve.

I don't remember when I first saw a miniature QWERTY keyboard on the screen of a mobile device, but it may well have been on one of the first iPhones. I was definitely not enthusiastic about it. The keys are much too small for touch-typing, and the layout was already a bad choice for desktop computer keyboards. The only argument in its favor is familiarity, but is that a good enough reason to cripple oneself for a long time to come?

When I got my Android phone, I was rapidly confronted with this issue in practice. Samsung left me the choice between the standard keyboard and Swype. Both had the same problem: too small keys for my fingers. I turned to the Android market and found many more QWERTY keyboards. And... Graffiti, my old friend from my Palm days. What a relief!

Of course, my phone is not a Palm. The Biggest difference is that the Palm had a stylus whereas today's smartphones are meant to be manipulated withthe fingers. But Graffiti works surprisingly well without a stylus. I find that I can write about equally well wïth the index or the thumb. Graffiti definitely is a good choice for Android, especially for Palm veterans.

Recently I discovered another alternative input and I like it enough that I might end up preferring it over Graffiti. It's called MessagEase and it consists of a 3x3 grid of comfortably large keys that display the 9 most frequent characters. The remaining characters, plus punctuation etc., is available by drawing lines outward from the center of a key. The technique doesn't require much time to master, but writing fluently requires a lot of practice because the layout needs to be memorized.

I started using MessagEase about two weeks ago and have reached about the same speed I ge with Graffiti. I wrote this whole article with MessagEase as a real-life exercise. Time will tell if I actually get faster than with Graffiti, but MessagEase definitely is a serious candidate for mobile texting in the post-QWERTY era. If you have an Android phone or an iPhone, give it a try.

Bye bye iCal, welcome org-mode

Konrad Hinsen — 2011-01-04

I have been using Macintosh computers since 2003, and overall I have been happy with the personal information management (PIM) tools provided by Apple: AddressBook, Mail, Safari (for bookmark management). The one tool I have never liked is iCal. Its user interface is fine for consulting my agenda, but entering information is too complicated and the todo-list management is particularly clumsy. But more importantly, I regularly found myself wanting to add information for which no entry field was provided. I ended up putting it into the "notes" section, or leave it out. Another unplesant feature of iCal is that all the information is stored in a complex proprietary database, making synchronization between several computers impossible except through cloud-based server solutions such as Apple's MobileMe (quite expensive) or fruux (much nicer in my opinion, but it still requires trusting your data to a cloud service).

Being unhappy with a tool for an important task implies looking for better options, but I didn't find anything that I liked. Until one day I discovered, mostly by accident, the org-mode package that has been distributed with Emacs for a while. org-mode is one of those pieces of software that is so powerful that it is difficult to describe to someone who has never used it. Basically, org-mode uses plain text files with a special lightweight markup syntax for things like todo items or time stamps (but there is a lot more), and then provides sophisticated and very configurable functions for working with this data. It can be used for keeping agendas, todo lists, journals, simple databases such as bookmark lists, spreadsheets, and much more. Most importantly, all of these can coexist in a single text file if you want, and the contents of this file can be structured in any way you like. You can even add pieces of executable code and thus use org-mode for literal programming, but that's a topic for another post.

To be more concrete, my personal information database in org-mode consists of several files at the top level: work.org for organizing my workday, home.org for tasks and appointments related to private life, research.org for notes about research projects, programming.org for notes (mostly bookmarks) about software development, etc. Inside my work.org, there is a section on research projects, one on teaching, one on my editorial work for CiSE, one for refereeing, etc. Inside each of these sections, there are agenda entries (seminars, meetings, courses etc.) and todo entries with three priority levels and optional deadlines. Any of them can be accompanied by notes of any kind, including links, references to files on my disk, and even executable shell commands. There is no limit to what you store there.

In October 2010 I started the transition from iCal to org-mode. Initially I entered all data twice, to make sure I could continue to rely on iCal. After a week I was confident enough to enter everything just once, using org-mode. I then transferred all agenda items for 2011 to org-mode and decided to stop using iCal on Januray 1, 2011. That day has arrived, and the iCal icon has disappeared from my dock. Without any regrets.

Conclusion: If you need a powerful PIM system and you don't fear Emacs, have a look at org-mode.

The future of Python

Konrad Hinsen — 2010-07-19

I have received a number of questions and remarks about my keynote talk at EuroSciPy 2010, ranging from questions about technical details to an inquiry about the release date of Python 4.0! Rather than writing lengthy replies to everyone, I try to address all these issues here.

First of all, my intentions behind the keynote were

Encourage scientists to look at new tools and developments that I believe to be important in the near future (Python 3, Cython) and at others that might become important to scientific applications (JIT compilers, alternative implementations).

Make computational scientists think about future commodity hardware (which is what we use most of the time) and its implications for programming, in particular the generalization of massively parallel computing.

Show that easy-to-use parallel programming paradigms, in particular deterministic ones, exist today. Computational scientists need to realize that MPI and OpenMP are not the last word on parallel programming.

Make my ideas concrete by showing how they could be implemented in Python.

My "Python 4.0" is completely fictitious and will probably never exist in exactly that form. However, it is important to realize that it could be implemented right now. With the GIL-free Python implementations (Jython, IronPython), it would even be rather straightforward to implement. For CPython, any implementation not removing the GIL would probably be too inefficient to be of practical interest.

Most of the ingredients for implementing my "Python 4.0" are well-known and have already been used in other languages or libraries:

The "declarative concurrency" programming paradigm has been used in Oz and FlowJava, but to the best of my knowledge not in any mainstream programming language. It is explained very well in the book Concepts, Techniques, and Models of Computer Programming, by Peter van Roy and Seif Haridi, and also in the freely downloadable essay Programming paradigms for dummies. Basically, it is the functional paradigm extended with annotations that identify computations to be done in parallel. Remove those annotations, and you get plain functional programs that yield the same results. Declarative concurrency is free of deadlocks and race conditions, which I think is a critical property for any parallel programming paradigm to be considered for a high-level language such as Python. Another nice feature of declarative concurrency is that data parallelism, including nested data parallelism, is a special case that can be implemented on top of it. Data parallelism is a useful paradigm for many scientific applications.

Futures are asynchronous tasks provided as library functions for various languages. A Python library for futures is the subject of PEP 3148; an implementation is already available.

The importance of effect-free functions for all kinds of code transformations (automatic or not) is widely recognized. It is equally recognized that a useful program needs to have effects. The two basic approaches to dealing with this contradiction are (a) allow effects, but make it as easy as possible not to use them to encourage a mostly effect-free style and (b) design a language without effects (pure functional language) and provide a mechanism to put in effects with special annotation or syntax but clearly as an exceptional feature. The best-known language in the second category is Haskell with its use of monads for controlling effects. Most functional languages are in the first category.
Efficient data structures for functional programs have been a subject of research for quite a while and quite a few good ones are known. It would be straightforward to replace Python's tuple implementation by something more efficient in typical functional settings, or to add an efficient immutable dictionary implementation. The standard reference is Chris Osaki's book Purely Functional Data Structures.

Futures may seem to provide most of what declarative concurrency promises, but this is not quite true. Futures are objects representing computations. They have a method that client code must call to wait for the result and retrieve it. Since waiting is an explicit operation on a standard object, it is easy to create a situation in which two futures wait for each other: a deadlock. This can only be avoided by not having futures accessible as standard objects. The language implementation must recognize futures as special and insert a wait call before any access to the value of the result. For this reason, declarative concurrency cannot be implemented as a library.

Another important condition for implementing declarative concurrency with futures is that code inside a future must be effect-free. Otherwise multiple concurrently running futures can modify the same object and create a race condition.

Probably the only truly original contribution in my "Python 4.0" scenario is the dynamically verified effect-free subset of the Python language. Most languages, even functional ones, provide no way for a compiler or a run-time system to verify that a given function is effect-free. Haskell is perhaps the only exception in having a static type system that can identify effect-free code. In Python, that is not a viable approach because everything is dynamic. But why not provide at least a run-time check for effect-free code where useful? It's still better to have a program crash with an exception saying "you did I/O in what should have been an effect-free function" than get wrong results silently.

Here is an outline of how such an approach could be implemented. Each function and method would have a flag saying "I am supposed to be effect-free." In my examples, this flag is set by the decorator @noeffects, but other ways are possible. Built-in functions would of course have that flag set correctly as well. As soon as the interpreter enters a function marked as effect-free, it goes into "functional mode" until it returns from that function again. In functional mode, it raises an exception whenever an unflagged function or method is called.

Some details to consider:

Effect-free functions may not contain global or nonlocal statements. Probably the only way to enforce this is to have special syntax for defining effect-free functions and methods (rather than a decorator) and make those statements syntactically illegal inside.

It would be much more useful to have a "referentially transparent" subset rather than an "effect-free" subset, but this is also much harder to implement. A referentially transparent function guarantees to return the same result for the same input arguments, but may modify mutable objects that it has created internally. For example, a typical matrix inversion function allocates an array for its result and then uses an imperative algorithm that modifies the elements of that array repeatedly before returning it. Such a function can be used as an asynchronous task without risk, but its building blocks cannot be safely run concurrently.

Finally, a comment on a minor issue. I have been asked if the "async" keyword is strictly necessary. The answer is no, but it makes the code much more readable. The main role of async is to write a function call without having it executed immediately. The same problem occurs in callbacks in GUI programming: you have to specify a function call to be executed at a later time. The usual solution is a parameter-free lambda expression, and that same trick could be used to make async a function rather than a keyword. But readability suffers a lot.

EuroSciPy 2010

Konrad Hinsen — 2010-07-12

This weekend I attended the EuroSciPy 2010 conference in Paris, dedicated to scientific applications of the programming language Python. This was the third EuroSciPy conference, but the US-based SciPy conference has been a regular event for many years already, and recently SciPy India joined the crowd. It looks like Python is becoming ever more popular in scientific computing. Next year, EuroSciPy will take place in Paris again.

There were lots of interesting presentations and announcements, and the breaks provided a much appreciated opportunity for exchanges between the participants. I won't try to provide an exhaustive summary, but rather list my personal highlights. Obviously this choice reflects my personal interests more than the quality of the presentations, and I will even list things that were not presented but that I learned about from other participants during the breaks.

Teaching

The opening keynote was given by Hans-Petter Langtangen, who is best known for his books about Python for scientific computing. His latest book is a textbook for a course on scientific programming for beginning science students, and the first part of his keynote was about this same course that he is teaching at the University of Oslo. As others have noted as well, he observed that the students have no problem at all with picking up Python and using it productively in science. The difficulties with using Python are elsewhere: it is hard to convince the university professors that Python is a good choice of programming language for such a course!

Another important aspect of his presentation was the observation that teaching scientific programming to beginning science students provides more than just training in some useful technique. Converting equations into programs and running them also provides a much better insight into the structure and applicability of the equations. Computational science thus helps to better educate future scientists.

Reproducible research

The reproducible research movement has the goal of improving the standards in computational science. At the moment, it is almost always impossible to reproduce published computational results from the information provided by the authors. Making these results reproducible requires a careful recording of what was calculated using which version of which software running on which machine, and of course making this information available along with the publication.

At EuroSciPy, Andrew Davison presented Sumatra, a Python library for tracking this information (and more) for computational procedures written in Python. The library is in an early stage, with more functionality to come, but those interested in reproducible research should check it out now and contribute to its development.

Jarrod Millman addressed the same topic in his presentation of the plans for creating a Foundation for Mathematical and Scientific Computing, whose goal is to fund development of tools and techniques that improve computational science.

NumPy and Python 3

As a couple of active contributors to the NumPy project were attending the conference, I asked about the state of the porting effort to Python 3. The good news is that the port is done and will soon be released. Those who have been waiting for NumPy to be ported before starting to port their own libraries can go to work right now: check out the NumPy Subversion repository, install, and use!

Useful maths libraries

Three new maths libraries that were presented caught my attention: Sebastian Walter's talk about algorithmic differentiation contained demos of algopy, a rather complete library for algorithmic differentiation in Python. During the Lightning talks on the last day, two apparently similar libraries for working with uncertain numbers (numbers with error bars) were shown: uncertainties, by Eric Lebigot, and upy, by Friedrich Romstedt. Both do error propagation and take correlations into account. Those of us working with experimental data or simulation results will appreciate this.

There was a lot more interesting stuff, of course, and I hope others will write more about it. I'll just point out that the slides for my own keynote about the future of Python in science are available from my Web site. And of course express my thanks to the organizing committee who invested a lot of effort to make this conference a big success!

Science and free will

Konrad Hinsen — 2010-07-01

The question if living beings, in particular those of our own species, possess "free will", and how it works if it exists, has recently become fashionable again. The new idea that brought the topic back into discussion was that our sense of free will might just be an illusion. According to this idea, we would be machines whose fate is entirely determined by the laws of physics (which might themselves be deterministic or not), even though we perceive ourselves as actors who pursue goals and take decisions that are not even in principle predictable by a physical analysis of our bodies, no matter at what level of detail.

The topic itself is an old one, perhaps as old as humanity. I won't go into its philosophical and religious aspects, but limit myself to the scientist's point of view: is free will compatible with scientific descriptions of our world? Perhaps even necessary for such descriptions? Or, on the contrary, in contradiction to the scientific approach? Can the scientific method be used to understand free will or show that it's a useless concept from the past?

What prompted me to write this post is a recent article by Anthony Cashmore in PNAS. In summary, Cashmore says that the majority of scientists do not believe in the existence of free will any more, and that society should draw conclusions from this, in particular concerning the judicial system, whose concept of responsibility for one's acts is based on a view of free will that the author no longer considers defendable. But don't take my word for it, read the article yourself. It's well written and covers many interesting points.

First of all, let me say that I don't agree at all with Cashmore's view that the judicial system should be reformed based on the prevailing view of today's scientists. I do believe that a modern society should take into account scientific findings, i.e. scientific hypotheses that have withstood a number of attempts at falsification. But mere beliefs of a small subpopulation, even if they are scientists, are not sufficient to justify a radical change of anything. As I will explain below, the question "do human beings possess free will" does not even deserve the label "scientific hypothesis" at this moment, because we have no idea of how we could answer it based on observation and experiment. We cannot claim either to be able to fully understand human behavior in terms of the laws of physics, which would allow us to call free will an unnecessary concept and invoke Occam's razor to get rid of it. Therefore, at this time, the existence of free will remains the subject of beliefs and scientists' beliefs are worth no more than anyone else's.

There is also a peculiar circularity to any argument about what "should" be done as a consequence of the non-existence of free will: if that hypothesis is true, nobody can decide anything! If humans have no free will, then societies don't have it either, and our judicial system is just as much a consequence of the laws of nature as my perceived decision to take coffee rather than tea for breakfast this morning.

Back to the main topic of this post: the relation between science and free will. It starts with the observation of a clear conflict. Science is about identifying regularities in the world that surrounds us, which permit the construction of detailed and testable theories. The first scientific theories were all about deterministic phenomena: given the initial state of some well-defined physical system (think of a clockwork, for example), the state of the system at any time in the future can be predicted with certainty. Later, stochastic phenomena entered the scientific world view. With stochasticity, the detailed behavior of a system is no longer predictable, but certain average properties still are. For example, we can predict how the temperature and pressure of water will change when we heat it, even though we cannot predict how each individual molecule will move. It is still a subject of debate whether stochastic elements exist in the fundamental laws of nature (quantum physics being the most popular candidate), or if they are merely a way of describing complex systems whose state we cannot analyze in detail due to insufficient resources. But scientists agree that a scientific theory may contain two forms of causality: determinism and stochasticity.

Free will, if it exists, would have to be added as a third form of causality. But it is hard to see how this could be done. The scientific method is based on identifying conditions from which exact predictions can be made. The decisions of an agent that possesses free will are by definition unpredictable, and therefore any theory about a system containing such an agent would be impossible to verify. Therefore the scientific method as we know it today cannot possibly take into consideration the existence of free will. Obviously this makes it impossible to examine the existence of free will as a scientific hypothesis. It also means that a hard-core scientist, who considers the scientific method as the only way to establish truth, has to deny the existence of free will, or else accept that some important aspects of our universe are forever inaccessible to scientific investigation.

However, there is another aspect to the relation between science and free will, which I haven't seen discussed yet anywhere: the existence of free will is in fact a requirement for the scientific method! Not as part of a system under scientific scrutiny, but as part of the scientist who runs an investigation. Testing a scientific hypothesis requires at the very least observing a specific phenomenon, but in most cases also preparing a well-defined initial state for some system that will then become the subject of observation. A scientist decides to create an experimental setup to verify some hypothesis. If the scientist were just a complex machine whose behavior is governed by the very same laws that he believes to be studying, then his carefully thought-out experiment is nothing but a particularly probable outcome of the laws of nature. We could still draw conclusions from observing it, of course, but these observations then only provide anecdotical evidence that is no more relevant than what we get from passively watching things happen around us.

In summary, our current scientific method supposes the existence of free will as an attribute of scientists, but also its absence from any system subjected to scientific scrutiny. This poses limits to what scientific investigation can yield when applied to humans.

Eclipse experiences

Konrad Hinsen — 2010-01-19

A few months ago I decided to take a closer look at Eclipse, since several people I know seemed to be quite fond of it. I had tried it earlier on my old iBook G4, but quickly abandoned it because it was much too slow. But my new MacBook Pro should be able to handle it.

Last week I finally decided to retire my Eclipse installation. I didn't remove it yet, since it might be useful for some specific tasks that I have deal with rarely (such as analyzing someone else's big C++ code). But I don't use it any more for my own work. Here's a summary of my impressions of Eclipse, the good and the bad.

In terms of features, Eclipse is as impressive as it looks. Anything you might wish for in an IDE is there, either in the base distribution or in the form of a plugin - there are hundreds if not thousands of those. And contrary to what one might expect, all those features are relatively easy to get used to. The user interface is very systematic and the most frequent functions are easy to spot. In terms of user interface design, I would call Eclipse a success.

However, in terms of usability it turned out to be a disappointment. Basically there are two major issues: Eclipse is a resource hog, and it isn't as stable as I expect an IDE to be.

The two resources that Eclipse can't get enough of is CPU time and disk space. Even on a brand-new machine (and not a low-end one at that), starting Eclipse takes a good ten seconds and I get to see the Macintosh's spinning colour wheel quite often. What's worst is that the spinning wheel prevents me from typing, at unpredictable moments. This is not acceptable for an IDE. I don't care if it takes a break in background compilation now and then, but I want to be able to type when I want. Execution times for various command can also vary unpredictably. Rebuilding all my projects took about a minute typically, but once I waited for 15 minutes for no apparent reason.

In terms of disk space, Eclipse is less of a resource hog, but it creates and updates impressive amounts of data, again for no clear reason. I noticed this because I make incremental backups regularly. Just starting and quitting Eclipse, with no action in between, resulted in a few MB of files to backup again. It's not that I can't live with that, but is this really necessary?

Finally, stability. I had only a single crash in which I lost data (the most recently entered code), which is not so bad for a big application (unfortunately...). But I had Eclipse hanging very often, and displaying verbose yet unintellegible error messages almost daily. All this is not reassuring, and together with the spinning-wheel issue this is what made me abandon Eclipse in the end.

Now I am a 100% Emacs user again, with no regrets. Emacs may look old-fashioned, and have some fewer high-powered features, but it is reliable and fast.

Scientific computing needs deterministic programming paradigms

Konrad Hinsen — 2009-09-09

Programmers, scientific and otherwise, spend a lot of time discussing which programming languages, libraries, and development tools to use. In such discussions, the notion of a programming paradigm is rarely mentioned, and yet it is a very fundamental one. It becomes particularly important for parallel and concurrent programming, where the most popular languages and libraries do not necessarily provide the best programming paradigm. In this post, I will explain what a programming paradigm is and why its choice matters more than the choice of a language.

A programming paradigm defines a general approach to writing programs. It defines the concepts and abstractions in terms of which a program is expressed. For example, a programming paradigm defines how data is described, how control flow is handled, how program elements are composed, etc. Well-known programming paradigms are structured programming, object-oriented programming, and functional programming.

The implementation of a programming paradigm consists of a programming language, its runtime system, libraries, and sometimes coding conventions. Some programming languages are optimized for a specific paradigm, whereas others are explicitly designed to support multiple paradigms. Paradigms that the language designer did not have in mind can sometimes be implemented by additional conventions, libraries, or preprocessors.

The list of programming paradigms that have been proposed and/or used is already quite long (see the Wikipedia entry, for example), but the ones that are practically important and significantly distinct are much less numerous. A good overview and comparison is given in the book chapter "Programming paradigms for dummies" by Peter van Roy. I will concentrate on one aspect discussed in van Roy's text (look at section 6 in particular), which I consider of particular relevance for scientific computing: determinism.

A deterministic programming paradigm is one in which every possible program has a fully deterministic behaviour: given the same input, it executes its steps in the same order and produces the same output. This is in fact what most of us would intuitively expect from a computer program. However, there are useful programs that could not be written with this restriction. A Web server, for example, has to react to external requests which are outside of its control, and take into account resource usage (e.g. database access) and possible network errors in deciding when and in which order to process requests. This shows that there is a need for non-deterministic programming paradigms. For the vast majority of scientific applications, however, determinism is a requirement, and a programming paradigm that enforces determinism is a big help in avoiding bugs. Most scientific applications that run serially have been written using a deterministic programming paradigm, as implemented by most of the popular programming languages.

Parallel computing has changed the situation significantly. When several independent processors work together on the execution of an algorithm, fully deterministic behavior is no longer desirable, as it would imply frequent synchronizations of all processors. The precise order in which independent operations are executed is typically left unspecified by a program. What matters is that the output of the program is determined only by the input. As long as this is guaranteed, it is acceptable and even desirable to let compilers and run-time systems optimize the scheduling of individual subtasks. In Peter van Roy's classification, this distinction is called "observable" vs. "non-observable" non-determinism. A programming paradigm for scientific computing should permit non-determinism, but should exclude observable non-determinism. While observable non-determinism makes the implementation of certain programs (such as Web servers) possible, it also opens the way to bugs that are particularly nasty to track down: deadlocks, race conditions, results that change with the number of processors or when moving from one parallel machine to another one.

Unfortunately, two of the most popular programming paradigms for parallel scientific applications do allow observable non-determinism: message passing, as implemented by the MPI library, and multi-threading. Those who have used either one have probably suffered the consequences. The problem is thus well known, but the solutions aren't. Fortunately, they do exist: there are several programming paradigms that encapsulate non-determinism in such a way that it cannot influence the results of a program. One of them is widely known and used: OpenMP, which is a layer above multi-threading that guarantees deterministic results. However, OpenMP is limited to shared-memory multiprocessor machines.

For the at least as important category of distributed-memory parallel machines, there are also programming paradigms that don't have the non-deterministic features of message passing, and they are typically implemented as a layer above MPI. One example is the BSP model, which I have presented in an article in the magazine Computing in Science and Engineering. Another example is the parallel skeletons model, presented by Joël Falcou in the same magazine. Unfortunately, these paradigms are little known and not well supported by programming tools. As a consequence, most scientific applications for distributed-memory machines are written using the message passing paradigm.

Finally, a pair of programming paradigms discussed by van Roy deserves special mention, because it might well become important in scientific computing in the near future: functional programming and declarative concurrency. I have written about functional programming earlier; its main advantage is the possibility to apply mathematical reasoning and automatic transformations to program code, leading to better code (in the sense of correctness) and to better optimization techniques. Declarative concurrency is functional programming plus annotations for parallelization. The nice feature is that these annotations (not very different in principle from OpenMP pragmas) don't change the result of the program, they only influence its performance on a parallel machine. Starting from a correct functional program, it is thus possible to obtain an equivalent parallel one by automatic or manual (but automatically verified) transformations that is guaranteed to behave identically except for performance. Correctness and performance can thus be separated, which should be a big help in writing correct and efficient parallel programs. I say "should" because this approach hasn't been used very much, and isn't supported yet by any mainstream programming tools. This may change in a couple of years, so you might want to watch out for developments in this area.

Sheldrake's New Science of Life

Konrad Hinsen — 2009-08-25

One of the books I read during my summer vacation is the recently published second edition of Rupert Sheldrake's "A New Science of Life". It is one of the most controversial books in science, having been both praised and condemned; a review of the first edition in the renowned science journal Nature concluded that this book should be burnt!

The question that Sheldrake addresses in this book is where form comes from. What defines the arrangements of atoms in a molecule? Or in a crystal? Why do proteins fold into their characteristic structures? How do biological molecules assemble into cells? And how to cells divide and specialize to form an embryo?

The standard reply to these questions you can find in science textbooks is that all these forms come from the fundamental interactions of physics. Molecules are simply energetically favorable arrangements of atoms. Proteins fold in such a way that free energy is minimized. Cells assemble as a result of complex attractive interactions between its constituents, which ultimately can be reduced to fundamental physics. Embryos develop according to a "genetic program" stored in the fecundated egg's DNA.

As Sheldrake rightly emphasizes, even though it may come as a surprise to most non-experts, these affirmations cannot be verified. They express a common belief among practicing scientists, and they are compatible with everything we know about nature, but they may well be wrong. We simply cannot verify them because the fundamental equations of physics can be solved only for very simple systems. Even for one of the simplest molecules, water, we cannot predict the arrangement of its atoms directly from the basic principles of physics. What we use in practice are approximations, but these approximations have been selected because they permit to predict the known molecular structures. We cannot use such approximations to verify more fundamental problems.

Sheldrake proposes an alternative theory, based on what he calls "morphogenetic fields". From my point of view as a physicist, the name is not very well chosen because these entities do not correspond to what a physicist would call a field, but of course this term may be perfectly clear to biologists. It's a minor point because Sheldrake explains this concept very clearly in his book. In summary, his theory says that forms exist because they have existed before; atoms, molecules, and cells arrange themselves into patterns that they "remember" from the past. His morphogenetic fields are a giant database of forms that the universe keeps around forever.

The main "problem" with this theory is that if it is right, even just approximately, then standard science, from physics to biology, is very much wrong. It is probably for this reason that his book has attracted so much criticism from the science establishment. Otherwise, there is little one could criticize: Sheldrake explains his theory and its consequences for chemistry and biology, and he proposes a large number of experimental verifications that would permit to test it. This is science at its best. Of course his theory may turn out to require modifications, or even be completely wrong, but that is true of any scientific theory when it is first formulated.

In fact, I recommend this book to anyone interested in the scientific process because of its detailed discussion of how scientific discovery works. I haven't seen many books accessible to non-specialists that explain the limits of verifiability of a scientific theory, for example. Nor have I seen any other book that makes the distinction between verified theories and widely accepted but untested beliefs so clear as Sheldrake does. Even if you don't care about his theory, you can gain a lot from reading this book.

Functional programming for scientific computing

Konrad Hinsen — 2009-06-23

With the increasing importance of parallel computers, ranging from multi-core desktop machines to massively parallel machines such as IBM's BlueGene, functional programming could well become an important technique for scientific software development, as it facilitates program transformations (including those for automatic or semi-automatic parallelization) considerably. It also appeals to the mathematical bend of many scientists in making it possible to apply mathematical reasoning to computer programs. The downside: there is a steep learning curve for those familiar with traditional programming (called "imperative").

I have written an introduction to functional programming for scientists for the July issue of Computing in Science and Engineering. It is also available (free access) via IEEE's Computing Now portal: http://www2.computer.org/portal/web/computingnow/0609/whatsnew/cise

While I don't expect functional programming to be adopted rapidly by computational scientists, I am convinced that ten years from now, it will be an essential item in everyone's toolbox. Better start preparing yourself now!

Static typing and code clutter

Konrad Hinsen — 2009-05-12

Among the many characteristics that distinguish programming languages, static vs. dynamic typing is one of the most debated ones. The main advantages claimed by the advocates of static typing are that compile-time type checks make code more robust and that static typing allows a compiler to do better optimizations. The dynamic programming camp points out the simplicity and flexibility of a language that requires no type declaration and that permits a piece of code to handle data objects defined well after it was written. Both sides are right and the choice is ultimately one of personal preference.

I have used various programming languages over the years, including both statically typed and dynamically typed ones. But when given a choice, I have always preferred dynamic typing. Since 1995, my main programming language has been Python, and more recently I have started to use Clojure. One of the reasons for this preference is something that I have never seen expressed before: static typing often adds visual clutter to the code that makes it harder to read.

An important property of any non-trivial computer program is its clarity to human readers. Both verification of a program's correctness and the overall utility of a piece of code in a context of changing requirements depend on this. Well-written specifications and unit tests help as well, but if you want my advice on the quality of a piece of code, or if you want my help with modifying it, my judgement will mostly be based on its clarity. If it's an effort to understand what's going on, I wouldn't want to work with it.

This criterion for code quality immediately translates into a criterion for programming languages: they should be able to express as many concepts of software engineering as possible in a direct, explicit way and without imposing any clutter or obfuscation. Static type systems often get in the way, either by imposing clutter or by encouraging a less clear programming style.

In my examples, I will use the languages Haskell (static typing) and Clojure (dynamic typing) for illustration. Haskell has one of the best type systems available at the moment, so if Haskell can't avoid the problems that I point out, it is likely that no other current language will do a better job. Clojure is a good comparison because like Haskell it is designed for a functional programming style. Of course, it also helps that I am reasonably familiar with both languages.

Example 1: abstract data types

The idea behind abstract data types is that the concrete representation of some data structure should be hidden from client code, which accesses the data structure only through a set of interface functions. Let's look at how this is typically implemented in Haskell, using the PFP library for probabilistic programming as the example (just because I happen to know it, many other libraries could serve the same purpose). In PFP, a probability distribution is represented by an abstract data type Dist a defined as

newtype Dist a = D {unD :: [(a,ProbRep)]}

This says that internally, Dist is a list of (a, ProbRep) pairs. The single constructor D converts such a list to the abstract data type Dist, whereas unD does the inverse: it makes the contents of a Dist value accessible for inspection.

The problem with this is that all of the implementation code for PFP is littered with D and unD, although they don't do anything and add nothing to the clarity of the code. They are there only to make sure that the signature of the functions contains the abstract type Dist a instead of the internal representation [(a,ProbRep)]. For the reader of the PFP code trying to understand how it works, this is clutter. There are also a couple of functions that exist only for dealing with the artificial distinction between Dist a and [(a,ProbRep)], for example

sizeD :: Dist a -> Int
sizeD = length . unD

which replaces the list function length (familiar to every Haskell programmer) by a special version whose purpose the reader has to remember.

A Clojure library that is essentially equivalent to PFP (look at the source code) is much shorter, and in my opinion much clearer. It represents a probability distribution by a map (known in other languages as a dictionary or an associative array) and directly uses Clojure's map operations to work on it. No visual overhead, no clutter. Of course, as static typing advocates would be quick to point out, no protection of the internal representation either: client code could directly manipulate the maps used to represent distributions, potentially creating maps that are not valid probability distributions. I have never run into such a problem in 15 years of using dynamically typed languages, but in principle it is possible.

It would be possible to avoid the code obfuscation due to abstract data types by recognizing that abstract data types are an interface issue and not a type issue. A language could provide an explicit declaration of an interface for a module where the function signatures would be given with the abstract data type, even though the concrete representation is used in the implementations. The compiler could verify the coherence of everything. But I haven't seen anything like this in any statically typed language.

Note also that something very similar could be implemented in Clojure: A couple of macros would provide wrappers around the exported functions that add type verification at runtime. However, this says more about the advantage of having a powerful macro system than about the advantages of dynamic typing.

Example 2: monads

A monad is package consisting of a data structure (or, more precisely, certain properties that a data structure must have) and two functions bind and result. A subclass of monads also has a special value called zero and a subclass of this subclass has one more function called plus. All these definitions must obey certain rules to make a valid monad.

In Haskell, there is a typeclass Monad that defines bind (called >>=) and result (called return), and another typeclass MonadPlus that defines mzero and mplus. A monad is defined by providing instances for concrete data types. When monadic operations are used, the type inference system identifies the data type and selects the corresponding operations. From the client's point of view, a monad thus is defined by a type.

There are a few drawbacks of this setup:

It is not possible to define a monad with a zero but no plus. This is a technical detail (MonadPlus could well be split into two typeclasses), but it's still a limitation in practical Haskell programming that is due to the rigidity of a type system.

It is impossible to have two monads for the same data type, although sometimes this would make sense. For example, there are (at least) two practically relevant ways to define plus for the list monad.

It is cumbersome to use the same monad operations for two different concrete data structures that are similar enough in behaviour to be used with the same monad definition.

In Clojure, monads are values, not types. In client code, a monad is selected explicitly by the programmer by surrounding the monadic code by a with-monad form that specifies the monad to be used. Usually the monad is named explicitly, but since monads are values, they can also be represented by a variable. A Clojure monad can be used with any data type that the definitions of the monadic operations accept, and any number of monads can be defined for a data type.

In Clojure it is the data structure that almost disappears from monad handling; the constraints on the monadic values are given only as documentation for the human reader. As with other aspects of dynamic typing, this provides more flexibility and less protection.

For standard monads, I'd say that the clarity of code is roughly equivalent in Haskell and Clojure. Haskell gains a bit in making the data structure explicit, Clojure gains a bit in making the monad explicit at the point of use. Both work pretty well.

This changes when monad transformers come into play. In the Haskell world, this is perhaps the most frightening concept to newcomers. Monad transformers are surrounded by an aura of mystery, they are only for the real experts. I think that this is at least in part due to the complexity of defining monad transformers in a type system.

Here's how Haskell defines the list monad transformer; only the relevant parts are shown:

newtype ListT m a = ListT { runListT :: m [a] }

instance (Monad m) => Monad (ListT m) where
    return a = ListT $ return [a]
    m >>= k  = ListT $ do
        a <- runListT m
        b <- mapM (runListT . k) a
        return (concat b)

As in the case of abstract data types, there is a data type definition for ListT that does nothing but introduce a new notation for the type m[a]. ListT and runListT are just notation converters that don't actually do anything useful. But unlike in the case of abstract data types, they are indispensable here. Monads are types, and therefore there has to be a new type to make a new monad. It doesn't help that the name runListT is a particularly bad choice: it suggest an action where there is none.

The definitions of return and >>= aren't masterpieces of clarity either. It takes a careful analysis of each function and its (inferred, and thus unwritten) type to understand which monad is being used where.

For comparison, here is the corresponding monad transformer in Clojure, again reduced to the basics:

(defn sequence-t [m]
   (monad
     [m-result (with-monad m
	         (fn [v]
		   (m-result (list v))))
      m-bind   (with-monad m
		 (fn [mv f]
                   (domonad
                     [a mv
                      b (m-map f a)]
		     (flatten b))))]))

Since monads are values, monad transformers are simply functions that take a monad argument and return another monad. It is also clear at a glance that inside the definition of each monad operation, all references to monad operations are to be interpreted in the inner monad. This doesn't make monad transformers trivial to understand, of course, but it is a lot clearer.

Monads in Clojure

Konrad Hinsen — 2009-04-22

One of my hobby projects over the last months has been the exploration of monads. Monads are packages consisting of a data structure and associated control structures that are used as abstractions in functional programming. They were popularized by the Haskell language, where they play a central role in introducing side effects (such as I/O) in a controlled way into a language that is otherwise purely functional.

Since I was also exploring Clojure, an interesting new dialect of Lisp that strongly encourages a purely functional programming style (but doesn't enforce it), I decided to explore monads by writing a monad library for Clojure. My experience is that monads are quite useful in Clojure as well, and that once you get used to monads, you see occasions for using them almost everywhere. If you have been hesitating to tackle monads seriously, I can only encourage you to go on!

I have also written a monad tutorial for Clojure programmers, which I published on the OnClojure blog. It consists of four parts:

Part 1 introduces the concept of monads and illustrates it with the identity and maybe monads.

Part 2 explains the importance of m-result using the sequence monad as an example. It also covers the monad laws.

Part 3 is about m-zero and m-plus, and explains the state monad.

Part 4 covers the probability monad and monad transformers.

I hope that this tutorial facilitates a first contact with monads for those who are more familiar with Lisp syntax than with Haskell syntax.