Recent posts

The roles of computer programs in science


Why do people write computer programs? The answer seems obvious: in order to produce useful tools that help them (or their clients) do whatever they want to do. That answer is clearly an oversimplification. Some people write programs just for the fun of it, for example. But when we replace "people" by "scientists", and limit ourselves to the scientists' professional activities, we get a
statement that rings true: Scientists write programs because these programs do useful work for them. Lengthy computations, for example, or visualization of complex data.

This perspective of "software as a tool for doing research" is so pervasive in computational science that it is hardly ever expressed. Many scientists even see software, or perhaps the combination of computer hardware plus software as just another piece of lab equipment. A nice illustration is this TEDx lecture by Klaus Schulten about his "computational microscope", which is in fact Molecular Dynamics simulation software for studying biological macromolecules such as proteins or DNA.

To see the fallacy behind equating computer programs with lab equipment, let's take a step back and look at the basic principles of science. The ultimate goal of science is to develop an understanding of the universe that we inhabit. The specificity of science (compared to other approaches such as philosophy or religion) is that it constructs precise models for natural phenomena that it validates and improves by repeated confrontation with observations made on the real thing:
science

An experiment is just an optimization: it's a setup designed for making a very specific kind of observation that might be difficult or impossible to make by just looking at the world around us. The process of doing science is an eternal cycle: the model is used to make predictions of yet-to-make observations, whereas the real observations are compared to these predictions in order to validate the model and, in case of a significant discrepancies, to correct it.

In this cycle of prediction and observation, the role of a traditional microscope is to help make observations of what happens in nature. In contrast, the role of Schulten's computational microscope is to make predictions from a theoretical model. Once you think about this for a while, it seems obvious. To make observations on a protein, you need to have that protein. A real one, made of real atoms. There is no protein anywhere in a computer, so a computer cannot do observations on proteins, no matter which software is being run on it. What you look at with the computational microscope is not a protein, but a model of a protein. If you actually watch Klaus Schulten's video to the end, you will see that this distinction is made at some point, although not as clearly as I think it should be.

So it seems that the term "a tool for exploring a theoretical model" is a good description of a simulation program. And in fact that's what early simulation programs were. The direct ancestors of Schulten's computational microscope are the first Molecular Dynamics simulation programs made for atomic liquids. A classic reference is Rahman's 1964 paper on the simulation of liquid argon. The papers of that time specify the model in terms of a few mathematical equations plus a some numerical parameters. Molecular Dynamics is basically Newton's equations of motion, discretized for numerical integration, plus a simple model for the interactions between the atoms, known as the Lennard-Jones potential. A simulation program of the time was a rather straightforward translation of the equations into FORTRAN, plus some bookkeeping and I/O code. It was indeed a tool for exploring a theoretical model.

Since then, computer simulation has been applied to ever bigger and ever more complex systems. The examples shown by Klaus Schulten in his video represent the state of the art: assemblies of biological macromolecules, consisting of millions of atoms. The theoretical model for these systems is still a discretized version of Newton's equations plus a model for the interactions. But this model for the interactions has become extremely complex. So complex in fact that nobody bothers to write it down any more. It's not even clear how you would write it down, since standard mathematical notation is no longer adequate for the task. A full specification requires some algorithms and a database of chemical information. Specific aspects of model construction have been discussed at length in the scientific literature (for example how best to describe electrostatic interactions), but a complete and precise specification of the model used in a simulation-based study is never provided.

The evolution from simple simulations (liquid argon) to complex ones (assemblies of macromolecules) looks superficially like a quantitative change, but there is in fact a qualitative difference: for today's complex simulations, the computer program is the model. Questions such as "Does program X correctly implement model A?", a question that made perfect sense in the 1960s, have become meaningless. Instead, we can only ask "Does program X implement the same model as program Y?", but that question is impossible to answer in practice. The reason is that the programs are even more complex than the models, because they also deal with purely practical issues such as optimization, parallelization, I/O, etc. This phenomenon is not limited to Molecular Dynamics simulations. The transition from mathematical models to computational models, which can only be expressed in the form of computer programs, is happening in many branches of science. However, scientists are slow to recognize what is happening, and I think that is one reason for the frequent misidentification of software as experimental equipment. Once a theoretical model is complex and drowned in even more complex software, it acquires many of the characteristics of experiments. Like a sample in an experiment, it cannot be known exactly, it can only be studied by observing its behavior. Moreover, these observations are associated with systematic and statistical errors resulting from numerical issues that frequently even the program authors don't fully understand.

From my point of view (I am a theoretical physicist), this situation is not acceptable. Models play a central role in science, in particular in theoretical science. Anyone claiming to be theoretician should be able to state precisely which models he/she is using. Differences between models, and approximations to them, must be discussed in scientific studies. A prerequisite is that the models can be written down in a human-readable form. Computational models are here to stay, meaning that computer programs as models will become part of the daily bread of theoreticians. What we will have to develop is notations and techniques that permit a separation of the model aspect of a program from all the other aspects, such as optimization, parallelization, and I/O handling. I have presented some ideas for reaching this goal in this article (click here for a free copy of the issue containing it, it's on page 77), but a lot of details remain to be worked out.

The idea of programs as a notation for models is not new. It has been discussed in the context of education, for example in this paper by Gerald Sussman and Jack Wisdom, as well as in their book that presents classical mechanics in a form directly executable on a computer. The constraint of executability imposed by computer programs forces scientists to remove any ambiguities from their models. The idea is that if you can run it on your computer, it's completely specified. Sussman and Wisdom actually designed a specialized programming language for this purpose. They say it's Scheme, which is technically correct, but Scheme is a member of the Lisp family of extensible programming languages, and the extensions written by Sussman and Wisdom are highly non-trivial, to the point of including a special-purpose computer algebra system.

For the specific example that I have used above, Molecular Dynamics simulations of proteins, the model is based on classical mechanics and it should thus be possible to use the language of Sussman and Wisdom to write down a complete specification. Deriving an efficient simulation program from such a model should also be possible, but requires significant research and devlopment effort.

However, any progress in this direction can happen only when the computational science community takes a step back from its everyday occupations (producing ever more efficient tools for running ever bigger simulations on ever bigger computers) and starts thinking about the place that it occupies in the pursuit of scientific research.

Update (2014-5-26) I have also written a more detailed article on this subject.

Python as a platform for reproducible research

The other day I was looking at the release notes for the recently published release 1.8 of NumPy, the library that is the basis for most of the Scientific Python ecosystem. As usual, it contains a list of new features and improvements, but also sections such as "dropped support" (for Python 2.4 and 2.5) and "future changes", to be understood as "incompatible changes that you should start to prepare for". Dropping support for old Python releases is understandable: maintaining compatibility and testing it is work that needs to be done by someone, and manpower is notoriously scarce for projects such as NumPy. Many of the announced changes are in the same category: they permit removing old code and thus reduce maintenance effort. Other announced changes have the goal of improving the API, and I suppose they were more controversial than the others, as it is rarely obvious that one API is better than another one.



From the point of view of reproducible research, all these changes are bad news. They mean that libraries and scripts that work today will fail to work with future NumPy releases, in ways that their users, who are usually not the authors, cannot easily understand or fix. Actively maintained libraries will of course be adapted to changes in NumPy, but much, perhaps most, scientific software is not actively maintained. A PhD student doing computational reasearch might well publish his/her software along with the thesis, but then switch subjects, or leave research altogether, and never look at the old code again. There are also specialized libraries developed by small teams who don't have the resources to do as much maintenance as they would like.



Of course NumPy is not the only source of instability in the Python platform. The most visible change in the Python ecosystem is the evolution of Python itself, whose 3.x series is not compatible with the initial Python language. It is difficult to say at this time for how long Python 2.x will be maintained, but it is well possible that much of today's scientific software written in Python will become difficult to run ten years from now.



The problem of scientific publications becoming more and more difficult to use is not specific to computational science. A theoretical physicist trying to read Isaac Newton's works would have a hard time, because the mathematical language of physics has changed considerably over time. Similarly, an experimentalist trying to reproduce Galileo Galilei's experiments would find it hard to follow his descriptions. Neither is a problem in practice, because the insights obtained by Newton and Galilei have been reformulated many times since then and are available in today's language in the form of textbooks. Reading the original works is required only for studying the history of science. However, it typically takes a few decades before specific results are universally recognized as important and enter the perpetually maintained canon of science.



The crucial difference with computations is that computing platforms evolve much faster than scientific research. Researchers in fields such as physics and chemistry routinely consult original research works that are up to thirty years old. But scientific software from thirty years ago is almost certainly unusable today without changes. The state of today's software thirty years from now is likely to be worse, since software complexity has increased significantly. Thirty years ago, the only dependencies a scientific program would have is a compiler and perhaps one of a few widely known numerical libraries. Today, even a simple ten-line Python script has lots of dependencies, most of the indirectly through the Python interpreter.




One popular attitude is to say: Just run old Python packages with old versions of Python, NumPy, etc. This is an option as long as the versions you need are recent enough that they can still be built and installed on a modern computer system. And even then, the practical difficulties of working with parallel installation of multiple versions of several packages are considerable, in spite of tools designed to help with this task (have a look at EasyBuild, hashdist, conda, and Nix or its offshoot Guix).



An additional difficulty is that the installation instructions for a library or script at best mention a minimum version number for dependencies, but not the last version with which they were tested. There is a tacit assumption in the computing world that later versions of a package are compatible with earlier ones, although this is not true in practice, as the example of NumPy shows. The Python platform would be a nicer place if any backwards-incompatible change were accompanied by a change in package name. Dependencies would then be evident, and the different incompatible versions could easily be installed in parallel. Unfortunately this approach is rarely taken, a laudable exception being Pyro, whose latest incarnation is called Pyro4 to distinguish it from its not fully compatible predecessors.



I have been thinking a lot about this issue recently, because it directly impacts my ActivePapers project. ActivePapers solves the dependency versioning problem for all code that lives within the ActivePaper universe, by abandoning the notion of a single collection of "installed packages" and replacing it by explicit references to a specific published version. However, the problem persists for packages that cannot be moved inside the ActivePaper universe, typically because of extension modules written in a compiled language. The most fundamental dependencies of this kind are NumPy and h5py, which are guaranteed to be available in an ActivePapers installation. ActivePapers does record the version numbers of NumPy and h5py (and also HDF5) that were used for each individual computation, but it has currently no way to reproduce that exact environment at a later time. If anyone has a good idea for solving this problem, in a way that the average scientist can handle without becoming a professional systems administrator, please leave a comment!



As I have pointed out in an earlier post, long-term reproducibility in computational science will become possible only if the community adopts a stable code representation, which needs to be situated somewhere in between processor instruction sets and programming languages, since both ends of this spectrum are moving targets. In the meantime, we will have to live with workarounds.


ActivePapers for Python

Today I have published the first release of ActivePapers for Python, available on PyPI or directly from the Mercurial repository on Bitbucket. The release coincides with the publication of my first scientific paper for which the complete code and data is in the supplementary material, available through the J. Chem. Phys. Web site or from Figshare. There is a good chance that this is the first fully reproducible paper in the field of biomolecular simulation, but it is of course difficult to verify such a claim.

ActivePapers is a framework for doing and publishing reproducible research. An ActivePaper is a file that contains code (Python modules and scripts) and data (HDF5 datasets), plus the dependency information between all these pieces. You can change a script and re-run all the computations that depend on it, for example. Once your project is finished, you can publish the ActivePaper as supplementary material to your standard paper. You can also re-use code and data from a published ActivePaper by using DOI-based links, although for the moment this works only for ActivePapers stored on Figshare.

I consider this first release of ActivePapers quite usable (I use it, after all), but it's definitely for "early adopters". You should be comfortable working with command-line tools, for example, and of course you need some experience with writing Python scripts if you want to create your own ActivePaper. For inspecting data, you can use any HDF5-based tool, such as HDFView, though this makes sense only for data that generic tools can handle. My first published ActivePaper contains lots of protein structures, which HDFView doesn't understand at all. I expect tool support for ActivePapers to improve significantly in the near future.

Platforms for reproducible research

This post was motivated by Ian Gent's recomputation manifesto and his blog post about it. While I agree with pretty much everything said there, there is one point that I strongly disagree with, and here I'd like to explain the reasons in some detail. The point in question is "The only way to ensure recomputability is to provide virtual machines". To be fair, the manifesto specifies that it's the only way "at least for now", so perhaps our disagreement is not as pronounced as it may seem.

I'll start with a quote from the manifesto that shows that we have similar ideas of the time scales over which computational research should be reproducible:
"It may be true that code you make available today can be built with only minor pain by many people on current computers. That is unlikely to be true in 5 years, and hardly credible in 20."

So the question is: how can we best ensure that the software used in our computational studies can still be run, with reasonable effort, 20 years from now. To answer that question, we have to look at the possible platforms for computational research.

By a "platform", I mean the combination of hardware and software that is required to use a given piece of digital information. For example, Flash video requires a Flash player and a computer plus operating system that the Flash player can run on. That's what defines the "Flash platform". Likewise, today's "Web platform" (a description that requires a date stamp to be precise, because Web standards evolve so quickly) consists of HTML5, JavaScript, and a couple of related standards. If you want to watch a Flash video in 20 years, you will need a working Flash platform, and if you want to use an archived copy of a 2013 Web site, you need the 2013 Web platform.

If you plan to distribute some piece of digital information with the hope that it will make sense 20 years from now, you must either have confidence in the longevity of the platform, or be willing and able to ensure its long-term maintenance yourself. For the Flash platform, that means confidence in Adobe and its willingness to keep Flash alive (I wouldn't bet on that). For the 2013 Web platform, you may hope that its sheer popularity will motivate someone to keep it alive, but I wouldn't bet on it either. The Web platform is too complex and too ill-defined to be kept alive reliably when no one uses it in daily life any more.

Back to computational science. 20 years ago, most scientific software was written in Fortran 77, often with extensions specific to a machine or compiler. Much software from that era relied on libraries as well, but they were usually written in the same language, so as long as their source code remains available, the platform for all that is a Fortran compiler compatible with the one from back then. For standard Fortran 77, that's not much of a problem, whereas most of the vendor-specific extensions have disappeared since. Much of that 20-year-old software can in fact still be used today. However, reproducing a computational study based on that software is a very different problem: it also requires all the input data and an executable description of the computational protocol. Even in the rare case that all that information is available, it is likely to depend on lots of other software pieces that may not be easy to get hold of any more. The total computational platform for a given research project is in fact as ill-defined as the 2013 Web platform.

Today's situation is worse, because we use more diverse software written in more different languages, and also use more interactive software whose use is notoriously non-reproducible. The only aspect where we have gained in standardization is the underlying hardware and OS layer: pretty much all computational science is done today on x86 processors running Linux. Hence the idea of conserving the full operating environment in the form of a virtual machine. Just fire up VirtualBox (or one of the other virtual machine managers) and run an exact copy of the original study's work environment.

But what is the platform required to run today's virtual machines? It's VirtualBox, or one of its peers. Note however that it's not "any of today's virtual machine managers" because compatibility between their virtual machine formats is not perfect. It may work, or it may not. For simplicity I will use VirtualBox in the following, but you can substitute another name and the basic arguments still hold.

VirtualBox is a highly non-trivial piece of software, and it has very stringent hardware requirements. Those hardware requirements are met by the vast majority of today's computing equipment used in computational science, but the x86 platform is losing market share rapidly on the wider computing device market. VirtualBox doesn't run on an iPad, for example, and probably it never will. Is VirtualBox likely to be around in 20 years? I won't dare a prediction. If x86 survives for another 20 years AND if Oracle sees a continuing interest in this product, then it will. I won't bet on it though.

What we really need for long-term recomputability is a simple platform. A platform that is simple enough that the scientific community alone can afford to keep it alive for its own needs, even if no one else in the world cares about it.

Unfortunately there is no suitable platform today, to the best of my knowledge. Which is why virtual machines are perhaps the best option right now, for lack of a satisfactory one. But if we care about recomputability, we should design and develop a good supporting platform, starting as soon as possible.

For a more detailed discussion of this issue, see this paper written by yours truly. It comes to the conclusion that the closest existing approximation to a good platform is the Java virtual machine. What we'd want ideally is something similar to the JVM, but designed and optimized for scientific applications. A basic JVM implementation is quite simple (the complex JIT stuff is not a requirement), a few orders of magnitude simpler than VirtualBox, and it has no specific hardware dependencies. It's even simpler than many of today's scientific software packages, so the scientific community can definitely afford to keep it alive, The tough part is... no, it's not designing or writing the required software, it's agreeing on a specification. Perhaps it will never happen. Perhaps virtual machines will remain the best choice for lack of a satisfactory one. Or perhaps we will end up compiling our software to asm.js and run in the browser, just because someone else will keep that platform alive for us, no matter how ill-adapted it is to our needs. But don't say you haven't been warned.

Bye bye Address Book, welcome BBDB

About two years ago I wrote a post about why and how I abandoned Apple's iCal for my agenda management and moved to Emacs org-mode instead. Now I am in the process of making the second step in the same direction: I am abandoning Apple's Address Book and starting to use the "Big Brother DataBase", the most popular contact management system from the Emacs universe.

What started to annoy me seriously about Address Book is a bug that makes the database and its backups grow over time, even if no contacts are added, because the images for the contacts keep getting copied and never deleted under certain circumstances. I ended up having address book backups of 200 MB for just 500 contacts, which is ridiculous. A quick Web search shows that the problem has been known for years but has not yet been fixed.

When I upgraded from MacOS 10.6 to 10.7 about a year ago (I am certainly not an early adopter of new MacOS versions), I had a second reason to dislike Address Book: the user interface had been completely re-designed and become a mess in the process. Every time I use it I have to figure out again how to navigate groups and contacts.

I had been considering moving to BBDB for a while, but I hadn't found any good solution for synchronizing contacts with my Android phone. That changed when I discovered ASynK, which does a bi-directional synchronization between a BBDB database and a Google Contacts account. That setup actually works better than anything I ever tried to synchronize Address Book with Google Contacts, so I gained more than I expected in the transition.

At first glance, it may seem weird to move from technology of the 2000's to technology of the 1970's. But the progress over that period in managing rather simple data such as contact information has been negligible. The big advantage of the Emacs platform over the MacOS platform is that it doesn't try to take control over my data. A BBDB database is just a plain text file whose structure is apparent after five minutes of study, whereas an Address Book database is stored in a proprietary format. A second advantage is that the Emacs developer community fixes bugs a lot faster than Apple does. A less shiny (but perfectly usable) user interface is a small price to pay.

A critical view of altmetrics

Altmetrics is one of the hotly debated topics in the Open Science movement today. In summary, the idea is that traditional bibliometric measures (citation counts, impact factors, h factors, ...) are too limited because they miss all the scientific activity that happens outside of the traditional journals. That includes the production of scientific contributions that are not traditional papers (i.e. datasets, software, blog posts, etc.) and the references to scientific contributions that are not in the citation list of a traditional paper (blogs, social networks, etc.). Note that the altmetrics manifesto describes altmetrics as a tool to help find scientists publications worth reading. I find it hard to believe that its authors have not thought of applications in evaluation of researchers and institutions, which will inevitably happen if altmetrics ever takes off.

At first sight, altmetrics appear as an evident "update" to traditional bibliometry. It sounds pretty obvious that, as scientific communication moves on to new media and finds new forms of expressions, bibliometry should adapt. On the other hand, bibliometry is considered a more less necessary evil by most scientists. Many deplore today's "publish or perish" culture and correctly observe that it is harmful to science in the long term, giving more importance to the marketing of research studies than to their careful design and meticulous execution. I haven't yet seen any discussion of this aspect in the context of altmetrics, so I'd like to start such a discussion with this post.

First of all, why is bibliometry so popular, and why is it harmful in the long run? Second, how will this change if and when altmetrics are adopted by the scientific community?

Bibliometry provides measures of scientific activity that have two important advantages: they are objective, based on data that anyone can check in principle, and they can be evaluated by anyone, even by a computer, without any need to understand the contents of scientific papers. On the downside, those measures can only indirectly represent scientific quality precisely because they ignore the contents. Bibliometry makes the fundamental assumption that the way specific articles are received by the scientific community can be used as a proxy for quality. That assumption is, of course, wrong, and that's how bibliometry ultimately harms the progress of science.

The techniques that people use to improve their bibliometrical scores without contributing to scientific progress are well known: dilution of content (more articles with less content per article), dilution of authorship (agreements between scientists to add each others' names to their works), marketing campaigns for getting more citations, application of a single technique to lots of very similar applications even if that adds no insight whatsoever. Altmetrics will cause the same techniques to be applied to datasets and software. For example, I expect scientific software developers to take Open Source libraries and re-publish them with small modifications under a new name, in order to have their name attached to them. Unless we come up with better techniques for software installation and deployment, this will probably make the management of scientific software a bit more complicated because we will have to deal with lots of small libraries. That's a technical problem that can and should be solved with a technical solution.

However, these most direct and most discussed negative consequences of bibliometry are not the only ones and perhaps not the worst. The replacement of expert judgement by majority vote, which is the basis of bibliometry, also in its altmetrics incarnation, leads to a phenomenon which I will call "scientiic bubbles" in analogy to market bubbles in economy. A market bubble occurs if the price of a good is determined not by the people who buy it to satisfy some need, but by traders and speculators who try to estimate the future price of the good and make a profit from a rise or fall relative to the current price. In science, the "client" whose "need" is fulfilled by a scientific study is mainly future science, plus in the case of applied research engineering and product development. The role of traders and speculators is taken by referees and journal editors. A scientific bubble is a fashionable topic that many people work on not because of its scientific interest but because of the chance it provides to get a highly visible publication. Like market bubbles, scientific bubbles eventually explode when people realize that the once fashionable topic was a dead end. But before exploding, a bubble has wasted much money and intellectual energy. It may also have blocked alternative and ultimately more fruitful research projects that were refused funding because they were in contradiction with the dominating fashionable point of view.

My prediction is that altmetrics will make bubbles more numerous and more severe. One reason is the wider basis of sources from which references are counted. In today's citation-based bibliometry, citations come from articles that went through some journal's peer-reviewing process. No matter how imperfect peer review is, it does sort out most of the unfounded and obviously wrong contributions.  To get a paper published in a journal whose citations count, you need a minimum of scientific competence. In contrast, anyone can publish an opinion on Twitter or Facebook. Since for any given topic the number of experts is much smaller than the number of people with just some interest, a wider basis for judgement automatically means less competence on average. As a consequence, high altmetrics scores are best obtained by writing articles that appeal to the masses who can understand what the work is about but not judge if it is well-founded. Another reason why altmetrics will contribute to bubbles is the positive feedback loop created by people reading and citing publications because they are already widely read and cited. That effect is dampened in traditional bibliometry because of the slowness of the publishing and citation mechanism.

My main argument ends here, but I will try to anticipate some criticisms and reply to them immediately.

One objection I expect is that the analysis of citation graphs can be used to assign a kind of reputation to each source and weight references by this reputation. That is the principle of Google's famous PageRank algorithm. However, any analysis of the citation graph suffers from the same fundamental problem as bibliometry itself: a method that only looks at relations between publications but not at their contents can't distinguish a gem from a shiny bubble. There will be reputation bubbles just like there are topic bubbles. No purely quantitative analysis can ever make a statement about quality. The situation is similar to mathematical formalisms, with citation graph analysis taking the role of formal proof and scientific quality the role of truth in Gödel's incompleteness theorem.

Another likely criticism is that the concept of the scientific bubble is dubious. Many paths of scientific explorations have turned out to be failures, but no one could possibly have predicted this in the beginning. In fact, many ultimately successful strategies have initially been criticized as hopeless. Moreover, exploration of a wrong path can still lead to scientific progress, once the mistake has been understood. How can one distinguish promising but ultimately wrong ideas from bubbles? The borderline is indeed fuzzy, but that doesn't mean that the concept of a bubble is useless. It's the same for market bubbles, which exist but are less severe when a good is traded both for consumption and for speculation. My point is that the bubble phenomenon exists and is detrimental to scientific progress.

Lessons from sixteen years of molecular simulation in Python

A while ago I was chatting with two users of my Molecular Modelling Toolkit (MMTK), a library for molecular simulations written in Python. One of them asked me what I would do differently if I were to write MMTK today. That's an interesting question, but not the kind of question I can answer in a sentence or two, so I promised to write a blog post about this. Here it is.

First, a bit of history. The first version of MMTK was released about 16 years ago. I don't have the exact data, but the first message on the MMTK mailing list, announcing MMTK release 1.0b2, is dated 29 May 1997. Back then Python 1.4 was the state of the art and Numerical Python was a young project that was just beginning to stabilize. MMTK was one of the first domain-specific scientific libraries written in Python, at a time when the scientific Python community was very small and its members were mostly considered cranks by their peers. MMTK was designed from the start as a Python library, with relatively small bits of C code for the time-critical stuff (mainly energy evaluation and MD integration), with NumPy arrays at the Python-C interface. This has since become one of the two main approaches to using Python in scientific computing, the other one being wrapper code around libraries written in C/C++ or Fortran.

So what would I do differently if I were to start writing MMTK today? Many things, for different reasons. Lets first get the obvious stuff out of the way: the Python ecosystem has evolved significantly since 1997, and of course I would use Python 3, and Cython instead of C for the time-critical parts. I would also adopt many of the conventions that the community has developed but which weren't around in 1997. I might even be tempted to use bleeding-edge tools like Numba, although with hesitation: Numba is not only a moving target at this time, but also requires dependencies (I am thinking mostly of LLVM) which are big and non-trivial to install. One lesson I have learned in 16 years of scientific Python is that dependencies can cause more trouble than they are worth. It's nice in theory to re-use existing tested code, but it also makes installation and deployment more cumbersome.

So far for changes in the Python ecosystem. What has changed as well, though at a slower pace, is the role of computation in science and in particular in molecular simulation. Back in 1997, there were a few molecular simulation ecosystems that operated almost in isolation. The big players were the CHARMM, AMBER, and GROMOS/GROMACS communities. Each of them had their own software, their own file formats, and their own force fields. Members of these communities would of course talk about science to each other, but not share any software or data. Developing new computational methods required a serious investment into one of these ecosystems. That was in fact my main motivation for developing MMTK: I figured that I would be more efficient (not to mention more satisfied) writing a new system from scratch using modern development tools than trying to get familiar with crufty Fortran code. But I adopted basically the same approach with MMTK: I created a new ecosystem without much regard to sharing code or data with the rest of the world. As an illustration, MMTK defines its own trajectory format which I still consider superior to what the rest of the world is doing, but which is undeniably hard to use without MMTK, given that the definition of a universe is stored as an executable Python expression. MMTK also encourages storing data as Python pickle files, which are even harder to deal with for other programs.

Today we are seeing a change in attitude in computational science that I am sure will soon reach the molecular simulation community as well. People are starting to realize that computational results have serious reliability problems. The most publicized case in the structural biology community was the retraction of a few important published protein structures following the discovery of a bug in the data processing software that lead to completely wrong final structures. This and similar events point to the urgent need for better validation of computational results. One aspect of validation is re-running the same computation with different tools. Another aspect is publishing both software and raw data, enabling other scientists to inspect them and check their validity. Technology for sharing scientific code and data exists today (have a look at Github, Bitbucket, and figshare, for example). But in molecular simulation, there are still important practical barriers to such validation attempts, in particular the use of program-specific and badly documented file formats. While MMTK's file formats are documented, they are still program-specific and thus incompatible with the requirements of the future.

The sentence that I would like to write now is "If I were to rewrite MMTK today, I would use the exchange data formats accepted by the molecular simulation community". But those formats don't exist yet, although there are a few initiatives to develop them. My own contribution to this effort is the Mosaic data model and data formats - if you are interested in this subject, please have a look at it and send me your feedback. Mosaic will of course find its way into future versions of MMTK.

Finally, there are things I would do differently because the experience with MMTK has shown that a few initial design decisions were not the best ones. Number one is the absence of stable atom numbers. In MMTK, each atom and molecule is represented by a unique Python object, and there are ways to refer uniquely to everything by using Python expressions. But there is no such thing as a unique order of atoms that would assign a number to each one. Atoms do have numbers by which the low-level C code refers to them, but these numbers can be different every time you run a Python script. My original design goal was to discourage the use of numbers to refer to atoms, because this is an important source of mistakes if the simulated system undergoes changes. But every other molecular simulation program out there uses numbers to refer to atoms, so people are used to them. For interoperability with other programs, atom numbers are fundamental. There are ways to handle such situations, of course, but it's a constant source of headaches.

The other design aspect that I would change if I were to rewrite MMTK today is the hierarchy of chemical objects. MMTK has Atoms, Groups, Molecules, and Complexes, plus specializations such as AminoAcidResidue (a special Group), PeptideChain (a special Molecule), and Protein (a special Complex). While all of these correspond to some chemical reality, the system is more complex than required for molecular simulation, leading in some situations to code that is bloated by irrelevant special cases. Today I'd go for just Atoms and Groups, with special features of specific kinds of groups indicated by attributes rather than specific classes.

Integrating scientific software and datasets into the citation record

This morning I read C. Titus Brown's blog post on how science could be so much better if scientitic data and the software used to work with it were openly available for reuse. One problem he mentions, like many others have done before, is the lack of incentive for publishing anything else but standard scientific papers. What matters for a scientist's career and for grant applications is papers, papers, papers. Any contribution that's not in a scientific journal with a reputation and an impact factor is usually ignored, even if its real impact exceeds that of many papers that nobody really wants to read.

Ideally, published scientific data and software should be treated just like a paper: it should be citeable and it should appear in the citation databases that are used to calculate impact factors, h factors, and whatever other metrics bibliometrists come up with and evaluation committees appreciate for their ease of use.

Treating text (i.e. papers), data, and code identically also happens to be useful for making scientific publications more useful to the reader, by adding interactive visualization and exploration of procedures (such as varying parameters) to the static presentation of results in a standard paper. This idea of "executable papers" has generated a lot of interest recently, as shown by Elsevier's Executable Paper Challenge and the Beyond the PDF workshop. For a technical description of how this can be achieved, see my ActivePapers project and/or the paper describing it. In the ActivePapers framework, a reference to code being called, or to a dataset being reused, is exactly identical to a reference to a published paper. It would then be much easier for citation databases to include all references rather than filter out the ones that are "classical" citations. And that's a good motivation to finally treat all scientific contributions equally.

Since the executable papers idea is much easier to sell than the idea of an upated incentive system, a seemingly innocent choice in technology could end up helping to change the way scientists and research projects are evaluated.

The ultimate calculator for Android and iOS

Calculators are among the most popular applications for smartphones, and therefore it is not surprising that the Google Play Store has more than 1000 calculators for the Android platform. Having used HP's scientific calculators for more than 20 years, I picked RealCalc when I got my Android phone and set it to RPN mode. It works fine, I have no complaints about it. But I no longer use it because I found something much more powerful.

It's called "J", which isn't exactly a very descriptive name. And that's probably a good idea because describing it it not so easy. J is much more than a calculator, but it does the calculator job very well. It's actually a full programming language, but one that differs substantially from everything else that goes by that label. The best description for J I can come up with is "executable mathematical notation". You type an expression, and you get the result. That's in fact not very different from working interactively with Python or Matlab, except that the expressions are very different. You can write traditional programs in J, using loops, conditionals, etc., but you can a lot of work done without ever using these features.

The basic data structure in J is the array, which can have any number of dimensions. Array elements can be numbers, characters, or other arrays. Numbers (zero-dimensional arrays) and text strings (one-dimensional arrays of characters) are just special cases. In J jargon, which takes its inspiration from linguistics, data items are called "nouns". Standard mathematical operators (such as + or -) are called "verbs" and can have one or two arguments (one left, one right). An expression is called a "sentence". There are no precedence rules, the right argument of any verb being everything to its right. Given the large number of verbs in J, this initially unfamiliar rule makes a lot of sense. A simple example (also showing the use of arrays) is
   2 * 3 + 10 20 30
26 46 66

Up to here, J expressions are not very different from Python or Matlab expressions. What J doesn't have is functions with the familiar f(x, y, z) syntax, accepting any number of arguments. There are only verbs, with one or two arguments. But what makes J really different from the well-known languages for scientific computing are the "parts of speech" that have no simple equivalent elsewhere: adverbs and conjunctions.

An adverb takes a verb argument and produces a derived verb from it. For example, the adverb ~ takes a two-argument verb (a dyad in J jargon) and turns it into a one-argument verb (a monad) that's equivalent to using the dyad with two equal arguments. With + standing for plain addition, +~ thus doubles its argument:
   +~ 1 5 10 20
2 10 20 40

meaning it is the same as
   1 5 10 20 + 1 5 10 20
2 10 20 40

A conjunction combines a verb with a noun or another verb to produce a derived verb. An example is ^:, the power conjunction, which applies a verb several times:
   +~(^:2) 1 5 10 20
4 20 40 80
+~(^:3) 1 5 10 20
8 40 80 160

The parentheses are required to separate the argument of the power conjunction (2 or 3) from the array that is the argument to the resulting derived verb. To see the real power of the power conjunction, consider that it accepts negative arguments as well:
   +~(^:_1) 1 5 10 20
0.5 2.5 5 10

You have seen right: J can figure out that the inverse of adding a number to itself is dividing that number by two!

Pretty much any programming language permits you to assign values to names for re-use in later expressions. J is no exception:
   data =. 1 5 10 20
double =. +~
double data
2 10 20 40
inv =. ^:_1
halve =. double inv
halve data
0.5 2.5 5 10

As you can see, names can be given not just to nouns (i.e. data), but also to verbs, adverbs, and conjunctions. Most J programs are just pieces of expressions that are assigned to names. Which means that the short summary of J that I have given here could well be all you ever need to know about the language - apart from the fact that you will have to acquire a working knowledge of many more verbs, adverbs, and conjunctions.

Before you rush off to the Play Store looking for J, let me add that J is not yet there, although it's supposed to arrive soon. For now, you have to download the APK and install it yourself, using your preferred Android file manager. I should also point out that J is not just for Android. It's been around for more than 20 years, and you can get J for all the common computing platforms from Jsoftware. There's also an iOS version for the iPhone and iPad. J's extreme terseness is a perfect fit for smartphones, where screen space is a scarce resource and where every character you don't have to type saves you a lot of time.

The Nix package manager in computational science

In an earlier post, I mentioned the Nix package management system as a candidate for ensuring reproducibility in computational science. What distinguishes Nix from the better known package managers (Debian, RPM, ...) is that it permits the installation of different versions of the same package in parallel, with a dependency tracking system that refers to a precise version of everything, including the versions of the development tools (compilers, ...) that were used to build the libraries and executables. Nix thus remembers for each package the complete details of how it can be reconstructed, which is what we would like to see for ensuring reproducibility.

There are, however, two caveats. First of all, Nix was designed for software installation management and not for computation. While in principle one could define the results (figures, tables, datasets) of some computation as a Nix package and perform the computation by installing the package, such an approach is quite cumbersome with the Nix support tools designed with a different task in mind. However, computation-specific support tools would probably suffice to fix this. Second, while the design of Nix looks quite sound, it is a young project with much less manpower behind it than the big package managers of the Linux world. This means there are fewer package definitions and they are overall less reliable. For example, I haven't yet managed to install my research computing environment (Python, NumPy, matplotlib, plus a few more packages) using Nix under MacOS X, because some packages simply fail to build. Again this is not an insurmountable problem, but it requires some serious effort to fix.

The Nix documentation is pretty good at describing how to use the package manager and the collection of package definitions for Linux and MacOS X named Nixpkgs. It is not so good at giving a basic understanding of how Nix works, which becomes important when you want to use it for something else than traditional package management. The following overview is the result of my own explorations of Nix. I am not a Nix authority, so be warned that there may be mistakes or misunderstandings.

At the heart of Nix is the "Nix store", a central database where everything managed by Nix is kept. Its default location is /nix/store and if you look at it you see an overwhelmingly long list of crypic filenames. Let's zoom in on something to see what's going on. Here is what ls -l /nix/store/*zlib* shows on my machine:


-r--r--r-- 1 hinsen staff 1000 Jan 1 1970
/nix/store/12vkkhs36xffzpqjaaa3vqhqv2yc97vs-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1181 Jan 1 1970
/nix/store/gymcn145ihhmymm6yk2wxqfd49s5dzdq-zlib-1.2.6.drv
dr-xr-xr-x 5 hinsen staff 170 Jan 1 1970
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
-r--r--r-- 1 hinsen staff 1000 Jan 1 1970
/nix/store/sj8l48kfc40wh8adb5pa843lwy38hskb-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1686 Jan 1 1970
/nix/store/xpm2xja2zv5agmdzgi362jqd5xx9ny10-zlib-1.2.6.tar.gz.drv

The single directory in that list actually contains the zlib installation in the familiar Unix file layout that you find under /usr or /usr/local:

~> ls -R /nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6:
include lib share

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/include:
zconf.h zlib.h

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib:
libz.1.2.6.dylib libz.1.dylib libz.a libz.dylib pkgconfig

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib/pkgconfig:
zlib.pc

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share:
man

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man:
man3

/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man/man3:
zlib.3.gz

Note that it contains just zlib, and nothing else, in particular not zlib's dependencies. Each library or application has its own directory in the Nix store.

Next, let's look at all the other files, those with the extension .drv (for "derivation", a Nix term for any artefact derived from human-provided input). There are three files that end in zlib-1.2.6.drv and one that ends in zlib-1.2.6.tar.gz.drv. Let's look at the contents of the last one first. I have made it more readable by adding whitespace:


Derive(
[("out",
"/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz",
"sha256",
"21235e08552e6feba09ea5e8d750805b3391c62fb81c71a235c0044dc7a8a61b")],
[("/nix/store/lhc0qhfdrw32rj1z7s5p90nbjfnkydhb-stdenv.drv",
["out"]),
("/nix/store/pawry9l3415kwfbfh4zrhgnynwfb10bs-mirrors-list.drv",
["out"])],

["/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],

"x86_64-darwin",
"/bin/bash",
["-e",
"/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],

[("buildInputs",""),
("buildNativeInputs",""),
("builder","/bin/bash"),
("id",""),
("impureEnvVars","http_proxy https_proxy ftp_proxy all_proxy no_proxy NIX_CURL_FLAGS NIX_HASHED_MIRRORS NIX_MIRRORS_apache NIX_MIRRORS_bitlbee NIX_MIRRORS_cpan NIX_MIRRORS_debian NIX_MIRRORS_fedora NIX_MIRRORS_gcc NIX_MIRRORS_gentoo NIX_MIRRORS_gnome NIX_MIRRORS_gnu NIX_MIRRORS_gnupg NIX_MIRRORS_hashedMirrors NIX_MIRRORS_imagemagick NIX_MIRRORS_kde NIX_MIRRORS_kernel NIX_MIRRORS_metalab NIX_MIRRORS_oldsuse NIX_MIRRORS_opensuse NIX_MIRRORS_postgresql NIX_MIRRORS_savannah NIX_MIRRORS_sf NIX_MIRRORS_sourceforge NIX_MIRRORS_ubuntu NIX_MIRRORS_xorg"),
("mirrorsFile","/nix/store/mmk41rbja1fvclbr7ghirzcigxlzl6f0-mirrors-list"),
("name","zlib-1.2.6.tar.gz"),
("out","/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz"),
("outputHash","06x6m33ls1606ni7275q5z392csvh18dgs55kshfnvrfal45w8r1"),
("outputHashAlgo","sha256"),
("preferHashedMirrors","1"),
("preferLocalBuild","1"),
("propagatedBuildInputs",""),
("propagatedBuildNativeInputs",""),
("showURLs",""),
("stdenv","/nix/store/9fnvs0bvhrszazham5cnl13h52hvm1rk-stdenv"),
("system","x86_64-darwin"),
("urls","http://www.zlib.net/zlib-1.2.6.tar.gz mirror://sourceforge/libpng/zlib/1.2.6/zlib-1.2.6.tar.gz")])

If that looks like a computational expression in a programming language, that's because it is. Don't worry, it's not something you are expected to write yourself, these expressions are created from the package definitions written in a more user-friendly syntax called "Nix expressions", which is very well documneted in the Nix documentation.. The expression shown above defines how to make (or "realise" in Nix jargon) the derivation /nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz, which is a rather simple one because the file is simply downloaded and verified for a known checksum. But even such a simple derivation has dependencies: the "standard environment" stdenv and the list of download mirror sites, mirrors-list.

It's time to say something about those funny 32-character prefixes in all the file names in the Nix store. You may have noticed that the zlib file list above contains two entries for zlib-1.2.6.drv that are identical except for this prefix. It looks as if the prefix is there to distinguish things that would otherwise be identical. This is true, and the information encoded in the prefix (which is a hash code) is the complete set of dependencies. The two zlib derivations differ in the version of the standard environment they were built with. I have both of these in my Nix store because I have played around with different releases of Nixpkgs. Nix really tries to keep track of every single dependency, including the exact versions of the various tools (mainly compilers) that were used in building a binary installation. That means you can keep lots of different versions of every single item on your system at the same time, and trace back exactly how they were built. You can also send a copy of the relevant derivation files (those with the .drv extension) to someone else, who can reproduce the exact same environment by "realising" those derivations again.

With so many zlibs floating around, which one does Nix use when you ask it to install some application that uses zlib? The one you specify. When some application requires zlib as a dependency, you have to tell Nix exactly which zlib derivation you want to be used. You don't normally do this manually for every single build (though you could), you'd rather use a coherent set of package definitions (such as Nixpkgs) that specifies all the interdependencies among hundreds of packages. The package definitions take the form of "Nix expressions", which are written in a language specifically designed for this purpose. Files containing Nix expressions have the extension .nix. Since the language is rather well documented in the Nix manual, I won't say any more about it here. A good starting point is to explore Nixpkgs. It helps to know that the central file is pkgs/top-level/all-packages.nix. This file imports the definitions of individual packages from their respective packages and makes a consistent package collection from them. When you build a particular derivation from Nixpkgs, only the packages listed explicitly as its dependencies are available in the build environment that is set up specifically for this build operation. No "default library" (such as /usr/lib) is used at all.

There is one more layer to Nix, whose role is twofold: making it convenient for users to work with programs installed through Nix, and pemitting to remove packages that were installed but are no longer needed.
Let's start with the second aspect because it is the simpler one: packages can be removed as soon as nobody needs them any more. This requires a way to figure out which packages are still needed. Obviously the packages that some user on the system wants to access are "needed", and that's why cleanup is related to user profiles which I will cover in a minute. The remaining needed packages are the dependencies of other needed packages. So once we know the packages that all users put together request to use, we can figure out which packages can safely be deleted. This clean-up operation is called "garbage collection" and handled by the command nix-store --gc.

Nix user environments are managed using the command nix-env, and if you don't care about how Nix works, that command is the only one you may ever need. Each user has his/her own environment, of course, which consists mainly of a directory named $HOME/.nix-profile. That directory contains subdirectories called bin, lib, man etc. whose names should sound familiar. They contain nothing but symbolic links into the Nix store. These links define which package the user actually accesses, by putting $HOME/.nix-profile/bin on th3 PATH environment variable. When you use nix-env to install a package, Nix builds it and puts it into the Nix store (unless it's already there), and then creates symbolic links in your Nix profile, which may replace links to some different version of a package. It is important to understand that your use profile never enters into the build process of any Nix derivation. Your profile is exclusively for your own use and has no impact on Nix package management other than protecting the packages you use from being removed during garbage collection.

So far for a first report on my exploration of Nix. I will continue trying to get my computational environment built with Nix, so that I can start to explore how to use it for reproducible computations. Watch this space for news.

PS: After I published this post initially, the friendly people on the Nix mailing list pointed out some additional material for learning about Nix. First of all, there is Eelco Dolstra's thesis entitled "The Purely Functional Software Deployment Model", which is what you should read if you really want to know everything about Nix. There's also Sander van der Burg's blog which has some very detailed posts about Nix and what it can be used for. You could start with this introduction.

← Previous Next →

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, polycrisis, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2025-06, 2025-04, 2025-03, 2024-10, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04