Recent posts
Bye bye Address Book, welcome BBDB
What started to annoy me seriously about Address Book is a bug that makes the database and its backups grow over time, even if no contacts are added, because the images for the contacts keep getting copied and never deleted under certain circumstances. I ended up having address book backups of 200 MB for just 500 contacts, which is ridiculous. A quick Web search shows that the problem has been known for years but has not yet been fixed.
When I upgraded from MacOS 10.6 to 10.7 about a year ago (I am certainly not an early adopter of new MacOS versions), I had a second reason to dislike Address Book: the user interface had been completely re-designed and become a mess in the process. Every time I use it I have to figure out again how to navigate groups and contacts.
I had been considering moving to BBDB for a while, but I hadn't found any good solution for synchronizing contacts with my Android phone. That changed when I discovered ASynK, which does a bi-directional synchronization between a BBDB database and a Google Contacts account. That setup actually works better than anything I ever tried to synchronize Address Book with Google Contacts, so I gained more than I expected in the transition.
At first glance, it may seem weird to move from technology of the 2000's to technology of the 1970's. But the progress over that period in managing rather simple data such as contact information has been negligible. The big advantage of the Emacs platform over the MacOS platform is that it doesn't try to take control over my data. A BBDB database is just a plain text file whose structure is apparent after five minutes of study, whereas an Address Book database is stored in a proprietary format. A second advantage is that the Emacs developer community fixes bugs a lot faster than Apple does. A less shiny (but perfectly usable) user interface is a small price to pay.
A critical view of altmetrics
At first sight, altmetrics appear as an evident "update" to traditional bibliometry. It sounds pretty obvious that, as scientific communication moves on to new media and finds new forms of expressions, bibliometry should adapt. On the other hand, bibliometry is considered a more less necessary evil by most scientists. Many deplore today's "publish or perish" culture and correctly observe that it is harmful to science in the long term, giving more importance to the marketing of research studies than to their careful design and meticulous execution. I haven't yet seen any discussion of this aspect in the context of altmetrics, so I'd like to start such a discussion with this post.
First of all, why is bibliometry so popular, and why is it harmful in the long run? Second, how will this change if and when altmetrics are adopted by the scientific community?
Bibliometry provides measures of scientific activity that have two important advantages: they are objective, based on data that anyone can check in principle, and they can be evaluated by anyone, even by a computer, without any need to understand the contents of scientific papers. On the downside, those measures can only indirectly represent scientific quality precisely because they ignore the contents. Bibliometry makes the fundamental assumption that the way specific articles are received by the scientific community can be used as a proxy for quality. That assumption is, of course, wrong, and that's how bibliometry ultimately harms the progress of science.
The techniques that people use to improve their bibliometrical scores without contributing to scientific progress are well known: dilution of content (more articles with less content per article), dilution of authorship (agreements between scientists to add each others' names to their works), marketing campaigns for getting more citations, application of a single technique to lots of very similar applications even if that adds no insight whatsoever. Altmetrics will cause the same techniques to be applied to datasets and software. For example, I expect scientific software developers to take Open Source libraries and re-publish them with small modifications under a new name, in order to have their name attached to them. Unless we come up with better techniques for software installation and deployment, this will probably make the management of scientific software a bit more complicated because we will have to deal with lots of small libraries. That's a technical problem that can and should be solved with a technical solution.
However, these most direct and most discussed negative consequences of bibliometry are not the only ones and perhaps not the worst. The replacement of expert judgement by majority vote, which is the basis of bibliometry, also in its altmetrics incarnation, leads to a phenomenon which I will call "scientiic bubbles" in analogy to market bubbles in economy. A market bubble occurs if the price of a good is determined not by the people who buy it to satisfy some need, but by traders and speculators who try to estimate the future price of the good and make a profit from a rise or fall relative to the current price. In science, the "client" whose "need" is fulfilled by a scientific study is mainly future science, plus in the case of applied research engineering and product development. The role of traders and speculators is taken by referees and journal editors. A scientific bubble is a fashionable topic that many people work on not because of its scientific interest but because of the chance it provides to get a highly visible publication. Like market bubbles, scientific bubbles eventually explode when people realize that the once fashionable topic was a dead end. But before exploding, a bubble has wasted much money and intellectual energy. It may also have blocked alternative and ultimately more fruitful research projects that were refused funding because they were in contradiction with the dominating fashionable point of view.
My prediction is that altmetrics will make bubbles more numerous and more severe. One reason is the wider basis of sources from which references are counted. In today's citation-based bibliometry, citations come from articles that went through some journal's peer-reviewing process. No matter how imperfect peer review is, it does sort out most of the unfounded and obviously wrong contributions. To get a paper published in a journal whose citations count, you need a minimum of scientific competence. In contrast, anyone can publish an opinion on Twitter or Facebook. Since for any given topic the number of experts is much smaller than the number of people with just some interest, a wider basis for judgement automatically means less competence on average. As a consequence, high altmetrics scores are best obtained by writing articles that appeal to the masses who can understand what the work is about but not judge if it is well-founded. Another reason why altmetrics will contribute to bubbles is the positive feedback loop created by people reading and citing publications because they are already widely read and cited. That effect is dampened in traditional bibliometry because of the slowness of the publishing and citation mechanism.
My main argument ends here, but I will try to anticipate some criticisms and reply to them immediately.
One objection I expect is that the analysis of citation graphs can be used to assign a kind of reputation to each source and weight references by this reputation. That is the principle of Google's famous PageRank algorithm. However, any analysis of the citation graph suffers from the same fundamental problem as bibliometry itself: a method that only looks at relations between publications but not at their contents can't distinguish a gem from a shiny bubble. There will be reputation bubbles just like there are topic bubbles. No purely quantitative analysis can ever make a statement about quality. The situation is similar to mathematical formalisms, with citation graph analysis taking the role of formal proof and scientific quality the role of truth in Gödel's incompleteness theorem.
Another likely criticism is that the concept of the scientific bubble is dubious. Many paths of scientific explorations have turned out to be failures, but no one could possibly have predicted this in the beginning. In fact, many ultimately successful strategies have initially been criticized as hopeless. Moreover, exploration of a wrong path can still lead to scientific progress, once the mistake has been understood. How can one distinguish promising but ultimately wrong ideas from bubbles? The borderline is indeed fuzzy, but that doesn't mean that the concept of a bubble is useless. It's the same for market bubbles, which exist but are less severe when a good is traded both for consumption and for speculation. My point is that the bubble phenomenon exists and is detrimental to scientific progress.
Lessons from sixteen years of molecular simulation in Python
First, a bit of history. The first version of MMTK was released about 16 years ago. I don't have the exact data, but the first message on the MMTK mailing list, announcing MMTK release 1.0b2, is dated 29 May 1997. Back then Python 1.4 was the state of the art and Numerical Python was a young project that was just beginning to stabilize. MMTK was one of the first domain-specific scientific libraries written in Python, at a time when the scientific Python community was very small and its members were mostly considered cranks by their peers. MMTK was designed from the start as a Python library, with relatively small bits of C code for the time-critical stuff (mainly energy evaluation and MD integration), with NumPy arrays at the Python-C interface. This has since become one of the two main approaches to using Python in scientific computing, the other one being wrapper code around libraries written in C/C++ or Fortran.
So what would I do differently if I were to start writing MMTK today? Many things, for different reasons. Lets first get the obvious stuff out of the way: the Python ecosystem has evolved significantly since 1997, and of course I would use Python 3, and Cython instead of C for the time-critical parts. I would also adopt many of the conventions that the community has developed but which weren't around in 1997. I might even be tempted to use bleeding-edge tools like Numba, although with hesitation: Numba is not only a moving target at this time, but also requires dependencies (I am thinking mostly of LLVM) which are big and non-trivial to install. One lesson I have learned in 16 years of scientific Python is that dependencies can cause more trouble than they are worth. It's nice in theory to re-use existing tested code, but it also makes installation and deployment more cumbersome.
So far for changes in the Python ecosystem. What has changed as well, though at a slower pace, is the role of computation in science and in particular in molecular simulation. Back in 1997, there were a few molecular simulation ecosystems that operated almost in isolation. The big players were the CHARMM, AMBER, and GROMOS/GROMACS communities. Each of them had their own software, their own file formats, and their own force fields. Members of these communities would of course talk about science to each other, but not share any software or data. Developing new computational methods required a serious investment into one of these ecosystems. That was in fact my main motivation for developing MMTK: I figured that I would be more efficient (not to mention more satisfied) writing a new system from scratch using modern development tools than trying to get familiar with crufty Fortran code. But I adopted basically the same approach with MMTK: I created a new ecosystem without much regard to sharing code or data with the rest of the world. As an illustration, MMTK defines its own trajectory format which I still consider superior to what the rest of the world is doing, but which is undeniably hard to use without MMTK, given that the definition of a universe is stored as an executable Python expression. MMTK also encourages storing data as Python pickle files, which are even harder to deal with for other programs.
Today we are seeing a change in attitude in computational science that I am sure will soon reach the molecular simulation community as well. People are starting to realize that computational results have serious reliability problems. The most publicized case in the structural biology community was the retraction of a few important published protein structures following the discovery of a bug in the data processing software that lead to completely wrong final structures. This and similar events point to the urgent need for better validation of computational results. One aspect of validation is re-running the same computation with different tools. Another aspect is publishing both software and raw data, enabling other scientists to inspect them and check their validity. Technology for sharing scientific code and data exists today (have a look at Github, Bitbucket, and figshare, for example). But in molecular simulation, there are still important practical barriers to such validation attempts, in particular the use of program-specific and badly documented file formats. While MMTK's file formats are documented, they are still program-specific and thus incompatible with the requirements of the future.
The sentence that I would like to write now is "If I were to rewrite MMTK today, I would use the exchange data formats accepted by the molecular simulation community". But those formats don't exist yet, although there are a few initiatives to develop them. My own contribution to this effort is the Mosaic data model and data formats - if you are interested in this subject, please have a look at it and send me your feedback. Mosaic will of course find its way into future versions of MMTK.
Finally, there are things I would do differently because the experience with MMTK has shown that a few initial design decisions were not the best ones. Number one is the absence of stable atom numbers. In MMTK, each atom and molecule is represented by a unique Python object, and there are ways to refer uniquely to everything by using Python expressions. But there is no such thing as a unique order of atoms that would assign a number to each one. Atoms do have numbers by which the low-level C code refers to them, but these numbers can be different every time you run a Python script. My original design goal was to discourage the use of numbers to refer to atoms, because this is an important source of mistakes if the simulated system undergoes changes. But every other molecular simulation program out there uses numbers to refer to atoms, so people are used to them. For interoperability with other programs, atom numbers are fundamental. There are ways to handle such situations, of course, but it's a constant source of headaches.
The other design aspect that I would change if I were to rewrite MMTK today is the hierarchy of chemical objects. MMTK has Atoms, Groups, Molecules, and Complexes, plus specializations such as AminoAcidResidue (a special Group), PeptideChain (a special Molecule), and Protein (a special Complex). While all of these correspond to some chemical reality, the system is more complex than required for molecular simulation, leading in some situations to code that is bloated by irrelevant special cases. Today I'd go for just Atoms and Groups, with special features of specific kinds of groups indicated by attributes rather than specific classes.
Integrating scientific software and datasets into the citation record
This morning I read C. Titus Brown's blog post on how science could be so much better if scientitic data and the software used to work with it were openly available for reuse. One problem he mentions, like many others have done before, is the lack of incentive for publishing anything else but standard scientific papers. What matters for a scientist's career and for grant applications is papers, papers, papers. Any contribution that's not in a scientific journal with a reputation and an impact factor is usually ignored, even if its real impact exceeds that of many papers that nobody really wants to read.
Ideally, published scientific data and software should be treated just like a paper: it should be citeable and it should appear in the citation databases that are used to calculate impact factors, h factors, and whatever other metrics bibliometrists come up with and evaluation committees appreciate for their ease of use.
Treating text (i.e. papers), data, and code identically also happens to be useful for making scientific publications more useful to the reader, by adding interactive visualization and exploration of procedures (such as varying parameters) to the static presentation of results in a standard paper. This idea of "executable papers" has generated a lot of interest recently, as shown by Elsevier's Executable Paper Challenge and the Beyond the PDF workshop. For a technical description of how this can be achieved, see my ActivePapers project and/or the paper describing it. In the ActivePapers framework, a reference to code being called, or to a dataset being reused, is exactly identical to a reference to a published paper. It would then be much easier for citation databases to include all references rather than filter out the ones that are "classical" citations. And that's a good motivation to finally treat all scientific contributions equally.
Since the executable papers idea is much easier to sell than the idea of an upated incentive system, a seemingly innocent choice in technology could end up helping to change the way scientists and research projects are evaluated.
The ultimate calculator for Android and iOS
It's called "J", which isn't exactly a very descriptive name. And that's probably a good idea because describing it it not so easy. J is much more than a calculator, but it does the calculator job very well. It's actually a full programming language, but one that differs substantially from everything else that goes by that label. The best description for J I can come up with is "executable mathematical notation". You type an expression, and you get the result. That's in fact not very different from working interactively with Python or Matlab, except that the expressions are very different. You can write traditional programs in J, using loops, conditionals, etc., but you can a lot of work done without ever using these features.
The basic data structure in J is the array, which can have any number of dimensions. Array elements can be numbers, characters, or other arrays. Numbers (zero-dimensional arrays) and text strings (one-dimensional arrays of characters) are just special cases. In J jargon, which takes its inspiration from linguistics, data items are called "nouns". Standard mathematical operators (such as
+
or -
) are called "verbs" and can have one or two arguments (one left, one right). An expression is called a "sentence". There are no precedence rules, the right argument of any verb being everything to its right. Given the large number of verbs in J, this initially unfamiliar rule makes a lot of sense. A simple example (also showing the use of arrays) is2 * 3 + 10 20 30
26 46 66
Up to here, J expressions are not very different from Python or Matlab expressions. What J doesn't have is functions with the familiar f(x, y, z) syntax, accepting any number of arguments. There are only verbs, with one or two arguments. But what makes J really different from the well-known languages for scientific computing are the "parts of speech" that have no simple equivalent elsewhere: adverbs and conjunctions.
An adverb takes a verb argument and produces a derived verb from it. For example, the adverb
~
takes a two-argument verb (a dyad in J jargon) and turns it into a one-argument verb (a monad) that's equivalent to using the dyad with two equal arguments. With +
standing for plain addition, +~
thus doubles its argument:+~ 1 5 10 20
2 10 20 40
meaning it is the same as
1 5 10 20 + 1 5 10 20
2 10 20 40
A conjunction combines a verb with a noun or another verb to produce a derived verb. An example is ^:, the power conjunction, which applies a verb several times:
+~(^:2) 1 5 10 20
4 20 40 80
+~(^:3) 1 5 10 20
8 40 80 160
The parentheses are required to separate the argument of the power conjunction (2 or 3) from the array that is the argument to the resulting derived verb. To see the real power of the power conjunction, consider that it accepts negative arguments as well:
+~(^:_1) 1 5 10 20
0.5 2.5 5 10
You have seen right: J can figure out that the inverse of adding a number to itself is dividing that number by two!
Pretty much any programming language permits you to assign values to names for re-use in later expressions. J is no exception:
data =. 1 5 10 20
double =. +~
double data
2 10 20 40
inv =. ^:_1
halve =. double inv
halve data
0.5 2.5 5 10
As you can see, names can be given not just to nouns (i.e. data), but also to verbs, adverbs, and conjunctions. Most J programs are just pieces of expressions that are assigned to names. Which means that the short summary of J that I have given here could well be all you ever need to know about the language - apart from the fact that you will have to acquire a working knowledge of many more verbs, adverbs, and conjunctions.
Before you rush off to the Play Store looking for J, let me add that J is not yet there, although it's supposed to arrive soon. For now, you have to download the APK and install it yourself, using your preferred Android file manager. I should also point out that J is not just for Android. It's been around for more than 20 years, and you can get J for all the common computing platforms from Jsoftware. There's also an iOS version for the iPhone and iPad. J's extreme terseness is a perfect fit for smartphones, where screen space is a scarce resource and where every character you don't have to type saves you a lot of time.
The Nix package manager in computational science
In an earlier post, I mentioned the Nix package management system as a candidate for ensuring reproducibility in computational science. What distinguishes Nix from the better known package managers (Debian, RPM, ...) is that it permits the installation of different versions of the same package in parallel, with a dependency tracking system that refers to a precise version of everything, including the versions of the development tools (compilers, ...) that were used to build the libraries and executables. Nix thus remembers for each package the complete details of how it can be reconstructed, which is what we would like to see for ensuring reproducibility.
There are, however, two caveats. First of all, Nix was designed for software installation management and not for computation. While in principle one could define the results (figures, tables, datasets) of some computation as a Nix package and perform the computation by installing the package, such an approach is quite cumbersome with the Nix support tools designed with a different task in mind. However, computation-specific support tools would probably suffice to fix this. Second, while the design of Nix looks quite sound, it is a young project with much less manpower behind it than the big package managers of the Linux world. This means there are fewer package definitions and they are overall less reliable. For example, I haven't yet managed to install my research computing environment (Python, NumPy, matplotlib, plus a few more packages) using Nix under MacOS X, because some packages simply fail to build. Again this is not an insurmountable problem, but it requires some serious effort to fix.
The Nix documentation is pretty good at describing how to use the package manager and the collection of package definitions for Linux and MacOS X named Nixpkgs. It is not so good at giving a basic understanding of how Nix works, which becomes important when you want to use it for something else than traditional package management. The following overview is the result of my own explorations of Nix. I am not a Nix authority, so be warned that there may be mistakes or misunderstandings.
At the heart of Nix is the "Nix store", a central database where everything managed by Nix is kept. Its default location is /nix/store
and if you look at it you see an overwhelmingly long list of crypic filenames. Let's zoom in on something to see what's going on. Here is what ls -l /nix/store/*zlib*
shows on my machine:
-r--r--r-- 1 hinsen staff 1000 Jan 1 1970
/nix/store/12vkkhs36xffzpqjaaa3vqhqv2yc97vs-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1181 Jan 1 1970
/nix/store/gymcn145ihhmymm6yk2wxqfd49s5dzdq-zlib-1.2.6.drv
dr-xr-xr-x 5 hinsen staff 170 Jan 1 1970
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
-r--r--r-- 1 hinsen staff 1000 Jan 1 1970
/nix/store/sj8l48kfc40wh8adb5pa843lwy38hskb-zlib-1.2.6.drv
-r--r--r-- 1 hinsen staff 1686 Jan 1 1970
/nix/store/xpm2xja2zv5agmdzgi362jqd5xx9ny10-zlib-1.2.6.tar.gz.drv
The single directory in that list actually contains the
zlib
installation in the familiar Unix file layout that you find under /usr
or /usr/local
:
~> ls -R /nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6:
include lib share
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/include:
zconf.h zlib.h
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib:
libz.1.2.6.dylib libz.1.dylib libz.a libz.dylib pkgconfig
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/lib/pkgconfig:
zlib.pc
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share:
man
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man:
man3
/nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6/share/man/man3:
zlib.3.gz
Note that it contains just
zlib
, and nothing else, in particular not zlib
's dependencies. Each library or application has its own directory in the Nix store.Next, let's look at all the other files, those with the extension .drv
(for "derivation", a Nix term for any artefact derived from human-provided input). There are three files that end in zlib-1.2.6.drv
and one that ends in zlib-1.2.6.tar.gz.drv
. Let's look at the contents of the last one first. I have made it more readable by adding whitespace:
Derive(
[("out",
"/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz",
"sha256",
"21235e08552e6feba09ea5e8d750805b3391c62fb81c71a235c0044dc7a8a61b")],
[("/nix/store/lhc0qhfdrw32rj1z7s5p90nbjfnkydhb-stdenv.drv",
["out"]),
("/nix/store/pawry9l3415kwfbfh4zrhgnynwfb10bs-mirrors-list.drv",
["out"])],
["/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],
"x86_64-darwin",
"/bin/bash",
["-e",
"/nix/store/01w11lngp8s4lxllyr6xbmjfyrfkrn43-builder.sh"],
[("buildInputs",""),
("buildNativeInputs",""),
("builder","/bin/bash"),
("id",""),
("impureEnvVars","http_proxy https_proxy ftp_proxy all_proxy no_proxy NIX_CURL_FLAGS NIX_HASHED_MIRRORS NIX_MIRRORS_apache NIX_MIRRORS_bitlbee NIX_MIRRORS_cpan NIX_MIRRORS_debian NIX_MIRRORS_fedora NIX_MIRRORS_gcc NIX_MIRRORS_gentoo NIX_MIRRORS_gnome NIX_MIRRORS_gnu NIX_MIRRORS_gnupg NIX_MIRRORS_hashedMirrors NIX_MIRRORS_imagemagick NIX_MIRRORS_kde NIX_MIRRORS_kernel NIX_MIRRORS_metalab NIX_MIRRORS_oldsuse NIX_MIRRORS_opensuse NIX_MIRRORS_postgresql NIX_MIRRORS_savannah NIX_MIRRORS_sf NIX_MIRRORS_sourceforge NIX_MIRRORS_ubuntu NIX_MIRRORS_xorg"),
("mirrorsFile","/nix/store/mmk41rbja1fvclbr7ghirzcigxlzl6f0-mirrors-list"),
("name","zlib-1.2.6.tar.gz"),
("out","/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz"),
("outputHash","06x6m33ls1606ni7275q5z392csvh18dgs55kshfnvrfal45w8r1"),
("outputHashAlgo","sha256"),
("preferHashedMirrors","1"),
("preferLocalBuild","1"),
("propagatedBuildInputs",""),
("propagatedBuildNativeInputs",""),
("showURLs",""),
("stdenv","/nix/store/9fnvs0bvhrszazham5cnl13h52hvm1rk-stdenv"),
("system","x86_64-darwin"),
("urls","http://www.zlib.net/zlib-1.2.6.tar.gz mirror://sourceforge/libpng/zlib/1.2.6/zlib-1.2.6.tar.gz")])
If that looks like a computational expression in a programming language, that's because it is. Don't worry, it's not something you are expected to write yourself, these expressions are created from the package definitions written in a more user-friendly syntax called "Nix expressions", which is very well documneted in the Nix documentation.. The expression shown above defines how to make (or "realise" in Nix jargon) the derivation
/nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz
, which is a rather simple one because the file is simply downloaded and verified for a known checksum. But even such a simple derivation has dependencies: the "standard environment" stdenv
and the list of download mirror sites, mirrors-list
.It's time to say something about those funny 32-character prefixes in all the file names in the Nix store. You may have noticed that the zlib
file list above contains two entries for zlib-1.2.6.drv
that are identical except for this prefix. It looks as if the prefix is there to distinguish things that would otherwise be identical. This is true, and the information encoded in the prefix (which is a hash code) is the complete set of dependencies. The two zlib derivations differ in the version of the standard environment they were built with. I have both of these in my Nix store because I have played around with different releases of Nixpkgs. Nix really tries to keep track of every single dependency, including the exact versions of the various tools (mainly compilers) that were used in building a binary installation. That means you can keep lots of different versions of every single item on your system at the same time, and trace back exactly how they were built. You can also send a copy of the relevant derivation files (those with the .drv
extension) to someone else, who can reproduce the exact same environment by "realising" those derivations again.
With so many zlibs floating around, which one does Nix use when you ask it to install some application that uses zlib? The one you specify. When some application requires zlib as a dependency, you have to tell Nix exactly which zlib derivation you want to be used. You don't normally do this manually for every single build (though you could), you'd rather use a coherent set of package definitions (such as Nixpkgs) that specifies all the interdependencies among hundreds of packages. The package definitions take the form of "Nix expressions", which are written in a language specifically designed for this purpose. Files containing Nix expressions have the extension .nix
. Since the language is rather well documented in the Nix manual, I won't say any more about it here. A good starting point is to explore Nixpkgs. It helps to know that the central file is pkgs/top-level/all-packages.nix
. This file imports the definitions of individual packages from their respective packages and makes a consistent package collection from them. When you build a particular derivation from Nixpkgs, only the packages listed explicitly as its dependencies are available in the build environment that is set up specifically for this build operation. No "default library" (such as /usr/lib
) is used at all.
There is one more layer to Nix, whose role is twofold: making it convenient for users to work with programs installed through Nix, and pemitting to remove packages that were installed but are no longer needed.
Let's start with the second aspect because it is the simpler one: packages can be removed as soon as nobody needs them any more. This requires a way to figure out which packages are still needed. Obviously the packages that some user on the system wants to access are "needed", and that's why cleanup is related to user profiles which I will cover in a minute. The remaining needed packages are the dependencies of other needed packages. So once we know the packages that all users put together request to use, we can figure out which packages can safely be deleted. This clean-up operation is called "garbage collection" and handled by the command nix-store --gc
.
Nix user environments are managed using the command nix-env
, and if you don't care about how Nix works, that command is the only one you may ever need. Each user has his/her own environment, of course, which consists mainly of a directory named $HOME/.nix-profile
. That directory contains subdirectories called bin
, lib
, man
etc. whose names should sound familiar. They contain nothing but symbolic links into the Nix store. These links define which package the user actually accesses, by putting $HOME/.nix-profile/bin
on th3 PATH
environment variable. When you use nix-env
to install a package, Nix builds it and puts it into the Nix store (unless it's already there), and then creates symbolic links in your Nix profile, which may replace links to some different version of a package. It is important to understand that your use profile never enters into the build process of any Nix derivation. Your profile is exclusively for your own use and has no impact on Nix package management other than protecting the packages you use from being removed during garbage collection.
So far for a first report on my exploration of Nix. I will continue trying to get my computational environment built with Nix, so that I can start to explore how to use it for reproducible computations. Watch this space for news.
PS: After I published this post initially, the friendly people on the Nix mailing list pointed out some additional material for learning about Nix. First of all, there is Eelco Dolstra's thesis entitled "The Purely Functional Software Deployment Model", which is what you should read if you really want to know everything about Nix. There's also Sander van der Burg's blog which has some very detailed posts about Nix and what it can be used for. You could start with this introduction.
Unifying version control and dependency management for reproducible research
Software developers use version control to keep track of the evolution of their software, to coordinate team development, and to manage experimental features. But version control is also of interest for software users: it permits them to refer to a specific version of a piece of software they use in a unique and reproducible way, even if that version is not the current one, nor perhaps even an official numbered release. In fact, official numbered releases are becoming a relict of the past. They make little sense in an Open Source universe where everyone has access to source code repositories under version control. In that situation, an official release is nothing but a bookmark pointing to a specific commit number. There is no need for a release number.
Why would you want to refer to a specific version of a piece of software, rather than always use the latest one? There are many reasons. As software evolves, some bugs get fixed but others sneak in. You may prefer the bugs you know to the ones that could surprise you. Sometimes later versions of some software are not fully compatible with their predecessors, be it by design or by mistake. And even if you want to use the very latest version at any time, you might still want to note which version you used for a specific application. In scientific computing, this is one of the fundamental principles of reproducible research: note carefully, and publish, the exact versions of all pieces of software that were used for obtaining any published research result. It's the only way for you and others to be able to understand exactly what happened when you look at your work many years later.
Another undeniable reality of modern software, in particular in the Open Source universe, is that it's modular. Developers use other people's software, especially if it's well written and has the reputation of being reliable, rather than reinventing the wheel. The typical installation instructions of a piece of Open Source software start with a list of dependencies, i.e. packages you have to install before you can install the current one. And of course the packages in the dependency list have their own dependency list. The number of packages to install can be overwhelming. The difficulties of dependency management are so widespread that the term "dependency hell" has been coined to refer to them.
Systems programmers have come up with a solution to that problem as well: dependency management tools, better known as package managers. Such tools keep a database of what is installed and which package depends on which other ones. The well-known Linux distributions are based on such package managers, of which the ones developed by Debian and RedHat are the most popular ones and are now used by other distributions as well. For MacOS X, MacPorts and Fink are the two most popular package managers, and I suspect that the Windows world has its own ones.
One of the major headaches that many computer users face is that version management and dependency management don't cooperate. While most package managers permit to state a minimal version number for a dependency, they don't permit to prescribe a precise version number. There is a good reason for this: the way software installation is managed traditionally on Unix systems makes it impossible to install multiple versions of the same package in parallel. If packages A and B both depend on C, but require different versions of it, there is simply no simple solution. Today's package managers sweep this problem under the rug and pretend that higher version numbers are always as least as good as their predecessors. They will therefore install the higher of the two version numbers required by A and B, forcing one of them to use a version different from its preference.
Anyone who has been using computers intensively for a few years has probably run into such a problem, which manifests itself by some program not working correctly any more after another one, seemingly unrelated, has been installed. Another variant is that an installation fails because some dependency is available in a wrong version. Such problems are part of "dependency hell".
This situation is particularly problematic for the computational scientist who cares about the reproducibility of computed results. At worst, verifying results from 2005 by comparing to results from 2009 can require two completely separate operating system installations running in separate virtual machines. Under such conditions, it is difficult to convince one's colleagues to adopt reproducible research practices.
While I can't propose a ready-to-use solution, I can point out some work that shows that there is hope for the future. One interesting tool is the Nix package manager, which works much like the package managers by Debian or RedHat, but permits installing multiple versions of the same package in parallel, and registers dependencies with precise versions. It could be used as a starting point for managing software for reproducible research, the main advantage being that it should work with all existing software. The next step would be to make each result dataset or figure a separate "package" whose complete dependency list (software and datasets) is managed by Nix with references to precise version numbers. I am currently exploring this approach; watch this space for news about my progress.
For a system even better suited to the needs of reproducible computational science, I refer to my own ActivePapers framework, which combines dependency management and version control for code and data with mechanisms for publishing code+data+documentation packages and re-use code from other publications in a secure way. I have to admit that it has a major drawback as well: it requires all code to run on the Java Virtual Machine (in order to guarantee portability and secure execution), which unfortunately means that most of today's scientific programs cannot be used. Time will tell if scientific computing will adopt some virtual machine in the future that will make such a system feasible in real life. Reproducible research might actually become a strong argument in favour of such a development.
Julia: a new language for scientific computing
New programming languages are probably invented every day, and even those that get developed and published are too numerous to mention. New programming languages developed specifically for science and engineering are very rare, however, and that's why such a rare event deserves some publicity. A while ago, I saw an announcement for Julia, which announces itself as "a fresh approach to technical computing". I couldn't resist the temptation to download, install, and test-drive it. Here are my first impressions.
The languages used today for scientific computing can be grouped into four categories:
- Traditional compiled languages optimized for number crunching. The big player in this category is of course Fortran, but some recent languages such as X10, Chapel, or Fortress are trying to challenge it.
- Rapid-development domain-specific languages, usually interpreted. Well-known examples are Matlab an R.
- General-purpose statically compiled languages with libraries for scientific computing. C and C++ come to mind immediately.
- General-purpose dynamic languages with libraries for scientific computing. The number one here is Python with its vast library ecosystem.
What sets Julia apart is that it sits somewhere between the first two categories. It's compiled, but fully interactive, there is no separate compilation phase. It is statically typed, allowing for efficient compilation, but also has the default type "Any" that makes it work just like dynamically typed languages in the absence of type declarations. Type infererence makes the mix even better. If that sounds like the best of both worlds, it actually is. It has been made possible by modern code transformation techniques that don't really fit into the traditional categories of "compilers" and "interpreters". Like many other recent languages and language implementations, Julia uses LLVM as its infrastructure for these code transformations.
Julia has a well-designed type system with a clear orientation towards maths and number crunching: there is support for complex numbers, and first-class array support. What may seem surprising is that Julia is not object-oriented. This is neither an oversight nor a nostalgic return to the days of Fortran 77, but a clear design decision. Julia has type hierarchies and function polymorphism with dispatch on the types of all arguments. For scientific applications (and arguably for some others), this is more useful than OO style method dispatch on a single value.
Another unusual feature of Julia is a metaprogramming system that is very similar to Lisp macros, although it is slightly more complicated by the fact that Julia has a traditional syntax layer, whereas Lisp represents code by data structures.
So far for a summary of the language. The real question is: does it live up to its promises? Before I try to answer that question, I would like to point out that Julia is a young language that is still in flux and for now has almost no development tool support. For many real-life problems, there is no really good solution at the moment but it is clear that a good solution can be provided, it just needs to be done. What I am trying to evaluate is not if Julia is ready for real-life use (it is not), but whether there are any fundamental design problems.
The first question I asked myself is how well Julia can handle non-scientific applications. I just happened to see a blog post by John D. Cook explaining why it's preferable to write math in a general-purpose language than to write non-math in a math language. My experience is exactly the same, and that's why I have adopted Python for most of my scientific programming. The point is that any non-trivial program sooner or later requires solving non-math problems (I/O, Web publishing, GUIs, ...). If you use a general-purpose language, you can usually just pick a suitable library and go ahead. With math-only languages such as Matlab, your options are limited, with interfacing to C code sometimes being the only way out.
So is it feasible to write Web servers or GUI libraries in Julia? I would say yes. All the features of general-purpose languages are there or under consideration (I am thinking in particular of namespaces there). With the exception of systems programming (device drivers and the like), pretty much every programming problem can be solved in Julia with no more effort than in most other languages. The real question is if it will happen. Julia is clearly aimed at scientists and engineers. It is probably good enough for doing Web development, but it has nothing to offer for Web developers compared to well-established languages. Will scientists and engineers develop their own Web servers in Julia? Will Web developers adopt Julia? I don't know.
A somewhat related question is that of interfacing to other languages. That's a quick way to make lots of existing code available. Julia has a C interface (which clearly needs better tool support, but I am confident that it will come), which can be used for other sufficiently C-like languages. It is not clear what effort will be required to interface Julia with languages like Python or Ruby. I don't see why it couldn't be done, but I can't say yet whether the result will be pleasant to work with.
The second question I explored is how well Julia is suited to my application domain, which is molecular simulations and the analysis of experimental data. Doing molecular simulation in Julia looks perfectly feasible, although I didn't really implement any non-trivial algorithm yet. What I concentrated on first is data analysis, because that's where I could profit most from Julia's advantages. The kinds of data I mainly deal with are (1) time series and frequency spectra and (2) volumetric data. For time series, Julia works just fine. My biggest stumbling block so far has been volumetric data.
Volumetric data is usually stored in a 3-dimensional array where each axis corresponds to one spatial dimension. Typical operations on such data are interpolation, selection of a plane (2-d subarray) or line (1-d subarray), element-wise multiplication of volume, plane, or line arrays, and sums over selected regions of the data. Using the general-purpose array systems I am familiar with (languages such as APL, libraries such as NumPy for Python), all of this is easy to handle.
Julia's arrays are different, however. Apparently the developers' priority was to make the transition to Julia easy for people coming from Matlab. Matlab is based on the principle that "everything is a matrix", i.e. a two-dimensional array-like data structure. Matlab vectors come on two flavors, row and column vectors, which are actually matrices with a single row or column, respectively. Matlab scalars are considered 1x1 matrices. Julia is different because it has arrays of arbitrary dimension. However, array literals are made to resemble Matlab literals, and array operations are designed to behave as similar as possible to Matlab operations, in particular for linear algebra functions. In Julia, as in Matlab, matrix multiplication is considered more fundamental than elementwise multiplication of two arrays.
For someone used to arrays that are nothing more than data structures, the result looks a bit messy. Here are some examples:
julia> a = [1; 2]
[1, 2]
julia> size(a)
(2,)
julia> size(transpose(a))
(1,2)
julia> size(transpose(transpose(a)))
(2,1)
I'd expect that the transpose of the transpose is equal to the original array, but that's not the case. But what does transpose do to a 3d array? Let's see:
julia> a = [x+y+z | x=1:4, y=1:2, z = 1:3]
4x2x3 Int64 Array:
...
ulia> transpose(a)
no method transpose(Array{Int64,3},)
in method_missing at base.jl:60
OK, so it seems this was not considered important enough, but of course that can be fixed.
Next comes indexing:
julia> a = [1 2; 3 4]
2x2 Int64 Array:
1 2
3 4
julia> size(a)
(2,2)
julia> size(a[1, :])
(1,2)
julia> size(a[:, 1])
(2,1)
julia> size(a[1, 1])
()
Indexing a 2-d array with a single number (all other indices being the all-inclusive range
:
) yields a 2-d array. Indexing with two number indices yields a scalar. So how do I extract a 1-d array? This generalizes to higher dimensions: if the number of number indices is equal to the rank of the array, the result is a scalar, otherwise it's an array of the same rank as the original.Array literals aren't that frequent in practice, but they are used a lot in development, for quickly testing functions. Here are some experiments:
julia> size([1 2])
(1,2)
julia> size([1; 2])
(2,)
julia> size([[1;2] ; [3;4]])
(4,)
julia> size([[1;2] [3;4]])
(2,2)
julia> size([[1 2] [3 4]])
(1,4)
julia> size([[[1 2] [3 4]] [[5 6] [7 8]]])
(1,8)
Can you guess the rules? Once you have them (or looked them up in the Julia manual), can you figure out how to write a 3-d array literal? I suspect it's not possible.
Next, summing up array elements:
julia> sum([1; 2])
3
julia> sum([1 2; 3 4])
10
Apparently
sum
doesn't care about the shape of my array, it always sums the individual elements. Then how do I do a sum over all the rows?I have tried to convert some of my basic data manipulation code from Python/NumPy to Julia, but found that I always spent most of the time fighting against the built-in array operations, which are clearly not made for my kind of application. In some cases a change of attitude may be sufficient. It seems natural to me that a plane extracted from volumetric data should be a 2-d array, but maybe if I decide that should be a 3-d array of "thickness" 1, everything will be easy.
I haven't tried yet, because I know there are cases that cannot be dealt with in that way. Suppose I have a time series of volumetric data that I store in a 4-d array. Obviously I want to be able to apply functions written for static volumetric data (i.e. 3-d arrays) to an element of such a time series. Which means I do need a way to extract a 3-d array out of a 4-d array.
I hope that what I need is there and I just didn't find it yet. Any suggestions are welcome. For now, I must conclude that test-driving Julia is a frustrating experience: the language holds so many promises, but fails for my needs due to superficial but practically very important problems.
Binary operators in Python
A two-hour train journey provided the opportunity to watch the video recording of the Panel with Guido van Rossum at the recent PyData Workshop. The lengthy discussion about PEP 225 (which proposes to add additional operators to Python that would enable to have both elementwise and aggregate operations on the same objects, in particular for providing both matrix and elementwise multiplication on arrays with a nice syntax) motivated me to write up my own thoughts about what's wrong with operators in Python from my computational scientist's point of view.
The real problem I see is that operators map to methods. In Python, a*b
is just syntactic sugar for a.__mul__(b)
. This means that it's the type of a
that decides how to do the multiplication. The method implementing this operation can of course check the type of b
, and it can even decide to give up and let b
handle everything, in which case Python does b.__rmul__(a)
. But this is just a kludge to work around the real weakness of the operators-map-to-methods approach. Binary operators fundamentally require a dispatch on both types, the type of a
and the type of b
. What a*b
should map to is __builtins__.__mul__(a, b)
, a global function that would then implement a binary dispatch operation. Implementing that dispatch would in fact be the real problem to solve, as Python currently has no multiple dispatch mechanisms at all.
But would multiple dispatch solve the issue addressed by PEP 225? Not at all, directly. But it would make some of the alternatives mentioned there feasible. A proper multiple dispatch system would allow NumPy (or any other library) to decide what multiplication of its own objects by a number means, no matter if the number is the first or the second factor.
More importantly, multiple dispatch would allow a major cleanup of many scientific packages, including NumPy, and even clean up the basic Python language by getting rid of __rmul__
and friends. NumPy's current aggressive handling of binary operations is actually more of a problem for me than the lack of a nice syntax for matrix multiplication.
There are many details that would need to be discussed before binary dispatch could be proposed as a PEP. Of course the old method-based approach would need to remain in place as a fallback, to ensure compatibility with existing code. But the real work is defining a good multiple dispatch system that integrates well with Python's dynamical type system and allows the right kind of extensibility. That same multiple dispatch method could then also be made available for use in plain functions.
Python becomes a platform
The recent announcement of clojure-py made some noise in the Clojure community, but not, as far as I can tell, in the Python community. For those who haven't heard of it before, clojure-py is an implementation of the Clojure language in Python, compiling Clojure code to bytecode for Python's virtual machine. It's still incomplete, but already usable if you can live with the subset of Clojure that has been implemented.
I think that this is an important event for the Python community, because it means that Python is no longer just a language, but is becoming a platform. One of the stated motivations of the clojure-py developers is to tap into the rich set of libraries that the Python ecosystem provides, in particular for scientific applications. Python is thus following the path that Java already went in the past: the Java virtual machine, initially designed only to support the Java language, became the target of many different language implementations which all provide interoperation with Java itself.
It will of course be interesting to see if more languages will follow once people realize it can be done. The prospect of speed through PyPy's JIT, another stated motivation for the clojure-py community, could also get more lanuage developers interested in Python as a platform.
Should Python programmers care about clojure-py? I'd say yes. Clojure is strong in two areas in which Python isn't. One of them is metaprogramming, a feature absent from Python which Clojure had from the start through its Lisp heritage. The other feature is persistent immutable data structures, for which clojure-py provides an implementation in Python. Immutable data structures make for more robust code, in particular but not exclusively for concurrent applications.
Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software
By month: 2024-10, 2024-06, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04