Posts tagged programming

The future of the Scientific Python ecosystem


SciPy 2015 is over, meaning that many non-participants like myself are now busy catching up with what happened by watching the videos. Today's dose for me was Jake VanderPlas' keynote entitled "State of the Tools". It's about the history, current state, and potential future of what is now generally known as the Scientific Python ecosystem: the large number of libraries and tools written in or for Python that scientists from many disciplines use to get their day-to-day computational work done.




History is done, the present status is a fact, but the future is open to both speculation and planning, so that's what I find most interesting in Jake's keynote. What struck me is that everything he discussed was about paying back technical debt: refactoring the core libraries, fixing compatibility problems, removing technical obstacles to installation and use of various tools. In fact, 20 years after Python showed up in scientific computing, the ecoystem is in a state that is typical for software projects of that age: a bit of a mess. The future work outlined by Jake would help to make it less of a mess, and I hope that something like this will actually happen. The big question mark for me is how this can be funded, given that it is "only" maintenance work, producing nothing fundamentally new. Fortunately there are people much better than me at thinking about funding, for example everyone involved in the NumFOCUS foundation.




Jake's approach to outlining the future is basically "how can we fix known problems and introduce some obvious improvements" (but please do watch the video to get the full story!). What I'd like to present here is an alternate approach: imagine an ideal scientific computing environment in 2015, and try to approximate it by an evolution of the current SciPy ecosystem while retaining a sane level of backwards compatibility. Think of it as the equivalent of Python 3 at the level of the core of the scientific ecosystem.




One aspect that has changed quite a bit over 20 years is the interaction between Python and low-level code. Back then, Python had an excellent C interface, which also worked well for Fortran 77 code, and the ease of wrapping C and Fortran libraries was one of the major reasons for Python's success in scientific computing. We have seen a few generations of wrapper code generators, starting with SWIG, and the idea of a hybrid language called Pyrex that was the ancestor of today's Cython. LLVM has been a major game changer, because it permits low-level code to be generated and compiled on-the-fly, without explicitly generating wrappers and compiling code. While wrapping C/C++/Fortran libraries still remains important, the equally important task of writing low-level code for performance can be handled much better with such tools. Numba is perhaps the best-known LLVM-based code generator in the Python world, providing JIT compilation for a language that is very similar to a subset of Python. But Numba is also an example of the mindset that has led to the current mess: take the existing ecosystem as given, and add a piece to it that solves a specific problem.




So how would one approach the high-/low-level interface today, having gained experience with LLVM and PyPy? Some claim that the distinction doesn't make sense any more. The authors of the Julia language, for example, claim that it "avoids the two-language problem". However, as I have pointed out on this blog, Julia is fundamentally a performance-oriented low-level language, in spite of having two features, interactivity and automatic memory management, that are traditionally associated with high-level languages. By the way, I don't believe the idea of a both-high-and-low-level language is worth pursuing for scientific computing. The closest realization of that idea is Common Lisp, which is as high-level as Python, perhaps more so, and also as low-level as Julia, but at the cost of being a very complex language with a very steep learning curve, especially for mastering the low-level aspects. Having two clearly distinct language levels makes it possible to keep both of them manageable, and the separation line serves as a clear warning sign to scientists, who should not attempt to cross it without first acquiring some serious knowledge about software development.




The model to follow, in my opinion, is the one of Lush and Terra. They embed a low-level language into a high-level language in such a way that the low-level code is a data structure at the high level. You can use literals for this data structure and get the equivalent of Numba. But you can also write code generators that specialize low-level code for a given problem. Specialization allows both optimization and simplification, both of which are desirable. The low-level language would have arrays as a primitive data structure, and both NumPy and Pandas, or evolutions such as xray, would become shallow Python APIs to such low-level array functionality. I think this is much more powerful than today's Numba building on NumPy. Moreover, wrapper generators become simple plain Python code, making the construction of interfaces to complex libraries (think of h5py) much easier than it is today. Think of it as ctypes on steroids. For more examples of what one could do with such a system, look at metaprogramming in Julia, which is exactly the same idea.




Another aspect that Jake talks about in some detail is visualization. There again, two decades of code written by people scratching their own itches has led to a mess of different libraries with a lot of overlap and no clear distinctive features. For cleaning it up, I propose the same approach: what are the needs and the available technologies for scientific visualization in 2015? We clearly want to profit from all the Web-based technologies, both for portability (think of mobile platforms) and for integration with Jupyter notebooks. But we also need to be able to integrate visualization into GUI applications. From the API point of view, we need something simple for simple plots (Toyplot looks promising), but also more sophisticad APIs for high-volume data visualization. The main barrier to overcome, in my opinion, is the current dominance of Matplotlib, which isn't particularly good in any of the categories I have outlined. Personally, I don't believe that any evolution of Matplotlib can lead to something pleasant to use, but I'd of course be happy to be proven wrong.




Perhaps the nastiest problem that Jake addresses is packaging. He seems to believe that conda is the solution, but I don't quite agree with that. Unless I missed some recent evolutions, a Python package prepared for installation through conda can only be used easily with a Python distribution built on conda as well. And that means Anaconda, because it's the only one. Since Anaconda is not Open Source, there is no way one can build a Python installation from scratch using conda. Of course, Anaconda is perfectly fine for many users. But if you need something that Anaconda does not provide, you may not be able to add it yourself. On the Mac, for example, I cannot compile C extensions compatible with Anaconda, because Mac Anaconda is built for compatibility with ancient OSX versions that are not supported by a standard XCode installation. Presumably that can be fixed, but I suspect that would be a major headache. And then, how about platforms unsupported by Anaconda?




Unfortunately I will have to leave this at the rant level, because I have no better proposition to make. Packaging has always been a mess, and will likely remain a mess, because the underlying platforms on which Python builds are already a mess. Unfortunately, it's becoming more and more of a problem as scientific Python packages grow in size and features. It's gotten to the point where I am not motivated to figure out how to install the latest version of nMOLDYN on my Mac, although I am a co-author of that program. The previous version is good enough for my own needs, and much simpler to install though already a bit tricky. That's how you get to love the command line… in 2015.


Exploring Racket


Over the last few months I have been exploring the Racket language for its potential as a language for computational science, and it's time to summarize my first impressions.



Why Racket?



There are essentially two reasons for learning a programing language: (1) getting acquainted with a new tool that promises to get some job done better than with other tools, and (2) learning about other approaches to computing and programming. My interest in Racket was driven by a combination of these two aspects. My background is in computational science (phsyics, chemistry, and structural biology), so I use computation extensively in my work. Like most computational scientists of my generation, I started working in Fortran, but quickly found this unsatisfactory. Looking for a better way to do computational science, I discovered Python in 1994 and joined the Matrix-SIG that developed what is now known as NumPy. Since then, Python has become my main programming language, and the ecosystem for scientific computing in Python has flourished to a degree unimaginable twenty years ago. For doing computational science, Python is one of the top choices today.




However, we shouldn't forget that we are still living in the stone age of computational science. Fortran was the Paleolithic, Python is the Neolithic, but we have to move on. I am convinced that computing will become as much an integral part of doing science as mathematics, but we are not there yet. One important aspect has not evolved since the beginnings of scientific computing in the 1950s: the work of a computational scientist is dominated by the technicalities of computing, rather than by the scientific concerns. We write, debug, optimize, and extend software, port it to new machines and operating systems, install messy software stacks, convert file formats, etc. These technical aspects, which are mostly unrelated to doing science, take so much of our time and attention that we think less and less about why we do a specific computation, how it fits into more general theoretical frameworks, how we can verify its soundness, and how we can improve the scientific models that underly our computations. Compare this to how theoreticians in a field like physics or chemistry use mathematics: they have acquired most of their knowledge and expertise in mathematics during their studies, and spend much more time applying mathematics to do science than worrying about the intrinsic problems of mathematics. Computing should one day have the same role. For a more detailed description of what I am aiming at, see my recent article.




This lengthy foreword was necessary to explain what I am looking for in Racket: not so much another language for doing today's computational science (Python is a better choice for that, if only for its well-developed ecosystem), but as an evironment for developing tomorrow's computational science. The Racket Web site opens with the title "A programmable programming language", and that is exactly the aspect of Racket that I am most interested in.




There are two more features of Racket that I found particularly attractive. First, it is one of the few languages that have good support for immutable data structures without being extremist about it. Mutable state is the most important cause of bugs in my experience (see my article on "Managing State" for details), and I fully agree with Clojure's Rich Hickey who says that "immutability is the right default". Racket has all the basic data structures in a mutable and an immutable variant, which provides a nice environment to try "going immutable" in practice. Second, there is a statically typed dialect called Typed Racket which promises a straightforward transition from fast prototyping in plain Racket to type-safe and more efficient production code in Typed Racket. I haven't looked at this yet, so I won't say any more about it.



Racket characteristics



For readers unfamiliar with Racket, I'll give a quick overview of the language. It's part of the Lisp family, more precisely a derivative of Scheme. In fact, Racket was formerly known as "PLT Scheme", but its authors decided that it had diverged sufficiently from Scheme to give it a different name. People familiar with Scheme will still recognize much of the language, but some changes are quite profound, such as the fact that lists are immutable. There are also many extensions not found in standard Scheme implementations.




The hallmark of the Lisp family is that programs are defined in terms of data structures rather than in terms of a text-based syntax. The most visible consequence is a rather peculiar visual aspect, which is dominated by parentheses. The more profound implication, and in fact the motivation for this uncommon choice, is the equivalence of code and data. Program execution in Lisp is nothing but interpretation of a data structure. It is possible, and common practice, to construct data structures programmatically and then evaluate them. The most frequent use of this characteristic is writing macros (which can be seen as code preprocessors) to effectively extend the language with new features. In that sense, all members of the Lisp family are "programmable programming languages".




However, Racket takes this approach to another level. Whereas traditional Lisp macros are small code preprocessors, Racket's macro system feels more like a programming API for the compiler. In fact, much of Racket is implemented in terms of Racket macros. Racket also provides a way to define a complete new language in terms of existing bits and pieces (see the paper "Languages as libraries" for an in-depth discussion of this philosophy). Racket can be seen as a construction kit for languages that are by design interoperable, making it feasible to define highly specific languages for some application domain and yet use it in combination with a general-purpose language.




Another particularity of Racket is its origin: it is developed by a network of academic research groups, who use it as tool for their own research (much of which is related to programming languages), and as a medium for teaching. However, contrary to most programming languages developed in the academic world, Racket is developed for use in the "real world" as well. There is documentation, learning aids, development tools, and the members of the core development team are always ready to answer questions on the Racket user mailing list. This mixed academic-application strategy is of interest for both sides: researchers get feedback on the utility of their ideas and developments, and application programmers get quick access to new technology. I am aware of only three other languages developed in a similar context: OCaml, Haskell, and Scala.



Learning and using Racket



A first look at the Racket Guide (an extended tutorial) and the Racket Reference shows that Racket is not a small language: there is a bewildering variety of data types, control structures, abstraction techniques, program structuration methods, and so on. Racket is a very comprehensive language that allows both fine-tuning and large-scale composition. It definitely doesn't fit into the popular "low-level" vs. "high-level" dichotomy. For the experienced programmer, this is good news: whatever technique you know to be good for the task at hand is probably supported by Racket. For students of software development, it's probably easy to get lost. Racket comes with several subsets developed for pedagogical purposes, which are used in courses and textbooks, but I didn't look at those. What I describe here is the "standard" Racket language.




Racket comes with its own development environment called "DrRacket". It looks quite poweful, but I won't say more about it because I haven't used it much. I use too many languages to be interested in any language-specific environment. Instead, I use Emacs for everything, with Geiser for Racket development.




The documentation is complete, precise, and well presented, including a pleasant visual layout. But it is not always an easy read. Be prepared to read through some background material before understanding all the details in the reference documentation of some function you are interested in. It can be frustrating sometimes, but I have never been disappointed: you do find everything you need to know if you just keep on following links.




My personal project for learning Racket is an implementation of the MOSAIC data model for molecular simulations. While my implementation is not yet complete (it supports only two kinds of data items, universes and configurations), it has data structure definitions, I/O to and from XML, data validation code, and contains a test suite for everything. It uses some advanced Racket features such as generators and interfaces, not so much out of necessity but because I wanted to play with them.




Overall I had few surprises during my first Racket project. As I already said, finding what you need in the documentation takes a lot of time initially, mostly because there is so much to look at. But once you find the construct you are looking for, it does what you expect and often more. I remember only one ongoing source of frustration: the multitude of specialized data structures, which force you to make choices you often don't really care about, and to insert conversion functions when function A returns a data structure that isn't exactly the one that function B expects to get. As an illustration, consider the Racket equivalent of Python dictionaries, hash tables. They come in a mutable and an immutable variant, each of which can use one of three different equality tests. It's certainly nice to have that flexibility when you need it, but when you don't, you don't want to have to read about all those details either.




As for Racket's warts, I ran into two of them. First, the worst supported data structure in Racket must be the immutable vector, which is so frustrating to work with (every operation on an immutable vector returns a mutable vector, which has to be manually converted back to an immutable vector) that I ended up switching to lists instead, which are immutable by default. Second, the distinction (and obligatory conversion) between lists, streams, generators and a somewhat unclear sequence abstraction makes you long for the simplicity of a single sequence interface as found in Python or Clojure. In Racket, you can decompose a list into head and tail using first and rest. The same operations on a stream are stream-first and stream-rest. The sequence abstraction, which covers both lists and streams and more, has sequence-tail for the tail, but to the best of my knowledge nothing for getting the first element, other than the somewhat heavy (for/first ([element sequence]) element).




The macro requirements of my first project were modest, not exceeding what any competent Lisp programmer would easily do using defmacro (which, BTW, exists in Racket for compatibility even though its use is discouraged). Nevertheless, in the spirit of my exploration, I tried all three levels of Racket's hygienic macro definitions: syntax-rule, syntax-case, and syntax-parse, in order of increasing power and complexity. The first, syntax-rule is straightforward but limited. The last one, syntax-parse, is the one you want for implementing industrial-strength compiler extensions. I don't quite see the need for the middle one, syntax-case, so I suppose it's there for historical reasons, being older than syntax-parse. Macros are the one aspect of Racket for which I recommend starting with something else than the Racket documentation: Greg Hendershott's Fear of Macros is a much more accessible introduction.



Scientific computing



As I said in the beginning of this post, my goal in exploring Racket was not to use it for my day-to-day work in computational science, but nevertheless I had a look at the support for scientific computing that Racket offers. In summary, there isn't much, but what there is looks very good.




The basic Racket language has good support for numerical computation, much of which is inherited from Scheme. There are integers of arbitrary size, rational numbers, and floating-point numbers (single and double precision), all with the usual operations. There are also complex numbers whose real/imaginary parts can be exact (integer or rational) or inexact (floats). Unlimited-precision floats are provided by an interface to MPFR in the Racket math library.




The math library (which is part of every standard Racket installation) offers many more goodies: multidimensional arrays, linear algebra, Fourier transforms, special functions, probability distributions, statistics, etc. The plot library, also in the standard Racket installation, adds one of the nicest collections of plotting and visualization routines that I have seen in any language. If you use DrRacket, you can even rotate 3D scenes interactively, a feature that I found quite useful when I used (abused?) plots for molecular visualization.




Outside of the Racket distribution, the only library I could find for scientific applications is Doug Williams' "science collection", which predates the Racket math library. It looks quite good as well, but I didn't find an occasion yet for using it.




Could I do my current day-to-day computations with Racket? A better way to put it is, how much support code would I have to write that is readily available for more mature scientific languages such as Python? What I miss most is access to my data in HDF5 and netCDF formats. And the domain-specific code for molecular simulation, i.e. the equivalent of my own Molecular Modeling Toolkit. Porting the latter to Racket would be doable (I wrote it myself, so I am familiar with all the algorithms and its pitfalls), and would in fact be an opportunity to improve many details. But interfacing HDF5 or netCDF sounds like a lot of work with no intrinsic interest, at least to me.



The community



Racket has an apparently small but active, competent, and friendly community. I say "apparently" because all I have to base my judgement on is the Racket user mailing list. Given Racket's academic and teaching background, it is quite possible that there are lots of students using Racket who find sufficient support locally that they never manifest themselves on the mailing list. Asking a question on the mailing list almost certainly leads to a competent answer, sometimes from one of the core developers, many of whom are very present. There are clearly many Racket beginners (and also programming newbies) on the list, but compared to other programming language users' lists, there are very few naive questions and comments. It seems like people who get into Racket are serious about programming and are aware that problems they encounter are most probably due to their lack of experience rathen than caused by bugs or bad design in Racket.




I also noticed that the Racket community is mostly localized in North America, judging from the peak posting times on the mailing list. This looks strange in today's Internet-dominated world, but perhaps real-life ties still matter more than we think.




Even though the Racket community looks small compared to other languages I have used, it is big and healthy enough to ensure its existence for many years to come. Racket is not the kind of experimental language that is likely to disappear when its inventor moves on to the next project.



Conclusion



Overall I am quite happy with Racket as a development language, though I have to add that I haven't used it for anything mission-critical yet. I plan to continue improving and completing my Racket implementation of Mosaic, and move it to Typed Racket as much as possible. But I am not ready to abandon Python as my workhorse for computational science, there are simply too many good libraries in the scientific Python ecosystem that are important for working efficiently.


The ultimate calculator for Android and iOS

Calculators are among the most popular applications for smartphones, and therefore it is not surprising that the Google Play Store has more than 1000 calculators for the Android platform. Having used HP's scientific calculators for more than 20 years, I picked RealCalc when I got my Android phone and set it to RPN mode. It works fine, I have no complaints about it. But I no longer use it because I found something much more powerful.

It's called "J", which isn't exactly a very descriptive name. And that's probably a good idea because describing it it not so easy. J is much more than a calculator, but it does the calculator job very well. It's actually a full programming language, but one that differs substantially from everything else that goes by that label. The best description for J I can come up with is "executable mathematical notation". You type an expression, and you get the result. That's in fact not very different from working interactively with Python or Matlab, except that the expressions are very different. You can write traditional programs in J, using loops, conditionals, etc., but you can a lot of work done without ever using these features.

The basic data structure in J is the array, which can have any number of dimensions. Array elements can be numbers, characters, or other arrays. Numbers (zero-dimensional arrays) and text strings (one-dimensional arrays of characters) are just special cases. In J jargon, which takes its inspiration from linguistics, data items are called "nouns". Standard mathematical operators (such as + or -) are called "verbs" and can have one or two arguments (one left, one right). An expression is called a "sentence". There are no precedence rules, the right argument of any verb being everything to its right. Given the large number of verbs in J, this initially unfamiliar rule makes a lot of sense. A simple example (also showing the use of arrays) is
   2 * 3 + 10 20 30
26 46 66

Up to here, J expressions are not very different from Python or Matlab expressions. What J doesn't have is functions with the familiar f(x, y, z) syntax, accepting any number of arguments. There are only verbs, with one or two arguments. But what makes J really different from the well-known languages for scientific computing are the "parts of speech" that have no simple equivalent elsewhere: adverbs and conjunctions.

An adverb takes a verb argument and produces a derived verb from it. For example, the adverb ~ takes a two-argument verb (a dyad in J jargon) and turns it into a one-argument verb (a monad) that's equivalent to using the dyad with two equal arguments. With + standing for plain addition, +~ thus doubles its argument:
   +~ 1 5 10 20
2 10 20 40

meaning it is the same as
   1 5 10 20 + 1 5 10 20
2 10 20 40

A conjunction combines a verb with a noun or another verb to produce a derived verb. An example is ^:, the power conjunction, which applies a verb several times:
   +~(^:2) 1 5 10 20
4 20 40 80
+~(^:3) 1 5 10 20
8 40 80 160

The parentheses are required to separate the argument of the power conjunction (2 or 3) from the array that is the argument to the resulting derived verb. To see the real power of the power conjunction, consider that it accepts negative arguments as well:
   +~(^:_1) 1 5 10 20
0.5 2.5 5 10

You have seen right: J can figure out that the inverse of adding a number to itself is dividing that number by two!

Pretty much any programming language permits you to assign values to names for re-use in later expressions. J is no exception:
   data =. 1 5 10 20
double =. +~
double data
2 10 20 40
inv =. ^:_1
halve =. double inv
halve data
0.5 2.5 5 10

As you can see, names can be given not just to nouns (i.e. data), but also to verbs, adverbs, and conjunctions. Most J programs are just pieces of expressions that are assigned to names. Which means that the short summary of J that I have given here could well be all you ever need to know about the language - apart from the fact that you will have to acquire a working knowledge of many more verbs, adverbs, and conjunctions.

Before you rush off to the Play Store looking for J, let me add that J is not yet there, although it's supposed to arrive soon. For now, you have to download the APK and install it yourself, using your preferred Android file manager. I should also point out that J is not just for Android. It's been around for more than 20 years, and you can get J for all the common computing platforms from Jsoftware. There's also an iOS version for the iPhone and iPad. J's extreme terseness is a perfect fit for smartphones, where screen space is a scarce resource and where every character you don't have to type saves you a lot of time.

Unifying version control and dependency management for reproducible research

When the Greek philosopher Heraclitus pronounced his famous "πάντα ῥεῖ" (everything flows), he most probably was not thinking about software. But it applies to software as much as to other aspects of life: software is in perpetual change, being modified to remove bugs, add features, and adapt it to changing environments. The management of change is now a well-established part of software engineering, with the most emblematic tool being version control. If you are developing software without using version control, stop reading this immediately and learn about Mercurial or Git, the two best version control systems available today. That's way more important than reading the rest of this post.

Software developers use version control to keep track of the evolution of their software, to coordinate team development, and to manage experimental features. But version control is also of interest for software users: it permits them to refer to a specific version of a piece of software they use in a unique and reproducible way, even if that version is not the current one, nor perhaps even an official numbered release. In fact, official numbered releases are becoming a relict of the past. They make little sense in an Open Source universe where everyone has access to source code repositories under version control. In that situation, an official release is nothing but a bookmark pointing to a specific commit number. There is no need for a release number.

Why would you want to refer to a specific version of a piece of software, rather than always use the latest one? There are many reasons. As software evolves, some bugs get fixed but others sneak in. You may prefer the bugs you know to the ones that could surprise you. Sometimes later versions of some software are not fully compatible with their predecessors, be it by design or by mistake. And even if you want to use the very latest version at any time, you might still want to note which version you used for a specific application. In scientific computing, this is one of the fundamental principles of reproducible research: note carefully, and publish, the exact versions of all pieces of software that were used for obtaining any published research result. It's the only way for you and others to be able to understand exactly what happened when you look at your work many years later.

Another undeniable reality of modern software, in particular in the Open Source universe, is that it's modular. Developers use other people's software, especially if it's well written and has the reputation of being reliable, rather than reinventing the wheel. The typical installation instructions of a piece of Open Source software start with a list of dependencies, i.e. packages you have to install before you can install the current one. And of course the packages in the dependency list have their own dependency list. The number of packages to install can be overwhelming. The difficulties of dependency management are so widespread that the term "dependency hell" has been coined to refer to them.

Systems programmers have come up with a solution to that problem as well: dependency management tools, better known as package managers. Such tools keep a database of what is installed and which package depends on which other ones. The well-known Linux distributions are based on such package managers, of which the ones developed by Debian and RedHat are the most popular ones and are now used by other distributions as well. For MacOS X, MacPorts and Fink are the two most popular package managers, and I suspect that the Windows world has its own ones.

One of the major headaches that many computer users face is that version management and dependency management don't cooperate. While most package managers permit to state a minimal version number for a dependency, they don't permit to prescribe a precise version number. There is a good reason for this: the way software installation is managed traditionally on Unix systems makes it impossible to install multiple versions of the same package in parallel. If packages A and B both depend on C, but require different versions of it, there is simply no simple solution. Today's package managers sweep this problem under the rug and pretend that higher version numbers are always as least as good as their predecessors. They will therefore install the higher of the two version numbers required by A and B, forcing one of them to use a version different from its preference.

Anyone who has been using computers intensively for a few years has probably run into such a problem, which manifests itself by some program not working correctly any more after another one, seemingly unrelated, has been installed. Another variant is that an installation fails because some dependency is available in a wrong version. Such problems are part of "dependency hell".

This situation is particularly problematic for the computational scientist who cares about the reproducibility of computed results. At worst, verifying results from 2005 by comparing to results from 2009 can require two completely separate operating system installations running in separate virtual machines. Under such conditions, it is difficult to convince one's colleagues to adopt reproducible research practices.

While I can't propose a ready-to-use solution, I can point out some work that shows that there is hope for the future. One interesting tool is the Nix package manager, which works much like the package managers by Debian or RedHat, but permits installing multiple versions of the same package in parallel, and registers dependencies with precise versions. It could be used as a starting point for managing software for reproducible research, the main advantage being that it should work with all existing software. The next step would be to make each result dataset or figure a separate "package" whose complete dependency list (software and datasets) is managed by Nix with references to precise version numbers. I am currently exploring this approach; watch this space for news about my progress.

For a system even better suited to the needs of reproducible computational science, I refer to my own ActivePapers framework, which combines dependency management and version control for code and data with mechanisms for publishing code+data+documentation packages and re-use code from other publications in a secure way. I have to admit that it has a major drawback as well: it requires all code to run on the Java Virtual Machine (in order to guarantee portability and secure execution), which unfortunately means that most of today's scientific programs cannot be used. Time will tell if scientific computing will adopt some virtual machine in the future that will make such a system feasible in real life. Reproducible research might actually become a strong argument in favour of such a development.

Julia: a new language for scientific computing

New programming languages are probably invented every day, and even those that get developed and published are too numerous to mention. New programming languages developed specifically for science and engineering are very rare, however, and that's why such a rare event deserves some publicity. A while ago, I saw an announcement for Julia, which announces itself as "a fresh approach to technical computing". I couldn't resist the temptation to download, install, and test-drive it. Here are my first impressions.

The languages used today for scientific computing can be grouped into four categories:



What sets Julia apart is that it sits somewhere between the first two categories. It's compiled, but fully interactive, there is no separate compilation phase. It is statically typed, allowing for efficient compilation, but also has the default type "Any" that makes it work just like dynamically typed languages in the absence of type declarations. Type infererence makes the mix even better. If that sounds like the best of both worlds, it actually is. It has been made possible by modern code transformation techniques that don't really fit into the traditional categories of "compilers" and "interpreters". Like many other recent languages and language implementations, Julia uses LLVM as its infrastructure for these code transformations.

Julia has a well-designed type system with a clear orientation towards maths and number crunching: there is support for complex numbers, and first-class array support. What may seem surprising is that Julia is not object-oriented. This is neither an oversight nor a nostalgic return to the days of Fortran 77, but a clear design decision. Julia has type hierarchies and function polymorphism with dispatch on the types of all arguments. For scientific applications (and arguably for some others), this is more useful than OO style method dispatch on a single value.

Another unusual feature of Julia is a metaprogramming system that is very similar to Lisp macros, although it is slightly more complicated by the fact that Julia has a traditional syntax layer, whereas Lisp represents code by data structures.

So far for a summary of the language. The real question is: does it live up to its promises? Before I try to answer that question, I would like to point out that Julia is a young language that is still in flux and for now has almost no development tool support. For many real-life problems, there is no really good solution at the moment but it is clear that a good solution can be provided, it just needs to be done. What I am trying to evaluate is not if Julia is ready for real-life use (it is not), but whether there are any fundamental design problems.

The first question I asked myself is how well Julia can handle non-scientific applications. I just happened to see a blog post by John D. Cook explaining why it's preferable to write math in a general-purpose language than to write non-math in a math language. My experience is exactly the same, and that's why I have adopted Python for most of my scientific programming. The point is that any non-trivial program sooner or later requires solving non-math problems (I/O, Web publishing, GUIs, ...). If you use a general-purpose language, you can usually just pick a suitable library and go ahead. With math-only languages such as Matlab, your options are limited, with interfacing to C code sometimes being the only way out.

So is it feasible to write Web servers or GUI libraries in Julia? I would say yes. All the features of general-purpose languages are there or under consideration (I am thinking in particular of namespaces there). With the exception of systems programming (device drivers and the like), pretty much every programming problem can be solved in Julia with no more effort than in most other languages. The real question is if it will happen. Julia is clearly aimed at scientists and engineers. It is probably good enough for doing Web development, but it has nothing to offer for Web developers compared to well-established languages. Will scientists and engineers develop their own Web servers in Julia? Will Web developers adopt Julia? I don't know.

A somewhat related question is that of interfacing to other languages. That's a quick way to make lots of existing code available. Julia has a C interface (which clearly needs better tool support, but I am confident that it will come), which can be used for other sufficiently C-like languages. It is not clear what effort will be required to interface Julia with languages like Python or Ruby. I don't see why it couldn't be done, but I can't say yet whether the result will be pleasant to work with.

The second question I explored is how well Julia is suited to my application domain, which is molecular simulations and the analysis of experimental data. Doing molecular simulation in Julia looks perfectly feasible, although I didn't really implement any non-trivial algorithm yet. What I concentrated on first is data analysis, because that's where I could profit most from Julia's advantages. The kinds of data I mainly deal with are (1) time series and frequency spectra and (2) volumetric data. For time series, Julia works just fine. My biggest stumbling block so far has been volumetric data.

Volumetric data is usually stored in a 3-dimensional array where each axis corresponds to one spatial dimension. Typical operations on such data are interpolation, selection of a plane (2-d subarray) or line (1-d subarray), element-wise multiplication of volume, plane, or line arrays, and sums over selected regions of the data. Using the general-purpose array systems I am familiar with (languages such as APL, libraries such as NumPy for Python), all of this is easy to handle.

Julia's arrays are different, however. Apparently the developers' priority was to make the transition to Julia easy for people coming from Matlab. Matlab is based on the principle that "everything is a matrix", i.e. a two-dimensional array-like data structure. Matlab vectors come on two flavors, row and column vectors, which are actually matrices with a single row or column, respectively. Matlab scalars are considered 1x1 matrices. Julia is different because it has arrays of arbitrary dimension. However, array literals are made to resemble Matlab literals, and array operations are designed to behave as similar as possible to Matlab operations, in particular for linear algebra functions. In Julia, as in Matlab, matrix multiplication is considered more fundamental than elementwise multiplication of two arrays.

For someone used to arrays that are nothing more than data structures, the result looks a bit messy. Here are some examples:


julia> a = [1; 2]
[1, 2]

julia> size(a)
(2,)

julia> size(transpose(a))
(1,2)

julia> size(transpose(transpose(a)))
(2,1)

I'd expect that the transpose of the transpose is equal to the original array, but that's not the case. But what does transpose do to a 3d array? Let's see:

julia> a = [x+y+z | x=1:4, y=1:2, z = 1:3]
4x2x3 Int64 Array:
...

ulia> transpose(a)
no method transpose(Array{Int64,3},)
in method_missing at base.jl:60

OK, so it seems this was not considered important enough, but of course that can be fixed.

Next comes indexing:


julia> a = [1 2; 3 4]
2x2 Int64 Array:
1 2
3 4

julia> size(a)
(2,2)

julia> size(a[1, :])
(1,2)

julia> size(a[:, 1])
(2,1)

julia> size(a[1, 1])
()

Indexing a 2-d array with a single number (all other indices being the all-inclusive range :) yields a 2-d array. Indexing with two number indices yields a scalar. So how do I extract a 1-d array? This generalizes to higher dimensions: if the number of number indices is equal to the rank of the array, the result is a scalar, otherwise it's an array of the same rank as the original.

Array literals aren't that frequent in practice, but they are used a lot in development, for quickly testing functions. Here are some experiments:


julia> size([1 2])
(1,2)

julia> size([1; 2])
(2,)

julia> size([[1;2] ; [3;4]])
(4,)

julia> size([[1;2] [3;4]])
(2,2)

julia> size([[1 2] [3 4]])
(1,4)

julia> size([[[1 2] [3 4]] [[5 6] [7 8]]])
(1,8)

Can you guess the rules? Once you have them (or looked them up in the Julia manual), can you figure out how to write a 3-d array literal? I suspect it's not possible.

Next, summing up array elements:


julia> sum([1; 2])
3

julia> sum([1 2; 3 4])
10

Apparently sum doesn't care about the shape of my array, it always sums the individual elements. Then how do I do a sum over all the rows?

I have tried to convert some of my basic data manipulation code from Python/NumPy to Julia, but found that I always spent most of the time fighting against the built-in array operations, which are clearly not made for my kind of application. In some cases a change of attitude may be sufficient. It seems natural to me that a plane extracted from volumetric data should be a 2-d array, but maybe if I decide that should be a 3-d array of "thickness" 1, everything will be easy.

I haven't tried yet, because I know there are cases that cannot be dealt with in that way. Suppose I have a time series of volumetric data that I store in a 4-d array. Obviously I want to be able to apply functions written for static volumetric data (i.e. 3-d arrays) to an element of such a time series. Which means I do need a way to extract a 3-d array out of a 4-d array.

I hope that what I need is there and I just didn't find it yet. Any suggestions are welcome. For now, I must conclude that test-driving Julia is a frustrating experience: the language holds so many promises, but fails for my needs due to superficial but practically very important problems.

Binary operators in Python

A two-hour train journey provided the opportunity to watch the video recording of the Panel with Guido van Rossum at the recent PyData Workshop. The lengthy discussion about PEP 225 (which proposes to add additional operators to Python that would enable to have both elementwise and aggregate operations on the same objects, in particular for providing both matrix and elementwise multiplication on arrays with a nice syntax) motivated me to write up my own thoughts about what's wrong with operators in Python from my computational scientist's point of view.

The real problem I see is that operators map to methods. In Python, a*b is just syntactic sugar for a.__mul__(b). This means that it's the type of a that decides how to do the multiplication. The method implementing this operation can of course check the type of b, and it can even decide to give up and let b handle everything, in which case Python does b.__rmul__(a). But this is just a kludge to work around the real weakness of the operators-map-to-methods approach. Binary operators fundamentally require a dispatch on both types, the type of a and the type of b. What a*b should map to is __builtins__.__mul__(a, b), a global function that would then implement a binary dispatch operation. Implementing that dispatch would in fact be the real problem to solve, as Python currently has no multiple dispatch mechanisms at all.

But would multiple dispatch solve the issue addressed by PEP 225? Not at all, directly. But it would make some of the alternatives mentioned there feasible. A proper multiple dispatch system would allow NumPy (or any other library) to decide what multiplication of its own objects by a number means, no matter if the number is the first or the second factor.

More importantly, multiple dispatch would allow a major cleanup of many scientific packages, including NumPy, and even clean up the basic Python language by getting rid of __rmul__ and friends. NumPy's current aggressive handling of binary operations is actually more of a problem for me than the lack of a nice syntax for matrix multiplication.

There are many details that would need to be discussed before binary dispatch could be proposed as a PEP. Of course the old method-based approach would need to remain in place as a fallback, to ensure compatibility with existing code. But the real work is defining a good multiple dispatch system that integrates well with Python's dynamical type system and allows the right kind of extensibility. That same multiple dispatch method could then also be made available for use in plain functions.

Python becomes a platform

The recent announcement of clojure-py made some noise in the Clojure community, but not, as far as I can tell, in the Python community. For those who haven't heard of it before, clojure-py is an implementation of the Clojure language in Python, compiling Clojure code to bytecode for Python's virtual machine. It's still incomplete, but already usable if you can live with the subset of Clojure that has been implemented.

I think that this is an important event for the Python community, because it means that Python is no longer just a language, but is becoming a platform. One of the stated motivations of the clojure-py developers is to tap into the rich set of libraries that the Python ecosystem provides, in particular for scientific applications. Python is thus following the path that Java already went in the past: the Java virtual machine, initially designed only to support the Java language, became the target of many different language implementations which all provide interoperation with Java itself.

It will of course be interesting to see if more languages will follow once people realize it can be done. The prospect of speed through PyPy's JIT, another stated motivation for the clojure-py community, could also get more lanuage developers interested in Python as a platform.

Should Python programmers care about clojure-py? I'd say yes. Clojure is strong in two areas in which Python isn't. One of them is metaprogramming, a feature absent from Python which Clojure had from the start through its Lisp heritage. The other feature is persistent immutable data structures, for which clojure-py provides an implementation in Python. Immutable data structures make for more robust code, in particular but not exclusively for concurrent applications.

 

Teaching parallel computing in Python

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I'll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:


from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it's nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn't.

Imagine your typical student typing this script and running it. Here's what happens:



Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what's wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it's not possible. Implementing it today within the framework of Python's existing pickle protocol is unfortunately not trivial, and that's why it hasn't been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn't hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the "advanced topics" part.

OK, you have passed the message, and your students fix the script:



from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:



Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
AttributeError: 'module' object has no attribute 'square_root'
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is "fork", you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:



from multiprocessing import Pool
import numpy

def square_root(x):
return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But... how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn't work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a "pool of worker processes" abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I'd prefer an API that does not suggest possibilities that don't exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit "run parallel job now" call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

EuroSciPy 2011

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how "expert" design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn't exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian's reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management - a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won't mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando's keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw's presentation of ROS, the Robot Operating System, Chris Myers' presentation about SloppyCell, and Yann Le Du's talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.

Executable Papers

The last two days I participated in the "Executable Papers workshop" at this year's ICCS conference. It was not just another workshop among the many ICCS workshops. The participants had all submitted a proposal to the "Executable Paper Grand Challenge" run by Elsevier, one of the biggest scientific publishers. On the first day, the nine finalists presented their work, and on the second day, the remaining accepted proposals were presented.

The term "executable papers" stands for the expected next revolution in scientific publishing. The move from printed journals to electronic on-line journals (or a combination of both) has changed little for authors and readers. It is the libraries that have seen the largest impact because they now do little more than paying subscription fees. Readers obtain papers as PDF files directly from the publishers' Web sites. The one change that does matter to scientists is that most journals now propose the distribute "supplementary material" in addition to the main paper. This can in principle be any kind of file, in practice it is mostly used for additional explanations, images, and tables, i.e. to keep the main paper shorter. Occasionally there are also videos, a first step towards exploring the new possibilities opened up by electronic distribution. The step to executable papers is a much bigger one: the goal is to integrate computer-readable data and executable program code together with the text part of a paper. The goals are a richer reader experience (e.g. interactive visualizations), verifiability of results by both referees and readers (by re-running part of the computations described in the paper), and re-use of data and code in later work by the same or other authors. There is some overlap in these goals with the "reproducible research" movement, whose goal is to make computational research reproducible by providing tools and methods that permit to store a trace of everything that entered into some computational procedure (input data, program code, description of the computing environment) such that someone else (or even the original author a month later) can re-run everything and obtain the same results. The new aspect in executable papers is the packaging and distribution of everything, as well as the handling of bibliographic references.

The proposals' variety mostly reflected the different background of the presenters. A mathematician documenting proofs obviously has different needs than an astrophysicist simulating a supernova on a supercomputer. Unfortunately this important aspect was never explicitly discussed. Most presenters did not even mention their field of work, much less what it implies in terms of data handling. This was probably due to the enormous time pressure; 15 to 20 minutes for a presentation plus demonstration of a complex tool was clearly not enough.

The proposals could roughly be grouped into three categories:

Some proposals covered two of these categories but with a clear emphasis on one of them. For the details of each propsal, see the ICCS proceedings which are freely available.

While it was interesting to see all the different ideas presented, my main impression of the Executable Paper Workshop is that of a missed opportunity. Having all those people who had thought long and hard about the various issues in one room for two days would have been a unique occasion to make progress towards better tools for the future. In fact, none of the solutions presented cover the needs of the all the domains of computational science. They make assumptions about the nature of the data and the code that are not universally valid. One or two hours of discussion might have helped a lot to improve everyone's tools.

The implementation of my own proposal, which addresses the questions of how to store code and data in a flexible, efficient, and future-proof way, is available here. It contains a multi-platform binary (MacOS, Linux, Windows, all on the x86 platform) and requires version 6 of the Java Runtime Environment. The source code is also included, but there is no build system at the moment (I use a collection of scripts that have my home-directory hard-coded in lots of places). There is, however, a tutorial. Feedback is welcome!

The future of Python

I have received a number of questions and remarks about my keynote talk at EuroSciPy 2010, ranging from questions about technical details to an inquiry about the release date of Python 4.0! Rather than writing lengthy replies to everyone, I try to address all these issues here.

First of all, my intentions behind the keynote were


  1. Encourage scientists to look at new tools and developments that I believe to be important in the near future (Python 3, Cython) and at others that might become important to scientific applications (JIT compilers, alternative implementations).


  2. Make computational scientists think about future commodity hardware (which is what we use most of the time) and its implications for programming, in particular the generalization of massively parallel computing.


  3. Show that easy-to-use parallel programming paradigms, in particular deterministic ones, exist today. Computational scientists need to realize that MPI and OpenMP are not the last word on parallel programming.


  4. Make my ideas concrete by showing how they could be implemented in Python.




My "Python 4.0" is completely fictitious and will probably never exist in exactly that form. However, it is important to realize that it could be implemented right now. With the GIL-free Python implementations (Jython, IronPython), it would even be rather straightforward to implement. For CPython, any implementation not removing the GIL would probably be too inefficient to be of practical interest.

Most of the ingredients for implementing my "Python 4.0" are well-known and have already been used in other languages or libraries:


Futures may seem to provide most of what declarative concurrency promises, but this is not quite true. Futures are objects representing computations. They have a method that client code must call to wait for the result and retrieve it. Since waiting is an explicit operation on a standard object, it is easy to create a situation in which two futures wait for each other: a deadlock. This can only be avoided by not having futures accessible as standard objects. The language implementation must recognize futures as special and insert a wait call before any access to the value of the result. For this reason, declarative concurrency cannot be implemented as a library.

Another important condition for implementing declarative concurrency with futures is that code inside a future must be effect-free. Otherwise multiple concurrently running futures can modify the same object and create a race condition.

Probably the only truly original contribution in my "Python 4.0" scenario is the dynamically verified effect-free subset of the Python language. Most languages, even functional ones, provide no way for a compiler or a run-time system to verify that a given function is effect-free. Haskell is perhaps the only exception in having a static type system that can identify effect-free code. In Python, that is not a viable approach because everything is dynamic. But why not provide at least a run-time check for effect-free code where useful? It's still better to have a program crash with an exception saying "you did I/O in what should have been an effect-free function" than get wrong results silently.

Here is an outline of how such an approach could be implemented. Each function and method would have a flag saying "I am supposed to be effect-free." In my examples, this flag is set by the decorator @noeffects, but other ways are possible. Built-in functions would of course have that flag set correctly as well. As soon as the interpreter enters a function marked as effect-free, it goes into "functional mode" until it returns from that function again. In functional mode, it raises an exception whenever an unflagged function or method is called.

Some details to consider:


Finally, a comment on a minor issue. I have been asked if the "async" keyword is strictly necessary. The answer is no, but it makes the code much more readable. The main role of async is to write a function call without having it executed immediately. The same problem occurs in callbacks in GUI programming: you have to specify a function call to be executed at a later time. The usual solution is a parameter-free lambda expression, and that same trick could be used to make async a function rather than a keyword. But readability suffers a lot.

EuroSciPy 2010

This weekend I attended the EuroSciPy 2010 conference in Paris, dedicated to scientific applications of the programming language Python. This was the third EuroSciPy conference, but the US-based SciPy conference has been a regular event for many years already, and recently SciPy India joined the crowd. It looks like Python is becoming ever more popular in scientific computing. Next year, EuroSciPy will take place in Paris again.

There were lots of interesting presentations and announcements, and the breaks provided a much appreciated opportunity for exchanges between the participants. I won't try to provide an exhaustive summary, but rather list my personal highlights. Obviously this choice reflects my personal interests more than the quality of the presentations, and I will even list things that were not presented but that I learned about from other participants during the breaks.

Teaching

The opening keynote was given by Hans-Petter Langtangen, who is best known for his books about Python for scientific computing. His latest book is a textbook for a course on scientific programming for beginning science students, and the first part of his keynote was about this same course that he is teaching at the University of Oslo. As others have noted as well, he observed that the students have no problem at all with picking up Python and using it productively in science. The difficulties with using Python are elsewhere: it is hard to convince the university professors that Python is a good choice of programming language for such a course!

Another important aspect of his presentation was the observation that teaching scientific programming to beginning science students provides more than just training in some useful technique. Converting equations into programs and running them also provides a much better insight into the structure and applicability of the equations. Computational science thus helps to better educate future scientists.

Reproducible research

The reproducible research movement has the goal of improving the standards in computational science. At the moment, it is almost always impossible to reproduce published computational results from the information provided by the authors. Making these results reproducible requires a careful recording of what was calculated using which version of which software running on which machine, and of course making this information available along with the publication.

At EuroSciPy, Andrew Davison presented Sumatra, a Python library for tracking this information (and more) for computational procedures written in Python. The library is in an early stage, with more functionality to come, but those interested in reproducible research should check it out now and contribute to its development.

Jarrod Millman addressed the same topic in his presentation of the plans for creating a Foundation for Mathematical and Scientific Computing, whose goal is to fund development of tools and techniques that improve computational science.

NumPy and Python 3

As a couple of active contributors to the NumPy project were attending the conference, I asked about the state of the porting effort to Python 3. The good news is that the port is done and will soon be released. Those who have been waiting for NumPy to be ported before starting to port their own libraries can go to work right now: check out the NumPy Subversion repository, install, and use!

Useful maths libraries

Three new maths libraries that were presented caught my attention: Sebastian Walter's talk about algorithmic differentiation contained demos of algopy, a rather complete library for algorithmic differentiation in Python. During the Lightning talks on the last day, two apparently similar libraries for working with uncertain numbers (numbers with error bars) were shown: uncertainties, by Eric Lebigot, and upy, by Friedrich Romstedt. Both do error propagation and take correlations into account. Those of us working with experimental data or simulation results will appreciate this.

There was a lot more interesting stuff, of course, and I hope others will write more about it. I'll just point out that the slides for my own keynote about the future of Python in science are available from my Web site. And of course express my thanks to the organizing committee who invested a lot of effort to make this conference a big success!

Eclipse experiences

A few months ago I decided to take a closer look at Eclipse, since several people I know seemed to be quite fond of it. I had tried it earlier on my old iBook G4, but quickly abandoned it because it was much too slow. But my new MacBook Pro should be able to handle it.

Last week I finally decided to retire my Eclipse installation. I didn't remove it yet, since it might be useful for some specific tasks that I have deal with rarely (such as analyzing someone else's big C++ code). But I don't use it any more for my own work. Here's a summary of my impressions of Eclipse, the good and the bad.

In terms of features, Eclipse is as impressive as it looks. Anything you might wish for in an IDE is there, either in the base distribution or in the form of a plugin - there are hundreds if not thousands of those. And contrary to what one might expect, all those features are relatively easy to get used to. The user interface is very systematic and the most frequent functions are easy to spot. In terms of user interface design, I would call Eclipse a success.

However, in terms of usability it turned out to be a disappointment. Basically there are two major issues: Eclipse is a resource hog, and it isn't as stable as I expect an IDE to be.

The two resources that Eclipse can't get enough of is CPU time and disk space. Even on a brand-new machine (and not a low-end one at that), starting Eclipse takes a good ten seconds and I get to see the Macintosh's spinning colour wheel quite often. What's worst is that the spinning wheel prevents me from typing, at unpredictable moments. This is not acceptable for an IDE. I don't care if it takes a break in background compilation now and then, but I want to be able to type when I want. Execution times for various command can also vary unpredictably. Rebuilding all my projects took about a minute typically, but once I waited for 15 minutes for no apparent reason.

In terms of disk space, Eclipse is less of a resource hog, but it creates and updates impressive amounts of data, again for no clear reason. I noticed this because I make incremental backups regularly. Just starting and quitting Eclipse, with no action in between, resulted in a few MB of files to backup again. It's not that I can't live with that, but is this really necessary?

Finally, stability. I had only a single crash in which I lost data (the most recently entered code), which is not so bad for a big application (unfortunately...). But I had Eclipse hanging very often, and displaying verbose yet unintellegible error messages almost daily. All this is not reassuring, and together with the spinning-wheel issue this is what made me abandon Eclipse in the end.

Now I am a 100% Emacs user again, with no regrets. Emacs may look old-fashioned, and have some fewer high-powered features, but it is reliable and fast.

Scientific computing needs deterministic programming paradigms

Programmers, scientific and otherwise, spend a lot of time discussing which programming languages, libraries, and development tools to use. In such discussions, the notion of a programming paradigm is rarely mentioned, and yet it is a very fundamental one. It becomes particularly important for parallel and concurrent programming, where the most popular languages and libraries do not necessarily provide the best programming paradigm. In this post, I will explain what a programming paradigm is and why its choice matters more than the choice of a language.

A programming paradigm defines a general approach to writing programs. It defines the concepts and abstractions in terms of which a program is expressed. For example, a programming paradigm defines how data is described, how control flow is handled, how program elements are composed, etc. Well-known programming paradigms are structured programming, object-oriented programming, and functional programming.

The implementation of a programming paradigm consists of a programming language, its runtime system, libraries, and sometimes coding conventions. Some programming languages are optimized for a specific paradigm, whereas others are explicitly designed to support multiple paradigms. Paradigms that the language designer did not have in mind can sometimes be implemented by additional conventions, libraries, or preprocessors.

The list of programming paradigms that have been proposed and/or used is already quite long (see the Wikipedia entry, for example), but the ones that are practically important and significantly distinct are much less numerous. A good overview and comparison is given in the book chapter "Programming paradigms for dummies" by Peter van Roy. I will concentrate on one aspect discussed in van Roy's text (look at section 6 in particular), which I consider of particular relevance for scientific computing: determinism.

A deterministic programming paradigm is one in which every possible program has a fully deterministic behaviour: given the same input, it executes its steps in the same order and produces the same output. This is in fact what most of us would intuitively expect from a computer program. However, there are useful programs that could not be written with this restriction. A Web server, for example, has to react to external requests which are outside of its control, and take into account resource usage (e.g. database access) and possible network errors in deciding when and in which order to process requests. This shows that there is a need for non-deterministic programming paradigms. For the vast majority of scientific applications, however, determinism is a requirement, and a programming paradigm that enforces determinism is a big help in avoiding bugs. Most scientific applications that run serially have been written using a deterministic programming paradigm, as implemented by most of the popular programming languages.

Parallel computing has changed the situation significantly. When several independent processors work together on the execution of an algorithm, fully deterministic behavior is no longer desirable, as it would imply frequent synchronizations of all processors. The precise order in which independent operations are executed is typically left unspecified by a program. What matters is that the output of the program is determined only by the input. As long as this is guaranteed, it is acceptable and even desirable to let compilers and run-time systems optimize the scheduling of individual subtasks. In Peter van Roy's classification, this distinction is called "observable" vs. "non-observable" non-determinism. A programming paradigm for scientific computing should permit non-determinism, but should exclude observable non-determinism. While observable non-determinism makes the implementation of certain programs (such as Web servers) possible, it also opens the way to bugs that are particularly nasty to track down: deadlocks, race conditions, results that change with the number of processors or when moving from one parallel machine to another one.

Unfortunately, two of the most popular programming paradigms for parallel scientific applications do allow observable non-determinism: message passing, as implemented by the MPI library, and multi-threading. Those who have used either one have probably suffered the consequences. The problem is thus well known, but the solutions aren't. Fortunately, they do exist: there are several programming paradigms that encapsulate non-determinism in such a way that it cannot influence the results of a program. One of them is widely known and used: OpenMP, which is a layer above multi-threading that guarantees deterministic results. However, OpenMP is limited to shared-memory multiprocessor machines.

For the at least as important category of distributed-memory parallel machines, there are also programming paradigms that don't have the non-deterministic features of message passing, and they are typically implemented as a layer above MPI. One example is the BSP model, which I have presented in an article in the magazine Computing in Science and Engineering. Another example is the parallel skeletons model, presented by Joël Falcou in the same magazine. Unfortunately, these paradigms are little known and not well supported by programming tools. As a consequence, most scientific applications for distributed-memory machines are written using the message passing paradigm.

Finally, a pair of programming paradigms discussed by van Roy deserves special mention, because it might well become important in scientific computing in the near future: functional programming and declarative concurrency. I have written about functional programming earlier; its main advantage is the possibility to apply mathematical reasoning and automatic transformations to program code, leading to better code (in the sense of correctness) and to better optimization techniques. Declarative concurrency is functional programming plus annotations for parallelization. The nice feature is that these annotations (not very different in principle from OpenMP pragmas) don't change the result of the program, they only influence its performance on a parallel machine. Starting from a correct functional program, it is thus possible to obtain an equivalent parallel one by automatic or manual (but automatically verified) transformations that is guaranteed to behave identically except for performance. Correctness and performance can thus be separated, which should be a big help in writing correct and efficient parallel programs. I say "should" because this approach hasn't been used very much, and isn't supported yet by any mainstream programming tools. This may change in a couple of years, so you might want to watch out for developments in this area.

Functional programming for scientific computing

With the increasing importance of parallel computers, ranging from multi-core desktop machines to massively parallel machines such as IBM's BlueGene, functional programming could well become an important technique for scientific software development, as it facilitates program transformations (including those for automatic or semi-automatic parallelization) considerably. It also appeals to the mathematical bend of many scientists in making it possible to apply mathematical reasoning to computer programs. The downside: there is a steep learning curve for those familiar with traditional programming (called "imperative").

I have written an introduction to functional programming for scientists for the July issue of Computing in Science and Engineering. It is also available (free access) via IEEE's Computing Now portal: http://www2.computer.org/portal/web/computingnow/0609/whatsnew/cise

While I don't expect functional programming to be adopted rapidly by computational scientists, I am convinced that ten years from now, it will be an essential item in everyone's toolbox. Better start preparing yourself now!

Static typing and code clutter

Among the many characteristics that distinguish programming languages, static vs. dynamic typing is one of the most debated ones. The main advantages claimed by the advocates of static typing are that compile-time type checks make code more robust and that static typing allows a compiler to do better optimizations. The dynamic programming camp points out the simplicity and flexibility of a language that requires no type declaration and that permits a piece of code to handle data objects defined well after it was written. Both sides are right and the choice is ultimately one of personal preference.

I have used various programming languages over the years, including both statically typed and dynamically typed ones. But when given a choice, I have always preferred dynamic typing. Since 1995, my main programming language has been Python, and more recently I have started to use Clojure. One of the reasons for this preference is something that I have never seen expressed before: static typing often adds visual clutter to the code that makes it harder to read.

An important property of any non-trivial computer program is its clarity to human readers. Both verification of a program's correctness and the overall utility of a piece of code in a context of changing requirements depend on this. Well-written specifications and unit tests help as well, but if you want my advice on the quality of a piece of code, or if you want my help with modifying it, my judgement will mostly be based on its clarity. If it's an effort to understand what's going on, I wouldn't want to work with it.

This criterion for code quality immediately translates into a criterion for programming languages: they should be able to express as many concepts of software engineering as possible in a direct, explicit way and without imposing any clutter or obfuscation. Static type systems often get in the way, either by imposing clutter or by encouraging a less clear programming style.

In my examples, I will use the languages Haskell (static typing) and Clojure (dynamic typing) for illustration. Haskell has one of the best type systems available at the moment, so if Haskell can't avoid the problems that I point out, it is likely that no other current language will do a better job. Clojure is a good comparison because like Haskell it is designed for a functional programming style. Of course, it also helps that I am reasonably familiar with both languages.

Example 1: abstract data types

The idea behind abstract data types is that the concrete representation of some data structure should be hidden from client code, which accesses the data structure only through a set of interface functions. Let's look at how this is typically implemented in Haskell, using the PFP library for probabilistic programming as the example (just because I happen to know it, many other libraries could serve the same purpose). In PFP, a probability distribution is represented by an abstract data type Dist a defined as
newtype Dist a = D {unD :: [(a,ProbRep)]}

This says that internally, Dist is a list of (a, ProbRep) pairs. The single constructor D converts such a list to the abstract data type Dist, whereas unD does the inverse: it makes the contents of a Dist value accessible for inspection.

The problem with this is that all of the implementation code for PFP is littered with D and unD, although they don't do anything and add nothing to the clarity of the code. They are there only to make sure that the signature of the functions contains the abstract type Dist a instead of the internal representation [(a,ProbRep)]. For the reader of the PFP code trying to understand how it works, this is clutter. There are also a couple of functions that exist only for dealing with the artificial distinction between Dist a and [(a,ProbRep)], for example
sizeD :: Dist a -> Int
sizeD = length . unD

which replaces the list function length (familiar to every Haskell programmer) by a special version whose purpose the reader has to remember.

A Clojure library that is essentially equivalent to PFP (look at the source code) is much shorter, and in my opinion much clearer. It represents a probability distribution by a map (known in other languages as a dictionary or an associative array) and directly uses Clojure's map operations to work on it. No visual overhead, no clutter. Of course, as static typing advocates would be quick to point out, no protection of the internal representation either: client code could directly manipulate the maps used to represent distributions, potentially creating maps that are not valid probability distributions. I have never run into such a problem in 15 years of using dynamically typed languages, but in principle it is possible.

It would be possible to avoid the code obfuscation due to abstract data types by recognizing that abstract data types are an interface issue and not a type issue. A language could provide an explicit declaration of an interface for a module where the function signatures would be given with the abstract data type, even though the concrete representation is used in the implementations. The compiler could verify the coherence of everything. But I haven't seen anything like this in any statically typed language.

Note also that something very similar could be implemented in Clojure: A couple of macros would provide wrappers around the exported functions that add type verification at runtime. However, this says more about the advantage of having a powerful macro system than about the advantages of dynamic typing.

Example 2: monads

A monad is package consisting of a data structure (or, more precisely, certain properties that a data structure must have) and two functions bind and result. A subclass of monads also has a special value called zero and a subclass of this subclass has one more function called plus. All these definitions must obey certain rules to make a valid monad.

In Haskell, there is a typeclass Monad that defines bind (called >>=) and result (called return), and another typeclass MonadPlus that defines mzero and mplus. A monad is defined by providing instances for concrete data types. When monadic operations are used, the type inference system identifies the data type and selects the corresponding operations. From the client's point of view, a monad thus is defined by a type.

There are a few drawbacks of this setup:

In Clojure, monads are values, not types. In client code, a monad is selected explicitly by the programmer by surrounding the monadic code by a with-monad form that specifies the monad to be used. Usually the monad is named explicitly, but since monads are values, they can also be represented by a variable. A Clojure monad can be used with any data type that the definitions of the monadic operations accept, and any number of monads can be defined for a data type.

In Clojure it is the data structure that almost disappears from monad handling; the constraints on the monadic values are given only as documentation for the human reader. As with other aspects of dynamic typing, this provides more flexibility and less protection.

For standard monads, I'd say that the clarity of code is roughly equivalent in Haskell and Clojure. Haskell gains a bit in making the data structure explicit, Clojure gains a bit in making the monad explicit at the point of use. Both work pretty well.

This changes when monad transformers come into play. In the Haskell world, this is perhaps the most frightening concept to newcomers. Monad transformers are surrounded by an aura of mystery, they are only for the real experts. I think that this is at least in part due to the complexity of defining monad transformers in a type system.

Here's how Haskell defines the list monad transformer; only the relevant parts are shown:
newtype ListT m a = ListT { runListT :: m [a] }

instance (Monad m) => Monad (ListT m) where
return a = ListT $ return [a]
m >>= k = ListT $ do
a <- runListT m
b <- mapM (runListT . k) a
return (concat b)

As in the case of abstract data types, there is a data type definition for ListT that does nothing but introduce a new notation for the type m[a]. ListT and runListT are just notation converters that don't actually do anything useful. But unlike in the case of abstract data types, they are indispensable here. Monads are types, and therefore there has to be a new type to make a new monad. It doesn't help that the name runListT is a particularly bad choice: it suggest an action where there is none.

The definitions of return and >>= aren't masterpieces of clarity either. It takes a careful analysis of each function and its (inferred, and thus unwritten) type to understand which monad is being used where.

For comparison, here is the corresponding monad transformer in Clojure, again reduced to the basics:
(defn sequence-t [m]
(monad
[m-result (with-monad m
(fn [v]
(m-result (list v))))
m-bind (with-monad m
(fn [mv f]
(domonad
[a mv
b (m-map f a)]
(flatten b))))]))

Since monads are values, monad transformers are simply functions that take a monad argument and return another monad. It is also clear at a glance that inside the definition of each monad operation, all references to monad operations are to be interpreted in the inner monad. This doesn't make monad transformers trivial to understand, of course, but it is a lot clearer.

Monads in Clojure

One of my hobby projects over the last months has been the exploration of monads. Monads are packages consisting of a data structure and associated control structures that are used as abstractions in functional programming. They were popularized by the Haskell language, where they play a central role in introducing side effects (such as I/O) in a controlled way into a language that is otherwise purely functional.

Since I was also exploring Clojure, an interesting new dialect of Lisp that strongly encourages a purely functional programming style (but doesn't enforce it), I decided to explore monads by writing a monad library for Clojure. My experience is that monads are quite useful in Clojure as well, and that once you get used to monads, you see occasions for using them almost everywhere. If you have been hesitating to tackle monads seriously, I can only encourage you to go on!

I have also written a monad tutorial for Clojure programmers, which I published on the OnClojure blog. It consists of four parts:

  1. Part 1 introduces the concept of monads and illustrates it with the identity and maybe monads.

  2. Part 2 explains the importance of m-result using the sequence monad as an example. It also covers the monad laws.

  3. Part 3 is about m-zero and m-plus, and explains the state monad.

  4. Part 4 covers the probability monad and monad transformers.


I hope that this tutorial facilitates a first contact with monads for those who are more familiar with Lisp syntax than with Haskell syntax.

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04