Recent posts

The future of Python

2010-07-19

I have received a number of questions and remarks about my keynote talk at EuroSciPy 2010, ranging from questions about technical details to an inquiry about the release date of Python 4.0! Rather than writing lengthy replies to everyone, I try to address all these issues here.

First of all, my intentions behind the keynote were

Encourage scientists to look at new tools and developments that I believe to be important in the near future (Python 3, Cython) and at others that might become important to scientific applications (JIT compilers, alternative implementations).

Make computational scientists think about future commodity hardware (which is what we use most of the time) and its implications for programming, in particular the generalization of massively parallel computing.

Show that easy-to-use parallel programming paradigms, in particular deterministic ones, exist today. Computational scientists need to realize that MPI and OpenMP are not the last word on parallel programming.

Make my ideas concrete by showing how they could be implemented in Python.

My "Python 4.0" is completely fictitious and will probably never exist in exactly that form. However, it is important to realize that it could be implemented right now. With the GIL-free Python implementations (Jython, IronPython), it would even be rather straightforward to implement. For CPython, any implementation not removing the GIL would probably be too inefficient to be of practical interest.

Most of the ingredients for implementing my "Python 4.0" are well-known and have already been used in other languages or libraries:

The "declarative concurrency" programming paradigm has been used in Oz and FlowJava, but to the best of my knowledge not in any mainstream programming language. It is explained very well in the book Concepts, Techniques, and Models of Computer Programming, by Peter van Roy and Seif Haridi, and also in the freely downloadable essay Programming paradigms for dummies. Basically, it is the functional paradigm extended with annotations that identify computations to be done in parallel. Remove those annotations, and you get plain functional programs that yield the same results. Declarative concurrency is free of deadlocks and race conditions, which I think is a critical property for any parallel programming paradigm to be considered for a high-level language such as Python. Another nice feature of declarative concurrency is that data parallelism, including nested data parallelism, is a special case that can be implemented on top of it. Data parallelism is a useful paradigm for many scientific applications.

Futures are asynchronous tasks provided as library functions for various languages. A Python library for futures is the subject of PEP 3148; an implementation is already available.

The importance of effect-free functions for all kinds of code transformations (automatic or not) is widely recognized. It is equally recognized that a useful program needs to have effects. The two basic approaches to dealing with this contradiction are (a) allow effects, but make it as easy as possible not to use them to encourage a mostly effect-free style and (b) design a language without effects (pure functional language) and provide a mechanism to put in effects with special annotation or syntax but clearly as an exceptional feature. The best-known language in the second category is Haskell with its use of monads for controlling effects. Most functional languages are in the first category.
Efficient data structures for functional programs have been a subject of research for quite a while and quite a few good ones are known. It would be straightforward to replace Python's tuple implementation by something more efficient in typical functional settings, or to add an efficient immutable dictionary implementation. The standard reference is Chris Osaki's book Purely Functional Data Structures.

Futures may seem to provide most of what declarative concurrency promises, but this is not quite true. Futures are objects representing computations. They have a method that client code must call to wait for the result and retrieve it. Since waiting is an explicit operation on a standard object, it is easy to create a situation in which two futures wait for each other: a deadlock. This can only be avoided by not having futures accessible as standard objects. The language implementation must recognize futures as special and insert a wait call before any access to the value of the result. For this reason, declarative concurrency cannot be implemented as a library.

Another important condition for implementing declarative concurrency with futures is that code inside a future must be effect-free. Otherwise multiple concurrently running futures can modify the same object and create a race condition.

Probably the only truly original contribution in my "Python 4.0" scenario is the dynamically verified effect-free subset of the Python language. Most languages, even functional ones, provide no way for a compiler or a run-time system to verify that a given function is effect-free. Haskell is perhaps the only exception in having a static type system that can identify effect-free code. In Python, that is not a viable approach because everything is dynamic. But why not provide at least a run-time check for effect-free code where useful? It's still better to have a program crash with an exception saying "you did I/O in what should have been an effect-free function" than get wrong results silently.

Here is an outline of how such an approach could be implemented. Each function and method would have a flag saying "I am supposed to be effect-free." In my examples, this flag is set by the decorator @noeffects, but other ways are possible. Built-in functions would of course have that flag set correctly as well. As soon as the interpreter enters a function marked as effect-free, it goes into "functional mode" until it returns from that function again. In functional mode, it raises an exception whenever an unflagged function or method is called.

Some details to consider:

Effect-free functions may not contain global or nonlocal statements. Probably the only way to enforce this is to have special syntax for defining effect-free functions and methods (rather than a decorator) and make those statements syntactically illegal inside.

It would be much more useful to have a "referentially transparent" subset rather than an "effect-free" subset, but this is also much harder to implement. A referentially transparent function guarantees to return the same result for the same input arguments, but may modify mutable objects that it has created internally. For example, a typical matrix inversion function allocates an array for its result and then uses an imperative algorithm that modifies the elements of that array repeatedly before returning it. Such a function can be used as an asynchronous task without risk, but its building blocks cannot be safely run concurrently.

Finally, a comment on a minor issue. I have been asked if the "async" keyword is strictly necessary. The answer is no, but it makes the code much more readable. The main role of async is to write a function call without having it executed immediately. The same problem occurs in callbacks in GUI programming: you have to specify a function call to be executed at a later time. The usual solution is a parameter-free lambda expression, and that same trick could be used to make async a function rather than a keyword. But readability suffers a lot.

EuroSciPy 2010

2010-07-12

This weekend I attended the EuroSciPy 2010 conference in Paris, dedicated to scientific applications of the programming language Python. This was the third EuroSciPy conference, but the US-based SciPy conference has been a regular event for many years already, and recently SciPy India joined the crowd. It looks like Python is becoming ever more popular in scientific computing. Next year, EuroSciPy will take place in Paris again.

There were lots of interesting presentations and announcements, and the breaks provided a much appreciated opportunity for exchanges between the participants. I won't try to provide an exhaustive summary, but rather list my personal highlights. Obviously this choice reflects my personal interests more than the quality of the presentations, and I will even list things that were not presented but that I learned about from other participants during the breaks.

Teaching

The opening keynote was given by Hans-Petter Langtangen, who is best known for his books about Python for scientific computing. His latest book is a textbook for a course on scientific programming for beginning science students, and the first part of his keynote was about this same course that he is teaching at the University of Oslo. As others have noted as well, he observed that the students have no problem at all with picking up Python and using it productively in science. The difficulties with using Python are elsewhere: it is hard to convince the university professors that Python is a good choice of programming language for such a course!

Another important aspect of his presentation was the observation that teaching scientific programming to beginning science students provides more than just training in some useful technique. Converting equations into programs and running them also provides a much better insight into the structure and applicability of the equations. Computational science thus helps to better educate future scientists.

Reproducible research

The reproducible research movement has the goal of improving the standards in computational science. At the moment, it is almost always impossible to reproduce published computational results from the information provided by the authors. Making these results reproducible requires a careful recording of what was calculated using which version of which software running on which machine, and of course making this information available along with the publication.

At EuroSciPy, Andrew Davison presented Sumatra, a Python library for tracking this information (and more) for computational procedures written in Python. The library is in an early stage, with more functionality to come, but those interested in reproducible research should check it out now and contribute to its development.

Jarrod Millman addressed the same topic in his presentation of the plans for creating a Foundation for Mathematical and Scientific Computing, whose goal is to fund development of tools and techniques that improve computational science.

NumPy and Python 3

As a couple of active contributors to the NumPy project were attending the conference, I asked about the state of the porting effort to Python 3. The good news is that the port is done and will soon be released. Those who have been waiting for NumPy to be ported before starting to port their own libraries can go to work right now: check out the NumPy Subversion repository, install, and use!

Useful maths libraries

Three new maths libraries that were presented caught my attention: Sebastian Walter's talk about algorithmic differentiation contained demos of algopy, a rather complete library for algorithmic differentiation in Python. During the Lightning talks on the last day, two apparently similar libraries for working with uncertain numbers (numbers with error bars) were shown: uncertainties, by Eric Lebigot, and upy, by Friedrich Romstedt. Both do error propagation and take correlations into account. Those of us working with experimental data or simulation results will appreciate this.

There was a lot more interesting stuff, of course, and I hope others will write more about it. I'll just point out that the slides for my own keynote about the future of Python in science are available from my Web site. And of course express my thanks to the organizing committee who invested a lot of effort to make this conference a big success!

Science and free will

2010-07-01

The question if living beings, in particular those of our own species, possess "free will", and how it works if it exists, has recently become fashionable again. The new idea that brought the topic back into discussion was that our sense of free will might just be an illusion. According to this idea, we would be machines whose fate is entirely determined by the laws of physics (which might themselves be deterministic or not), even though we perceive ourselves as actors who pursue goals and take decisions that are not even in principle predictable by a physical analysis of our bodies, no matter at what level of detail.

The topic itself is an old one, perhaps as old as humanity. I won't go into its philosophical and religious aspects, but limit myself to the scientist's point of view: is free will compatible with scientific descriptions of our world? Perhaps even necessary for such descriptions? Or, on the contrary, in contradiction to the scientific approach? Can the scientific method be used to understand free will or show that it's a useless concept from the past?

What prompted me to write this post is a recent article by Anthony Cashmore in PNAS. In summary, Cashmore says that the majority of scientists do not believe in the existence of free will any more, and that society should draw conclusions from this, in particular concerning the judicial system, whose concept of responsibility for one's acts is based on a view of free will that the author no longer considers defendable. But don't take my word for it, read the article yourself. It's well written and covers many interesting points.

First of all, let me say that I don't agree at all with Cashmore's view that the judicial system should be reformed based on the prevailing view of today's scientists. I do believe that a modern society should take into account scientific findings, i.e. scientific hypotheses that have withstood a number of attempts at falsification. But mere beliefs of a small subpopulation, even if they are scientists, are not sufficient to justify a radical change of anything. As I will explain below, the question "do human beings possess free will" does not even deserve the label "scientific hypothesis" at this moment, because we have no idea of how we could answer it based on observation and experiment. We cannot claim either to be able to fully understand human behavior in terms of the laws of physics, which would allow us to call free will an unnecessary concept and invoke Occam's razor to get rid of it. Therefore, at this time, the existence of free will remains the subject of beliefs and scientists' beliefs are worth no more than anyone else's.

There is also a peculiar circularity to any argument about what "should" be done as a consequence of the non-existence of free will: if that hypothesis is true, nobody can decide anything! If humans have no free will, then societies don't have it either, and our judicial system is just as much a consequence of the laws of nature as my perceived decision to take coffee rather than tea for breakfast this morning.

Back to the main topic of this post: the relation between science and free will. It starts with the observation of a clear conflict. Science is about identifying regularities in the world that surrounds us, which permit the construction of detailed and testable theories. The first scientific theories were all about deterministic phenomena: given the initial state of some well-defined physical system (think of a clockwork, for example), the state of the system at any time in the future can be predicted with certainty. Later, stochastic phenomena entered the scientific world view. With stochasticity, the detailed behavior of a system is no longer predictable, but certain average properties still are. For example, we can predict how the temperature and pressure of water will change when we heat it, even though we cannot predict how each individual molecule will move. It is still a subject of debate whether stochastic elements exist in the fundamental laws of nature (quantum physics being the most popular candidate), or if they are merely a way of describing complex systems whose state we cannot analyze in detail due to insufficient resources. But scientists agree that a scientific theory may contain two forms of causality: determinism and stochasticity.

Free will, if it exists, would have to be added as a third form of causality. But it is hard to see how this could be done. The scientific method is based on identifying conditions from which exact predictions can be made. The decisions of an agent that possesses free will are by definition unpredictable, and therefore any theory about a system containing such an agent would be impossible to verify. Therefore the scientific method as we know it today cannot possibly take into consideration the existence of free will. Obviously this makes it impossible to examine the existence of free will as a scientific hypothesis. It also means that a hard-core scientist, who considers the scientific method as the only way to establish truth, has to deny the existence of free will, or else accept that some important aspects of our universe are forever inaccessible to scientific investigation.

However, there is another aspect to the relation between science and free will, which I haven't seen discussed yet anywhere: the existence of free will is in fact a requirement for the scientific method! Not as part of a system under scientific scrutiny, but as part of the scientist who runs an investigation. Testing a scientific hypothesis requires at the very least observing a specific phenomenon, but in most cases also preparing a well-defined initial state for some system that will then become the subject of observation. A scientist decides to create an experimental setup to verify some hypothesis. If the scientist were just a complex machine whose behavior is governed by the very same laws that he believes to be studying, then his carefully thought-out experiment is nothing but a particularly probable outcome of the laws of nature. We could still draw conclusions from observing it, of course, but these observations then only provide anecdotical evidence that is no more relevant than what we get from passively watching things happen around us.

In summary, our current scientific method supposes the existence of free will as an attribute of scientists, but also its absence from any system subjected to scientific scrutiny. This poses limits to what scientific investigation can yield when applied to humans.

Eclipse experiences

2010-01-19

A few months ago I decided to take a closer look at Eclipse, since several people I know seemed to be quite fond of it. I had tried it earlier on my old iBook G4, but quickly abandoned it because it was much too slow. But my new MacBook Pro should be able to handle it.

Last week I finally decided to retire my Eclipse installation. I didn't remove it yet, since it might be useful for some specific tasks that I have deal with rarely (such as analyzing someone else's big C++ code). But I don't use it any more for my own work. Here's a summary of my impressions of Eclipse, the good and the bad.

In terms of features, Eclipse is as impressive as it looks. Anything you might wish for in an IDE is there, either in the base distribution or in the form of a plugin - there are hundreds if not thousands of those. And contrary to what one might expect, all those features are relatively easy to get used to. The user interface is very systematic and the most frequent functions are easy to spot. In terms of user interface design, I would call Eclipse a success.

However, in terms of usability it turned out to be a disappointment. Basically there are two major issues: Eclipse is a resource hog, and it isn't as stable as I expect an IDE to be.

The two resources that Eclipse can't get enough of is CPU time and disk space. Even on a brand-new machine (and not a low-end one at that), starting Eclipse takes a good ten seconds and I get to see the Macintosh's spinning colour wheel quite often. What's worst is that the spinning wheel prevents me from typing, at unpredictable moments. This is not acceptable for an IDE. I don't care if it takes a break in background compilation now and then, but I want to be able to type when I want. Execution times for various command can also vary unpredictably. Rebuilding all my projects took about a minute typically, but once I waited for 15 minutes for no apparent reason.

In terms of disk space, Eclipse is less of a resource hog, but it creates and updates impressive amounts of data, again for no clear reason. I noticed this because I make incremental backups regularly. Just starting and quitting Eclipse, with no action in between, resulted in a few MB of files to backup again. It's not that I can't live with that, but is this really necessary?

Finally, stability. I had only a single crash in which I lost data (the most recently entered code), which is not so bad for a big application (unfortunately...). But I had Eclipse hanging very often, and displaying verbose yet unintellegible error messages almost daily. All this is not reassuring, and together with the spinning-wheel issue this is what made me abandon Eclipse in the end.

Now I am a 100% Emacs user again, with no regrets. Emacs may look old-fashioned, and have some fewer high-powered features, but it is reliable and fast.

Scientific computing needs deterministic programming paradigms

2009-09-09

Programmers, scientific and otherwise, spend a lot of time discussing which programming languages, libraries, and development tools to use. In such discussions, the notion of a programming paradigm is rarely mentioned, and yet it is a very fundamental one. It becomes particularly important for parallel and concurrent programming, where the most popular languages and libraries do not necessarily provide the best programming paradigm. In this post, I will explain what a programming paradigm is and why its choice matters more than the choice of a language.

A programming paradigm defines a general approach to writing programs. It defines the concepts and abstractions in terms of which a program is expressed. For example, a programming paradigm defines how data is described, how control flow is handled, how program elements are composed, etc. Well-known programming paradigms are structured programming, object-oriented programming, and functional programming.

The implementation of a programming paradigm consists of a programming language, its runtime system, libraries, and sometimes coding conventions. Some programming languages are optimized for a specific paradigm, whereas others are explicitly designed to support multiple paradigms. Paradigms that the language designer did not have in mind can sometimes be implemented by additional conventions, libraries, or preprocessors.

The list of programming paradigms that have been proposed and/or used is already quite long (see the Wikipedia entry, for example), but the ones that are practically important and significantly distinct are much less numerous. A good overview and comparison is given in the book chapter "Programming paradigms for dummies" by Peter van Roy. I will concentrate on one aspect discussed in van Roy's text (look at section 6 in particular), which I consider of particular relevance for scientific computing: determinism.

A deterministic programming paradigm is one in which every possible program has a fully deterministic behaviour: given the same input, it executes its steps in the same order and produces the same output. This is in fact what most of us would intuitively expect from a computer program. However, there are useful programs that could not be written with this restriction. A Web server, for example, has to react to external requests which are outside of its control, and take into account resource usage (e.g. database access) and possible network errors in deciding when and in which order to process requests. This shows that there is a need for non-deterministic programming paradigms. For the vast majority of scientific applications, however, determinism is a requirement, and a programming paradigm that enforces determinism is a big help in avoiding bugs. Most scientific applications that run serially have been written using a deterministic programming paradigm, as implemented by most of the popular programming languages.

Parallel computing has changed the situation significantly. When several independent processors work together on the execution of an algorithm, fully deterministic behavior is no longer desirable, as it would imply frequent synchronizations of all processors. The precise order in which independent operations are executed is typically left unspecified by a program. What matters is that the output of the program is determined only by the input. As long as this is guaranteed, it is acceptable and even desirable to let compilers and run-time systems optimize the scheduling of individual subtasks. In Peter van Roy's classification, this distinction is called "observable" vs. "non-observable" non-determinism. A programming paradigm for scientific computing should permit non-determinism, but should exclude observable non-determinism. While observable non-determinism makes the implementation of certain programs (such as Web servers) possible, it also opens the way to bugs that are particularly nasty to track down: deadlocks, race conditions, results that change with the number of processors or when moving from one parallel machine to another one.

Unfortunately, two of the most popular programming paradigms for parallel scientific applications do allow observable non-determinism: message passing, as implemented by the MPI library, and multi-threading. Those who have used either one have probably suffered the consequences. The problem is thus well known, but the solutions aren't. Fortunately, they do exist: there are several programming paradigms that encapsulate non-determinism in such a way that it cannot influence the results of a program. One of them is widely known and used: OpenMP, which is a layer above multi-threading that guarantees deterministic results. However, OpenMP is limited to shared-memory multiprocessor machines.

For the at least as important category of distributed-memory parallel machines, there are also programming paradigms that don't have the non-deterministic features of message passing, and they are typically implemented as a layer above MPI. One example is the BSP model, which I have presented in an article in the magazine Computing in Science and Engineering. Another example is the parallel skeletons model, presented by Joël Falcou in the same magazine. Unfortunately, these paradigms are little known and not well supported by programming tools. As a consequence, most scientific applications for distributed-memory machines are written using the message passing paradigm.

Finally, a pair of programming paradigms discussed by van Roy deserves special mention, because it might well become important in scientific computing in the near future: functional programming and declarative concurrency. I have written about functional programming earlier; its main advantage is the possibility to apply mathematical reasoning and automatic transformations to program code, leading to better code (in the sense of correctness) and to better optimization techniques. Declarative concurrency is functional programming plus annotations for parallelization. The nice feature is that these annotations (not very different in principle from OpenMP pragmas) don't change the result of the program, they only influence its performance on a parallel machine. Starting from a correct functional program, it is thus possible to obtain an equivalent parallel one by automatic or manual (but automatically verified) transformations that is guaranteed to behave identically except for performance. Correctness and performance can thus be separated, which should be a big help in writing correct and efficient parallel programs. I say "should" because this approach hasn't been used very much, and isn't supported yet by any mainstream programming tools. This may change in a couple of years, so you might want to watch out for developments in this area.

Sheldrake's New Science of Life

2009-08-25

One of the books I read during my summer vacation is the recently published second edition of Rupert Sheldrake's "A New Science of Life". It is one of the most controversial books in science, having been both praised and condemned; a review of the first edition in the renowned science journal Nature concluded that this book should be burnt!

The question that Sheldrake addresses in this book is where form comes from. What defines the arrangements of atoms in a molecule? Or in a crystal? Why do proteins fold into their characteristic structures? How do biological molecules assemble into cells? And how to cells divide and specialize to form an embryo?

The standard reply to these questions you can find in science textbooks is that all these forms come from the fundamental interactions of physics. Molecules are simply energetically favorable arrangements of atoms. Proteins fold in such a way that free energy is minimized. Cells assemble as a result of complex attractive interactions between its constituents, which ultimately can be reduced to fundamental physics. Embryos develop according to a "genetic program" stored in the fecundated egg's DNA.

As Sheldrake rightly emphasizes, even though it may come as a surprise to most non-experts, these affirmations cannot be verified. They express a common belief among practicing scientists, and they are compatible with everything we know about nature, but they may well be wrong. We simply cannot verify them because the fundamental equations of physics can be solved only for very simple systems. Even for one of the simplest molecules, water, we cannot predict the arrangement of its atoms directly from the basic principles of physics. What we use in practice are approximations, but these approximations have been selected because they permit to predict the known molecular structures. We cannot use such approximations to verify more fundamental problems.

Sheldrake proposes an alternative theory, based on what he calls "morphogenetic fields". From my point of view as a physicist, the name is not very well chosen because these entities do not correspond to what a physicist would call a field, but of course this term may be perfectly clear to biologists. It's a minor point because Sheldrake explains this concept very clearly in his book. In summary, his theory says that forms exist because they have existed before; atoms, molecules, and cells arrange themselves into patterns that they "remember" from the past. His morphogenetic fields are a giant database of forms that the universe keeps around forever.

The main "problem" with this theory is that if it is right, even just approximately, then standard science, from physics to biology, is very much wrong. It is probably for this reason that his book has attracted so much criticism from the science establishment. Otherwise, there is little one could criticize: Sheldrake explains his theory and its consequences for chemistry and biology, and he proposes a large number of experimental verifications that would permit to test it. This is science at its best. Of course his theory may turn out to require modifications, or even be completely wrong, but that is true of any scientific theory when it is first formulated.

In fact, I recommend this book to anyone interested in the scientific process because of its detailed discussion of how scientific discovery works. I haven't seen many books accessible to non-specialists that explain the limits of verifiability of a scientific theory, for example. Nor have I seen any other book that makes the distinction between verified theories and widely accepted but untested beliefs so clear as Sheldrake does. Even if you don't care about his theory, you can gain a lot from reading this book.

Functional programming for scientific computing

2009-06-23

With the increasing importance of parallel computers, ranging from multi-core desktop machines to massively parallel machines such as IBM's BlueGene, functional programming could well become an important technique for scientific software development, as it facilitates program transformations (including those for automatic or semi-automatic parallelization) considerably. It also appeals to the mathematical bend of many scientists in making it possible to apply mathematical reasoning to computer programs. The downside: there is a steep learning curve for those familiar with traditional programming (called "imperative").

I have written an introduction to functional programming for scientists for the July issue of Computing in Science and Engineering. It is also available (free access) via IEEE's Computing Now portal: http://www2.computer.org/portal/web/computingnow/0609/whatsnew/cise

While I don't expect functional programming to be adopted rapidly by computational scientists, I am convinced that ten years from now, it will be an essential item in everyone's toolbox. Better start preparing yourself now!

Static typing and code clutter

2009-05-12

Among the many characteristics that distinguish programming languages, static vs. dynamic typing is one of the most debated ones. The main advantages claimed by the advocates of static typing are that compile-time type checks make code more robust and that static typing allows a compiler to do better optimizations. The dynamic programming camp points out the simplicity and flexibility of a language that requires no type declaration and that permits a piece of code to handle data objects defined well after it was written. Both sides are right and the choice is ultimately one of personal preference.

I have used various programming languages over the years, including both statically typed and dynamically typed ones. But when given a choice, I have always preferred dynamic typing. Since 1995, my main programming language has been Python, and more recently I have started to use Clojure. One of the reasons for this preference is something that I have never seen expressed before: static typing often adds visual clutter to the code that makes it harder to read.

An important property of any non-trivial computer program is its clarity to human readers. Both verification of a program's correctness and the overall utility of a piece of code in a context of changing requirements depend on this. Well-written specifications and unit tests help as well, but if you want my advice on the quality of a piece of code, or if you want my help with modifying it, my judgement will mostly be based on its clarity. If it's an effort to understand what's going on, I wouldn't want to work with it.

This criterion for code quality immediately translates into a criterion for programming languages: they should be able to express as many concepts of software engineering as possible in a direct, explicit way and without imposing any clutter or obfuscation. Static type systems often get in the way, either by imposing clutter or by encouraging a less clear programming style.

In my examples, I will use the languages Haskell (static typing) and Clojure (dynamic typing) for illustration. Haskell has one of the best type systems available at the moment, so if Haskell can't avoid the problems that I point out, it is likely that no other current language will do a better job. Clojure is a good comparison because like Haskell it is designed for a functional programming style. Of course, it also helps that I am reasonably familiar with both languages.

Example 1: abstract data types

The idea behind abstract data types is that the concrete representation of some data structure should be hidden from client code, which accesses the data structure only through a set of interface functions. Let's look at how this is typically implemented in Haskell, using the PFP library for probabilistic programming as the example (just because I happen to know it, many other libraries could serve the same purpose). In PFP, a probability distribution is represented by an abstract data type Dist a defined as

newtype Dist a = D {unD :: [(a,ProbRep)]}

This says that internally, Dist is a list of (a, ProbRep) pairs. The single constructor D converts such a list to the abstract data type Dist, whereas unD does the inverse: it makes the contents of a Dist value accessible for inspection.

The problem with this is that all of the implementation code for PFP is littered with D and unD, although they don't do anything and add nothing to the clarity of the code. They are there only to make sure that the signature of the functions contains the abstract type Dist a instead of the internal representation [(a,ProbRep)]. For the reader of the PFP code trying to understand how it works, this is clutter. There are also a couple of functions that exist only for dealing with the artificial distinction between Dist a and [(a,ProbRep)], for example

sizeD :: Dist a -> Int
sizeD = length . unD

which replaces the list function length (familiar to every Haskell programmer) by a special version whose purpose the reader has to remember.

A Clojure library that is essentially equivalent to PFP (look at the source code) is much shorter, and in my opinion much clearer. It represents a probability distribution by a map (known in other languages as a dictionary or an associative array) and directly uses Clojure's map operations to work on it. No visual overhead, no clutter. Of course, as static typing advocates would be quick to point out, no protection of the internal representation either: client code could directly manipulate the maps used to represent distributions, potentially creating maps that are not valid probability distributions. I have never run into such a problem in 15 years of using dynamically typed languages, but in principle it is possible.

It would be possible to avoid the code obfuscation due to abstract data types by recognizing that abstract data types are an interface issue and not a type issue. A language could provide an explicit declaration of an interface for a module where the function signatures would be given with the abstract data type, even though the concrete representation is used in the implementations. The compiler could verify the coherence of everything. But I haven't seen anything like this in any statically typed language.

Note also that something very similar could be implemented in Clojure: A couple of macros would provide wrappers around the exported functions that add type verification at runtime. However, this says more about the advantage of having a powerful macro system than about the advantages of dynamic typing.

Example 2: monads

A monad is package consisting of a data structure (or, more precisely, certain properties that a data structure must have) and two functions bind and result. A subclass of monads also has a special value called zero and a subclass of this subclass has one more function called plus. All these definitions must obey certain rules to make a valid monad.

In Haskell, there is a typeclass Monad that defines bind (called >>=) and result (called return), and another typeclass MonadPlus that defines mzero and mplus. A monad is defined by providing instances for concrete data types. When monadic operations are used, the type inference system identifies the data type and selects the corresponding operations. From the client's point of view, a monad thus is defined by a type.

There are a few drawbacks of this setup:

It is not possible to define a monad with a zero but no plus. This is a technical detail (MonadPlus could well be split into two typeclasses), but it's still a limitation in practical Haskell programming that is due to the rigidity of a type system.

It is impossible to have two monads for the same data type, although sometimes this would make sense. For example, there are (at least) two practically relevant ways to define plus for the list monad.

It is cumbersome to use the same monad operations for two different concrete data structures that are similar enough in behaviour to be used with the same monad definition.

In Clojure, monads are values, not types. In client code, a monad is selected explicitly by the programmer by surrounding the monadic code by a with-monad form that specifies the monad to be used. Usually the monad is named explicitly, but since monads are values, they can also be represented by a variable. A Clojure monad can be used with any data type that the definitions of the monadic operations accept, and any number of monads can be defined for a data type.

In Clojure it is the data structure that almost disappears from monad handling; the constraints on the monadic values are given only as documentation for the human reader. As with other aspects of dynamic typing, this provides more flexibility and less protection.

For standard monads, I'd say that the clarity of code is roughly equivalent in Haskell and Clojure. Haskell gains a bit in making the data structure explicit, Clojure gains a bit in making the monad explicit at the point of use. Both work pretty well.

This changes when monad transformers come into play. In the Haskell world, this is perhaps the most frightening concept to newcomers. Monad transformers are surrounded by an aura of mystery, they are only for the real experts. I think that this is at least in part due to the complexity of defining monad transformers in a type system.

Here's how Haskell defines the list monad transformer; only the relevant parts are shown:

newtype ListT m a = ListT { runListT :: m [a] }

instance (Monad m) => Monad (ListT m) where
    return a = ListT $ return [a]
    m >>= k  = ListT $ do
        a <- runListT m
        b <- mapM (runListT . k) a
        return (concat b)

As in the case of abstract data types, there is a data type definition for ListT that does nothing but introduce a new notation for the type m[a]. ListT and runListT are just notation converters that don't actually do anything useful. But unlike in the case of abstract data types, they are indispensable here. Monads are types, and therefore there has to be a new type to make a new monad. It doesn't help that the name runListT is a particularly bad choice: it suggest an action where there is none.

The definitions of return and >>= aren't masterpieces of clarity either. It takes a careful analysis of each function and its (inferred, and thus unwritten) type to understand which monad is being used where.

For comparison, here is the corresponding monad transformer in Clojure, again reduced to the basics:

(defn sequence-t [m]
   (monad
     [m-result (with-monad m
	         (fn [v]
		   (m-result (list v))))
      m-bind   (with-monad m
		 (fn [mv f]
                   (domonad
                     [a mv
                      b (m-map f a)]
		     (flatten b))))]))

Since monads are values, monad transformers are simply functions that take a monad argument and return another monad. It is also clear at a glance that inside the definition of each monad operation, all references to monad operations are to be interpreted in the inner monad. This doesn't make monad transformers trivial to understand, of course, but it is a lot clearer.

Monads in Clojure

2009-04-22

One of my hobby projects over the last months has been the exploration of monads. Monads are packages consisting of a data structure and associated control structures that are used as abstractions in functional programming. They were popularized by the Haskell language, where they play a central role in introducing side effects (such as I/O) in a controlled way into a language that is otherwise purely functional.

Since I was also exploring Clojure, an interesting new dialect of Lisp that strongly encourages a purely functional programming style (but doesn't enforce it), I decided to explore monads by writing a monad library for Clojure. My experience is that monads are quite useful in Clojure as well, and that once you get used to monads, you see occasions for using them almost everywhere. If you have been hesitating to tackle monads seriously, I can only encourage you to go on!

I have also written a monad tutorial for Clojure programmers, which I published on the OnClojure blog. It consists of four parts:

Part 1 introduces the concept of monads and illustrates it with the identity and maybe monads.

Part 2 explains the importance of m-result using the sequence monad as an example. It also covers the monad laws.

Part 3 is about m-zero and m-plus, and explains the state monad.

Part 4 covers the probability monad and monad transformers.

I hope that this tutorial facilitates a first contact with monads for those who are more familiar with Lisp syntax than with Haskell syntax.

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, polycrisis, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2025-06, 2025-04, 2025-03, 2024-10, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04