Recent posts
Scientific computing needs deterministic programming paradigms
A programming paradigm defines a general approach to writing programs. It defines the concepts and abstractions in terms of which a program is expressed. For example, a programming paradigm defines how data is described, how control flow is handled, how program elements are composed, etc. Well-known programming paradigms are structured programming, object-oriented programming, and functional programming.
The implementation of a programming paradigm consists of a programming language, its runtime system, libraries, and sometimes coding conventions. Some programming languages are optimized for a specific paradigm, whereas others are explicitly designed to support multiple paradigms. Paradigms that the language designer did not have in mind can sometimes be implemented by additional conventions, libraries, or preprocessors.
The list of programming paradigms that have been proposed and/or used is already quite long (see the Wikipedia entry, for example), but the ones that are practically important and significantly distinct are much less numerous. A good overview and comparison is given in the book chapter "Programming paradigms for dummies" by Peter van Roy. I will concentrate on one aspect discussed in van Roy's text (look at section 6 in particular), which I consider of particular relevance for scientific computing: determinism.
A deterministic programming paradigm is one in which every possible program has a fully deterministic behaviour: given the same input, it executes its steps in the same order and produces the same output. This is in fact what most of us would intuitively expect from a computer program. However, there are useful programs that could not be written with this restriction. A Web server, for example, has to react to external requests which are outside of its control, and take into account resource usage (e.g. database access) and possible network errors in deciding when and in which order to process requests. This shows that there is a need for non-deterministic programming paradigms. For the vast majority of scientific applications, however, determinism is a requirement, and a programming paradigm that enforces determinism is a big help in avoiding bugs. Most scientific applications that run serially have been written using a deterministic programming paradigm, as implemented by most of the popular programming languages.
Parallel computing has changed the situation significantly. When several independent processors work together on the execution of an algorithm, fully deterministic behavior is no longer desirable, as it would imply frequent synchronizations of all processors. The precise order in which independent operations are executed is typically left unspecified by a program. What matters is that the output of the program is determined only by the input. As long as this is guaranteed, it is acceptable and even desirable to let compilers and run-time systems optimize the scheduling of individual subtasks. In Peter van Roy's classification, this distinction is called "observable" vs. "non-observable" non-determinism. A programming paradigm for scientific computing should permit non-determinism, but should exclude observable non-determinism. While observable non-determinism makes the implementation of certain programs (such as Web servers) possible, it also opens the way to bugs that are particularly nasty to track down: deadlocks, race conditions, results that change with the number of processors or when moving from one parallel machine to another one.
Unfortunately, two of the most popular programming paradigms for parallel scientific applications do allow observable non-determinism: message passing, as implemented by the MPI library, and multi-threading. Those who have used either one have probably suffered the consequences. The problem is thus well known, but the solutions aren't. Fortunately, they do exist: there are several programming paradigms that encapsulate non-determinism in such a way that it cannot influence the results of a program. One of them is widely known and used: OpenMP, which is a layer above multi-threading that guarantees deterministic results. However, OpenMP is limited to shared-memory multiprocessor machines.
For the at least as important category of distributed-memory parallel machines, there are also programming paradigms that don't have the non-deterministic features of message passing, and they are typically implemented as a layer above MPI. One example is the BSP model, which I have presented in an article in the magazine Computing in Science and Engineering. Another example is the parallel skeletons model, presented by Joël Falcou in the same magazine. Unfortunately, these paradigms are little known and not well supported by programming tools. As a consequence, most scientific applications for distributed-memory machines are written using the message passing paradigm.
Finally, a pair of programming paradigms discussed by van Roy deserves special mention, because it might well become important in scientific computing in the near future: functional programming and declarative concurrency. I have written about functional programming earlier; its main advantage is the possibility to apply mathematical reasoning and automatic transformations to program code, leading to better code (in the sense of correctness) and to better optimization techniques. Declarative concurrency is functional programming plus annotations for parallelization. The nice feature is that these annotations (not very different in principle from OpenMP pragmas) don't change the result of the program, they only influence its performance on a parallel machine. Starting from a correct functional program, it is thus possible to obtain an equivalent parallel one by automatic or manual (but automatically verified) transformations that is guaranteed to behave identically except for performance. Correctness and performance can thus be separated, which should be a big help in writing correct and efficient parallel programs. I say "should" because this approach hasn't been used very much, and isn't supported yet by any mainstream programming tools. This may change in a couple of years, so you might want to watch out for developments in this area.
Sheldrake's New Science of Life
The question that Sheldrake addresses in this book is where form comes from. What defines the arrangements of atoms in a molecule? Or in a crystal? Why do proteins fold into their characteristic structures? How do biological molecules assemble into cells? And how to cells divide and specialize to form an embryo?
The standard reply to these questions you can find in science textbooks is that all these forms come from the fundamental interactions of physics. Molecules are simply energetically favorable arrangements of atoms. Proteins fold in such a way that free energy is minimized. Cells assemble as a result of complex attractive interactions between its constituents, which ultimately can be reduced to fundamental physics. Embryos develop according to a "genetic program" stored in the fecundated egg's DNA.
As Sheldrake rightly emphasizes, even though it may come as a surprise to most non-experts, these affirmations cannot be verified. They express a common belief among practicing scientists, and they are compatible with everything we know about nature, but they may well be wrong. We simply cannot verify them because the fundamental equations of physics can be solved only for very simple systems. Even for one of the simplest molecules, water, we cannot predict the arrangement of its atoms directly from the basic principles of physics. What we use in practice are approximations, but these approximations have been selected because they permit to predict the known molecular structures. We cannot use such approximations to verify more fundamental problems.
Sheldrake proposes an alternative theory, based on what he calls "morphogenetic fields". From my point of view as a physicist, the name is not very well chosen because these entities do not correspond to what a physicist would call a field, but of course this term may be perfectly clear to biologists. It's a minor point because Sheldrake explains this concept very clearly in his book. In summary, his theory says that forms exist because they have existed before; atoms, molecules, and cells arrange themselves into patterns that they "remember" from the past. His morphogenetic fields are a giant database of forms that the universe keeps around forever.
The main "problem" with this theory is that if it is right, even just approximately, then standard science, from physics to biology, is very much wrong. It is probably for this reason that his book has attracted so much criticism from the science establishment. Otherwise, there is little one could criticize: Sheldrake explains his theory and its consequences for chemistry and biology, and he proposes a large number of experimental verifications that would permit to test it. This is science at its best. Of course his theory may turn out to require modifications, or even be completely wrong, but that is true of any scientific theory when it is first formulated.
In fact, I recommend this book to anyone interested in the scientific process because of its detailed discussion of how scientific discovery works. I haven't seen many books accessible to non-specialists that explain the limits of verifiability of a scientific theory, for example. Nor have I seen any other book that makes the distinction between verified theories and widely accepted but untested beliefs so clear as Sheldrake does. Even if you don't care about his theory, you can gain a lot from reading this book.
Functional programming for scientific computing
I have written an introduction to functional programming for scientists for the July issue of Computing in Science and Engineering. It is also available (free access) via IEEE's Computing Now portal: http://www2.computer.org/portal/web/computingnow/0609/whatsnew/cise
While I don't expect functional programming to be adopted rapidly by computational scientists, I am convinced that ten years from now, it will be an essential item in everyone's toolbox. Better start preparing yourself now!
Static typing and code clutter
I have used various programming languages over the years, including both statically typed and dynamically typed ones. But when given a choice, I have always preferred dynamic typing. Since 1995, my main programming language has been Python, and more recently I have started to use Clojure. One of the reasons for this preference is something that I have never seen expressed before: static typing often adds visual clutter to the code that makes it harder to read.
An important property of any non-trivial computer program is its clarity to human readers. Both verification of a program's correctness and the overall utility of a piece of code in a context of changing requirements depend on this. Well-written specifications and unit tests help as well, but if you want my advice on the quality of a piece of code, or if you want my help with modifying it, my judgement will mostly be based on its clarity. If it's an effort to understand what's going on, I wouldn't want to work with it.
This criterion for code quality immediately translates into a criterion for programming languages: they should be able to express as many concepts of software engineering as possible in a direct, explicit way and without imposing any clutter or obfuscation. Static type systems often get in the way, either by imposing clutter or by encouraging a less clear programming style.
In my examples, I will use the languages Haskell (static typing) and Clojure (dynamic typing) for illustration. Haskell has one of the best type systems available at the moment, so if Haskell can't avoid the problems that I point out, it is likely that no other current language will do a better job. Clojure is a good comparison because like Haskell it is designed for a functional programming style. Of course, it also helps that I am reasonably familiar with both languages.
Example 1: abstract data types
The idea behind abstract data types is that the concrete representation of some data structure should be hidden from client code, which accesses the data structure only through a set of interface functions. Let's look at how this is typically implemented in Haskell, using the PFP library for probabilistic programming as the example (just because I happen to know it, many other libraries could serve the same purpose). In PFP, a probability distribution is represented by an abstract data type
Dist a
defined asnewtype Dist a = D {unD :: [(a,ProbRep)]}
This says that internally,
Dist
is a list of (a, ProbRep)
pairs. The single constructor D
converts such a list to the abstract data type Dist
, whereas unD
does the inverse: it makes the contents of a Dist
value accessible for inspection.The problem with this is that all of the implementation code for PFP is littered with
D
and unD
, although they don't do anything and add nothing to the clarity of the code. They are there only to make sure that the signature of the functions contains the abstract type Dist a
instead of the internal representation [(a,ProbRep)]
. For the reader of the PFP code trying to understand how it works, this is clutter. There are also a couple of functions that exist only for dealing with the artificial distinction between Dist a
and [(a,ProbRep)]
, for examplesizeD :: Dist a -> Int
sizeD = length . unD
which replaces the list function
length
(familiar to every Haskell programmer) by a special version whose purpose the reader has to remember.A Clojure library that is essentially equivalent to PFP (look at the source code) is much shorter, and in my opinion much clearer. It represents a probability distribution by a map (known in other languages as a dictionary or an associative array) and directly uses Clojure's map operations to work on it. No visual overhead, no clutter. Of course, as static typing advocates would be quick to point out, no protection of the internal representation either: client code could directly manipulate the maps used to represent distributions, potentially creating maps that are not valid probability distributions. I have never run into such a problem in 15 years of using dynamically typed languages, but in principle it is possible.
It would be possible to avoid the code obfuscation due to abstract data types by recognizing that abstract data types are an interface issue and not a type issue. A language could provide an explicit declaration of an interface for a module where the function signatures would be given with the abstract data type, even though the concrete representation is used in the implementations. The compiler could verify the coherence of everything. But I haven't seen anything like this in any statically typed language.
Note also that something very similar could be implemented in Clojure: A couple of macros would provide wrappers around the exported functions that add type verification at runtime. However, this says more about the advantage of having a powerful macro system than about the advantages of dynamic typing.
Example 2: monads
A monad is package consisting of a data structure (or, more precisely, certain properties that a data structure must have) and two functions
bind
and result
. A subclass of monads also has a special value called zero
and a subclass of this subclass has one more function called plus
. All these definitions must obey certain rules to make a valid monad.In Haskell, there is a typeclass
Monad
that defines bind
(called >>=
) and result (called return
), and another typeclass MonadPlus
that defines mzero
and mplus
. A monad is defined by providing instances for concrete data types. When monadic operations are used, the type inference system identifies the data type and selects the corresponding operations. From the client's point of view, a monad thus is defined by a type.There are a few drawbacks of this setup:
- It is not possible to define a monad with a
zero
but noplus
. This is a technical detail (MonadPlus
could well be split into two typeclasses), but it's still a limitation in practical Haskell programming that is due to the rigidity of a type system. - It is impossible to have two monads for the same data type, although sometimes this would make sense. For example, there are (at least) two practically relevant ways to define
plus
for the list monad. - It is cumbersome to use the same monad operations for two different concrete data structures that are similar enough in behaviour to be used with the same monad definition.
In Clojure, monads are values, not types. In client code, a monad is selected explicitly by the programmer by surrounding the monadic code by a
with-monad
form that specifies the monad to be used. Usually the monad is named explicitly, but since monads are values, they can also be represented by a variable. A Clojure monad can be used with any data type that the definitions of the monadic operations accept, and any number of monads can be defined for a data type.In Clojure it is the data structure that almost disappears from monad handling; the constraints on the monadic values are given only as documentation for the human reader. As with other aspects of dynamic typing, this provides more flexibility and less protection.
For standard monads, I'd say that the clarity of code is roughly equivalent in Haskell and Clojure. Haskell gains a bit in making the data structure explicit, Clojure gains a bit in making the monad explicit at the point of use. Both work pretty well.
This changes when monad transformers come into play. In the Haskell world, this is perhaps the most frightening concept to newcomers. Monad transformers are surrounded by an aura of mystery, they are only for the real experts. I think that this is at least in part due to the complexity of defining monad transformers in a type system.
Here's how Haskell defines the list monad transformer; only the relevant parts are shown:
newtype ListT m a = ListT { runListT :: m [a] }
instance (Monad m) => Monad (ListT m) where
return a = ListT $ return [a]
m >>= k = ListT $ do
a <- runListT m
b <- mapM (runListT . k) a
return (concat b)
As in the case of abstract data types, there is a data type definition for
ListT
that does nothing but introduce a new notation for the type m[a]
. ListT
and runListT
are just notation converters that don't actually do anything useful. But unlike in the case of abstract data types, they are indispensable here. Monads are types, and therefore there has to be a new type to make a new monad. It doesn't help that the name runListT
is a particularly bad choice: it suggest an action where there is none.The definitions of
return
and >>=
aren't masterpieces of clarity either. It takes a careful analysis of each function and its (inferred, and thus unwritten) type to understand which monad is being used where.For comparison, here is the corresponding monad transformer in Clojure, again reduced to the basics:
(defn sequence-t [m]
(monad
[m-result (with-monad m
(fn [v]
(m-result (list v))))
m-bind (with-monad m
(fn [mv f]
(domonad
[a mv
b (m-map f a)]
(flatten b))))]))
Since monads are values, monad transformers are simply functions that take a monad argument and return another monad. It is also clear at a glance that inside the definition of each monad operation, all references to monad operations are to be interpreted in the inner monad. This doesn't make monad transformers trivial to understand, of course, but it is a lot clearer.
Monads in Clojure
Since I was also exploring Clojure, an interesting new dialect of Lisp that strongly encourages a purely functional programming style (but doesn't enforce it), I decided to explore monads by writing a monad library for Clojure. My experience is that monads are quite useful in Clojure as well, and that once you get used to monads, you see occasions for using them almost everywhere. If you have been hesitating to tackle monads seriously, I can only encourage you to go on!
I have also written a monad tutorial for Clojure programmers, which I published on the OnClojure blog. It consists of four parts:
- Part 1 introduces the concept of monads and illustrates it with the identity and maybe monads.
- Part 2 explains the importance of
m-result
using the sequence monad as an example. It also covers the monad laws. - Part 3 is about m-zero and m-plus, and explains the state monad.
- Part 4 covers the probability monad and monad transformers.
I hope that this tutorial facilitates a first contact with monads for those who are more familiar with Lisp syntax than with Haskell syntax.
Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software
By month: 2024-10, 2024-06, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04