Recent posts

Binary operators in Python

2012-03-29

A two-hour train journey provided the opportunity to watch the video recording of the Panel with Guido van Rossum at the recent PyData Workshop. The lengthy discussion about PEP 225 (which proposes to add additional operators to Python that would enable to have both elementwise and aggregate operations on the same objects, in particular for providing both matrix and elementwise multiplication on arrays with a nice syntax) motivated me to write up my own thoughts about what's wrong with operators in Python from my computational scientist's point of view.

The real problem I see is that operators map to methods. In Python, a*b is just syntactic sugar for a.__mul__(b). This means that it's the type of a that decides how to do the multiplication. The method implementing this operation can of course check the type of b, and it can even decide to give up and let b handle everything, in which case Python does b.__rmul__(a). But this is just a kludge to work around the real weakness of the operators-map-to-methods approach. Binary operators fundamentally require a dispatch on both types, the type of a and the type of b. What a*b should map to is __builtins__.__mul__(a, b), a global function that would then implement a binary dispatch operation. Implementing that dispatch would in fact be the real problem to solve, as Python currently has no multiple dispatch mechanisms at all.

But would multiple dispatch solve the issue addressed by PEP 225? Not at all, directly. But it would make some of the alternatives mentioned there feasible. A proper multiple dispatch system would allow NumPy (or any other library) to decide what multiplication of its own objects by a number means, no matter if the number is the first or the second factor.

More importantly, multiple dispatch would allow a major cleanup of many scientific packages, including NumPy, and even clean up the basic Python language by getting rid of __rmul__ and friends. NumPy's current aggressive handling of binary operations is actually more of a problem for me than the lack of a nice syntax for matrix multiplication.

There are many details that would need to be discussed before binary dispatch could be proposed as a PEP. Of course the old method-based approach would need to remain in place as a fallback, to ensure compatibility with existing code. But the real work is defining a good multiple dispatch system that integrates well with Python's dynamical type system and allows the right kind of extensibility. That same multiple dispatch method could then also be made available for use in plain functions.

Python becomes a platform

2012-03-15

The recent announcement of clojure-py made some noise in the Clojure community, but not, as far as I can tell, in the Python community. For those who haven't heard of it before, clojure-py is an implementation of the Clojure language in Python, compiling Clojure code to bytecode for Python's virtual machine. It's still incomplete, but already usable if you can live with the subset of Clojure that has been implemented.

I think that this is an important event for the Python community, because it means that Python is no longer just a language, but is becoming a platform. One of the stated motivations of the clojure-py developers is to tap into the rich set of libraries that the Python ecosystem provides, in particular for scientific applications. Python is thus following the path that Java already went in the past: the Java virtual machine, initially designed only to support the Java language, became the target of many different language implementations which all provide interoperation with Java itself.

It will of course be interesting to see if more languages will follow once people realize it can be done. The prospect of speed through PyPy's JIT, another stated motivation for the clojure-py community, could also get more lanuage developers interested in Python as a platform.

Should Python programmers care about clojure-py? I'd say yes. Clojure is strong in two areas in which Python isn't. One of them is metaprogramming, a feature absent from Python which Clojure had from the start through its Lisp heritage. The other feature is persistent immutable data structures, for which clojure-py provides an implementation in Python. Immutable data structures make for more robust code, in particular but not exclusively for concurrent applications.

Teaching parallel computing in Python

2012-02-06

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I'll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:

from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it's nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn't.

Imagine your typical student typing this script and running it. Here's what happens:


Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what's wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it's not possible. Implementing it today within the framework of Python's existing pickle protocol is unfortunately not trivial, and that's why it hasn't been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn't hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the "advanced topics" part.

OK, you have passed the message, and your students fix the script:


from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
    return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:


Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is "fork", you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:


from multiprocessing import Pool
import numpy

def square_root(x):
    return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But... how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn't work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a "pool of worker processes" abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I'd prefer an API that does not suggest possibilities that don't exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit "run parallel job now" call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

A rant about mail clients

2011-11-04

A while ago I described why migrated my agendas from iCal to orgmode. To sum it up, my main motivation was to gain more freedom in managing my information: where iCal imposes a rigid format for events and insists on storing them in its own database, inaccessible to other programs, orgmode lets me mix agenda information with whatever else I like in plain text files. Today's story is a similar one, but without the happy end. I am as much fed up with mail clients as I was with iCal, and for much the same reasons, but I haven't yet found anything I could migrate to.

From an information processing point of view, an e-mail message is not very different from lots of other pieces of data. It's a sequence of bytes respecting a specific format (defined by a handful of standards) to allow its unambiguous interpretation by various programs in the processing chain. An e-mail message can perfectly well be stored in a file and in fact most e-mail clients permit saving a message to a file. Unfortunately, the number of e-mail clients able to open and display correctly such a file is already much smaller. But when it comes to collections of messages, information processing freedom ends completely.

Pretty much every mail client's point of view is that all of a user's mail is stored in some database, and that it (the client) is free to handle this database in whatever way it likes. The user's only access to the messages is the mail client. The one and only. The only exception is server-based mail databases handled via the IMAP protocol, where multiple clients can work with a common database. If you don't use IMAP, you have no control over how and where your mail is stored, who has access to it, etc.

What I'd like to do is manage mail just like I manage other files. A mailbox should just be a directory containing messages, one per file. Mailboxes could be stored anywhere in the file system. Mailboxes could be shared through the file system, and backed up via the file system. They could be grouped with whatever other information in whatever way that suits me. I would double-click on a message to view it, or double-click on a mailbox directory to view a summary, sorted in the way I like it. Or I would use command-line tools to work on a message or a mailbox. I'd pick the best tool for each job, just like I do when working with any other kind of file.

Why all that isn't possible remains a mystery to me. The technology has been around for decades. The good old Maildir format would be just fine for storing mailboxes anywhere in the file system, as would the even more venerable mbox format. But even mail clients that use mbox or Maildir internally insist that all such mailboxes must reside in a single master directory. Moreover, they won't let me open a mailbox from outside, I have to run the mail client and work through its hierarchical presentation of mailboxes to get to my destination.

Before I get inundated by comments pointing out that mail client X has feature Y from the list above: Yes, I know, there are small exceptions here and there. But unless I have the complete freedom to put my mail where I want it, the isolated feature won't do me much good. If someone knows of a mail client that has all the features I am asking for, plus the features we all expect from a modern mail client, then please do leave a comment!

EuroSciPy 2011

2011-08-30

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how "expert" design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn't exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian's reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management - a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won't mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando's keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw's presentation of ROS, the Robot Operating System, Chris Myers' presentation about SloppyCell, and Yann Le Du's talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.

Executable Papers

2011-06-03

The last two days I participated in the "Executable Papers workshop" at this year's ICCS conference. It was not just another workshop among the many ICCS workshops. The participants had all submitted a proposal to the "Executable Paper Grand Challenge" run by Elsevier, one of the biggest scientific publishers. On the first day, the nine finalists presented their work, and on the second day, the remaining accepted proposals were presented.

The term "executable papers" stands for the expected next revolution in scientific publishing. The move from printed journals to electronic on-line journals (or a combination of both) has changed little for authors and readers. It is the libraries that have seen the largest impact because they now do little more than paying subscription fees. Readers obtain papers as PDF files directly from the publishers' Web sites. The one change that does matter to scientists is that most journals now propose the distribute "supplementary material" in addition to the main paper. This can in principle be any kind of file, in practice it is mostly used for additional explanations, images, and tables, i.e. to keep the main paper shorter. Occasionally there are also videos, a first step towards exploring the new possibilities opened up by electronic distribution. The step to executable papers is a much bigger one: the goal is to integrate computer-readable data and executable program code together with the text part of a paper. The goals are a richer reader experience (e.g. interactive visualizations), verifiability of results by both referees and readers (by re-running part of the computations described in the paper), and re-use of data and code in later work by the same or other authors. There is some overlap in these goals with the "reproducible research" movement, whose goal is to make computational research reproducible by providing tools and methods that permit to store a trace of everything that entered into some computational procedure (input data, program code, description of the computing environment) such that someone else (or even the original author a month later) can re-run everything and obtain the same results. The new aspect in executable papers is the packaging and distribution of everything, as well as the handling of bibliographic references.

The proposals' variety mostly reflected the different background of the presenters. A mathematician documenting proofs obviously has different needs than an astrophysicist simulating a supernova on a supercomputer. Unfortunately this important aspect was never explicitly discussed. Most presenters did not even mention their field of work, much less what it implies in terms of data handling. This was probably due to the enormous time pressure; 15 to 20 minutes for a presentation plus demonstration of a complex tool was clearly not enough.

The proposals could roughly be grouped into three categories:

Web-based tools that permit the author to compose his executable paper by supplying data, code, and text, and permit the reviewer and reader to consult this material and re-run computations.

Systems for preserving the author's computational environment in order to permit reviewers and readers to use the author's software with little effort and without any security risks.

Semantic markup systems that make parts of the written text interpretable by a computer for various kinds of processing

Some proposals covered two of these categories but with a clear emphasis on one of them. For the details of each propsal, see the ICCS proceedings which are freely available.

While it was interesting to see all the different ideas presented, my main impression of the Executable Paper Workshop is that of a missed opportunity. Having all those people who had thought long and hard about the various issues in one room for two days would have been a unique occasion to make progress towards better tools for the future. In fact, none of the solutions presented cover the needs of the all the domains of computational science. They make assumptions about the nature of the data and the code that are not universally valid. One or two hours of discussion might have helped a lot to improve everyone's tools.

The implementation of my own proposal, which addresses the questions of how to store code and data in a flexible, efficient, and future-proof way, is available here. It contains a multi-platform binary (MacOS, Linux, Windows, all on the x86 platform) and requires version 6 of the Java Runtime Environment. The source code is also included, but there is no build system at the moment (I use a collection of scripts that have my home-directory hard-coded in lots of places). There is, however, a tutorial. Feedback is welcome!

Text input on mobile devices

2011-05-02

I have been using mobile (pocket size) computers for about 15 years, starting with the Palm Pilot. Currently I use an Android smartphone (Samsung Galaxy S). While mobile devices are mostly used for consulting rather than for entering information, text entry has always been a hot topic of debate.

Apple's Newton Messagepad, probably the first mobile computing device in the modern sense, pursued the ambitious goal of handwriting recognition. It was both an impressive technical achievement and a practical failure. I don't think anyone ever managed to use the Newton's handwriting recognition satisfactorily in daily life.

The Palm Pilot had a more modest but also more achievable goal: its Graffiti technology was based on single letter recognition with simplified letter shapes. It took a while to become fluent with Graffiti, but many people managed and I don't remember anyone complainig about. the nearning curve.

I don't remember when I first saw a miniature QWERTY keyboard on the screen of a mobile device, but it may well have been on one of the first iPhones. I was definitely not enthusiastic about it. The keys are much too small for touch-typing, and the layout was already a bad choice for desktop computer keyboards. The only argument in its favor is familiarity, but is that a good enough reason to cripple oneself for a long time to come?

When I got my Android phone, I was rapidly confronted with this issue in practice. Samsung left me the choice between the standard keyboard and Swype. Both had the same problem: too small keys for my fingers. I turned to the Android market and found many more QWERTY keyboards. And... Graffiti, my old friend from my Palm days. What a relief!

Of course, my phone is not a Palm. The Biggest difference is that the Palm had a stylus whereas today's smartphones are meant to be manipulated withthe fingers. But Graffiti works surprisingly well without a stylus. I find that I can write about equally well wïth the index or the thumb. Graffiti definitely is a good choice for Android, especially for Palm veterans.

Recently I discovered another alternative input and I like it enough that I might end up preferring it over Graffiti. It's called MessagEase and it consists of a 3x3 grid of comfortably large keys that display the 9 most frequent characters. The remaining characters, plus punctuation etc., is available by drawing lines outward from the center of a key. The technique doesn't require much time to master, but writing fluently requires a lot of practice because the layout needs to be memorized.

I started using MessagEase about two weeks ago and have reached about the same speed I ge with Graffiti. I wrote this whole article with MessagEase as a real-life exercise. Time will tell if I actually get faster than with Graffiti, but MessagEase definitely is a serious candidate for mobile texting in the post-QWERTY era. If you have an Android phone or an iPhone, give it a try.

Bye bye iCal, welcome org-mode

2011-01-04

I have been using Macintosh computers since 2003, and overall I have been happy with the personal information management (PIM) tools provided by Apple: AddressBook, Mail, Safari (for bookmark management). The one tool I have never liked is iCal. Its user interface is fine for consulting my agenda, but entering information is too complicated and the todo-list management is particularly clumsy. But more importantly, I regularly found myself wanting to add information for which no entry field was provided. I ended up putting it into the "notes" section, or leave it out. Another unplesant feature of iCal is that all the information is stored in a complex proprietary database, making synchronization between several computers impossible except through cloud-based server solutions such as Apple's MobileMe (quite expensive) or fruux (much nicer in my opinion, but it still requires trusting your data to a cloud service).

Being unhappy with a tool for an important task implies looking for better options, but I didn't find anything that I liked. Until one day I discovered, mostly by accident, the org-mode package that has been distributed with Emacs for a while. org-mode is one of those pieces of software that is so powerful that it is difficult to describe to someone who has never used it. Basically, org-mode uses plain text files with a special lightweight markup syntax for things like todo items or time stamps (but there is a lot more), and then provides sophisticated and very configurable functions for working with this data. It can be used for keeping agendas, todo lists, journals, simple databases such as bookmark lists, spreadsheets, and much more. Most importantly, all of these can coexist in a single text file if you want, and the contents of this file can be structured in any way you like. You can even add pieces of executable code and thus use org-mode for literal programming, but that's a topic for another post.

To be more concrete, my personal information database in org-mode consists of several files at the top level: work.org for organizing my workday, home.org for tasks and appointments related to private life, research.org for notes about research projects, programming.org for notes (mostly bookmarks) about software development, etc. Inside my work.org, there is a section on research projects, one on teaching, one on my editorial work for CiSE, one for refereeing, etc. Inside each of these sections, there are agenda entries (seminars, meetings, courses etc.) and todo entries with three priority levels and optional deadlines. Any of them can be accompanied by notes of any kind, including links, references to files on my disk, and even executable shell commands. There is no limit to what you store there.

In October 2010 I started the transition from iCal to org-mode. Initially I entered all data twice, to make sure I could continue to rely on iCal. After a week I was confident enough to enter everything just once, using org-mode. I then transferred all agenda items for 2011 to org-mode and decided to stop using iCal on Januray 1, 2011. That day has arrived, and the iCal icon has disappeared from my dock. Without any regrets.

Conclusion: If you need a powerful PIM system and you don't fear Emacs, have a look at org-mode.

The future of Python

2010-07-19

I have received a number of questions and remarks about my keynote talk at EuroSciPy 2010, ranging from questions about technical details to an inquiry about the release date of Python 4.0! Rather than writing lengthy replies to everyone, I try to address all these issues here.

First of all, my intentions behind the keynote were

Encourage scientists to look at new tools and developments that I believe to be important in the near future (Python 3, Cython) and at others that might become important to scientific applications (JIT compilers, alternative implementations).

Make computational scientists think about future commodity hardware (which is what we use most of the time) and its implications for programming, in particular the generalization of massively parallel computing.

Show that easy-to-use parallel programming paradigms, in particular deterministic ones, exist today. Computational scientists need to realize that MPI and OpenMP are not the last word on parallel programming.

Make my ideas concrete by showing how they could be implemented in Python.

My "Python 4.0" is completely fictitious and will probably never exist in exactly that form. However, it is important to realize that it could be implemented right now. With the GIL-free Python implementations (Jython, IronPython), it would even be rather straightforward to implement. For CPython, any implementation not removing the GIL would probably be too inefficient to be of practical interest.

Most of the ingredients for implementing my "Python 4.0" are well-known and have already been used in other languages or libraries:

The "declarative concurrency" programming paradigm has been used in Oz and FlowJava, but to the best of my knowledge not in any mainstream programming language. It is explained very well in the book Concepts, Techniques, and Models of Computer Programming, by Peter van Roy and Seif Haridi, and also in the freely downloadable essay Programming paradigms for dummies. Basically, it is the functional paradigm extended with annotations that identify computations to be done in parallel. Remove those annotations, and you get plain functional programs that yield the same results. Declarative concurrency is free of deadlocks and race conditions, which I think is a critical property for any parallel programming paradigm to be considered for a high-level language such as Python. Another nice feature of declarative concurrency is that data parallelism, including nested data parallelism, is a special case that can be implemented on top of it. Data parallelism is a useful paradigm for many scientific applications.

Futures are asynchronous tasks provided as library functions for various languages. A Python library for futures is the subject of PEP 3148; an implementation is already available.

The importance of effect-free functions for all kinds of code transformations (automatic or not) is widely recognized. It is equally recognized that a useful program needs to have effects. The two basic approaches to dealing with this contradiction are (a) allow effects, but make it as easy as possible not to use them to encourage a mostly effect-free style and (b) design a language without effects (pure functional language) and provide a mechanism to put in effects with special annotation or syntax but clearly as an exceptional feature. The best-known language in the second category is Haskell with its use of monads for controlling effects. Most functional languages are in the first category.
Efficient data structures for functional programs have been a subject of research for quite a while and quite a few good ones are known. It would be straightforward to replace Python's tuple implementation by something more efficient in typical functional settings, or to add an efficient immutable dictionary implementation. The standard reference is Chris Osaki's book Purely Functional Data Structures.

Futures may seem to provide most of what declarative concurrency promises, but this is not quite true. Futures are objects representing computations. They have a method that client code must call to wait for the result and retrieve it. Since waiting is an explicit operation on a standard object, it is easy to create a situation in which two futures wait for each other: a deadlock. This can only be avoided by not having futures accessible as standard objects. The language implementation must recognize futures as special and insert a wait call before any access to the value of the result. For this reason, declarative concurrency cannot be implemented as a library.

Another important condition for implementing declarative concurrency with futures is that code inside a future must be effect-free. Otherwise multiple concurrently running futures can modify the same object and create a race condition.

Probably the only truly original contribution in my "Python 4.0" scenario is the dynamically verified effect-free subset of the Python language. Most languages, even functional ones, provide no way for a compiler or a run-time system to verify that a given function is effect-free. Haskell is perhaps the only exception in having a static type system that can identify effect-free code. In Python, that is not a viable approach because everything is dynamic. But why not provide at least a run-time check for effect-free code where useful? It's still better to have a program crash with an exception saying "you did I/O in what should have been an effect-free function" than get wrong results silently.

Here is an outline of how such an approach could be implemented. Each function and method would have a flag saying "I am supposed to be effect-free." In my examples, this flag is set by the decorator @noeffects, but other ways are possible. Built-in functions would of course have that flag set correctly as well. As soon as the interpreter enters a function marked as effect-free, it goes into "functional mode" until it returns from that function again. In functional mode, it raises an exception whenever an unflagged function or method is called.

Some details to consider:

Effect-free functions may not contain global or nonlocal statements. Probably the only way to enforce this is to have special syntax for defining effect-free functions and methods (rather than a decorator) and make those statements syntactically illegal inside.

It would be much more useful to have a "referentially transparent" subset rather than an "effect-free" subset, but this is also much harder to implement. A referentially transparent function guarantees to return the same result for the same input arguments, but may modify mutable objects that it has created internally. For example, a typical matrix inversion function allocates an array for its result and then uses an imperative algorithm that modifies the elements of that array repeatedly before returning it. Such a function can be used as an asynchronous task without risk, but its building blocks cannot be safely run concurrently.

Finally, a comment on a minor issue. I have been asked if the "async" keyword is strictly necessary. The answer is no, but it makes the code much more readable. The main role of async is to write a function call without having it executed immediately. The same problem occurs in callbacks in GUI programming: you have to specify a function call to be executed at a later time. The usual solution is a parameter-free lambda expression, and that same trick could be used to make async a function rather than a keyword. But readability suffers a lot.

EuroSciPy 2010

2010-07-12

This weekend I attended the EuroSciPy 2010 conference in Paris, dedicated to scientific applications of the programming language Python. This was the third EuroSciPy conference, but the US-based SciPy conference has been a regular event for many years already, and recently SciPy India joined the crowd. It looks like Python is becoming ever more popular in scientific computing. Next year, EuroSciPy will take place in Paris again.

There were lots of interesting presentations and announcements, and the breaks provided a much appreciated opportunity for exchanges between the participants. I won't try to provide an exhaustive summary, but rather list my personal highlights. Obviously this choice reflects my personal interests more than the quality of the presentations, and I will even list things that were not presented but that I learned about from other participants during the breaks.

Teaching

The opening keynote was given by Hans-Petter Langtangen, who is best known for his books about Python for scientific computing. His latest book is a textbook for a course on scientific programming for beginning science students, and the first part of his keynote was about this same course that he is teaching at the University of Oslo. As others have noted as well, he observed that the students have no problem at all with picking up Python and using it productively in science. The difficulties with using Python are elsewhere: it is hard to convince the university professors that Python is a good choice of programming language for such a course!

Another important aspect of his presentation was the observation that teaching scientific programming to beginning science students provides more than just training in some useful technique. Converting equations into programs and running them also provides a much better insight into the structure and applicability of the equations. Computational science thus helps to better educate future scientists.

Reproducible research

The reproducible research movement has the goal of improving the standards in computational science. At the moment, it is almost always impossible to reproduce published computational results from the information provided by the authors. Making these results reproducible requires a careful recording of what was calculated using which version of which software running on which machine, and of course making this information available along with the publication.

At EuroSciPy, Andrew Davison presented Sumatra, a Python library for tracking this information (and more) for computational procedures written in Python. The library is in an early stage, with more functionality to come, but those interested in reproducible research should check it out now and contribute to its development.

Jarrod Millman addressed the same topic in his presentation of the plans for creating a Foundation for Mathematical and Scientific Computing, whose goal is to fund development of tools and techniques that improve computational science.

NumPy and Python 3

As a couple of active contributors to the NumPy project were attending the conference, I asked about the state of the porting effort to Python 3. The good news is that the port is done and will soon be released. Those who have been waiting for NumPy to be ported before starting to port their own libraries can go to work right now: check out the NumPy Subversion repository, install, and use!

Useful maths libraries

Three new maths libraries that were presented caught my attention: Sebastian Walter's talk about algorithmic differentiation contained demos of algopy, a rather complete library for algorithmic differentiation in Python. During the Lightning talks on the last day, two apparently similar libraries for working with uncertain numbers (numbers with error bars) were shown: uncertainties, by Eric Lebigot, and upy, by Friedrich Romstedt. Both do error propagation and take correlations into account. Those of us working with experimental data or simulation results will appreciate this.

There was a lot more interesting stuff, of course, and I hope others will write more about it. I'll just point out that the slides for my own keynote about the future of Python in science are available from my Web site. And of course express my thanks to the organizing committee who invested a lot of effort to make this conference a big success!

← Previous Next →

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, polycrisis, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2025-04, 2025-03, 2024-10, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04