Recent posts

Teaching parallel computing in Python

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I'll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:


from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it's nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn't.

Imagine your typical student typing this script and running it. Here's what happens:



Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what's wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it's not possible. Implementing it today within the framework of Python's existing pickle protocol is unfortunately not trivial, and that's why it hasn't been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn't hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the "advanced topics" part.

OK, you have passed the message, and your students fix the script:



from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:



Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
AttributeError: 'module' object has no attribute 'square_root'
task = get()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is "fork", you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:



from multiprocessing import Pool
import numpy

def square_root(x):
return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But... how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn't work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a "pool of worker processes" abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I'd prefer an API that does not suggest possibilities that don't exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit "run parallel job now" call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

A rant about mail clients

A while ago I described why migrated my agendas from iCal to orgmode. To sum it up, my main motivation was to gain more freedom in managing my information: where iCal imposes a rigid format for events and insists on storing them in its own database, inaccessible to other programs, orgmode lets me mix agenda information with whatever else I like in plain text files. Today's story is a similar one, but without the happy end. I am as much fed up with mail clients as I was with iCal, and for much the same reasons, but I haven't yet found anything I could migrate to.

From an information processing point of view, an e-mail message is not very different from lots of other pieces of data. It's a sequence of bytes respecting a specific format (defined by a handful of standards) to allow its unambiguous interpretation by various programs in the processing chain. An e-mail message can perfectly well be stored in a file and in fact most e-mail clients permit saving a message to a file. Unfortunately, the number of e-mail clients able to open and display correctly such a file is already much smaller. But when it comes to collections of messages, information processing freedom ends completely.

Pretty much every mail client's point of view is that all of a user's mail is stored in some database, and that it (the client) is free to handle this database in whatever way it likes. The user's only access to the messages is the mail client. The one and only. The only exception is server-based mail databases handled via the IMAP protocol, where multiple clients can work with a common database. If you don't use IMAP, you have no control over how and where your mail is stored, who has access to it, etc.

What I'd like to do is manage mail just like I manage other files. A mailbox should just be a directory containing messages, one per file. Mailboxes could be stored anywhere in the file system. Mailboxes could be shared through the file system, and backed up via the file system. They could be grouped with whatever other information in whatever way that suits me. I would double-click on a message to view it, or double-click on a mailbox directory to view a summary, sorted in the way I like it. Or I would use command-line tools to work on a message or a mailbox. I'd pick the best tool for each job, just like I do when working with any other kind of file.

Why all that isn't possible remains a mystery to me. The technology has been around for decades. The good old Maildir format would be just fine for storing mailboxes anywhere in the file system, as would the even more venerable mbox format. But even mail clients that use mbox or Maildir internally insist that all such mailboxes must reside in a single master directory. Moreover, they won't let me open a mailbox from outside, I have to run the mail client and work through its hierarchical presentation of mailboxes to get to my destination.

Before I get inundated by comments pointing out that mail client X has feature Y from the list above: Yes, I know, there are small exceptions here and there. But unless I have the complete freedom to put my mail where I want it, the isolated feature won't do me much good. If someone knows of a mail client that has all the features I am asking for, plus the features we all expect from a modern mail client, then please do leave a comment!

EuroSciPy 2011

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how "expert" design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn't exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian's reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management - a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won't mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando's keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw's presentation of ROS, the Robot Operating System, Chris Myers' presentation about SloppyCell, and Yann Le Du's talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.

Executable Papers

The last two days I participated in the "Executable Papers workshop" at this year's ICCS conference. It was not just another workshop among the many ICCS workshops. The participants had all submitted a proposal to the "Executable Paper Grand Challenge" run by Elsevier, one of the biggest scientific publishers. On the first day, the nine finalists presented their work, and on the second day, the remaining accepted proposals were presented.

The term "executable papers" stands for the expected next revolution in scientific publishing. The move from printed journals to electronic on-line journals (or a combination of both) has changed little for authors and readers. It is the libraries that have seen the largest impact because they now do little more than paying subscription fees. Readers obtain papers as PDF files directly from the publishers' Web sites. The one change that does matter to scientists is that most journals now propose the distribute "supplementary material" in addition to the main paper. This can in principle be any kind of file, in practice it is mostly used for additional explanations, images, and tables, i.e. to keep the main paper shorter. Occasionally there are also videos, a first step towards exploring the new possibilities opened up by electronic distribution. The step to executable papers is a much bigger one: the goal is to integrate computer-readable data and executable program code together with the text part of a paper. The goals are a richer reader experience (e.g. interactive visualizations), verifiability of results by both referees and readers (by re-running part of the computations described in the paper), and re-use of data and code in later work by the same or other authors. There is some overlap in these goals with the "reproducible research" movement, whose goal is to make computational research reproducible by providing tools and methods that permit to store a trace of everything that entered into some computational procedure (input data, program code, description of the computing environment) such that someone else (or even the original author a month later) can re-run everything and obtain the same results. The new aspect in executable papers is the packaging and distribution of everything, as well as the handling of bibliographic references.

The proposals' variety mostly reflected the different background of the presenters. A mathematician documenting proofs obviously has different needs than an astrophysicist simulating a supernova on a supercomputer. Unfortunately this important aspect was never explicitly discussed. Most presenters did not even mention their field of work, much less what it implies in terms of data handling. This was probably due to the enormous time pressure; 15 to 20 minutes for a presentation plus demonstration of a complex tool was clearly not enough.

The proposals could roughly be grouped into three categories:

Some proposals covered two of these categories but with a clear emphasis on one of them. For the details of each propsal, see the ICCS proceedings which are freely available.

While it was interesting to see all the different ideas presented, my main impression of the Executable Paper Workshop is that of a missed opportunity. Having all those people who had thought long and hard about the various issues in one room for two days would have been a unique occasion to make progress towards better tools for the future. In fact, none of the solutions presented cover the needs of the all the domains of computational science. They make assumptions about the nature of the data and the code that are not universally valid. One or two hours of discussion might have helped a lot to improve everyone's tools.

The implementation of my own proposal, which addresses the questions of how to store code and data in a flexible, efficient, and future-proof way, is available here. It contains a multi-platform binary (MacOS, Linux, Windows, all on the x86 platform) and requires version 6 of the Java Runtime Environment. The source code is also included, but there is no build system at the moment (I use a collection of scripts that have my home-directory hard-coded in lots of places). There is, however, a tutorial. Feedback is welcome!

Text input on mobile devices

I have been using mobile (pocket size) computers for about 15 years, starting with the Palm Pilot. Currently I use an Android smartphone (Samsung Galaxy S). While mobile devices are mostly used for consulting rather than for entering information, text entry has always been a hot topic of debate.


Apple's Newton Messagepad, probably the first mobile computing device in the modern sense, pursued the ambitious goal of handwriting recognition. It was both an impressive technical achievement and a practical failure. I don't think anyone ever managed to use the Newton's handwriting recognition satisfactorily in daily life.


The Palm Pilot had a more modest but also more achievable goal: its Graffiti technology was based on single letter recognition with simplified letter shapes. It took a while to become fluent with Graffiti, but many people managed and I don't remember anyone complainig about. the nearning curve.


I don't remember when I first saw a miniature QWERTY keyboard on the screen of a mobile device, but it may well have been on one of the first iPhones. I was definitely not enthusiastic about it. The keys are much too small for touch-typing, and the layout was already a bad choice for desktop computer keyboards. The only argument in its favor is familiarity, but is that a good enough reason to cripple oneself for a long time to come?


When I got my Android phone, I was rapidly confronted with this issue in practice. Samsung left me the choice between the standard keyboard and Swype. Both had the same problem: too small keys for my fingers. I turned to the Android market and found many more QWERTY keyboards. And... Graffiti, my old friend from my Palm days. What a relief!


Of course, my phone is not a Palm. The Biggest difference is that the Palm had a stylus whereas today's smartphones are meant to be manipulated withthe fingers. But Graffiti works surprisingly well without a stylus. I find that I can write about equally well wïth the index or the thumb. Graffiti definitely is a good choice for Android, especially for Palm veterans.


Recently I discovered another alternative input and I like it enough that I might end up preferring it over Graffiti. It's called MessagEase and it consists of a 3x3 grid of comfortably large keys that display the 9 most frequent characters. The remaining characters, plus punctuation etc., is available by drawing lines outward from the center of a key. The technique doesn't require much time to master, but writing fluently requires a lot of practice because the layout needs to be memorized.


I started using MessagEase about two weeks ago and have reached about the same speed I ge with Graffiti. I wrote this whole article with MessagEase as a real-life exercise. Time will tell if I actually get faster than with Graffiti, but MessagEase definitely is a serious candidate for mobile texting in the post-QWERTY era. If you have an Android phone or an iPhone, give it a try.

Bye bye iCal, welcome org-mode

I have been using Macintosh computers since 2003, and overall I have been happy with the personal information management (PIM) tools provided by Apple: AddressBook, Mail, Safari (for bookmark management). The one tool I have never liked is iCal. Its user interface is fine for consulting my agenda, but entering information is too complicated and the todo-list management is particularly clumsy. But more importantly, I regularly found myself wanting to add information for which no entry field was provided. I ended up putting it into the "notes" section, or leave it out. Another unplesant feature of iCal is that all the information is stored in a complex proprietary database, making synchronization between several computers impossible except through cloud-based server solutions such as Apple's MobileMe (quite expensive) or fruux (much nicer in my opinion, but it still requires trusting your data to a cloud service).

Being unhappy with a tool for an important task implies looking for better options, but I didn't find anything that I liked. Until one day I discovered, mostly by accident, the org-mode package that has been distributed with Emacs for a while. org-mode is one of those pieces of software that is so powerful that it is difficult to describe to someone who has never used it. Basically, org-mode uses plain text files with a special lightweight markup syntax for things like todo items or time stamps (but there is a lot more), and then provides sophisticated and very configurable functions for working with this data. It can be used for keeping agendas, todo lists, journals, simple databases such as bookmark lists, spreadsheets, and much more. Most importantly, all of these can coexist in a single text file if you want, and the contents of this file can be structured in any way you like. You can even add pieces of executable code and thus use org-mode for literal programming, but that's a topic for another post.

To be more concrete, my personal information database in org-mode consists of several files at the top level: work.org for organizing my workday, home.org for tasks and appointments related to private life, research.org for notes about research projects, programming.org for notes (mostly bookmarks) about software development, etc. Inside my work.org, there is a section on research projects, one on teaching, one on my editorial work for CiSE, one for refereeing, etc. Inside each of these sections, there are agenda entries (seminars, meetings, courses etc.) and todo entries with three priority levels and optional deadlines. Any of them can be accompanied by notes of any kind, including links, references to files on my disk, and even executable shell commands. There is no limit to what you store there.

In October 2010 I started the transition from iCal to org-mode. Initially I entered all data twice, to make sure I could continue to rely on iCal. After a week I was confident enough to enter everything just once, using org-mode. I then transferred all agenda items for 2011 to org-mode and decided to stop using iCal on Januray 1, 2011. That day has arrived, and the iCal icon has disappeared from my dock. Without any regrets.

Conclusion: If you need a powerful PIM system and you don't fear Emacs, have a look at org-mode.

The future of Python

I have received a number of questions and remarks about my keynote talk at EuroSciPy 2010, ranging from questions about technical details to an inquiry about the release date of Python 4.0! Rather than writing lengthy replies to everyone, I try to address all these issues here.

First of all, my intentions behind the keynote were


  1. Encourage scientists to look at new tools and developments that I believe to be important in the near future (Python 3, Cython) and at others that might become important to scientific applications (JIT compilers, alternative implementations).


  2. Make computational scientists think about future commodity hardware (which is what we use most of the time) and its implications for programming, in particular the generalization of massively parallel computing.


  3. Show that easy-to-use parallel programming paradigms, in particular deterministic ones, exist today. Computational scientists need to realize that MPI and OpenMP are not the last word on parallel programming.


  4. Make my ideas concrete by showing how they could be implemented in Python.




My "Python 4.0" is completely fictitious and will probably never exist in exactly that form. However, it is important to realize that it could be implemented right now. With the GIL-free Python implementations (Jython, IronPython), it would even be rather straightforward to implement. For CPython, any implementation not removing the GIL would probably be too inefficient to be of practical interest.

Most of the ingredients for implementing my "Python 4.0" are well-known and have already been used in other languages or libraries:


Futures may seem to provide most of what declarative concurrency promises, but this is not quite true. Futures are objects representing computations. They have a method that client code must call to wait for the result and retrieve it. Since waiting is an explicit operation on a standard object, it is easy to create a situation in which two futures wait for each other: a deadlock. This can only be avoided by not having futures accessible as standard objects. The language implementation must recognize futures as special and insert a wait call before any access to the value of the result. For this reason, declarative concurrency cannot be implemented as a library.

Another important condition for implementing declarative concurrency with futures is that code inside a future must be effect-free. Otherwise multiple concurrently running futures can modify the same object and create a race condition.

Probably the only truly original contribution in my "Python 4.0" scenario is the dynamically verified effect-free subset of the Python language. Most languages, even functional ones, provide no way for a compiler or a run-time system to verify that a given function is effect-free. Haskell is perhaps the only exception in having a static type system that can identify effect-free code. In Python, that is not a viable approach because everything is dynamic. But why not provide at least a run-time check for effect-free code where useful? It's still better to have a program crash with an exception saying "you did I/O in what should have been an effect-free function" than get wrong results silently.

Here is an outline of how such an approach could be implemented. Each function and method would have a flag saying "I am supposed to be effect-free." In my examples, this flag is set by the decorator @noeffects, but other ways are possible. Built-in functions would of course have that flag set correctly as well. As soon as the interpreter enters a function marked as effect-free, it goes into "functional mode" until it returns from that function again. In functional mode, it raises an exception whenever an unflagged function or method is called.

Some details to consider:


Finally, a comment on a minor issue. I have been asked if the "async" keyword is strictly necessary. The answer is no, but it makes the code much more readable. The main role of async is to write a function call without having it executed immediately. The same problem occurs in callbacks in GUI programming: you have to specify a function call to be executed at a later time. The usual solution is a parameter-free lambda expression, and that same trick could be used to make async a function rather than a keyword. But readability suffers a lot.

EuroSciPy 2010

This weekend I attended the EuroSciPy 2010 conference in Paris, dedicated to scientific applications of the programming language Python. This was the third EuroSciPy conference, but the US-based SciPy conference has been a regular event for many years already, and recently SciPy India joined the crowd. It looks like Python is becoming ever more popular in scientific computing. Next year, EuroSciPy will take place in Paris again.

There were lots of interesting presentations and announcements, and the breaks provided a much appreciated opportunity for exchanges between the participants. I won't try to provide an exhaustive summary, but rather list my personal highlights. Obviously this choice reflects my personal interests more than the quality of the presentations, and I will even list things that were not presented but that I learned about from other participants during the breaks.

Teaching

The opening keynote was given by Hans-Petter Langtangen, who is best known for his books about Python for scientific computing. His latest book is a textbook for a course on scientific programming for beginning science students, and the first part of his keynote was about this same course that he is teaching at the University of Oslo. As others have noted as well, he observed that the students have no problem at all with picking up Python and using it productively in science. The difficulties with using Python are elsewhere: it is hard to convince the university professors that Python is a good choice of programming language for such a course!

Another important aspect of his presentation was the observation that teaching scientific programming to beginning science students provides more than just training in some useful technique. Converting equations into programs and running them also provides a much better insight into the structure and applicability of the equations. Computational science thus helps to better educate future scientists.

Reproducible research

The reproducible research movement has the goal of improving the standards in computational science. At the moment, it is almost always impossible to reproduce published computational results from the information provided by the authors. Making these results reproducible requires a careful recording of what was calculated using which version of which software running on which machine, and of course making this information available along with the publication.

At EuroSciPy, Andrew Davison presented Sumatra, a Python library for tracking this information (and more) for computational procedures written in Python. The library is in an early stage, with more functionality to come, but those interested in reproducible research should check it out now and contribute to its development.

Jarrod Millman addressed the same topic in his presentation of the plans for creating a Foundation for Mathematical and Scientific Computing, whose goal is to fund development of tools and techniques that improve computational science.

NumPy and Python 3

As a couple of active contributors to the NumPy project were attending the conference, I asked about the state of the porting effort to Python 3. The good news is that the port is done and will soon be released. Those who have been waiting for NumPy to be ported before starting to port their own libraries can go to work right now: check out the NumPy Subversion repository, install, and use!

Useful maths libraries

Three new maths libraries that were presented caught my attention: Sebastian Walter's talk about algorithmic differentiation contained demos of algopy, a rather complete library for algorithmic differentiation in Python. During the Lightning talks on the last day, two apparently similar libraries for working with uncertain numbers (numbers with error bars) were shown: uncertainties, by Eric Lebigot, and upy, by Friedrich Romstedt. Both do error propagation and take correlations into account. Those of us working with experimental data or simulation results will appreciate this.

There was a lot more interesting stuff, of course, and I hope others will write more about it. I'll just point out that the slides for my own keynote about the future of Python in science are available from my Web site. And of course express my thanks to the organizing committee who invested a lot of effort to make this conference a big success!

Science and free will

The question if living beings, in particular those of our own species, possess "free will", and how it works if it exists, has recently become fashionable again. The new idea that brought the topic back into discussion was that our sense of free will might just be an illusion. According to this idea, we would be machines whose fate is entirely determined by the laws of physics (which might themselves be deterministic or not), even though we perceive ourselves as actors who pursue goals and take decisions that are not even in principle predictable by a physical analysis of our bodies, no matter at what level of detail.

The topic itself is an old one, perhaps as old as humanity. I won't go into its philosophical and religious aspects, but limit myself to the scientist's point of view: is free will compatible with scientific descriptions of our world? Perhaps even necessary for such descriptions? Or, on the contrary, in contradiction to the scientific approach? Can the scientific method be used to understand free will or show that it's a useless concept from the past?

What prompted me to write this post is a recent article by Anthony Cashmore in PNAS. In summary, Cashmore says that the majority of scientists do not believe in the existence of free will any more, and that society should draw conclusions from this, in particular concerning the judicial system, whose concept of responsibility for one's acts is based on a view of free will that the author no longer considers defendable. But don't take my word for it, read the article yourself. It's well written and covers many interesting points.

First of all, let me say that I don't agree at all with Cashmore's view that the judicial system should be reformed based on the prevailing view of today's scientists. I do believe that a modern society should take into account scientific findings, i.e. scientific hypotheses that have withstood a number of attempts at falsification. But mere beliefs of a small subpopulation, even if they are scientists, are not sufficient to justify a radical change of anything. As I will explain below, the question "do human beings possess free will" does not even deserve the label "scientific hypothesis" at this moment, because we have no idea of how we could answer it based on observation and experiment. We cannot claim either to be able to fully understand human behavior in terms of the laws of physics, which would allow us to call free will an unnecessary concept and invoke Occam's razor to get rid of it. Therefore, at this time, the existence of free will remains the subject of beliefs and scientists' beliefs are worth no more than anyone else's.

There is also a peculiar circularity to any argument about what "should" be done as a consequence of the non-existence of free will: if that hypothesis is true, nobody can decide anything! If humans have no free will, then societies don't have it either, and our judicial system is just as much a consequence of the laws of nature as my perceived decision to take coffee rather than tea for breakfast this morning.

Back to the main topic of this post: the relation between science and free will. It starts with the observation of a clear conflict. Science is about identifying regularities in the world that surrounds us, which permit the construction of detailed and testable theories. The first scientific theories were all about deterministic phenomena: given the initial state of some well-defined physical system (think of a clockwork, for example), the state of the system at any time in the future can be predicted with certainty. Later, stochastic phenomena entered the scientific world view. With stochasticity, the detailed behavior of a system is no longer predictable, but certain average properties still are. For example, we can predict how the temperature and pressure of water will change when we heat it, even though we cannot predict how each individual molecule will move. It is still a subject of debate whether stochastic elements exist in the fundamental laws of nature (quantum physics being the most popular candidate), or if they are merely a way of describing complex systems whose state we cannot analyze in detail due to insufficient resources. But scientists agree that a scientific theory may contain two forms of causality: determinism and stochasticity.

Free will, if it exists, would have to be added as a third form of causality. But it is hard to see how this could be done. The scientific method is based on identifying conditions from which exact predictions can be made. The decisions of an agent that possesses free will are by definition unpredictable, and therefore any theory about a system containing such an agent would be impossible to verify. Therefore the scientific method as we know it today cannot possibly take into consideration the existence of free will. Obviously this makes it impossible to examine the existence of free will as a scientific hypothesis. It also means that a hard-core scientist, who considers the scientific method as the only way to establish truth, has to deny the existence of free will, or else accept that some important aspects of our universe are forever inaccessible to scientific investigation.

However, there is another aspect to the relation between science and free will, which I haven't seen discussed yet anywhere: the existence of free will is in fact a requirement for the scientific method! Not as part of a system under scientific scrutiny, but as part of the scientist who runs an investigation. Testing a scientific hypothesis requires at the very least observing a specific phenomenon, but in most cases also preparing a well-defined initial state for some system that will then become the subject of observation. A scientist decides to create an experimental setup to verify some hypothesis. If the scientist were just a complex machine whose behavior is governed by the very same laws that he believes to be studying, then his carefully thought-out experiment is nothing but a particularly probable outcome of the laws of nature. We could still draw conclusions from observing it, of course, but these observations then only provide anecdotical evidence that is no more relevant than what we get from passively watching things happen around us.

In summary, our current scientific method supposes the existence of free will as an attribute of scientists, but also its absence from any system subjected to scientific scrutiny. This poses limits to what scientific investigation can yield when applied to humans.

Eclipse experiences

A few months ago I decided to take a closer look at Eclipse, since several people I know seemed to be quite fond of it. I had tried it earlier on my old iBook G4, but quickly abandoned it because it was much too slow. But my new MacBook Pro should be able to handle it.

Last week I finally decided to retire my Eclipse installation. I didn't remove it yet, since it might be useful for some specific tasks that I have deal with rarely (such as analyzing someone else's big C++ code). But I don't use it any more for my own work. Here's a summary of my impressions of Eclipse, the good and the bad.

In terms of features, Eclipse is as impressive as it looks. Anything you might wish for in an IDE is there, either in the base distribution or in the form of a plugin - there are hundreds if not thousands of those. And contrary to what one might expect, all those features are relatively easy to get used to. The user interface is very systematic and the most frequent functions are easy to spot. In terms of user interface design, I would call Eclipse a success.

However, in terms of usability it turned out to be a disappointment. Basically there are two major issues: Eclipse is a resource hog, and it isn't as stable as I expect an IDE to be.

The two resources that Eclipse can't get enough of is CPU time and disk space. Even on a brand-new machine (and not a low-end one at that), starting Eclipse takes a good ten seconds and I get to see the Macintosh's spinning colour wheel quite often. What's worst is that the spinning wheel prevents me from typing, at unpredictable moments. This is not acceptable for an IDE. I don't care if it takes a break in background compilation now and then, but I want to be able to type when I want. Execution times for various command can also vary unpredictably. Rebuilding all my projects took about a minute typically, but once I waited for 15 minutes for no apparent reason.

In terms of disk space, Eclipse is less of a resource hog, but it creates and updates impressive amounts of data, again for no clear reason. I noticed this because I make incremental backups regularly. Just starting and quitting Eclipse, with no action in between, resulted in a few MB of files to backup again. It's not that I can't live with that, but is this really necessary?

Finally, stability. I had only a single crash in which I lost data (the most recently entered code), which is not so bad for a big application (unfortunately...). But I had Eclipse hanging very often, and displaying verbose yet unintellegible error messages almost daily. All this is not reassuring, and together with the spinning-wheel issue this is what made me abandon Eclipse in the end.

Now I am a 100% Emacs user again, with no regrets. Emacs may look old-fashioned, and have some fewer high-powered features, but it is reliable and fast.
← Previous Next →

Tags: computational science, computer-aided research, emacs, mmtk, mobile computing, programming, proteins, python, rants, reproducible research, science, scientific computing, scientific software, social networks, software, source code repositories, sustainable software

By month: 2024-10, 2024-06, 2023-11, 2023-10, 2022-08, 2021-06, 2021-01, 2020-12, 2020-11, 2020-07, 2020-05, 2020-04, 2020-02, 2019-12, 2019-11, 2019-10, 2019-05, 2019-04, 2019-02, 2018-12, 2018-10, 2018-07, 2018-05, 2018-04, 2018-03, 2017-12, 2017-11, 2017-09, 2017-05, 2017-04, 2017-01, 2016-05, 2016-03, 2016-01, 2015-12, 2015-11, 2015-09, 2015-07, 2015-06, 2015-04, 2015-01, 2014-12, 2014-09, 2014-08, 2014-07, 2014-05, 2014-01, 2013-11, 2013-09, 2013-08, 2013-06, 2013-05, 2013-04, 2012-11, 2012-09, 2012-05, 2012-04, 2012-03, 2012-02, 2011-11, 2011-08, 2011-06, 2011-05, 2011-01, 2010-07, 2010-01, 2009-09, 2009-08, 2009-06, 2009-05, 2009-04