A rant about software deployment in 2015

2015-11-06 python, scientific computing, rants

We all know that software deployment in a research environment can be a pain, but knowing this as a fact is not quite the same as experiencing it in reality. Over the last days, I spent way more time that I would have imagined on what sounds like a simple task: installing a scientific application written in Python on a Linux machine for use by a group of students in a training session. Here is an outline of the difficulties, in the hope that it will (1) help others who face similar problems and (2) contributes a little bit to improving the situation.

The software that I installed is nMOLDYN, an analysis tool for Molecular Dynamics trajectories. From a software engineering point of view, this is a rather standard Python program building on NumPy and MMTK for its computations and on Tkinter and matplotlib for the graphical user interface. There is no need for anything on the bleeding edge, a decent three-year old installation of the scientific Python stack would support this perfectly well.

The machine that was set up for the training session is configured much like a typical node in a compute cluster: stable and trusted software installed once and never updated. More specifically, the machine runs CentOS 6.7. Another feature rather typical of compute nodes is the very restricted network connectivity: users can log in via ssh, and copy data in and out using scp. Everything else is blocked, in particular all outgoing network traffic. The idea is that students will work on desktop or laptop machines, from where they have full network access to search for information, and connect to the compute server only for running scientific software. For my own software installation I had to limit myself to a user account, i.e. no administrator rights, although I could ask the systems administrator to install additional RPMs from CentOS.

A first exploration of the system's Python installation showed a collection of oldies: Python 2.6.6, NumPy 1.4.1, matplotlib 0.99.1.1. That's the state of the art five years ago. I quickly decided not to use it at all, for two reasons. First, I wasn't sure how much of what I had to add would still work with such old versions. All the software was already around five years ago, but I would have had to track down the versions that were current back then. Second, adding modules in a user account to a Python installation at the system level can easily lead to a fragile total. Following Murphy's law such problems would show up during the student sessions. So I decided to start with a fresh install of Python 2.7.

First surprise: no C compiler. An e-mail to the administrator, and I had gcc. Trying to install Python showed that the Tcl/Tk setup was incomplete: the header files were missing. An another e-mail asking for tcl-devel and tk-devel, and that was settled as well. Python, NumPy, netCDF, ScientificPython, and MMTK were up and running half an hour later. An attempt to install nMOLDYN resulted in the information that I still needed to install Pyro and matplotlib. That can't be so hard, right?

Pyro was no problem indeed, but matplotlib kept me busy for a few more hours. All I had done in the past was pip install matplotlib, but pip is useless without outgoing network connections. I had to track down source tarballs for matplotlib and all its dependencies. There's a list of dependencies on the matplotlib Web site, but it's incomplete in two ways: some dependencies are missing (setuptools and six), and others are given by name but without a link. Try googling for "cycler" - you will learn a lot about celestial mechanics before you find a package with this name on PyPI. Of all the matplotlib dependencides, only freetype was already available on my machine, so I had some searching and downloading to do.

The installation instructions for setuptools clearly do not consider the possibility of not having a network connection. They tell me to download a Python script and execute it to download the real software. Fortunately, there is the "advanced" installation option via a tarball. Which ends rather quickly with an error message complaining about the absence of the zlib module.

That module is part of the Python standard library, but it is compiled only if zlib (the C library) is installed on the machine. It wasn't on mine. This is not particularly difficult to fix, but rather annoying: I had to install zlib, and then run the Python installation once more. Not to forget: I knew zlib was in the standard library, and I immediately saw why it was missing on my machine, because I have been installing Pythons in lots of environments over twenty years. Someone else might well have spent a few hours figuring out what to do about zlib.

From then on everything went smoothly, so this is the end of my story. In order to provide something constructive, here is the complete list of matplotlib dependencies with links, and in the order of installation:

Finally, I will pass on a hint that came in this morning via Twitter:

@khinsen @MrTheodor pip install —download . —no-use-wheel <foobar> Won’t work if there are Linux specific dependencies though.
— Donald Stufft (@dstufft) 6. November 2015

Using pip install --download . --no-use-wheel matplotlib, run of course on a machine that has a network connection, you get tarballs for matplotlib and all its dependencies that pip knows about. You still have to add setuptools (which pip doesn't download because it depends on it itself), the C libraries libpng and zlib, and of course Python's standard but not-always-there zlib module.

Looking back at my twenty years using Python, I come to the unfortunate conclusion that software installation is much more of a problem today than it was back in 1995. The main reason is of course that Python software has become more feature-rich and complex - in 1995 something like matplotlib was only a dream. But the state of Python packaging tools is also to blame, with three overlapping and partially compatible tools (distutils, setuptools, and distribute) creating a lot of confusion and various distribution formats (tarballs, eggs, wheels) adding another layer of complexity. What is also sorely missing is a straightforward way to package an application program with all its dependencies in such a way that it can be installed with reasonable effort on all common platforms.

Comments retrieved from Disqus

ostrokach:
Check out the Anaconda python distribution (https://anaconda.org). It addresses most of the issues that you list in this blog.
- Konrad Hinsen:
  Anaconda is great if all the packages you need are in it. If you have to add pure Python packages by hand afterwards, that's OK as well, but if you have to add packages with C extension modules, Anaconda is more of a pain than a help in my experience. And that's what I needed in this specific case (Scientific Python + MMTK + nMOLDYN, all with extension modules).
  Moreover, I am not sure that Anaconda is usable in an environment without Internet connection, though I haven't tried.
  - ostrokach:
    You could set up a local Anaconda repository, and copy into it all the packages (*.tar.bz files) that you require. Anaconda comes with `zlib` and all the other binaries which were giving you a hard time, compiled on old CentOS 5 and thus working on most Linux distros. Furthermore, you could created conda packages for MMTK and nMOLDYN on a local machine, and so you wouldn't even need `gcc` on the server.
    Edit:
    > If you have to add packages with C extension modules, Anaconda is more of a pain than a help in my experience
    If you don't have root access on your machine, you will end up compiling many of the required C libraries in your home directory anyway. You might as well make a conda package out of them so you can use them anywhere. It also helps you want to install several python packages that have different dependencies (NumPy, Boost, etc.).
    - Konrad Hinsen:
      That all sounds nice in theory, but practice is different. I did )try to build conda packages for ScientificPython and MMTK but gave up. Linking extension modules to the shared libraries already provided by Anconda proved to be impossible in a portable way.
      - Chris Barker:
        It's been another year, and Conda / Anaconda has grown more features and support. Particularly with conda-forge
        This would probably be very easy today. Though a non-connected environment is a lot harder.
        You'd need to download all your conda packages by hand, but that's not too hard to do, if annoying.
        And there is the Constructor project:
        https://github.com/conda/co...
        that might not even be neccesary
        So it HAS gotten better!
Emmanuel V.:
Mmm. You're right about Python packaging tools.
For your setup, an approach could have been:
1) replicate the "hostile" CentOS environment in a local VM (eg VirtualBox on your laptop, same base OS version, but with Internet connectivity)
2) Install all needed software as an unprivileged user
3) copy this user account on the target CentOS
Emmanuel
- matthew scholz:
  This is a solution that should not be required.
  - Chris Barker:
    well, it's not, but really, on eo f the sources of the problem here is a locked down environment -- having a not-isolated "copy" of the exact same environment should be standard part of such a system.
- Konrad Hinsen:
  Yes, that's another possible strategy. But setting up a virtual machine is also a lot of work, so it comes down to estimating how close one will be to the break-even point. I really didn't expect to spend much time on this before I started.