Think of all the things you hate about using computers in doing research. Software installation. Getting your colleagues’ scripts to work on your machine. System updates that break your computational code. The multitude of file formats and the eternal need for conversion. That great library that’s unfortunately written in the wrong language for you. Dependency and provenance tracking. Irreproducible computations. They all have something in common: they are consequences of the difficulty of composing digital information. In the following, I will explain the root causes of these problem. That won’t make them go away, but understanding the issues will perhaps help you to deal with them more efficiently, and to avoid them as much as possible in the future.
Composing information is something we all do every day, mostly without thinking of it. A shopping list is the composition of names of things you need to buy. An e-mail message is the composition of the recipients’ addresses, a subject line, and the body of the message. An address book is a composition of addresses, which in turn are compositions of various pieces of information related to some person.
Science has its own information items and associated compositions. Measurements are composed into tables. Mathematical equations are composed into more complex equations. Datasets are composed to make a database. Hypotheses are composed to make a model.
Writing computer programs means composing expressions and statements into procedures or functions, composing procedures to make modules, and composing modules to make programs. Reading data from a file means composing your algorithms with the data they work on into a complete computation. Configuring a new computer and installing software are about composing an operating system, various libraries, and application software into a functioning whole.
When you look at these examples more closely, you might notice that some of these acts of composition are so trivial that we don’t even think about them, whereas others are a real pain. In that second category, we find most of the composition work related to computers. So what is the difference?
Humans process information in terms of concepts. We all have accumulated a vast amount of conceptual knowledge over our lifetime, starting with the most basic concepts that we learned in infancy. This knowledge includes the definitions of all the concepts, but also the relations between them. Our knowledge of concepts helps us to “make sense” of information, which includes the detection of probable mistakes and sometimes even their correction. Humans are very tolerant to mistakes and variations in how some piece of information is expressed. We don’t care if the items in a shopping list are arranged vertically or horizontally, for example.
When composing information, we read the individual items, translate them into concepts, and then write out the composition. I use the vocabulary of processing written language here, but the same holds for oral or visual communication. Variations in notation may be an inconvenience, but not a real problem. As long as the information refers to familiar concepts, we can deal with it.
Computers process information by applying precise mechanical rules. They don’t care about concepts, nor about context. If you ask a computer to do something stupid, it will happily do so. This may look like a criticism of how computers work, but it’s also exactly why they are so useful in research: they have different strengths and weaknesses compared to humans, and are therefore complementary “partners” in solving problems.
At the hardware level of a digital computer, a computation is a multi-step process that transforms an input bit sequence into an output bit sequence under the control of a program that is stored as a bit sequence as well. Information processing by computers thus requires all data to be expressed as bit sequences. Dealing with bit sequences is, however, very inconvenient for humans. We therefore use data representations that are more suitable for human brains, but still exactly convertible from and to the bit sequences that are stored in a computer’s memory. These representations are called formal languages. The definition of a formal language specifies precisely how some piece of information is encoded as sequences of bits. Many formal languages are defined in terms of sequences of text characters instead of sequences of bits, for another level of human convenience. Since the mapping from text characters to bits is straightforward, this makes little difference in practice. The term “formal language” is commonly used in computer science, but in computational science we usually speak of “data formats”, “file formats”, and “programming languages”, all of which are specific kinds of formal languages. The use of formal languages, rather the the informal languages of human communication, is the defining characteristic of digital information.
The definition of a formal language consists of two parts, syntax and semantics. Syntax defines which bit patterns or text strings are valid data items in the language. Syntax rules can be verified by a suitable program called a parser. Semantics define the meaning of syntactically correct data items. With one important exception, semantics are mere conventions for the interpretation of digital data. Meaning refers to conceptual knowledge that a computer neither has nor needs: all it does is process bit sequences. The exception concerns formal languages for expressing algorithms, i.e. rules for the transformation of data. The semantics of an algorithmic language defines how each operation transforms input data into output data. Writing down such transformation rules obviously requires a notation for the data is being worked on. For that reason, a formal language that can express algorithms also defines the syntax and semantics of the input and output data for these algorithms. Your favorite programming language, whichever it is, provides a good illustration.
There is a huge number of formal languages today, which can be organized into a hierarchy of abstraction layers, such that languages at a higher level incorporate languages from lower levels. As a simple example, a programming language such as Fortran incorporates formal languages defining individual data elements - integers, floating-point numbers, etc. At the lowest level of this hierarchy, close to the bit level at which computing hardware operates, we have formal languages such as Unicode for text characters or the floating-point number formats of IEEE standard 754. One level up we find the memory layout of Fortran arrays, the layout of UTF–8 encoded text files, and many other basic data structures and file formats. Structured file formats such as XML or HDF5 are defined on the next higher level, as they incorporate basic data structures such as arrays or text strings. Programming languages such as Python or C reside on that level as well.
Different formal languages that encode the same information at the semantic level can be converted into each other. The two best-known translations of this kind in the daily life of a computational scientist are file-format conversion and the compilation of software source code into processor instructions. However, if you take into account that the in-memory data layout of any program is a formal language as well, all I/O operations can be considered conversions between two formal languages.
Digital information is, by definition, information expressed in a formal language. Composition of digital information produces a new, more complex, digital information item, which is of course expressed in a formal language as well. And since the ingredients remain accessible as parts of the whole, everything must be expressed in one and the same formal language. And that’s where all our trouble comes from.
If we start from ingredients expressed in different languages, we have basically two options: translate everything to a common language, or define a new formal superlanguage that incorporates all the languages used for expressing the various ingredients. We can of course choose a mixture of these two extreme approaches. But both of them imply a lot of overhead and add considerable complexity to the composed assembly. Translation requires either tedious and error-prone manual labor, or writing a program to do the job. Defining a superlanguage requires implementing software tools for processing this new superlanguage.
As an illustration, consider a frequent situation in computational science: a data processing program that reads a specific file format, and a dataset stored in a different format. The translation option means writing a file format converter. The superlanguage option means extending the data processing program to read a second file format. In both cases, the use of multiple formal languages adds complexity to the composition that is unrelated to the real problem to be solved, which is the data analysis. In software engineering, this is known as "accidental complexity", as opposed to the “essential complexity” inherent in the problem.
As a second example, consider writing a program that is supposed to call a procedure written in language A and another procedure written in language B. The translation option means writing a compiler from A to B or vice-versa. The superlanguage option means writing an interpreter or compiler that accepts both languages A and B. A mixed approach could use two compilers, one for A and one for B, that share a common target language. The latter solution seems easy at first sight, because compilers from A and B to processor instructions probably already exist. However, the target language of a compiler is not “processor instruction set” but “processor instruction set plus specific representations of data structures and conventions for memory management”. It is unlikely that two unrelated compilers for A and B are compatible at that level. Practice has shown that combining code written in different programming languages is always a source of trouble, except when using tools that were explicitly designed for implementing the superlanguage from the start.
In the last paragraph, I have adopted a somewhat unusual point of view which I will continue to use in the following. We usually think of a language as something named and documented, such as C or Unicode. The point of view I adopt here is that the language in which a piece of digital information is expressed consists of all the rules and constraints that must be satisfied, including the rules and constraints due to composition. To illustrate the difference, consider the Python language and the Python language with the NumPy extension. According to the standard point of view, Python is the language and NumPy is a library written in Python. In my point of view, Python+NumPy is a language different from plain Python. To see that libraries modify their underlying languages, consider the Python statement
import numpy. It fails in plain Python, so it is not a valid statement in the Python language, whereas it is valid in the Python+NumPy language. Moreover, in the Python+NumPy language you are not allowed to write a module called
numpy. The addition of NumPy to plain Python makes some formerly invalid programs valid and vice versa, which justifies speaking of different, though certainly similar, languages.
The above discussion suggests that to keep our lives simple, we should use as few different formal languages as possible. Unfortunately, an inventory of what we have to deal with shows that we are very far from that optimum.
Data formats are the easiest part. Even the number of “standard” formats is enormous, and many of them aren’t that well standardized, leading to different dialects. Worse, many scientific programs make up their own ad-hoc data formats that are scarcely documented. That’s why file conversion takes up so much of our time. Moreover, we usually have different on-disk and in-memory data formats for the same data, which is why we need to write I/O routines for our software.
But the complexity of formal languages used to define programs completely dwarfs the complexity of data formats. Let’s start at the bottom level: the processor’s instruction set. If you write an operating system (OS), that’s the level you work at. Otherwise, your program is a plug-in to be composed with an operating system, and the operating system defines the formal language in which you need to provide your program. The “OS language” includes the processor’s instruction set, but also adds constraints (memory use, relocatability, …) and access to OS functions. The OS language can be as simple as the COM file format from CP/M and DOS days but also as complex as Linux’ ELF format.
The ELF format introduces the next level of composition: object files and dynamic libraries, in addition to executable files. In a modern OS, a program is composed from several ingredients immediately before execution. The motivation for introducing this last-minute composition was the possibility to share frequently used program building blocks among the hundreds of processes running in parallel, thus reducing their memory footprint. But this comes at the price of considerable accidental complexity. The OS language that your program must be written in now includes note only the processor instruction set and the ELF format specification, but also conventions about where certain shared libraries are stored in the file system. That’s why it is no longer possible to prepare a generic program for the Linux platform. Different Linux distributions have different conventions for arranging the shared libraries in the file system, and moreover these conventions change over time. They have different OS languages.
Upon closer inspection, the situation is actually even worse. The OS language for a given piece of software includes all the software packages that have been installed on the same computer before. Obviously, only one software package can occupy a given filename. Once you have installed a package that uses the file
/usr/lib/libm.so, no other package can occupy the same slot. That makes it impossible to wrap up “my program and all the files it requires” for installation on some other machine. If package A contains
/usr/lib/libm.so and package B another
/usr/lib/libm.so, even if it is only a slightly older version of the same library, the two packages could not coexist. The only solution is to distribute programs and libraries as building blocks to be added to a growing assembly, whose composition - now called “software installation - is left to the system administrator. Each block comes with a list of "required dependencies”, whose presence the system administrator must ensure. Moreover, each block occupies certain slots that must be available in the system. In the terminology of formal languages, each new block must conform to a language that its author cannot know in advance, and cannot even fully describe. I have described this error-prone approach in an earlier blog post as the Tetris model of software installation, because of its obvious similarities with the well-known video game. It’s the most widely used model in scientific computing today.
The obvious problems caused by this approach have motivated the development of various tools for the management of software installations. Some are specific to some OS platform (the package managers of Debian, RedHat, BSD, etc.). Others are specific to a programming language, e.g. Python’s
distutils system and its derivates. The multitude of software installation managers has created a secondary composition problem: to install a Python package on a Debian system, you must negotiate a compromise between Python’s and Debian’s views on how software installation should be managed.
Another approach is to give up on sharing common resources, and provide some way to package programs with all the files they need into a single unit, even if this leads to duplication of data on disk and in memory. This is the idea behind MacOS X application bundles (which go back to NextSTEP) and also Docker containers. Tools such as Python’s
virtualenv proceed in a similar way, by isolating a specific composition of building blocks from other potentially conflicting compositions of building blocks on the same computer.
An ingenious construction that combines the best of both worlds is the approach taken by the Nix package manager and its offshoot Guix. Instead of having building blocks refer to each other through filenames, they use a hash code computed from the actual contents of the files. This allows the composition of arbitrary building blocks, including pairs that would claim the same filenames in a standard Linux system, but also prevents multiple identical copies of any building block. This idea is known as content-addressable storage, and is also used in the popular version control system git.
Up to here, I have described the composition of specific programs with an operating system. But the program that is prepared as a plug-in to an OS is itself already a composition. How it is composed and from which constituents depends on the programming language(s) being used and on the tools that implement them. In Python, for example, a program consists in general of packages which consist of modules which consist of name-value pairs. A C program consists of source code files and header files, which each contain value and function definitions and interact via macro definitions. Like in the case of the “OS language”, the precise formal language in which each piece is written is not just Python or C. It also includes constraints and extensions coming from other building blocks — libraries — that the program refers to, as I have illustrated above for the example of Python plus NumPy.
Comparing these two situations, we can identify the common culprit: the use of a global namespace for composing building blocks. In the “OS language” of a typical Linux system, the global namespace is the filesystem. In Python, it’s the namespace of top-level module names. In C, it’s the namespace of non-static top-level definitions. Composition requires one building block to refer to another building block through a name in that namespace. And that in turn requires each building block to occupy a specific name in that namespace, so that others can refer to it.
One way to alleviate this problem is encouraging the use of very specific names. That’s the approach taken by Java, whose global namespace for packages is supposed to contain “reversed domain names” such as
org.apache.commons.lang3.math. While such a rule, if respected, indeed reduces the risk of name collisions between unrelated packages to almost zero, the most frequent source of name collisions remains: different versions of a package have the same name and can therefore not be used together in a composition. When composing building blocks into a program, one can argue that mixing different versions is bad practice anyway. But in the Tetris model of a single global software collection per computer, not being able to have several versions of a building block is often a serious restriction.
A final kind of formal language worth mentioning in this context is languages for defining compositions. This category includes Makefiles, Dockerfiles autoconf configuration files, and of course the package specification files of the various package managers. Their multitude shows the importance of the composition problem, but it also contributes to it. It is not rare to see a specification file for one package manager refer to another package manager. Conversion from one such language to another is nearly impossible, because the precise language for defining a composition depends not only on the package manager, but also on the other existing packages. It’s exactly the same situation as with the “OS language” and programming languages extended by libraries.
I believe that there is, and I have some ideas about this, but I will leave them for another time as this post is already quite long. I hope that the above analysis contributes to a better understanding of the problems that computational scientists are facing in their daily work, which is the prerequisite to improving the situation.
As a first step, I encourage everyone to prefer solutions to workarounds when faced with composition-related issues. Solutions identify a cause and eliminate it, whereas workarounds merely alleviate the impact of the problem, often re-creating the same problem at another level later on. In the approaches I have discussed above, an example of a solution is content-addressable storage, as used in Nix. In contrast, the traditional Linux package managers are workarounds, because they re-create a composition issue at the package level. Linux distribution authors have done a lot of hard and useful work with these package managers, which I don’t want to play down in any way. But the fruit of that work can be carried over to better foundations. The Tetris model of software installation is not sustainable in my opinion. We have to move on.