Wanted: a hierarchically modular software architecture

2020-05-05 scientific software

In his 1962 classic "The Architecture of Complexity", Herbert Simon described the hierarchical structure found in many complex systems, both natural and human-made. But even though complexity is recognized as a major issue in software development today, the architecture described by Simon is not common in software, and in fact seems unsupported by today's software development and deployment tools.

The prime characteristic that Simon identifies in most complex systems is a hierarchical structure. Systems consist of subsystems, which consist of sub-sub-systems, etc. Simon describes the subsystems at each level as "nearly decomposable", meaning that the interactions between subsystems are much less important than the interactions between the parts inside a subsystem. I prefer the shorter term "modular" for this feature, and thus end up with "hierarchically modular" as my label for the architecture that Simon describes in much detail. I won't repeat his arguments for the ubiquity of such systems, so please read the paper - it's definitely worth it, and it's very clearly written.

It may seem as if many of today's programming languages propose exactly this kind of architecture for designing software systems, but a critical inspection shows that they don't. To explain where the problem is, I will use Python as an example because it is widely known, but the arguments apply with some modifications to most other languages as well.

Python's module system is basically a hierarchy of namespaces, with namespaces containing mainly function and class definitions, but also variables referring to arbitrary data objects. Since namespaces are independent, and can contain sub-namespaces, this looks like a perfect match for a hierarchically modular architecture.

One obstacle is that there is no way to combine independently designed modules into a larger hierarchy. Suppose I want to create a software component called ode_solver that uses the popular packages NumPy and SciPy. In a hierarchically modular architecture, implementation details of a component, such as the names of the packages it uses, would be hidden from outside view. The packages would become ode_solver.numpy and ode_solver.scipy. In real Python, they can only remain numpy and scipy, as their authors decided to call them. Independently written software components in Python always live in the globally shared top-level namespace. And since developers are free to modify their packages as they like, this makes the top-level namespace an instance of shared mutable state, universally recognized as problematic in software engineering.

The shared top-level namespace creates a strong interaction between all components at all levels. Suppose I have another component called visualizer that also uses NumPy and SciPy, but requires different versions. That component becomes impossible to combine with my ode_solver because of conflicting version requirements - the well known dependency hell. Another way to look at this is to consider each package's detailed dependency list, with version requirements, as part of its interface.

The second obstacle is that the full specification of a module's interface (something that's never ever written down in Python) in general includes classes defined by its dependencies. My ode_solver could, for example, return some value as a NumPy array. That would make NumPy not only a run-time dependency of the code, but also a specification dependency for the interface. If visualizer expects a NumPy array as the input to one of its functions, I'd be in trouble again as the class definition in the two different versions of NumPy might not be the same. And that trouble would not go away if I could migrate NumPy and SciPy inside my component's namespace as suggested above.

Some readers' first reaction is likely to be "that's a symptom of bad specifications" or "that's the trouble you deserve for using a dynamically typed language". However, static typing doesn't solve the problem, it merely shifts it from run time to compile time. It's the types introduced by dependencies that end up in the static interface of a component. The impact on component compatibility is the same. And if that's a symptom of bad design, then good design is not only rare but also actively discouraged by today's software development tools. The only way out I can see is to create wrapper types and wrapper functions in the component that hide the implementation in terms of dependencies. Hands up if you find that idea appealing!

The only programming language I know of that does not suffer from this problem is Unison, which refers to functions and data types via hashes rather than names. It's a very young language, so it's too early to say how this feature will change software architecture on a larger scale.

Programming languages are not the only realm in which we can try to construct hierarchically modular software. It would in fact be preferable to do so at a language-neutral level, to escape from the silos that languages tend to represent. I'd love to be able to combine a component written in Python with a component written in R! So maybe we should try to make hierarchically modular assemblies at the level of compiled binaries.

One candidate would then be Linux' Executable and Linkable Format (ELF), which covers several types of binary files: executables, object files, shared libraries, and more. But there is no kind of ELF file that could represent hierarchically composable modules, as far as I can see. There's no way to combine two shared libraries into a bigger shared library, nor two executables into a larger executable, and moreover every executable has a global namespace that would create the same issues that I outlined above for Python. You can't have an executable that includes or refers to two different versions of the zlib library, for example.

The only approach that looks doable in the Unix world is working at the process level. A software component is then a process based on an executable, and data between processes is exchanged via files or sockets. Choosing a clever hash-based naming scheme (as done by Nix and Guix) makes it possible to keep any combination of versions accessible in parallel. Several processes could be managed as child processes by a superprocess, which would thus represent a component one level up in the hierarchy. In the Web world, a very similar setup could be constructed by making each component a Web service. There isn't much tool support for such techniques, but perhaps the most important obstacle is efficiency issues in the communication between components, which would require serialization and either file storage or network communication.

The main merit of the two approaches I have outlined in the last paragraph is that they can accommodate legacy code and systems, unlike the starting-from-scratch approach of Unison. With a bit of luck, improved tooling and optimization could turn the process/service-based approach into a viable technique for some types of real-life application, while Unison and perhaps others introduce the same basic idea at the programming language end of the scale of software component technologies. And then, if the concept turns out to be successful for taming software complexity, it might become the norm after a few decades. So far for my daily dose of wishful thinking!

Finally, let me reveal my motivation for writing this post: I hope that someone will prove me wrong. I'd love to see a comment pointing out that I am simply not aware of the right tools and techniques. And you get bonus points for references to actual hierarchically modular software systems that work!

Comments retrieved from Disqus

Konrad Hinsen:
A Twitter comment says that Rust's package management system satisfies my requirements.