Konrad Hinsen's Blog

The landscapes of digital scientific knowledge

Over the last years, an interesting metaphor for information and knowledge curation is beginning to take root. It compares knowledge to a landscape in which it identifies in particular two key elements: streams and gardens. The first use of this metaphor that I am aware of is this essay by Mike Caulfield, which I strongly recommend you to read first. In the following, I will apply this metaphor specifically to scientific knowledge and its possible evolution in the digital era.

In the landscape metaphor, streams are timelines of information parcels. News, RSS feeds, Twitter, Facebook, but also scientific journals, are stream media. Gardens are continuously evolving information assemblies that are actively curated by their authors. Encyclopedias and dictionaries are perhaps the oldest examples. In the printed paper era, updating an information collection was expensive because everything had to be reprinted and redistributed. As a consequence, garden-type resources were rare. Digital gardens have no such overhead, and almost no cost other than the work of their curators. More and more people are setting up their own digital gardens as an alternative or complement to the personal stream, better known as a blog. Click here, here, and here to see a few examples of personal digital gardens. Like blogs, digital gardens can also be collective efforts, run by a company, a research group, or a larger community. The most widespread tool for digital gardening is the Wiki, but there are also more recent developments in this space, such as Notion or Roam.

One distinction that I haven’t seen mentioned yet in this context is the one between a garden and a park. Both are curated and thus continuously evolving. But whereas gardens are set up and maintained for the benefit and enjoyment of their owners, parks are created and maintained for the benefit and enjoyment of the public. The difference can be subtle, as digital gardens are often visible to the public as well. But they are more like the unwalled garden on the roadside that you can admire passing by than like the park in which you can take a walk and sit down reading a book. A good example of a digital park is Wikipedia.

Science is all about acquiring information about our world and distilling it into knowledge, and therefore requires a fair bit of gardening. In its early days, it was managed as a garden by and for a small community of people who were motivated by curiosity and relied on personal wealth or on sponsors for doing their work. Universities employed scientists more for teaching than for doing research. Research was done by individuals or small teams, and presented at conferences or in journal articles, much like today. Unlike today, most scientists were up to date on everything that was happening in their field, and had personal exchanges with almost everyone else, in face-to-face meetings or by correspondence. Conferences were events in which conflicting results and different points of views were actively debated, enabling the formation of consensus. The streams of papers and conference contributions thus watered the garden of scientific knowledge.

All that changed after World War II, when science underwent rapid growth as states injected a lot of money while at the same time expecting the scientific community to cultivate a park rather than a garden, contributing to the common good. Keeping up to date with everybody else’s work became more and more difficult, slowly eroding the possibility of consensus formation through live debate at conferences. Productivity metrics focusing on what is easiest to quantify ended up rewarding scientists for contributing to the stream of journal articles, but not for contributing to the cultivation of the park of scientific knowledge. Today, the streams of journal articles have become torrents whose distillation into knowledge is becoming ever more difficult. A good illustration is the (serious) proposal to use machine learning tools to make sense of the “tsunami” of articles resulting from the intense research on the Covid–19 pandemic.

The design and implementation of new mechanisms for knowledge distillation and consensus formation is thus a major challenge for science today, and even though machine learning techniques may prove to be helpful, I expect this to remain a fundamentally human task for a long time to come. These new mechanisms must combine technological aspects (good tools for working towards these goals) and social aspects (incentives for scientists to participate in this work). As always, the social aspects are the harder problem. As a first step and as a source for inspiration, let’s look at similar existing mechanisms in science and elsewhere. Which digital parks exist? How do they work? Can their mechanisms be adapted to other applications?

I have already cited Wikipedia as a prime example of a digital park. I had expected to see Wikis more widely used as a platform for collective information curation in science, be it as gardens or parks, but when I searched for examples I found surprisingly few, e.g. Tricki (for mathematical problem-solving techniques) or the Complexity Zoo (on classes of computational complexity). One problematic aspects of Wikis is that they present only a single view to the outside world. They are better suited for presenting an established consensus than for supporting the process of consensus formation in rapidly evolving fields. One of the rare cases of a Wiki used for coordinating collaborative research, rather than for summarizing the state of the art, is the Polymath project. It is probably not a coincidence that this has happened in mathematics, a domain whose working habits remain close to those of the early scientific community, with individuals having more agency than in disciplines that are more dependent on material resources.

Federated Wiki is an interesting evolution of the Wiki concept (initiated by the original inventor of the Wiki, Ward Cunningham) that allows individual contributors to maintain and publish their own view while at the same time encouraging reciprocal borrowing of content. This video illustrates the process nicely. Whereas federated Wiki looks like a promising approach to consensus formation, the technical obstacles to setting up a federated Wiki are significant (contributors must manage personal Web servers and domains) and make it difficult to evaluate it in practice.

Perhaps the most frequent kind of digital park in science today is the collaborative software development project, hosted on platforms such as GitHub, GitLab, or similar platforms operated by research institutions. Ignoring the differences resulting from the focus on code rather than prose, the main differences between platforms and Wikis are (1) a stronger emphasis on discussion (“issues”) and (2) the co-existence of multiple branches representing different public or private views of a common project, with one branch (conventionally named “master” or “main”) representing the current consensus.

Collaborative software projects are an interesting case study also for the question of incentives. The lack of recognition of software development as a research activity has been deplored for a long time. It is usually attributed to the relative novelty of software as a form of research output. But I suspect that the park nature of software, as opposed to the stream nature of journals, is also an important factor, because it makes it more difficult to evaluate an individual’s contributions based on purely formal (and thus easily measurable) criteria. On the other hand, today’s collaborative platforms make such an evaluation technically feasible, by counting for example the number of commits made by an individual, or the number of lines changed by those commits. Everybody involved in software development will probably agree that this is a stupid metric, but it’s no more stupid than counting publications weighted by journal impact factor.

Another social aspect that is well illustrated by software is the difficulty of the transition from gardens to parks. Projects usually start out as gardens, with a small team developing software for its own use. Then early users start to join, who by necessity have to figure out for themselves how to adapt the software to their needs, and are thus likely to become contributors. With an increasing user base, developers have an interest to work on more robust code and better documentation, in order to reduce the effort of technical support. At that stage, the software becomes attractive to less technically minded users who see no need to ever get in touch with the development community. These users consider the software a park, even if its developers still consider it a garden, leading to contradictory tacit expectations on both sides about the priorities for future maintenance, which I have described in an earlier post. Developers tend to contribute to this confusion by advertising their project as a park while maintaining it as a garden.

The above examples illustrate that the technical challenges of digital gardens and parks are somewhat understood and partially solved. Collaborative software development platforms in particular have proven very effective. Adapting their concepts to different use cases and different users looks definitely possible, although the effort required should not be underestimated, in particular for developing appropriate user interfaces. But the real challenge is creating incentives for collaboration, in a universe currently dominated by competition for limited resources.