Preparing for scientific deepfakes
By now, most scientists have probably seen figures, tables, and even entire journal articles made by so-called "generative AI", containing more or less subtle mistakes or inconsistencies. What I haven't seen yet, but expect to see soon, is the scientific equivalent of deepfakes: made-up results that come with made-up code that reproduces them. This is likely to become a new challenge for reproducible research.
Like just about any other community that depends strongly on code, scientific computing is increasingly polarized concerning the use of "generative AI", here meaning large language models (LLMs) generating code from natural-language prompts. The two opposing camps are LLM enthusiasts, who believe that scientists should embrace these tools in order to reduce the human effort spent on software development and get more research done, and LLM skeptics, who doubt the reliability of vibe-coded software, dislike its black-box nature, and fear for the credibility of scientific findings. Many researchers also have ethical objections to the use of LLMs, but they are mostly unrelated to the quality of vibe-coded software. The discussion between the two camps, and the people in between who remain undecided, focuses on community-developed software libraries that are maintained over many years. What I will discuss in the following is the impact of LLM-based coding agents on the top layer of the scientific software stack, i.e. the highly situated software that computes results for a specific research project, such as the figures and tables that are presented in a journal article.
Among the many differences between community-owned tools and libraries on one hand, and project-specific code on the other hand, the most relevant here is the different means of establishing trust. Community-level software performs well-defined wide-spectrum tasks that have some form of specification outside of the software, be it in its documentation or in theoretical papers that describe the implemented methods. At least in principle, such software can be evaluated even if it is a black box, for example via a test suite. For project-level software, there usually is no source of ground truth that could be used to verify the obtained results. The code transforms project-specific inputs, e.g. experimental data, into project-specific outputs, e.g. plots. There is most often no way to run it unchanged on other, well-known inputs, in order to check if it reproduces the corresponding well-known outputs.
One of the fundamental tenets of the reproducible research movement is that such code (and its inputs) must always be published along with the results, in order to permit verification of the latter. And this actually happens more and more often. Papers are complemented by scripts, notebooks, or workflows that interested readers can examine and re-run to see if they find the same results. However, the code and input data are considered "supplementary material", and with few exceptions not subjected to any form of review (see my earlier post on this topic for a more in-depth discussion).
This project-specific code is a very tempting target for vibe coding. Much of its actual work is delegated to libraries, and today's coding agents know popular libraries very well. Better, in fact, than the average researcher. Why bother to look up API details, if your LLM assistant knows how to use it? And when used in good faith for simple problems of data analysis and presentation, the resulting code is typically short and quite readable, fulfulling its intended role as executable documentation of the research project.
But suppose that good faith is lacking. There are researchers who make up data, with or without the help of generative AI. Maybe they want to back up a made-up claim with false but credible reasoning, or maybe they just want some publishable result without regard for its veracity. In either case, generative AI will fulfill their wishes, producing not only the fake results but also the code that generates them. And all it takes to cover up is a more elaborate prompt. You can ask a coding agent to make a modified version of a respected library that returns the desired fake results, and then write an unremarkable script that calls this library. You then ask your coding agent to write a Guix recipe for building and running your code, using the modified library. That's what I call a scientific deepfake: a software assembly that looks good superficially and works flawlessly, but does something nasty in the depths of the software stack that nobody is likely to look at. If you read Ken Thompson's Reflections on trusting trust, you will see that the nastiness can be pushed down arbitrarily far into the software stack. With ever increasing effort, but LLM coding agents will make this ever more doable in practice.
What this means for reproducible research is that automated checks for reproducibility become meaningless. In the pre-LLM era, a carefully constructed software assembly that reproduced a figure on the click of a button was valuable evidence for building trust. In the vibe coding era, it means nothing. Whatever criteria an automated check applies, it is possible to vibe-code a software stack that satisfies all of them while also producing a predefined result. Even a check for suspect modifications deep in the software stack is impossible to do reliably, since patches may well be done for fixing bugs or improving performance. The quality of being suspect is not formalizable. Only a human reviewer can judge if a modification is legit or not.
If scientific deepfakes become widespread, we might even see a reversal of today's best-practice attitudes. A complex software assembly that just works but is too large to be inspected may be seen as suspect, whereas a short script whose outputs are robust under version changes of its dependencies may become the new gold standard for credibility.