Automating science
The advent of AI agents based on large language models (LLMs) has put the idea of automating the intellectual and cognitive work of researchers on the table. A lively, sometimes even heated discussion is already going on. A frequently missing piece in this debate is the question why we, individually and as a society, actually do science. I will examine this question first, and then consider what it implies for introducing automation into science.
Science
First of all, what is science? Any short definition is necessarily a caricature, but I hope that the following caricature is at least a useful one: science is a collective process that aims at accumulating reliable knowledge about the world we live in, emphasizing doubt and epistemic humility in order to counterbalance human cognitive biases. In other words: science proceeds with the assumption that anything can turn out to be wrong, and that the default answer to any question is "we don't know". Everything we think we know can be questioned and revised in the light of new evidence or new critical examination.
Next, why do we do science? This depends very much on who is that "we". Science started in the 16th century as a leisure activity by wealthy or sponsored people to satisfy their curiosity. Nowadays, science is funded by governments in order to support economic growth and policy decisions, which is a much more utilitarian stance. And yet, indivdual researchers are still largely motivated by curiosity. But curiosity and utilitarianism are not as distinct as they may seem. From an evolutionary perspective, it makes sense for organisms living in a complex world to use spare resources for acquiring knowledge that may be useful in an uncertain future. Curiosity is thus a trait that helps making people and societies more robust.
Scientific institutions have the role of maintaining and extending the collective knowledge base, which is a complex network of interconnected pieces of information. There is knowledge about the world we live in, of course, but also knowledge about how to make observations and how to interpret their results, plus theories and knowledge about these theories, and a lot more. And all that knowledge comes with a judgement of its reliability attached to it, which is important because the ultimate goal of science is obtaining reliable knowledge.
If you imagine the collective knowledge base as a huge library full of books and journals, or as a collection of Web sites that look like Wikipedia, you are missing an important piece. Information archives are important for science, but they cannot capture everything required to interpret this information. The archives contain marks on paper, or bit patterns for the digital ones. The procedural knowledge required to make sense of these information snippets, and to relate them to the world we live in, is embodied in practicing scientists. You cannot learn chemistry from reading chemistry books alone. At some point, you have to manipulate chemicals, touch them, smell them, mix them, and see what happens. More generally, if you want to learn everything required to understand and contribute meaningfully to the chemistry literature, you need to work as an apprentice to an experienced chemist. If for some reason all chemists die, the books and Web sites will become unintellegible. This is not abstract theory. We have written documents from the past that nobody can read any more because nobody living today knows the language and writing system used by the authors.
Most popular narratives about science concentrate on the task of extending the collective knowledge base, by making new observations about the world or new theories and models explaining such observations. Maintenance gets a lot less attention. It consists essentially of two processes: training the next generation of scientists, and re-examining existing knowledge, in the light of new information or new ideas for representing this knowledge. A modern textbook on classical mechanics looks very different from Isaac Newton's "Principia Mathematica", but it describes the same theoretical framework. What has changed is the notation and the presentation, making the material easier to understand and apply, and easier to integrate with other theories in physics but also in other fields using the same mathematical notation. And the more a theory is applied, integrated, and tested, the more reliable it becomes. The best evidence for the reliability of Newton's mechanics (when applied within its limits of applicability) is the fact that it underlies a huge part of the technology we use every day. Centuries of refinement have turned Newton's intellectual exercise into knowledge that we can rely on. Maintenance matters!
Not all scientific knowledge has been revised and applied for centuries. How then do we judge its reliability? That's an important question that is not examined often enough, in particular in the ongoing AI-for-science debate. An early and still relevant technique is double checking. If multiple researchers do similar work and obtain similar results, their results strengthen each other's reliability. And if the results disagree, the causes of the differences can be explored systematically. The simple version of double-checking that I have described here works only for studies of simple systems, where "similar work" and "similar results" are well-defined concepts. But the idea can be extended to more complex systems, where one would examine the coherence of findings from a large number of individual studies.
Trust
But there remains an important condition: a judgement of reliability requires a detailed understanding of all the studies involved. Nobody can have that level of competence in more than one or two narrow domains. And yet, everybody doing research needs to rely on results from other domains and disciplines. A biologist performing data analysis is rarely a trained statistician, for example. And a physicist performing numerical simulations is rarely a trained numericist. All researchers nowadays need to trust the reliability judgements of experts in other domains. And that's also what decision makers in politics and industry do in order to figure out which scientific findings they should turn into plans for action.
Human societies rely on webs of trust, because trust is the foundation for cooperation. In today's industrial societies, this web links together individuals, institutions, ideas, technologies, and physical objects, via numerous mechanisms such as reputation, certification, accountability, or punishment by law. Consider why you trust the train that you take to work every morning to transport you safely. Your trust builds on a trust in the engineers who designed the train, scientific findings that the engineers relied on, the workers that built the train, laws that define safety-related obligations, government agencies that oversee the respect of these obligations, and a lot more.
A large part of the boring grunt work of parlaments and government agencies is maintaining this web of trust, in contact with the scientific web of trust, among others. The web of trust behind train safety has grown over centuries, since long before the first railways were built. A society's web of trust is a big part of its social capital.
Digital technology remains a challenge for the web of trust, because it evolves much faster than traditional trust mechanisms can adapt. The one technology that is almost completely exempt from legal and contractual obligations concerning safety and reliability is software. Major perturbations such as the CrowdStrike incident have contributed to a growing awareness about this problem, but so far nothing much has changed at the legal level. Software vendors are not sanctioned for negligence, nor even for intentional malice (such as Grok producing deepfake porn).
In science, digital technologies have likewise been adopted enthusiastically and uncritically. The publication and quality control process, which has been based on journal publications and peer review since about the 1960s, is no longer adequate for today's research work, which due to the support by digital technology now features large collaborations, big datasets, and complex computational analyses. The replication crisis is to a large part the result of this mismatch between the imagined value of peer review as a quality control mechanism and its real value as a rough credibility check. As with the safety issues I mentioned above, we are only starting to understand and correct for this evolution. And while we are grappling with these issues, LLMs are causing another earthquake in the foundations of the scientific web of trust.
Automation
To what degree can science possibly be automated? Let's start with the highest imaginable level: fully automated science. That would be a machine that supplies supposedly reliable knowledge via some sort of interface, perhaps a supercharged chatbot. You could ask the machine a question, and it would enter into a dialog to request additional input from you, before in the end giving you an answer. This answer could well be "I don't know yet, ask again a month from now while I do some more research". Obviously this machine would have to be more than a bunch of computers. It would have to interact with the real world, making observations, setting up experiments, etc. Think of a network of computers and robots if you want a concrete image.
Would you trust such a machine to provide reliable answers?
Would you agree on having the machine do experiments on you? Would you trust its affirmation that these experiments are in your best interest?
For most of us, answering such questions comes down to trusting others who we perceive as experts or authorities, or who are involved in designing or operating the machine. What would be the profile of an expert whose affirmations about the machine's reliability you would trust? Which institution would you trust to issue a certification for the science machine?
If you have some expertise in science or engineering yourself, you might want to start by inspecting the processes that led to the creation of the machine. That's a good start. You might end up becoming one of those experts that the rest of us rely on. But if the machine will do all future science, then there won't be human scientists left a few decades later. And maybe no engineers either. So... who will take over your job as an expert? Why would your grandchildren trust the machine? And who will keep the machine running? It can't look after itself, as a living organism does.
The good news is that nobody talking about automating science actually proposes this extreme level of automation. The bad news is the obvious conclusion that many people who propose automating science are unaware of many of the aspects of the process they wish to automate. My proposal: when discussing automation, always say explicitly where you see the interface between machines and humans. It's always there, somewhere. As long as there are humans interested in accumulating reliable knowledge, there will be a science process run by humans, who delegate specific tasks to machines. As we have been doing for quite a while already, e.g. when using DNA sequencers, or when deploying software on a computer. Automation, in science and elsewhere, has been with us for a few centuries, since the beginning of industrialization.
There are three main motivations for automating a task, as compared to have humans perform it:
Economy. Machines make many things cheaper than humans do, at least in our current economic model that ignores externalities such as resource depletion and environmental pollution. Often the machines produce less useful or less versatile products, but at a so much lower price that the trade-off looks favorable. As an example, consider buying an industrially produced chair as compared to making a chair yourself, or having one tailor-made by a craftsperson.
Quality. Machines do a better job at producing certain items. Staying in the carpentry theme, consider nails. Humans have made and used nails since pre-historical times, but with the arrival of industrial-made nails, human-made nails have disappeared. Machines do a better job at tasks that require high precision. They make nails that are both better and cheaper.
Complexity. Some artefacts are so complex that industrial production is the only viable option. Consider a modern car with its mechanical and electronic complexity. I doubt that anyone has ever even tried to make such a car using nothing but human labor.
In the current debate on automating science, the only motivation I see cited is economy: LLMs would allow us to do more science given the same number of people and the same resources. Most proponents of LLMs for science (e.g. this one, to give a concrete example) conveniently gloss over what "more science" actually means. They use the same bibliometric proxies whose inadequacy for research assessment is finally being recognized: more science means more papers. Some largely LLM-written papers have already been accepted in scientific journals, so the claim that LLMs can write papers that can pass peer review is credible. However, "passing peer review" is not the same as "useful contributions to science". In other words, the problem is not so much LLMs as an outdated quality assessment process from the 1950s that has not kept up with the enormous changes in research over the last 70 years.
If we want to update our quality assessment, the question we should focus on is: how can we assess the reliability of knowledge that we obtain with the help of LLMs? Again this is not a new question. It's a question that we have asked about every single scientific instrument or experimental setup since the dawn of science. The goal is not to eliminate unreliable information sources. They often contribute useful information, and in some cases, such as in the beginnings of a new field of research, all available information may be of low reliability. That's fine, as lack of reliability can be compensated by diversity and coherence. The sum of many information sources is often more reliable than any single one on its own. But it does matter that we can estimate the reliability of each information source. Which, for experimental setups, we usually can.
It is much harder to estimate the reliability of computed information, due to the complexity of software. And so... like society at large (see the last section), when it comes to software, scientists have mostly suspended the doubt that used to be their trademark. In parallel to developing computational methods, we should have developed processes for establishing trust in them, but we didn't. Only with the arrival of LLMs we realized that establishing trust is an important and difficult problem. Well, better late than never. Let's start. My contributions so far are this opinion piece about reviewing research software, and this preprint that analyzes the reviewability of software and AI tools. Unfortunately, in the meantime we will have to deal with the ongoing massive assault of our journals by LLM-generated submissions, most of which are likely to be of bad quality.
My prediction is that, once the excitement about "automating science" has died off, we will forget this idea and concentrate on using LLMs under human supervision for well-defined tasks in which they have proven to be useful, reliable, and cost effective. The last part is rarely discussed, but it's important to keep in mind that today's AI operators run at a huge loss in order to encourage massive adoption of their product. They won't do this forever, so prices will increase, while research budgets for non-AI topics are diminishing in many countries. Nevertheless, LLMs could turn out to be a good trade-off for specific tasks in software development, in data analysis, or in the presentation of results.
However, establishing responsible LLM use in research is possible only if researchers can try out and evaluate these tools before committing to their use and making themselves dependent on them. This cannot happen in a few weeks. It cannot happen either in the current strongly polarized atmosphere where people are divided into two opposing camps, one crying "Only AI can save us!" and the other one replying "AI is the devil!" More than anything else, we need to remember the self-doubting attitude inherent to science, and admit that anyone's views on LLMs may need a revision.
It is also dubious if responsible use is possible at all with today's generation of LLMs. In addition to the ethical issues, which I will address in the next section, there is a contradiction between the complete opacity of these models and the transparency requirements of science. This means you can use them only if you can audit the results in some way, as you can in some software development settings. It doesn't help that the companies that control these models openly support a government that is actively destroying scientific institutions. There is absolutely no reason to trust these companies to support science; at best, we can hope that they completely ignore research applications in developing their tools. The minimum condition for LLMs that are safe for science would be a disclosure of the training data set and the tweaks that happen after the ingestion of the training data. There are various projects to construct LLMs under such conditions, but they don't seem to be ready for practical applications.
Taking a step back
In the last section, I have looked at automation in science by LLMs with a narrow focus on the topic. That means I have taken into account the properties of science as far as automation is concerned, and the properties of LLMs as far as they relate to scientific research. I have not taken into account other characteristics of science and LLMs. This narrow-focus view is our culture's default way of analyzing things. It keeps complexity to a minimum, which is helpful. But it also hides potentially relevant aspects from view. Which is why I will now adopt a wider focus: taking into account more aspects, though necessarily in less detail.
Science doesn't happen in a monastery located on a remote island. It is embedded in industrial societies, where it has ties to philosophy, politics, education, industry, and a lot more. LLMs didn't fall from the sky. They were developed by people inside organizations, tied to philosophy, politics, education, industry, and a lot more. LLMs are deployed on physical machines that need to be designed, built, and operated. All that means that adoption of LLMs in science implies also bringing some of the LLM context into science. And if science one day becomes a major user of LLMs, in terms of quantity or prestige, the reverse will happen as well.
There are two major criticisms of today's LLMs that derive from this wide-focus view:
The impact of LLM use on the natural environment, via their enormous resource requirements.
The process by which LLMs were created: much of the training material was used without permission from its authors, and the unpleasant human labor involved (screening for atrocities) was outsourced to people who are too poor to be able to refuse the job.
Many scientists reject LLMs for these reasons, much like they reject experiments on animals or excessive air travel to conferences: not because LLMs are bad for science, but because they are bad for entities outside of science, be they humans, non-human organisms, or the entire biosphere. Expressed in the jargon of economics, LLM use has significant negative externalities.
Negative externalities are of course not specific to LLMs. Much of what we do in daily life (and even more so in scientific research) comes with negative externalities that we prefer not to think about because we cannot really do much about them. Climate change is the most visible one: the metabolism of our societies runs on fossil fuels, without which we couldn't even feed the human population of the planet, let alone provide the material security and comfort that we are used to. The ubiquity of negative externalities in our lives is probably the main reason why so many people, including scientists, do not see a particular issue with the negative externalities of LLMs. It's just one more item on the long list of negative externalities that we accept in order to go on with life. In this light, LLM rejection is comparable to veganism or flight shame: a conscious rejection of social norms in order to make at least a first step away from industrial societies' path towards ever increasing resource consumption and exploitation of other living beings.
Can these ethical issues with LLMs be overcome? In theory, yes. There are ideas for eliminating every single one. Resource consumption can be reduced by designing more efficient hardware. Less generalist LLMs could be trained with less training material, which could then be gathered with permission of its authors. Screening for atrocities becomes a non-issue when training domain-specific LLMs for science. The whole training process can be made transparent. Unfortunately, in the current economic context, none of this is likely to happen. And even in the best imaginable scenario, it would take years to decades to develop ethical LLMs for science.
Such a delay, however, is inacceptable to those putting forward ethical arguments for LLM use: an acceleration of knowledge acquisition in ethically relevant domains such as health research. What if LLMs can help us cure cancer more rapidly? Is it ethically defendable not to do this? Most of these arguments are fallacious. Whereas the ethical arguments against LLMs are based on real observed negative externalities, the ethical arguments for LLM use that I have seen so far are based on speculation about hypothetical benefits. I have not seen anyone outline a credible path to an accelerated development of cancer treatments with the help of LLMs. The best you can say is that it is not logically impossible. My suspicion is that proponents of such arguments severely underestimate what it takes to develop cancer treatments. Experiments and clinical trials take a lot of time, which is not compressible by computation of any kind. And never forget the trust issue: in the end, a practicing oncologist must trust new treatments before they can actually make a difference.
There are, however, quite probably some less sensational contexts in which LLM use does speed up research that has credible societal benefits. And therefore the argument "my LLM use is ethically justifiable because the benefits outweigh the negative externalities" cannot be rejected in general. However, so far I haven't seen any attempt to estimate such a trade-off, let alone the combination of net ethical benefit and reliable outcomes.
And the verdict is...
Let me end this post with my personal conclusion: I do not use LLMs for any aspect of my scientific research. Not for writing articles (nor blog posts), not for writing software, not for information retrieval, not for anything else. My research has always been more methodological than applied, and over the years it has moved more and more towards foundational questions such as reproducibility, in the topic space of metascience and philosophy of science. I consider these topics important, but not urgent. They don't justify contributing to massive harm elsewhere, nor putting quantity above quality - quite on the contrary!
What I haven't made up my mind about yet is the use of LLM-written software. LLM use in software developments comes in many shades, from code cleanup to 100% vibe coding. The latter is incompatible with the transparency requirements of science anyway, except for code snippets small enough to be audited by humans. My provisional policy is to take a critical look at LLM-supported software before adopting it. Yes, that's vague, but the only way I see to refine my policy is through practice, and that takes time! What I will not do, however, is completely reject LLM use by others. That would imply no longer collaborating with many of my colleagues, and that's a bad idea: nobody has anything to gain from the scientific community splitting into pro-AI and anti-AI camps.
Recommended reading
AI slop and the destruction of knowledge by cognitive scientist Iris van Rooij. She illustrates with a concrete example what happens when LLM-generated erroneous information is incorporated unchecked into formerly trusted scientific knowledge repositories. This is happening in many places right now, and it is not clear how we will ever manage to clean up these knowledge repositories again, assuming we even decide to do it.
Scientists invented a fake disease. AI told people it was real by science journalist Chris Stokel-Walker illustrates the other side of the knowledge destruction process: fake information clearly marked as such is nevertheless absorbed by LLMs and contributes to their output.
How much of the scientific literature is generated by AI? by science journalist Miryam Naddaf. A report on the magnitude of LLM use in article preparation, and why it is quite difficult to estimate.
AI agents may be skilled researchers—but not always honest ones by science journalist Nicola Jones. Another story about AI agents that are intended to automate aspects of research but do so unreliably.
Context Widows by Kevin Baker. He explains in much more detail than I did above how the goal displacement from "quality contribution to science" to "citation metrics" in the 1960s and 1970s prepared the ground for an exploitation of the new goals via LLMs, while the initial goal of quality contributions to science is silently abandoned.
Academics Need to Wake Up on AI, Part III by sociologist Alexander Kustov. His key point is that today's LLMs can produce research papers that are no worse than many human-written ones that pass peer review. He concludes that in the context of today's incentives and funding criteria, academics cannot afford not to use LLMs without losing out to their competitors who do. This confirms Kevin Baker's point about goal displacement.
A BlueSky thread by cognitive scientist Molly Crocket, pointing out the disequilibrium in science funding that prioritizes AI development while defunding everything else. Quote: "We risk all of science if we rush to build “AI Scientists”, before we understand the value of human science."
Why do we do astrophysics? by astrophysicist David W. Hogg. He argues that for non-utilitarian fields such as his own, automating research work makes no sense because the main value of the research is not the findings but the maintenance of the research community. Many of his arguments are of interest to utilitarian perspectives as well, so this is worth reading even if you care only about "useful" science.
The fall of the theorem economy by mathematician David Bessis. His main point is that the value of theorem proving in mathematics is not the catalog of proven theorems but the insight gained from coming up with the proof. Theorem proving by LLMs doesn't provide this value. He predicts that the increasing use of LLMs will lead to a shift of evaluation criteria, away from valuing proofs as a proxy for the work that went into constructing them. Similar arguments can be made in other disciplines, e.g. theoretical physics.
The machines are fine. I'm worried about us. by physicist and mathematician Minas Karamanis. He worries about the consequences of students using LLMs to speed up their PhD work. The students don't learn much about research, and the supervisors could well be tempted to use LLMs directly and stop bothering with students. In either case, we lose the next generation of scientists able to do research, with or without LLMs.
Against the uncritical adoption of ‘AI’ technologies in academia by a multidisciplinary team of researchers. A very detailed and well-documented analysis of the numerous issues that makes many past and present AI technologies problematic in research and education.