Data science is usually considered a very recent invention, made possible by electronic computing and communication technologies. Some consider it the fourth paradigm of science, suggesting that it came after three other paradigms, though the whole idea of distinct paradigms remains controversial. What I want to point out in this post is that the principles of data science are much older than most of today’s practitioners imagine. Let me introduce you to Apollonius, Hipparchus, and Ptolemy, who applied these principles about 2000 years ago.
The focus of interest of these early researchers was a topic that had kept humanity busy for quite a while already, all over the world: the motion of heavenly bodies. The main motivation was making predictions for the near future. The configuration of the stars and planets was widely believed to have an impact on human affairs (a belief we call astrology today), so knowing them in advance was of obvious interest. They had astronomical observations at their disposal, but numbers alone are not sufficient to make predictions. You also need a model for extrapolating the numbers to the future.
The tool that Apollonius, Hipparchus, Ptolemy, and probably many others, developed and improved to near perfection was epicycles: a model for the orbit of a heavenly body consisting of a superposition of circles, with each circle’s center moving along a bigger circle’s circumference. Epicycles are similar in spirit to Fourier series. Any periodic orbit can be described as a superposition of circular motions. Given enough data, one can fit an epicycle model and make predictions. But since the epicycle model does not contain any physics, it doesn’t come with any safeguards against mistakes. Epicycles can equally well describe real and completely unrealistic orbits, and therefore the quality of the data is very important.
Today’s data science works much the same. Very general models, such as neural networks, are fitted to large datasets and then used to make predictions. Again the models contain very few assumptions about underlying laws of nature. They are by design very general (see e.g this visual proof that neural networks can compute any function) in order to capture any kind of regularity in the input datasets. As for epicycles, data quality is important, which is why data scientist invest a significant effort into cleaning up the raw data they work on.
Aside from the obvious technological aspects and the associated change of scale in the size of datasets, the main improvement of today’s data science on epicyle models for orbits is even more generality. Early astronomers had periodicity baked into their models from the start. Neural networks (and other models used in data science) could predict the motion of heavenly bodies with even less theoretical input. However, it is important to realize that every model imposes some a priori assumptions, even if, as in the case of neural networks, these assumptions are not fully understood and therefore not formalized. Seen in this light, the improvement of modern data science over epicycles is gradual rather than fundamental.
Adopting an historical perspective, data science turns out to mark the beginning of scientific disciplines rather than their refinement. It permits the very first step from raw observations to a description of regularities. Connecting these regularities to known more fundamental principles, or even discovering new fundamental principles as in the case of Newton’s laws for celestial mechanics, can only happen afterwards.
Perhaps a more fundamental distinction than the one between experiment and theory (plus, according to some, simulation and data science) is the one between data-driven and model-driven science. Data-driven science starts from observations and searches for regularities using generic models. Model-driven science takes more advanced problem-specific models and aims at evaluating and improving their quality on one hand, and explore their consequences on the other hand. In terms of day-to-day research activities, data-driven science collects observations that promise to be interesting and uses statistical methods to interpret them. Model-driven science has theoreticians exploring models and experimentalists asking Nature specific questions arising from this exploration. The oldest and best-known scientific disciplines, i.e. physics and chemistry, are primarily model-driven today, which may contribute to the impression that data-driven science is new. As the epicycle example shows, this is really just a lack of historical perspective.