An unlikely alliance: medical science comes to the aid of astrophysics
How a chance encounter between a medical scientist and an astrophysicist led to a novel adaptation of the random forest method of machine learning towards interpreting the spectra of exoplanetary atmospheres, and its implications for increased transparency and reproducibility in the practice of atmospheric retrieval
The paper in Nature Astronomy is here: go.nature.com/2KEI6ir
Breakthroughs sometimes occur in unanticipated ways and the (true) story I am about to tell you will perhaps serve as a reminder that universities and institutes of higher learning need to allow their researchers the mental space to pursue whimsical, curiosity-driven research that is not encumbered by the daily, mainstream demands of academic life. By complete accident, I met a young assistant professor named Raphael Sznitman, who is a computer scientist by training and was a recent hire by the ARTORG Center for Biomedical Engineering at the University of Bern, Switzerland. We struck up a conversation where the only motivation was to find out what the other person was doing, without any expectation that it would lead to anything concrete. I was immediately struck by Raphael's ability to explain difficult concepts in image recognition and computer science in accessible language. Intrigued, I met Raphael for lunch several more times, where I learned how he was using machine learning methods to identify cancer cells in images and to control high-precision surgical tools (e.g., for eye operations). During one of these lunches, I explained to Raphael one of the central problems in exoplanetary atmospheres, which is to solve the inverse problem of interpreting a spectrum to obtain chemical abundances and other properties of the atmosphere. Over time, we added two brilliant, younger members to our nascent "crack team" in machine learning at the University of Bern: Pablo Marquez-Neila, the lead author of the current study, coder extraordinaire and a computer scientist by training, as well as Chloe Fisher, the second author who graduated from Cambridge University with degrees in mathematics and astronomy before joining my research group as a Ph.D student. Pablo and Chloe did the bulk of the technical work that led to the reported results.
During one of our brainstorming sessions together, we again discussed the inverse problem of exoplanetary atmospheres, known in the community as "atmospheric retrieval". At one point, I distinctly remember Raphael saying, very matter-of-factly, "Random forest." There was no doubt in his mind that this was the correct method. Random forests are composed of decision trees (for discrete data) or regression trees (for continuous data, such as spectra), and it is easy to explain decision trees to a non-specialist with no knowledge of machine learning, because the name could not be more apt: it is a flow chart that records decision-making, structured like a tree with branches and sub-branches, eventually ending in leaves. I must admit I was dumbfounded at first. It took me a month to appreciate that the classification of fruit in a basket based on their colour, size, smell, taste, etc, has the same conceptual structure as quantifying the non-linear correlations between the data points in a spectrum and their underlying physical/chemical parameters. The random forest method itself is not novel and has been known since the late 1990s. What is novel and not immediately intuitive is that it can be adapted towards performing atmospheric retrieval, which is the main thrust of our Nature Astronomy Letter.
The human brain is excellent at performing two-dimensional pattern recognition, but falters at recognising patterns in an abstract, multi-dimensional space. In battles of human versus machine, the random forest retrieval managed to mimic my intuition as a theoretical astrophysicist interpreting Hubble-WFC3 spectra, but my mind could not reproduce the more subtle, multi-aspect correlations in the spectrum that the random forest detected. For example, I could guess from staring at the shape of the opacity of the water molecule, across wavelength, that the data points clustered around 1.4 micrometers are important for constraining the water abundance. However, it was much harder for me to estimate the percentage information content of each of the 13 data points in the WFC3 spectrum we used as an example. To be perfectly clear, this does not mean that the method can be used blindly, because the random forest is only as reliable as the set of models used to train it. If I feed it garbage, I expect to get back garbage (the "garbage in, garbage out" rule-of-thumb that I routinely teach my Ph.D students). This is true when one constructs any model in general and is part of the craft and gravitas of a theoretical astrophysicist, but machine learning brings it into focus because the potential to fool oneself is very real.
The choice of the random forest method was driven by a desire to obtain a physically meaningful interpretation of a spectrum. Specifically, given a sparse spectrum one wishes not only to obtain the best-fit values of the model parameters, but also their posterior distributions. To our knowledge, the random forest method is the only machine learning method for which computer scientists have proven that the underlying distributions generated are actually posterior distributions. This has not been proven for deep learning methods such as convolutional neural networks, which are widely used in image recognition, probably because posterior distributions are uninteresting to these practitioners. I was flabbergasted to learn that a complete theory of convolutional neutral networks remains elusive, meaning that some aspects of why they work remain empirical. The interaction with the medical scientists also shook one of the foundations of my training as a physicist: what does it mean to understand something? I previously held the belief that understanding was tied to the ability to predict. But the case studies in medical science shook this belief. For example, a machine learning procedure can be trained to predict the presence of cancer cells (provided there is enough data to perform this training, which is not the case yet, but that is another story) with zero understanding of what cancer is (i.e., having no underlying, first-principles theory of cancer). If we let our guard down, machine learning allows us to be ignorant and predictive, and to confuse correlation with causation. It is easy to be seduced by the ostensible precision of numbers.
An implication of the current study that was difficult to express in the paper is its potential for increasing the level of transparency and reproducibility in atmospheric retrieval research. Currently, the standard practice is to generate a family of models on the fly and use them to locate the best-fit solution in a multi-dimensional space of parameters (using, for example, Markov Chain Monte Carlo or nested sampling methods). In principle, it is possible to store this grid of models and make it publicly available, but in practice no one does it. This implies that it is, in practice, impossible to independently examine the model grid used to perform the retrieval and therefore to study the trends it displays and the (sometimes hidden) assumptions built into it. Random forest retrieval separates the construction of the model grid from the seeking of the best-fit solution as two separate logistical steps. This means that one can first compute a model grid, study and understand it, debug it, publish it online, etc, then use it to locate the best-fit solution afterwards. Once trained, the outcome of the random forest training can be published and used by anyone else. One can even use model grids from competing researchers without having access to the computer code that generated it. (To be clear: I do not necessarily advocate for this approach, because of my disdain of "black boxes", but it now becomes a practical possibility if one so desires it.) Another key advantage is that one does not require special hardware or software: the random forest may be trained in several minutes, using the Python programming language, on a laptop. (On Macbooks, for example, Python comes stock.) The interpretation of the spectrum takes one millisecond. It democratises the practice of science, allowing the exoplanetary atmospheres community to focus on who has the best ideas, rather than who has privileged access to software or hardware.