Distilling Free-Form Natural Laws from Experimental Data

Schmidt, M., & Lipson, H. (2009). Science324(5923): 81-85.

Automating the Search for Natural Laws

Scientists have always been concerned with identifying the laws that govern natural phenomenon. We now live in an age where we can collect massive amounts of experimental data from a wide range of systems – subatomic to astronomical. We are also blessed with constantly growing computational power. Can we use these resources to automate the search for the governing laws of any physical system? In this paper, Schmidt and Lipson present an approach based on symbolic regression to automate the search for natural laws from experimental data. Mathematical symmetries and invariants underlie nearly all physical phenomenon. Thus, the search for natural laws is invariably a search for conserved quantities or invariant equations. However, the most prohibitive obstacle for automating this process is finding meaningful invariants. This paper proposes a principle for the identification of non-trivial invariants.

Symbolic regression

Several methods exist for modeling scientific data:

  • Fixed-form parametric models based on expert knowledge
  • Numerical models aimed at prediction, for e.g. neural networks
  • Restricted model spaces using greedy search

The goal here, however, is to find principal unconstrained analytical expressions that explain symbolically precise conserved relations. This requires searching the space of both functions and parameters.

In this paper, Symbolic Regression, which is an evolutionary computation method, is used for searching the space of mathematical functions. Initial mathematical expressions are constructed using basic mathematical building blocks – algebraic operators (+,-,\times), a basic set of analytical functions (for e.g. sine, cosine), constants and state variables. The search algorithm, based on genetic programming, forms new equations by recombining previous equations and probabilistically varying sub-expressions. These models are associated with a fitness score. Models with high fitness are retained and unpromising solutions are discarded. The algorithm terminates after the obtained equations reach a desired level of fitness.

Identification of nontrivial relations

Rather than try to model any specific signal, the goal here is to find any underlying physical law that the system obeys. The candidate equations should predict relationships between dynamics of the components of the system. Specifically, this paper proposes a Predictive Ability Criterion: the candidate equations should predict connections among derivatives of groups of variables over time. The fitness score used by their symbolic regression algorithm is, thus, a measure of the difference between partial derivatives obtained

  • symbolically from the candidate equations
  • numerically from the experimental data.

Further, instead of producing just one candidate, the algorithm outputs a short list of final candidate analytical expressions on the accuracy-complexity Pareto frontier.

Results

The search algorithm with the partial-derivative-pairs criterion was applied to measurements from a few simple physical systems – air-track oscillators and double pendulums. One important feature of the algorithm is that one could control the type of law that the system might find by choosing the input data. For example, for a pendulum, if only position coordinates are provided as input, the algorithm converges to the equation of a circle. Given position and velocity data, the algorithm converges on energy laws. The algorithm could extract Hamiltonians of the air-track oscillators and the conservation of angular momentum laws for the double pendulum.

Caveats

Though the algorithm can present equations corresponding to physical laws in their mathematical form, the bulk constants in the expressions are not characterized. The authors propose a systematic approach to parsing these coefficients by analyzing multiple data sets from the same system, albeit with different configurations and parameters. They demonstrate this approach by using measurements from simulated air-track oscillators and pendulums.

The time to converge on the law equations depends exponentially on complexity of the law itself, dimensionality of the system and measurement noise. Therefore, a key challenge is scaling to higher complexity. One potential solution proposed is the use of candidate equations obtained from simpler systems as initial seeds for complex systems. This seeding approach does not constrain the equation search, but instead biases the search to reuse terms from previous laws.

Problems

In the course of our discussion of this paper, there were a couple of questions raised. The primary doubt stems from the Predictive Ability Criterion. The text describes a “candidate law equation f” whereas f is just an expression, not an equation. Now, if you do make it a conservation equation, then basic calculus gives you the opposite answer from what is reported as the predictive ability criterion. In particular, for a conservation law f(x,y) = c, any change in x must be accompanied by a change in y, and the result is exactly the negative of Eq S2 in the supplement!

  • f(x,y) = c . . . . . . . . . . . . . . . # conservation law
  • \frac{df}{dx} + \frac{df}{dy} \frac{dy}{dx} = 0 . . . . . . . . . . . . . . . # diff. both sides w.r.t x, use chain rule
  • \frac{dy}{dx}= - \frac{df/dx}{df/dy} . . . . . . . . . . . . . . . # solve for dy/dx as desired

Equation S2 in the supplement states instead that 

  • \frac{dy}{dx}= + \frac{df/dx}{df/dy} . . . . . . . . . . . . . . . # note opposite sign!

The proposed algorithm is designed to identify nontrivial conservation laws. However, the authors also claim that this algorithm can be used to identify equations such as Lagrangian equations, which summarize the system dynamics but are not invariant. It is not quite clear how these Lagrangians could be obtained.

When we contacted the corresponding author about these problems, he replied that we should read the supplement, and mentioned that absolute values on the implicit derivatives produce Lagrangians. However, these suggestions did not resolve our questions. Any further clarifications from the authors or scientific community would be most welcome. There is room for commenting below.

Applications to Neuroscience?

This paper presents an approach to identify analytical, human interpretable governing laws from experimental data without any prior knowledge of the system, and demonstrates its success for simple, low-dimensional physical systems. For us, the pertinent question is – can this method be scaled to neural data? We can now simultaneously record the activities of tens of thousands of neurons in the brain. Neural data has various sources of noise and variations across experimental subjects/systems. Further, the governing principles of the brain (if indeed the brain has any) probably describe the dynamics of what the neurons represent. Perhaps with the right kind of fitness score (which also accounts for identifying the appropriate neural representations) and sufficient computational power, the symbolic regression approach could potentially extract some underlying computational principles of the brain.