A theory of multineuronal dimensionality, dynamics and measurement

We recently discussed this paper by Gao et al. from Ganguli lab. They present a theory of neural dimensionality and sufficiency conditions for accurate recovery of neural trajectories, providing a much-needed theoretical perspective from which to judge a majority of systems neuroscience studies that rely on dimensionality reduction. Their results also provide a long overdue mathematical justification for drawing conclusions about entire neural systems based on the activity of a small number of neurons. I felt the paper was well written, and the mathematical arguments used in the proofs were pretty engaging — I don’t remember the last time I enjoyed reading supplementary material quite like this. Here’s a brief summary and some additional thoughts on the paper.

Linear dimensionality reduction techniques are widely used in neuroscience to study how behaviourally-relevant variables are represented in the neurons. The general approach goes like this – (i) apply dimensionality reduction e.g. PCA on trial-averaged activity of a population of M neurons to identify a P -dimensional subspace (P<M ) capturing a sufficient fraction of neural activity, and (ii) examine how neural dynamics evolve within this subspace to (hopefully) gain insights about neural computation. This recipe has largely been successful (ignoring failures that generally go unpublished): the reduced dimensionality of neural datasets is often quite small and the corresponding low-dimensional dynamical portraits are usually interpretable. However, neuroscientists observe only a tiny fraction of the complete neural population. So could the success of dimensionality reduction be an artefact of severe subsampling? This is precisely the question that Gao et al. attempt to answer in their paper.

They first develop a theory that describes how neural dimensionality (defined below) is bounded by the task design and some easy-to-measure properties of neurons. Then they adapt the mathematical theory of random projection to neuroscience setting and obtain the amount of geometric distortion in the neural trajectories introduced by subsampling, or equivalently, the minimum number of neurons one has to measure in order to achieve an arbitrarily small distortion in a real experiment. Throughout this post, I use the term neural dimensionality in the same sense that the authors use in the paper: the dimension of the smallest affine subspace that contains a large (~80 – 90%) fraction of the neural trajectories. Note that this notion of dimensionality differs from the intrinsic dimensionality of the neural manifold, which is usually much smaller.

To derive an analytical expression for dimensionality, the authors note that there is an inherent biological limit to how fast the neural trajectory can evolve as a function of the task parameters. Concretely, consider the response of a population of visual neurons to an oriented bar. As you change the orientation from 0 to \pi , the activity of the neural population will likely change too. If \vartheta denotes the minimum change in orientation required to induce an appreciable change in the population activity (i.e. the width of the autocorrelation in the population activity pattern), then the population will be able to explore roughly \pi/\vartheta linear dimensions. Of course, the scale of autocorrelation will differ across brain areas (presumably increases as one goes from the retina to higher visual areas), so the neural dimensionality would depend on the properties of the population being sampled, not just on the task design. Similar reasoning applies to other task parameters such as time (yes, they consider time as a task parameter because, after all, neural activity is variable in time). If you wait for time period T , the dimensionality will be roughly equal to T/\tau where \tau is now the width of temporal autocorrelation. For the general case of K different task parameters, they prove that neural dimensionality D is ultimately bounded by (even if you record from millions of neurons):

\displaystyle \LARGE D \le C\frac{\prod_{k=1}^{K}{L_k}}{\prod_{k=1}^{K}{\lambda_k}} \qquad \qquad (1)

where \\L_k is the range of the k^{th} task parameter, \lambda_k is the corresponding autocorrelation length and C is an O(1) constant which they prove is close to 1. The numerator and denominator depend on task design and smoothness of neural dynamics respectively, so they label the term on the right-hand side neural task complexity (NTC). This terminology was a source of confusion among some of us as it appears to downplay the fundamental role of the neural circuit properties in restricting the dimensionality, but its intended meaning is pretty clear if you read the paper.

To derive NTC, the authors assume that the neural response is stationary in the task parameters and the joint autocorrelation function is factorisable as a product of individual task parameters’ autocorrelation functions, and then show that the above bound becomes weak when these assumptions do not hold for the particular population being studied. The proof was also facilitated in part by a clever choice of the definition of dimensionality: ‘participation ratio’ ={\left (\sum_i \mu_i \right )^2}/{\left (\sum_i \mu_i^2 \right )} where \mu_i are the eigenvalues of the neuronal covariance matrix, instead of the more common but analytically cumbersome measure based on ‘fraction x of variance explained’ =\begin{matrix} argmin\\ D \end{matrix} \ s.t. \ \left ( \sum_{i=1}^{D} \mu_i \right )/\left ( \sum_i \mu_i \right ) \geq x , but they demonstrate that their choice is reasonable.

Much of the discussion in our journal club centred on whether equation (1) is just circular reasoning, and whether we really gain any new insight from this theory. This view was somewhat understandable because the authors introduce the paper by promising to present a theory that explains the origin of the simplicity betrayed by the low dimensionality of neural recordings… only to show us that it emerges from the specific way in which neural populations respond (smooth dynamics \approx large denominator) to specific tasks (low complexity \approx small numerator). Although this result may seem qualitatively trivial, the strength of their work lies in making our intuitions precise and packaging them in the form of a compact theorem. Moreover, as shown later in the paper, knowing this bound on dimensionality can be practically helpful in determining how many neurons to record. Before discussing that aspect, I’d like to briefly dwell a little bit on a potentially interesting corollary and a possible extension of the above theorem.

Based on the above theorem, one can identify three regimes of dimensionality for a recording size of M neurons:
(i) D\approx M;\ D\ll NTC
(ii) D\approx NTC;\ D\ll M
(iii) D\ll M;\ D\ll NTC

The first two regimes are pretty straightforward to interpret. (i) implies that you might not have sampled enough neurons, while (ii) means that the task was not complex enough to elicit richer dynamics. The authors call (iii) the most interesting and say ‘Then, and only then, can one say that the dimensionality of neural state space dynamics is constrained by neural circuit properties above and beyond the constraints imposed by the task and smoothness of dynamics alone’. What could those properties be? Here, it is worth noting that their theory takes the speed of neural dynamics into account, but not the direction. Recurrent connections, for example, might prevent the neural trajectory from wandering in certain directions thereby constraining the dimensionality. Such constraints may in fact lead to nonstationary and/or unfactorisable neuronal covariance, violating the conditions that are necessary for dimensionality to approach NTC. Although this is not explicitly discussed, they simulate a non-normal network to demonstrate that its dimensionality is reduced by recurrent amplification. So I guess it must be possible to derive a stronger theorem with a tighter bound on neural dimensionality by incorporating the influence of the strength and structure of connections between neurons.

NTC is a bound on the size of the linear subspace within which neural activity is mostly confined. But even if NTC is small, it is not clear whether we can accurately estimate the neural trajectory within this subspace simply by recording M neurons such that M\gg NTC. After all, M is still only a tiny fraction of the total number of neurons in the population N. To explore this, the authors use the theory of random projection and show that it is possible to achieve some desired level of fractional error \epsilon in estimating the neural trajectory by ensuring:

 \displaystyle M(\epsilon)=K[O(log\ NTC)\ +\ O(log\ N)\ +\ O(1)]\ \epsilon^{-2} \qquad \qquad (2)


where K is the number of task parameters. This means that the demands on the size of the neural recording grow only linearly in the number of task parameters and logarithmically (!!) in both NTC and N. Equation (2) holds as long as the recorded sample is statistically homogenous to the rest of the neurons, a restriction that is guaranteed for most higher brain areas provided the sampling is unbiased i.e. the experimenter does not cherry-pick which neurons to record/analyse. The authors encourage us to use their theorems to obtain back-of-the-envelope estimates of recording size and to guide experimental design. This is easier said than done, especially when studying a new brain area or when designing a completely new task. Nevertheless, their work is likely to push the status quo in neuroscience experiments by encouraging experimentalists to move boldly towards more complex tasks without radically revising their approach to neural recordings.


Rats optimally accumulate and discount evidence in a dynamic environment

This link will take you to the presentation slides used in our journal club. The presented paper’s preprint is available here.

Normative models of evidence accumulation have proven useful in understanding behavioral and neural data in perceptual decision making tasks.  They allow us to understand how humans and animals use the available information to decide between alternatives. However, even for relatively simple tasks, normative models can be quite complex. A lot of recent work therefore aims to uncover when animals can learn the structure of a task, and use the available evidence in a way that is consistent with a normative model, and what computations they perform to do so. The paper under discussion presents the results of an experimental study in which rats perform a two-alternative forced choice task requiring the integration of sensory evidence. Importantly, the correct choice is not constant in time.

The stimulus consisted of two trains of auditory clicks, presented simultaneously to each one of the rat’s ear. The clicks were produced according to an inhomogeneous Poisson process with two possible instantaneous rates: r_1, r_2. The state of the environment was defined as the assignment of a rate to a specific ear. The experimenters forbade the assignment of the same rate to both ears. Thus at any instant in time, the environment was either in state S^1, meaning that the higher click rate was presented to the right ear and the lower rate to the left ear, or in state S^2, for the opposite assignment. The environment evolved as a telegraph process, alternating between the two statesS^1 and S^2,  with hazard rate h. Once prompted, a rat had to choose between two reward ports.  If it entered the correct port the one at the side of the highest rate it received a reward.

The study followed experimental setup of Brunton et al. 2013 and Erlich et al. 2015, and was inspired by the theoretical framework of recent Bayes’ optimal inference algorithms described in Veliz-Cuba et al. 2015 and Glaze et al. 2015 (a mathematical model from Brunton et al. 2013 is also revisited). The stated aim was to:

probe whether rodents can optimally discount evidence by adapting the timescale over which they accumulate it.

The authors reported the following main results:
  1. Optimal timescale for evidence discounting depends on both:
    1. environment volatility (the hazard rate)
    2. noise in sensory processing (modeled as the probability of mislocalizing a click)
  2. Rats accumulate evidence almost optimally, if both variables above are considered.
  3. Rats adapt their integration timescale to the volatility of the environment.
  4. The authors’ model makes quantitative predictions about timing of changes of mind.
  5. Overall, the paper establishes a quantitative behavioral framework to study adaptive evidence accumulation.
The first result above is derived mathematically. The optimal evidence accumulation equation is,
\displaystyle \frac{da}{dt} = \delta_{R,t}-\delta_{L,t}-\frac{2h}{\kappa}\sinh(\kappa a)
a               is the posterior-odds ratio
\delta_{R,t},\ \ \delta_{L,t}  are the right and left auditory click trains (sum of delta functions)
h               is the hazard rate, or volatility of the environment

\kappa               is the click reliability parameter. It indicates how much evidence a single click provides.

The standard formula for the click reliability parameter is (assuming r_1>r_2), 
\displaystyle \kappa = \log\frac{r_1}{r_2}
Sensory noise is modeled by a probability, n, of a click being mislocalized, changing click reliability into: 
 \displaystyle \kappa=\log\frac{r_1\cdot(1-n)+r_2\cdot n}{r_2\cdot(1-n)+r_1\cdot n} 

Thus, sensory noise has the effect of reducing the distance between the two click rates (the numerator and denominator of the above fraction become the effective click rates to the rat), thereby increasing the difficulty of the task. In the supplementary material, the section Sensory noise parameterization details analyzes other types of sensory noise, such as the possibility of missing some clicks.

To obtain the second result, the authors performed a sequence of steps. First, they noted that the optimal inference model is well approximated by a linear model of the form,

\displaystyle \frac{da}{dt} = \delta_{R,t}-\delta_{L,t}-\lambda \cdot a

The discounting rate, \lambda, is found by numerical optimization; it is the \lambda that maximizes accuracy of the observer’s choice, for given r_1,r_2,h,n and trial duration. Note that for fixed task parameters, the discounting rate depends on the sensory noise n. Using the linear model allows a straightforward interpretation of the parameter \lambda, as the discounting rate of the accumulated evidence. The authors define the inverse of the discounting rate, 1/\lambda, as the integration timescale.

The second step of the analysis consisted of computing reverse correlation kernels for both the behaving rats, and the best linear model, from the same stimulus set. The reverse kernel curves were computed as follows. First, the click trains from trial i, in evidence space, were smoothed with a causal Gaussian filter k(t):
\displaystyle r_i(t)=\delta_{R,t}\star k(t) - \delta_{L,t}\star k(t).
Then, subtracting the expected click rate, given the true state of the environment at each point in time, yielded the normalized variable:
e_i(t)=r_i(t)-\langle r(t)|S_i(t)\rangle 
Finally, the excess click rate, which is the y-value of the reverse correlation curves, was computed by averaging the previous quantity over trials :
excess-rate(t|choice)=\langle e(t)|choice\rangle

As a third step, the authors verified that fitting an exponential, \displaystyle ae^{bt}, to a reverse correlation curve obtained from the linear model allowed them to back out the initial, true, discounting rate. That is, after the fit, they confirm that b is close to \lambda. This justified the application of the same procedure to the reverse kernel curves obtained from rats behavior. The authors found that the backed out discounting rates from the reverse kernel curves, obtained from rat behavior and from the linear model inference, are close to each other; provided that the sensory noise value reported in Brunton et al. 2013 is factored into the linear model. No quantitative measure of closeness’ is provided (see figure 4B in the paper).

In addition to the analysis described above, the authors fit (via Maximum Likelihood Estimation) a more detailed evidence accumulation model to each rat, in order to investigate the difference in sensory noise and integration timescale between individuals. The model from Brunton et al. 2013; Hanks et al. 2015 and Erlich et al. 2015, was revisited, removing the absorbing decision boundaries. The equations are,
\displaystyle da = (\delta_{R,t}\cdot\eta_R\cdot C - \delta_{L,t}\cdot\eta_L\cdot C)dt -\lambda\cdot adt+ \sigma_a dW,
\displaystyle \frac{dC}{dt} = \frac{1-C}{\tau_\phi}+(\phi-1)\cdot C\cdot(\delta_{R,t}+\delta_{L,t}),
where the additional variables are described below:
\eta_R, \ \eta_L   multiplicative Gaussian sensory noise applied to clicks (really to jump in evidence at each click)
C            additional adaptation process filtering the clicks
\sigma_a           variance of constant Gaussian additive noise
\phi            adaptation strength
\tau_{\phi}            adaptation time constant
The upshot of this second model analysis was that:
  1. The best fit discounting rate parameter, \lambda, is compared, for each rat, to the values of \lambda obtained on another cohort, in a static environment case, in Brunton et al. 2013. A clear separation between the two cohorts is apparent, indicating that rats in the dynamic environment tend to have much shorter integration time scales.
  2. With the previous linear model, a relationship between the best $latex \lambda$ and the theoretical level of sensory noise, n, was numerically explored. Here, the sensory noise of each rat is estimated from the model parameters (see section Calculating noise level from model parameters in the paper).  The pairs (\lambda, n) for each rat lie slightly off the theoretical curve from the linear model (Fig 5C in paper). The authors find that this is due to additional constraints generated by the more detailed model. Given the other parameters from the detailed model, the authors conclude that the rats still use the best possible discounting rate, for a given level of sensory noise.

The third result was only established in a preliminary fashion in the paper, insofar as the authors only tested 3 rats from their cohort. Each one of these three rats underwent three consecutive experimental phases (each phase during at least 25 daily trial sessions), with environmental hazard rate taking on the values 0.5 Hz, 0 Hz and 0.5 Hz, respectively. In other words, each rat underwent a phase in a dynamic environment, followed by a phase in a static environment, and further followed by a phase in a dynamic environment. The reverse correlation curves display a dramatic change in shape between the dynamic environment and the static phases. As expected from an adaptive decision maker, the reverse kernel curves are fairly flat in the static environment phase (indicating equal weighting of the evidence along the trial duration), but show a decay of old evidence weighting in the the dynamic environment case.

The authors do not present any analysis of the fourth result but point out that it is a potential from their model.

In conclusion, we believe that the dynamic clicks task experiment described in this paper is key for the study of adaptive evidence accumulation. The reported evidence that some rats are able to change their evidence discounting rate according to the environment’s volatility is convincing. On the theoretical side, we wonder whether additional models could produce similar reverse correlation curves, and this could represent a route for further research projects.


Brunton, B. W., Botvinick, M. M., and Brody, C. D. (2013). Rats and humans can optimally accumulate evidence for decision-making. Science, 340(6128):95–98.

Erlich, J. C., Brunton, B. W., Duan, C. A., Hanks, T. D., and Brody, C. D. (2015). Distinct effects of prefrontal and parietal cortex inactivations on an accumulation of evidence task in the rat. eLife, 4:e05457.

Glaze, C. M., Kable, J. W., and Gold, J. I. (2015). Normative evidence accumulation in unpredictable environments. eLife, 4:e08825.

Veliz-Cuba, A., Kilpatrick, Z. P., and Josic, K. (2016). Stochastic models of evidence accumulation in changing environments. SIAM Review.

Predictive coding and Bayesian inference in the brain

The paper [1] offered a comprehensive review on the progress on predictive coding and Bayesian inference. Categorizing from Marr’s three levels [2], Bayesian inference is on the computational level and predictive coding is on the representation level. As discussed in the paper, the two concepts can be developed alone or combined with each other. Predictive coding can be a useful neural representational motif for multiple computational goals (e.g. efficient coding, membrane potential dynamics optimization and reinforcement learning). The brain can also perform Bayesian inference with other neural representations (e.g. probability coding, log-probability coding, and direct variable coding). The authors observe that more experiments need to be designed to offer direct evidence about such representations.

Predictive coding is used to describe very different approaches in the neuroscience literature [3]. This paper understood the term as representing prediction error by neural responses. Alternatively, it has been defined as neurons that preferentially encode stimuli that carry information about the future [4]. The two definitions are not necessarily consistent with each other, since the neurons representing errors might not be predictive about the future.

Bayesian inference uses the sensory input to compute the posterior over latent variables. Based on this posterior, the brain might pick an ideal point estimate about the latent. This preferred latent could then be used to compute a prediction about input.

Predictive coding doesn’t specify how to generate predictions, whereas Bayesian inference offers a natural way to compute the prediction. It seems natural to combine them to create Bayesian predictive coding. In [5], a hierarchical model that use neurons to carry out the prediction error generated by the bottom-up inputs and top-down predictions is built to model the vision system. Several experimental results (e.g. surrounding suppression) can be explained by this hierarchical Bayesian predictive coding model. But this model also used neurons to represent the prediction and latent variables. Thus it’s a more like a hybrid Bayesian predictive and direct variable coding model. We then argued in our journal club about if the pure predictive coding idea is enough to represent the brain’s inference process. Mathematically, this can be understood as follows. If the prior and likelihood are all Gaussians, maximization of posterior with respect to weights is equivalent to the minimization of the sum-of-square error function. If the brain uses gradient descent to perform learning, the gradient of sum-of-square error function will be dependent on the prediction errors. This means the brain need to represent the prediction errors if it is doing optimal Bayesian inference under Gaussian assumption. However, the brain might do suboptimal Bayesian inference. The distribution of input can also be non-Gaussian. Thus in general, only representing prediction error is not enough for the brain to perform Bayesian inference.

One advantage of predictive coding is the brain can save neural representational power by just represent the prediction error. However, this idea doesn’t consider the representation of prediction itself. The brain still must spend neural resources to represent the prediction. The efficient coding framework [6] claims that neural representations are trying to maximize the mutual information between responses and stimulus under some constraints. Under high signal-to-noise ratios, the efficient representation will meet with the representation learned from predictive coding principle [7,8]. However, when the signal to noise ratio is small, the efficient coding says that preserving all the stimulus input is more beneficial compared to just preserving the prediction error [7,8]. Thus this predictive coding motif isn’t always true under the efficient coding framework.

In summary, we agreed with the conclusions of this paper. More experiments need to be designed to offer conclusive answers about how the brain performs Bayesian inference. We also can’t rule out the possibility that brain use multiple representations. Unifying such possible representations in a more general theoretical framework would be worth the future efforts of computational neuroscientists.

[1]. Aitchison L, Lengyel M. With or without you: predictive coding and Bayesian inference in the brain. Current Opinion in Neurobiology, 2017, 46: 219-227.

[2]. Marr D: Vision: a computational investigation into the human representation and processing of visual information. WH San Francisco: Freeman and Company; 1982.

[3]. Chalk M, Marre O, Tkacik G. Towards a unified theory of efficient, predictive and sparse coding. bioRxiv, 2017: 152660.

[4]. Bialek W, Van Steveninck R R D R, Tishby N. Efficient representation as a design principle for neural coding and computation[C]//Information Theory, 2006 IEEE International Symposium on. IEEE, 2006: 659-663.

[5]. Rao RP, Ballard DH: Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci 1999, 2:79-87.

[6]. Barlow H: Possible principles underlying the transformations of sensory messages. In Sensory Communication. Edited by Rosenblith W. MIT Press; 1961:217-234.

[7]. Atick J J, Redlich A N. Towards a theory of early visual processing. Neural Computation, 1990, 2(3): 308-320.

[8]. Srinivasan M V, Laughlin S B, Dubs A. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London B: Biological Sciences, 1982, 216(1205): 427-459.

Bayesian Efficient Coding

On 15 sep 2017, we discussed Bayesian Efficient Coding by Il Memming Park and Jonathan Pillow.

As the title suggests, the authors aim to synthesize bayesian inference with efficient coding. The Bayesian brain hypothesis states that the brain computes posterior probabilities based on its model of the world (prior) and its sensory measurements (likelihood). Efficient coding assumes that the brain distributes its resources to maximize a cost, typically information. In particular, they note that efficient coding that optimizes mutual information is a special case of their more general framework, but ask whether other maximizations based on the Bayesian posterior might better explain data.

Denoting stimulus x, measurements y, and model parameters \theta, they use the following ingredients for their theory: a prior p(x), a likelihood p(y|x), an encoding capacity constraint C(\theta), and a loss functional L(\cdot). They assume that the brain is able to construct the true posterior p(x|y,\theta). The goal is to find a model that optimizes the expected loss


under the constraint C(\theta)\leq c.

The loss functional is the key. The authors consider two things the loss might depend on: the posterior L(p(x|y)), or the ground truth L(x,p(x|y)). They needed to make the loss explicitly dependent on the posterior in order to optimize for mutual information. It was unclear whether they also considered a loss depending on both, which seems critical. We communicated with them and they said they’d clarify this in the next version.

They state that there is no clear a priori reason to maximize mutual information (or equivalently to minimize the average posterior entropy, since the prior is fixed). They give a nice example of a multiple choice test for which encodings that maximize information will achieve fewer correct answers than encodings that maximize percent correct for the MAP estimates. The ‘best’ answer depends on how one defines ‘best’.

After another few interesting gaussian examples, they revisit the famous Laughlin (1981) result on efficient coding in the blowfly. This was hailed as a triumph for efficient coding theory in predicting the nonlinear input-output photoreceptor curve derived directly from the measured prior over luminance. But here the authors found that instead a different loss function on the posterior gave a better fit. Interestingly, though, that loss function was based on a point estimate,


where the point estimate is the Bayesian optimum for this cost function and p is a parameter. The limit p\to 0 gives the familiar entropy, p=2 is the conventional squared error, and the best fit to the data was p=1/2, a “square root loss.” It’s hard to provide any normative explanation of why this or any other choice is best (since the loss is basically the definition of ‘best’, and you’d have to relate the theoretical loss to some real consequences in the world), it is very interesting that the efficient coding solution explains data worse than their other Bayesian efficient coding losses.

Besides the minor confusion about whether their loss does/should include the ground truth x, and some minor disagreement about how much others have done things along this line (Ganguli and Simoncelli, Wei and Stocker, whom they do cite), my biggest question is whether the cost really should depend on the posterior as opposed to a point estimate. I’m a fan of Bayesianism, but ultimately one must take a single action, not a distribution. I discussed this with Jonathan over email, and he maintained that it’s important to distinguish an action from a point estimate of the stimulus: there’s a difference between the width of the river and whether to try to jump over it. I countered that one could refer actions back to the stimulus: the river is jumpable, or unjumpable (essentially a Gibsonian affordance). In a world of latent variables, any point estimate based on a posterior is a compromise based on the loss function.

So when should you keep around a posterior, rather than a point estimate? It may be that the appropriate loss function changes with context, and so the best point estimate would change too. While one could certainly consider that to be a bigger computation to produce a context-dependent point estimate, it may be more parsimonious to just represent information about the posterior directly.