Bayesian Efficient Coding

On 15 sep 2017, we discussed Bayesian Efficient Coding by Il Memming Park and Jonathan Pillow.

As the title suggests, the authors aim to synthesize bayesian inference with efficient coding. The Bayesian brain hypothesis states that the brain computes posterior probabilities based on its model of the world (prior) and its sensory measurements (likelihood). Efficient coding assumes that the brain distributes its resources to maximize a cost, typically information. In particular, they note that efficient coding that optimizes mutual information is a special case of their more general framework, but ask whether other maximizations based on the Bayesian posterior might better explain data.

Denoting stimulus $x$ , measurements $y$ , and model parameters $\theta$ , they use the following ingredients for their theory: a prior $p(x)$ , a likelihood $p(y|x)$ , an encoding capacity constraint $C(\theta)$ , and a loss functional $L(\cdot)$ . They assume that the brain is able to construct the true posterior $p(x|y,\theta)$ . The goal is to find a model that optimizes the expected loss

$\bar{L}(\theta)=\mathbb{E}_{p(y|\theta)}\left[L(p(x|y,\theta))\right]$

under the constraint $C(\theta)\leq c$ .

The loss functional is the key. The authors consider two things the loss might depend on: the posterior $L(p(x|y))$ , or the ground truth $L(x,p(x|y))$ . They needed to make the loss explicitly dependent on the posterior in order to optimize for mutual information. It was unclear whether they also considered a loss depending on both, which seems critical. We communicated with them and they said they’d clarify this in the next version.

They state that there is no clear a priori reason to maximize mutual information (or equivalently to minimize the average posterior entropy, since the prior is fixed). They give a nice example of a multiple choice test for which encodings that maximize information will achieve fewer correct answers than encodings that maximize percent correct for the MAP estimates. The ‘best’ answer depends on how one defines ‘best’.

After another few interesting gaussian examples, they revisit the famous Laughlin (1981) result on efficient coding in the blowfly. This was hailed as a triumph for efficient coding theory in predicting the nonlinear input-output photoreceptor curve derived directly from the measured prior over luminance. But here the authors found that instead a different loss function on the posterior gave a better fit. Interestingly, though, that loss function was based on a point estimate,

$L(x,p(x|y))=\mathbb{E}_{p(x|y)}\left[\left|x-\hat{x}(y)\right|^p\right]$

where the point estimate is the Bayesian optimum for this cost function and $p$ is a parameter. The limit $p\to 0$ gives the familiar entropy, $p=2$ is the conventional squared error, and the best fit to the data was $p=1/2$ , a “square root loss.” It’s hard to provide any normative explanation of why this or any other choice is best (since the loss is basically the definition of ‘best’, and you’d have to relate the theoretical loss to some real consequences in the world), it is very interesting that the efficient coding solution explains data worse than their other Bayesian efficient coding losses.

Besides the minor confusion about whether their loss does/should include the ground truth $x$ , and some minor disagreement about how much others have done things along this line (Ganguli and Simoncelli, Wei and Stocker, whom they do cite), my biggest question is whether the cost really should depend on the posterior as opposed to a point estimate. I’m a fan of Bayesianism, but ultimately one must take a single action, not a distribution. I discussed this with Jonathan over email, and he maintained that it’s important to distinguish an action from a point estimate of the stimulus: there’s a difference between the width of the river and whether to try to jump over it. I countered that one could refer actions back to the stimulus: the river is jumpable, or unjumpable (essentially a Gibsonian affordance). In a world of latent variables, any point estimate based on a posterior is a compromise based on the loss function.

So when should you keep around a posterior, rather than a point estimate? It may be that the appropriate loss function changes with context, and so the best point estimate would change too. While one could certainly consider that to be a bigger computation to produce a context-dependent point estimate, it may be more parsimonious to just represent information about the posterior directly.

3 thoughts on “Bayesian Efficient Coding”

Jonathan W Pillow on September 25, 2017 at 12:52 am said:

Thanks for the very nice summary and plug, Xaq!

To reply very quickly about the loss function: our intent was to say that the loss is a function of the posterior, not ONLY a function of the posterior. Thus it can depend on the posterior and other stuff (e.g. the true stimulus, as in the example cited).

One other comment: the Bayesian efficient coding framework we proposed is more general than Barlow’s classical efficient coding, because it includes that as a special case, but it is also more general than settings based on estimation errors in point estimates (e.g. MSE), since it also includes those as special cases. But it also extends to cases where loss depends on a decision or action (assuming Bayesian actions or decisions, i.e., those computed using an integral over the posterior).

(This is NOT to say the brain always keeps around a full posterior over the stimulus; BEC is just a framework for normatively optimal coding that (in our view) synthesizes and generalizes a bunch of previous work in this area.)

Thanks again for the comments, we will do our best to clarify these points in the revision!

Reply ↓
Krešimir Josić on September 25, 2017 at 10:18 pm said:

One reason to keep more information about the posterior is for learning. If I chose an action based on a point estimate, but the posterior was broad, I may not learn much from being wrong. If I am wrong with a narrow posterior, I may want to
reconsider my model of the world. It may be sufficient to keep another point estimate of “certainty” to do so. But going down this road requires keeping more and more information about the posterior.

Reply ↓
xaq on September 25, 2017 at 10:34 pm said:

That’s a good point, Krešo.

Learning essentially induces structure at a longer timescale, such that marginally independent individual trials (i.i.d. given the true parameters $\theta^*$ ),
$p(x_t,x_{t'}|\theta^*)\neq p(x_t|\theta^*)p(x_{t'}|\theta^*)$
are no longer conditionally independent given the estimated parameter $\hat{\theta}$ :
$p(x_t,x_{t'}|\hat{\theta})\neq p(x_t|\hat{\theta})p(x_{t'}|\hat{\theta})$
A loss function allowing learning must now account for this longer timescale. Consequently, one can again (in principle) choose a point estimate that optimizes this new, broader loss, and this point estimate will necessarily involve evidence from multiple time points. So technically you don’t need a posterior even during learning.

But this is a much more complicated scheme than just online learning based on posteriors! Parsimony favors a posterior.

Reply ↓