The Dynamical Regime of Sensory Cortex: Stable Dynamics around a Single Stimulus-Tuned Attractor Account for Patterns of Noise Variability

Neurons in the cortex are characterized by irregular spiking patterns that can be correlated across the population. However, both the variability and covariability in neuronal activity can be modulated by stimuli (Churchland et al., 2010). The reduction in variability and covariability observed experimentally has been explained using multi-attractor models (Ponce-Alvarez et al., 2013), and chaotic networks (Molgedey et al., 1992). In this paper the authors propose an alternative explanation using a stochastic, stabilized supralinear network (SSN).  The main mechanism driving the modulation in this model is stabilizing inhibitory feedback.

In multi-attractor models, the network operates in a multi-stable regime and in the absence of stimulus fluctuations causes the network to wander between different attractors. This meandering gives rise to correlated variability in the activity of different cells.  Upon stimulus onset, the variability is suppressed as the network’s activity is pinned to the neighborhood of a single attractor.  In chaotic models, variability is due to chaotic fluctuations in activity. Certain types of stimuli can suppress such chaotic fluctuations by forcing network dynamics to follow specific trajectories, thus quenching variability across trials. While both of these models explain stimulus-induced reduction of variability,  only the multi-attractor models can explain the stimulus-tuning of variability reduction. Even then, this requires either considerable parameter tuning or very strong noise. 

To address this question the authors present an alternative way to explain this phenomenon using a stochastic stabilized supralinear network (SSN). The dynamics of the spiking SSN is defined via equations governing the voltages of the individual neurons:

\tau_i \frac{dV_i}{dt}=-V_i(t)+V_{rest}+h_i(t)+\eta_i(t)+\sum_{j\in E cells}J_{ij}a_j(t)-\sum_{j\in I cells}J_{ij}a_j(t).

Neurons integrate their external and recurrent inputs linearly in their membrane potential, but their output (rate) is a nonlinear function of the voltage. The quantity \eta_i(t) is noise, and a_j(t) is the low pass filtered version of the spike train of neuron j. Neuron j generates spikes at each time bin dt with an instantaneous probability dt\times f(V_j), where f(V_j) is the instantaneous firing rate of that neuron. 

The choice of the nonlinearity, f, translating voltage into spikes is the crucial part of the model. As shown in earlier work by the authors (Ahmadian et al, 2013), this function needs to be superlinear, and here a threshold-quadratic function is chosen. To understand the mechanics of variability modulation SSNs the authors linearize the dynamics around the input dependent fixed point, and obtain a Schur decomposition of the Jacobian matrix. This allows them to show analytically that transient variability increases for weak inputs, due to weak recurrent inhibitory self coupling and strong balance amplification. Variability decreases for strong inputs due to the supralinear growth in recurrent inhibition, stabilizing the fixed point and damping fluctuations. This mechanism can already be understood using a network of two units representing the excitatory (E) and inhibitory (I) subpopulations, respectively: Weak inputs lead to an increase in variability (across trials). However, for strong inputs variability is suppressed. Modulation in variability requires recurrent interactions: a feed-forward circuit is not enough.

The insights obtained from analyzing a two population model hold more generally. The authors next considered a network of 5,000 neurons (80% E and 20% I) randomly connected with low probability and synaptic weights chosen from the population model (meaning any two neurons from populations a and b connected with the same strength, for a=E,I and b=E,I). In this case, variability suppression was achieved at the single cell level (voltage), as well as at the population level (LFP and rates). This model primarily suppressed shared rather than private variability, which was also observed in experiments (Churchland et al, 2010). This is due to effective mean field connectivity in the network causing the dynamics to be coupled to network-wide patterns of activity across E or across I cells. These patterns were affected by the changes in effective connectivity due to the stimulus. Significant variability in these patterns was present due to correlated noise. Hence, shared E and I patterns behaved as units from the population model, and variability suppression also caused suppression of covariability. In addition, the model accounts for the stimulus-induced modulation of the power spectrum and cross-coherence of LFP and single cell voltage seen experimentally in visual area V1 of awake monkeys.

Since in this model the network was randomly connected, and neurons were not stimulus selective, variability was suppressed uniformly across the population. However, experiments show that variability suppression depends on stimulus tuning. Therefore, the authors next introduced an extension of the network with a ring architecture, with neurons around the ring having different preferred orientations (PO). In accord with previous models of this type (Ponce-Alvarez et al., 2013; Lombardo et al., 2015; Lin et al., 2015), neurons with similar tuning were connected more strongly than neurons with large difference in PO. In this network, a bump of activity is present only when a stimulus is present, and does not represent the short term memory of the stimulus. The dynamics of this model agree well with data from V1 in awake monkeys: The model displays variability suppression at the single cell level and at the population level, a drop in the Fano factor, and U-shaped tuning of variability suppression with stimulus orientation in Fano factors and membrane potentials. In particular, the fact that shared variability was mainly suppressed over private variability is due to the spatially smooth connectivity profile, resulting in regular patterns of activity across the population.

The authors examined the structure of quenched noise variability after stimulus onset. They showed that most of the shared variability arose from variability in the location and width of the bump of activity. These small transformations resulted in a characteristic pattern of deviation of network activity from the mean bump. These two patterns contributed two distinct spatial covariance templates, which taken together accounted for 87\%  of the structure in the full covariance matrix of the network. Hence, bump kinetics correctly predicted membrane potential variances.

As mentioned before, chaotic models cannot account for stimulus-modulated changes in variability, because neurons are not tuned to different stimuli. Therefore, the authors compared the SSN to the multi-attractor model, in order to determine if their proposed mechanism better explains the  suppression in variability seen in experiments. They find that multi-attractor networks show a more limited repertoire of variability patterns. In the SSN model fluctuations in bump location and width led to very weak variability modulation between orthogonal cells, and a much shallower modulation between similarly tuned cells. The attractor model did not display this cancellation even for orthogonally tuned cells. Therefore, correlations between orthogonally tuned cells were modulated as strongly as similarly tuned cells, in disagreement with experimental data

Lastly, the authors explored the temporal dynamics of variability modulation, because this can potentially show fundamental differences between the three different mechanisms. In order to do so, they measured the timescales on which suppression and recovery of variability took place. For SSNs, suppression and recovery was fastest and in the same timescale as the membrane time constant, in agreement with experiments. In contrast, chaotic networks were 4-15 times slower, and the multi-attractor network was at least 20 times slower than the single cell time constant. 

In conclusion, the authors proposed a robust model (SSN) which can capture key aspects of variability modulation, such as stimulus-induced quenching, stimulus-tuned quenching, and realistic timescales of suppression and recovery, without the need of precise fine-tuning of parameters, or large amounts of noise. The main insight here is that in SSNs, the supra linear dependence of firing on inputs increases the effective connectivity with increasing input, which in turn modulates the variability and covariability of responses. Two remarkable effects that explain modulation of variability: balanced amplification, which amplifies variability due to E-I interactions and dominates at weak input; and inhibitory feedback, which quenches variability by generating strong inhibitory input in the network and dominates at large inputs. In combination, these mechanisms robustly reproduce experimentally observed spatial and temporal patterns of variability quenching and modulation.

References:

[1] Ahmadian, Y., Rubin D.B., and Miller K.D. (2013). Neural Comput. 25, 1994-2037.
[2] Churchland, M.M., Yu, B.M., Cunningham, J.P., Sugrue, L.P., Cohen, M.R., Corrado, G.S., Newsome, W.T., Clark, A.M., Hosseini, P., Scott, B.B., et al. (2010). Nat. Neurosci. 13, 369-378.
[3] Lin, I.-C., Okun, M., Carandini, M., and Harris, K.D. (2015). Neuron 87, 644-656.
[4] Lombardo, J., Macelliao, M., Liu, B., Osborne, L.C., and Palmer, S.E. (2015). In 2015 Neuroscience Meeting Planner (online) (Washington, DC: Society for Neuroscience).
[5] Molgedey, L., Schuchhardt, J., and Schuster, H.G. (1992). Phys. Rev. Lett. 69, 3717-3719.
[6] Ponce-Alvarez, A., Thiele, A., Albright, T.D., Stoner, G.R., and DEco, G. (2013). Proc. Natl. Acad. Sci. USA 110, 13162-13167.

Neural nets for audio predicts brain responses

A Task-Optimized Neural Network Replicated Human Auditory Behavior Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy

By  Alexander Kell, Dan Yamins, Erica Shook, Sam V. Norman-Haignere, Josh H. McDermott

Summary by Josue Ortega-Caro

In this work, the authors study the capabilities of a deep neural network to predict both neural responses of human auditory cortex and the behavior of human participants. In addition, the authors point to their results as evidence of hierarchical processing in the auditory cortex. Their results are part of a growing body of work that uses deep neural networks as models of human cortex (see [1] for references), especially the visual cortex.

The assumption behind this literature states that everyday perceptual tasks may impose strong constraints on the brain. These constraints may yield general principles of how tasks can be solved within a distributed network architecture. Therefore, deep neural networks trained directly on these perceptual tasks are likely to implement the same solutions as the cortex, and thereby exhibit brain-like representations and transformations.

Through comparisons to humans in several different noise conditions, the authors show the strong similarities between their models and humans in speech and music recognition tasks. These results suggest that deep neural networks also provide a good model for auditory cortex.

The authors used two tasks: speech and music recognition tasks. Every signal was preprocessed by transforming into a cochleogram prior to passing through the network. On the first task, subjects were given a two second clip of speech, and were asked to recognize one of 587 possible words in the middle of a sentence. On the second task, subjects listened to a two second clip of a song, and were asked to identify one of 41 possible genres to which it belongs. In addition, every two second clip was perturbed by different types and levels of background noise, as a way to compare the models’ and humans’ abilities to perform auditory recognition under difficult conditions.

First, the authors performed an architecture search over 180 architectures to find neural networks that perform well on speech and music recognition independently. Next they searched for ways to combine the architectures into a single network that performed both tasks. The authors argue that both speech and music recognition should share low-level auditory features similar to visual recognition tasks. (It was not clear why they wanted to combine networks trained on each task, except perhaps as a way of decreasing the number of parameters needed for both tasks together.) The final network has three parts: a shared-core (up to Conv3) for both tasks, followed by a word-classifier and a genre-classifier branches, selected by the performance on both the speech and music recognition tasks.

The authors compare the behavior (performance) of the model and humans under different noise conditions. Figure 2 shows that under five types of noise at six intensity levels, the human and model behavior have r^2 values of 0.25 and 0.92 for music and speech recognition, respectively. One concern during the discussion was the inter-human variance in the task (via r^2 between humans), because if humans cannot do this task, consistently, then the small r^2 between the model and humans (For music recognition) is a better result. An important caveat stated by the authors is that the genre recognition task has very fine distinction between genres, e.g. “New Age” versus “Ambient”. Therefore, human subjects were asked to perform top-5 categorization.

Moreover, they tested the ability of the model trained on speech and music recognition to predict fMRI recordings of 8 participants to 165 natural sounds. Some of these sounds were speech and music related but most (113 out of 165) were not. They did independent linear readouts from the activations of different layers of their network in order to find which layers better predict each voxel response to the 165 natural sounds. Figure 3 shows that median variance explained by their model is higher than a spatiotemporal model (baseline model) and a random network baseline. This shows that training the model did help its predictive power.

The most interesting results comes when they examine the fine-grain predictions of different model layers for different voxels in the auditory hierarchy. In Figure 4, their hierarchical map shows higher correlation between early layers of auditory cortex and the predicted voxels from the shared-core layers. At the same time, higher cortical areas have higher correlation with their respective task specific branch. In addition, they show that even though the shared layers can predict the higher cortical areas, the specific branches predict significantly better those areas compare to the shared-core and vice versa. Furthermore, music and speech specific voxels are better predicted by their respective branches of the model. Even though this result is very clear, we were not completely sure if the correlation differences are significant between speech and music branches.

Moreover, the authors explore the ability of their model to explicitly represent acoustic features of the stimulus (via linear decodability). They argue that this is a way to understand the representation learned by the model. They show the variance explained of the spectral filters is higher in early layers, and that variance explained of the spectral modulation peaks at intermediate layers. In contrast, the untrained network has a similar profile but not the same increase or decrease in decodability as the trained network. One experiment they might have done is linearly decode of the spectral filters on the fMRI data and see if the voxels that have higher decodability also have high correlation with the early layers of the network.

Lastly, the authors show in Figure 7 that networks at intermediate points during training explain variance that is correlated to its performance in the speech and music tasks. I think this is the clearest result of their philosophical position: There is a linear relationship between the network’s performance on an everyday task, and the presence of brain-like representational transformations.

In conclusion, the authors show a correlation between humans and deep neural networks in neural, behavioral and representational axis during speech and music recognition tasks. As stated before, this framework of task-optimized neural networks is becoming widespread throughout the neuroscience community, going from visual cortex [1], motor cortex [2], etc., and now to auditory cortex. It is important to note, as also noted by the authors, that the behavioral, neural and representational comparison done by the authors are only sufficient to suggest correlations between humans and deep neural networks. We still need more causal ways to compare both systems in order to see if the similarities are superficial (merely because of the highly-optimized nature of both humans and models on speech and music tasks) or if they reveal a deeper relationship between deep neural models and humans. In addition, we wondered if including other neural motifs such as recurrence, feedback modulation and cell-specific functions would help improve the results presented in this paper.

[1] Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences111(23), 8619-8624.

[2] Michaels, J. A., Dann, B., Intveld, R. W., & Scherberger, H. (2018). Neural dynamics of variable grasp movement preparation in the macaque fronto-parietal network. Journal of Neuroscience, 2557-17.