NeuroTheory Journal Club blog

Featured

We are a cross-departmental student-run group, whose aim is to bring together the Houston computational neuroscience community (BCM/RICE/UH/UTHealth). We meet weekly to discuss papers. Every other week will be focused on our NeuroNex center project to infer graphical models for interactions between neurons and the world. Other weeks we will cover general topics in computational neuroscience, including cellular, systems, cognitive, stats, machine learning topics.

Meeting Time & Place: Friday @ 9:00am-10:00am, in BCM room S553.

(Note: this is followed by our Machine Learning journal club from 10–11 in the same place.)

Contact: Yicheng Fei, yf17[a]rice.edu

Review of “Optimal policy for multi-alternative decisions”

Paper by Satohiro Tajima, Jan Drugowitsch, Nisheet Patel, and Alexendre Pouget Nature Neruoscience, August 5th, 2019

Review by Nick Barendregt (CU, Boulder)

Summary

Organisms must develop robust and accurate strategies to make decisions in order to survive in complex environments. Recent studies have largely focused on value-based or perceptual decisions where observers must choose between two alternatives. However, many real-world situations require choosing between multiple options, and it is not clear if the strategies that are optimal for two-alternative tasks can be directly translated to multi-alternative tasks. To address this question, the authors use dynamic programming to find the optimal strategy for an n-alternative task. Using Bellman’s equation, the authors find that the belief thresholds at which a decision process is terminated are time-varying non-linear functions. To understand how observers could implement such a complex stopping rule, the authors then develop a neural network model that approximates the optimal strategy. Using this network model, they then show that several experimental observations, such as the independence of irrelevant alternatives (IIA) principle, that had been thought to be suboptimal can in fact be explained by the non-linearity of the network. The authors conclude by using their network model to generate testable hypotheses about observer behavior in multi-alternative decision tasks.

Optimal Decision Strategy

To find the optimal strategy for an n-alternative decision task, the authors assume the observer accumulates evidence to update their belief, and that the observer commits to a choice whenever their belief becomes strong enough; this can be described mathematically by the belief crossing a threshold. To find these thresholds, the authors assume that the observer sets their thresholds to maximize their reward rate, or the average reward (less the average cost of accumulating evidence) per average trial length. These assumptions allow them to construct a utility, or value, function for the observer. At each timestep, the observer collects a new piece of evidence and uses it to update their belief. With this new belief, the observer calculates the utility associated with two classes of actions. The first class, which has total actions, is committing to a choice, which has utility equal to the reward for a correct choice times the probability their choice is correct (i.e., their belief in the choice being correct). The second class, which has a single action, is waiting and drawing a new observation, which has utility equal to the average future utility less some cost of obtaining a new observation. The utility function selects the maximum of these n+1 actions for the observer. The decision thresholds are then given by the belief values where the maximal-utility action changes.

Using Bellman’s equations for the utility function, the authors find the decision thresholds are non-linear functions that evolve in time. From the form of these thresholds, the authors surmise that the belief-update process can be projected onto a lower-dimensional space, and that the thresholds collapse as time increases, reflecting the urgency the observer faces to make a choice and proceed to the next trial.

Neural Network Model

To see how observers may approximate this non-linear stopping rule, the authors construct a recurrent neural network that implements a sub-optimal version of the decision strategy. This network model has n neurons, one for each option, which track the belief associated with each option. The network also includes divisive normalization, which is widespread in the cortex, and an urgency signal, which increases the gain as time increases. These two features allow the model to well-approximate the optimal stopping rule, and result in a model that has a similar lower-dimensional projection and collapsing thresholds. When comparing their network model to a standard race model, the authors find that adding normalization and urgency improves model performance in both value-based and perceptual tasks, with normalization having a larger impact on performance.

Results and Predictions

Using their neural network model, the authors are able reproduce several well-established results, such as Hick’s law for response times, and explain several behavioral- and physiological-findings in humans that have long been thought to be sub-optimal. First, because of the normalization, the model is able to replicate the violation of IIA, which says that in a choice between two high-value options, adding a third option of lower value should not influence the decision process. The normalization also replicates the similarity effect, which says that when choosing between option 1 and option 2, adding a third option similar to that of option 1 decreases the probability of choosing option 1. The authors then conclude that the key explanation of these behaviors is divisive normalization.

After validating their model by reproducing these previously-observer results, the authors then make predictions about observer behavior in multi-alternative tasks. The main prediction is in the two types of strategies used for multi-alternative tasks: the “max-vs.-average” strategy and the “max-vs.-next” strategy. The model predicts that the reward distribution across choice should cause observers to smoothly transition between these two strategies; this prediction is something that could be tested in psychophysics experiments. 

Unsupervised learning algorithms

Post by Aadith Vittala.

Pehlevan, C., Chklovskii, D. B. (2019). Neuroscience-inspired online unsupervised learning algorithms. ArXiv:1908.01867 [Cs, q-Bio].

This paper serves as a review of similarity-based online unsupervised learning algorithms. These types of algorithms are important because they are biologically-plausible, produce non-trivial results, and sometimes work as well as other non-biological algorithms. In this paper, biologically-plausible algorithms have three main features: they are local (each neuron uses only pre or post synaptic information), they are online (data vectors are presented one at a time and learning occurs after each data vector is passed in), and they are unsupervised (there is no teaching signal to tell the neurons information about error). The simplest example of one of these algorithms is the Oja online PCA algorithm. Here, the system receives \textbf{x}_t each timestep and calculates y_t = \textbf{w}_{t-1} \cdot \textbf{x}_t, which represents the value of the top principal component. The weights are modified each timestep according to

\textbf{w}_t = \textbf{w}_{t-1} + \eta(\textbf{x}_t - \textbf{w}_{t-1} y_t)y_t

This algorithm is both biologically-plausible and potentially useful. This paper aims to find more algorithms like the Oja online PCA algorithm.

As a starting point, they aimed to develop a biologically-plausible algorithm that would output multiple principal components from a given data set. To do this, they chose to use a similarity-matching objective function, where the goal is to minimize the following expression

\min_{\textbf{y}_1 \ldots \textbf{y}_T} \frac{1}{T^2}\sum_{t=1}^T \sum_{t'=1}^T \big(\textbf{x}_t^T \textbf{x}_{t'} - \textbf{y}_t^T \textbf{y}_{t'} \big)^2

This expression essentially tries to match the pairwise similarities between input vectors to the pairwise similarities between output vectors. In previous work, they have shown that the solution to this problem (with \textbf{y} having fewer dimensions than \textbf{x}) is PCA. To solve this in a biologically-plausible fashion, they use a variable substitution trick (inspired by the Hubbard-Stratonovich transformation from physics) to convert this problem to a minimax problem over new variables \textbf{W} and \textbf{M}

\min_{\textbf{W}} \max_{\textbf{M}} \frac{1}{T} \sum_{t=1}^T \big[ 2 {\rm Tr\,}(\textbf{W}^T \textbf{W}) - {\rm Tr\,}(\textbf{M}^T \textbf{M}) +\min_{\textbf{y}_t} (-4 \textbf{x}_t^T \textbf{W}^T \textbf{y}_t + 2 \textbf{y}_t^T\textbf{M}\textbf{y}_t) \big]

This expression leads to an online algorithm where you solve for \textbf{y}_t during each time step t using

\dot{\textbf{y}}_t = \textbf{W} \textbf{x}_t - \textbf{M} \textbf{y}_t

and then update \textbf{W} and \textbf{M} with

W_{ij} = W_{ij} + \eta (y_i x_j - W_{ij}) \hspace{20pt} \textrm{Hebbian}

M_{ij} = M_{ij} + \frac{\eta}{2} (y_i y_j - M_{ij}) \hspace{20pt} \textrm{``anti-Hebbian"}

though after discussion, we think it would be better to call the second weights update “Hebbian for inhibitory synapses”. This online algorithm has not yet been proven to converge, but it gives relatively good results when tested. In addition, it provides a simple interpretation of \textbf{W} as the presynaptic weights mapping all inputs to all neurons, \textbf{y} as the postsynaptic output from all neurons, and \textbf{M} as inhibitory lateral projections between neurons.

The paper goes on to generalize this algorithm to accept whitening constraints (via Lagrange multipliers), to work for non-negative outputs, and to work for clustering and manifold tiling. The details for all of these processes are covered in cited papers, but not in this specific paper. Overall, the similarity-matching objective seems to give well-performing biologically-plausible algorithms for a wide range of problems. However, there are a few important caveats: none of these online algorithms have been proved to converge for the correct solution, inputs were not allowed to correlate, and there is no theoretical basis for stacking (since multiple layers would be equivalent to a single layer). In addition, during discussion we noted that similarity matching seems to essentially promote networks that just rotate their input into their output (as similarity measures the geometric dot product between vectors), so it is not obvious how this technique can conduct the more non-linear transformations necessary for complex computation. Nonetheless, this paper and similarity-matching in general provides important insight into how networks can perform computations while still remaining within the confines of biological plausibility. 

What is the dynamical regime of the cortex?

A review of a preprint by Y. Ahmadian and K. D. Miller

What is the dynamical regime of cortical networks? This question has been debated as long as we have been able to measure cortical activity. The question itself can be interpreted in multiple ways, the answers depending on spatial and temporal scales of the activity, behavioral states, and other factors. Moreover, we can characterize dynamics in terms of dimensionality, correlations, oscillatory structure, or other features of neural activity.

In this review/comment, Y. Ahmadian and K. Miller consider the dynamics of single cells in cortical circuits, as characterized by multi-electrode, and intracellular recording techniques. Numerous experiments of this type indicate that individual cells receive excitation and inhibition that are approximately balanced. As a result, activity is driven by fluctuations that cause the cell’s membrane potential to occasionally, and irregularly cross a firing threshold. Moreover, this balance is not a result of a fine tuning between excitatory and inhibitory weights, but is achieved dynamically.

There have been several theoretical approaches to explain the emergence of such balance. Perhaps the most influential of these theories was developed by C. van Vreeswijk and H. Sompolinsky in 1996. This theory of dynamic balance relies on the assumption that the number of  excitatory and inhibitory inputs to a cell, K, is large and that these inputs scale like 1/\sqrt(K). If external inputs to the network are strong, under fairly general conditions activity is irregular, and in a balanced regime: The average difference between the excitatory and inhibitory input to a cell is much smaller than either the excitatory input or inhibitory input itself. Ahmadian and Miller refer to this as the tightly balanced regime.

In contrast, excitation and inhibition still cancel approximately in loosely balanced networks. However, in such networks the residual input is comparable to the excitatory input, and cancellation is thus not as tight. This definition is too broad, however, and the authors also assume that the net input (excitation minus inhibition) driving each neuron grows sublinearly as a function of the external input. As shown previously by the authors and others, such a state emerges when the number of inputs to each cell is not too large, and each cell’s firing rate grows superlinearly with input strength. Under these conditions a sufficiently strong input to the network evokes fast inhibition that loosely balances excitation to prevent runaway activity.

Loose and tight balance can occur in the same model network, but loose balance occurs at intermediate external inputs, while tight balance emerges at high external input levels. While the transition between the two regimes is not sharp, the network behaves very differently in the two regimes: A tightly balanced network responds linearly to its inputs, while the response of a loosely balanced network can be nonlinear. Moreover, external input can be of the same order as the total input for loosely balanced networks, but must be much larger than the total input (of the same order as excitation and inhibition on their own) for tightly balanced networks.

Which of these two regimes describe the state of the cortex? Tightness of balance is difficult to measure directly, as one cannot isolate excitatory and inhibitory inputs to the same cell simultaneously. However, the authors present a number of strong, indirect arguments in favor of loose balance basing their argument on several experimental findings: 1) Recordings suggest that the ratio of the mean to the standard deviation excitatory input is not sufficiently large to necessitate precise cancellation from inhibition. This would put the network in the loosely balanced regime. Moreover, excitatory currents alone are not too strong, comparable to the voltage difference between the mean and threshold. 2) Cooling and silencing studies suggest that external input, e.g. from thalamus, to local cortical networks is comparable to the net input. This is consistent with loose balance, as tight balance is characterized by strong external inputs. 3) Perhaps most importantly cortical computations are nonlinear. Sublinear response summation, and surround suppression, for instance, can be implemented by loosely balanced networks. However, classical tightly balanced networks exhibit linear responses, and thus cannot implement these computations. 4) Tightly balanced networks are uncorrelated, and do not exhibit the stimulus modulated correlations observed in cortical networks.

These observations deserve a few comments: 1) The transition from tight to loose balance is gradual. It is therefore not exactly clear when, for instance, the mean excitatory input is sufficiently strong to require tight cancellation. As the authors suggest, some cortical areas may therefore lean more towards a tight balance, while others lean more towards loose balance. 2) It is unclear whether cooling reduces inputs to the cortical areas in question. 3 and 4) Classical tightly balanced networks are unstructured and are driven by uncorrelated inputs. Changes to these assumptions can result in networks that do exhibit a richer dynamical repertoire including, spatiotemporally structured, and correlated behaviors, as well as nonlinear computations.

Why does this this debate matter? The dynamical regime of the cortex describes how a population of neurons transforms its inputs, and thus the computations that a network can perform. The questions of which computations the cortex performs, and how it does so, are therefore closely related to questions about its dynamics. However, at present our answers are somewhat limited. Most current theories ignore several features of cortical networks that may impact their dynamics: There is a great diversity of cell types, particularly inhibitory cells, each with its own dynamical, and connectivity properties. It is likely that this diversity of cells shape the dynamical state of the network in a way that we do not yet fully understand. Moreover, the distribution of synaptic weights, and spatial synaptic interactions across the dendritic trees are not accurately captured in most models. It is possible, that these, and other, details are irrelevant, and current theories of balance are robust. However, this is not yet completely clear. Thus, while the authors make a strong case that the cortex is loosely balanced, a definitive answer to this question lies in the future.

Thanks go to Robert Rosenbaum for his input on this post.

Inductive bias

I make a distinction between latent variables and parameters. Latent variables change over time; parameters don’t.

(This is really a quantitative matter, not a qualitative one, as a “parameter” might change slowly (often called adaptation) and thus be promoted to a “variable.”)

This distinction carries over to computation: inferring a variable is just called “inference,” but inferring a parameter is typically called “learning.” Mathematically, these are really the same process, just about different quantities and on different timescales.

Analogously, model bias and inductive bias are both biases, but about variables versus parameters. Model bias is an inference bias in the mapping from input to output. Inductive bias is a learning bias, a bias in what is learned from data (i.e. in the mapping from a data set to parameters). This inductive bias in turn creates its own model bias.

An inductive bias is caused by both the model class (such as a neural network architecture) and the optimization procedure.

The following table compares the relevant concepts.

QuantityLatent variable xParameter Q
ExamplePosition of falling appleGravity
Changes?Yes (fast)No (slow)
ComputationInference p(x|d,Q)

or p(x|d,M)
= Σp(x|d,Qp(Q|M)
or p(x|d,M)
= Σp(x|d,Q) p(Q|{d},M
Learning p(Q|{d})
BiasModel bias
p(x|Q,M)
Inductive bias
p(Q|M)
Bias
created by:
ParametersModel, Architecture,
Parameter dynamics
(plasticity rule,
objective function,
optimization)

Table: Comparison of latent variables and parameters, and computations relevant to them. Here x is a vector of latent variables, Q is a parameter vector, d is observable input data, {d} is a data set, and M is a model class.

Cortical Areas Interact through a Communication Subspace

JD Semedo, A Zandvakili, CK Machens, BM Yu, and A Kohn. Neuron, 2019.

Summary

How do populations of neurons in interconnected brain areas communicate? This work proposes the idea that different cortical areas interact through a communication subspace: a low-dimensional subspace of the source population activity fluctuations that is most predictive of the target population fluctuations. Further, this communication subspace is not aligned with the largest activity patterns in a source area. Importantly, the computational advantage of such a subspace is that it allows for selective and flexible routing of signals among cortical areas.

Approach used to study inter-areal interactions

Previous studies of inter-areal interactions in the brain have related spiking activities of pairs of neurons in different brain areas or LFP-LFP interactions. Such studies have provided insight into how interaction strength changes with stimulus drive, attentional states, or task demands. However, these methods do not explain how population spiking activity in different areas are related on a trial-by-trial basis.

This work leverages trial-to-trial co-fluctuations of V1 and V2 neuronal population responses, recorded simultaneously in macaque monkeys, to understand population-level interactions between cortical areas.

Experiment details

The activity of neuronal populations in output layers (2/3-4B) of V1, and the primary downstream target of the middle layers of V2, were recorded in three anesthetized monkeys (Fig 1A in the paper). The recorded populations had retinotopically aligned receptive fields. The stimulus comprised drifting sinusoidal gratings at 8 different orientations. The trial-to-trial fluctuations to repeated presentations of each grating were analyzed.

Source and target populations

The recorded V1 neurons were divided into source and target populations (Fig 1B). For each dataset, the target V1 population was drawn randomly from the full set of V1 neurons such that it matched the neuron count and firing rate distribution of the target V2 population. This matching procedure was repeated 25 times, using different random subsets of neurons. Results for each stimulus condition were based on averages across these repeats.

Results

Strength of population interactions

The V1-V2 interactions were first characterized by (i) measuring noise correlations, and (ii) multivariate linear regression to see how well the variability of the target populations could be explained by the fluctuations of the source V1 population (Fig 2). Both these analyses indicated that the interactions between areas (source V1 – target V2) have similar strength as those within a cortical area (source V1 – target V1).

What about the structure of these interactions?

Consider predicting the activity of a V2 neuron from a population of three V1 neurons using linear regression. The regression weights correspond to a regression dimension. In a basic multivariate regression model, each V2 neuron has its own regression dimension and these dimensions could, in principle, fully span the V1 activity space. But what if they span only a subspace (Fig 3)?

If only a few dimensions of V1 activity are predictive of V2, then using a low dimensional subspace should achieve the same prediction performance as the full regression model.

Testing the existence of these subspaces

To test this hypothesis, the authors used linear regression with a rank constraint. They observed that reduced rank regression achieved nearly the same performance as the full regression model. Further, they observed that the number of dimensions needed to account for V1-V2 interactions was less than the ones involved in V1-V1 interactions (Fig 4).

Are the V1-V2 interactions low dimensional because the V2 population activity itself is lower dimensional that the target V1? Factor analysis revealed that the dimensionality of the V2 activity was actually higher than that of the target V1 (Fig 5A).

To assess how the complexity of the target population influenced the dimensionality of the interactions, they also compared the number of predictive dimensions to the dimensionality of the target population activity (Fig 5B). For V1-V1 interactions, the number of predictive dimensions matched the target V1 dimensionality. In contrast, for V1-V2 interactions, the number of predictive dimensions was consistently lower than the target V2 dimensionality.

Based on these observations, the authors conclude that

  • the V1-V1 interaction uses as many predictive dimensions as possible
  • but the V1-V2 interaction is confined to a small subspace of source V1 population activity

The authors term this subspace the communication subspace. This low-dimensional interaction structure was also observed in simultaneous population recordings in V1 and V4 of awake monkeys (Fig S2).

Relationship to source population activity

By removing source population activity along required predictive dimensions, they identified that the V2 predictive dimensions are not aligned with target V1 predictive dimensions (Fig 6).

Next, using factor analysis, they identified the dimensions of largest shared fluctuations within the source V1 population. However, these dominant dimensions of the source V1 population are worse than the V2 predictive dimensions at predicting V2 fluctuations (Fig 7A).

Summary of results

The V1 predictive dimensions are aligned with the largest source V1 fluctuations (dominant dimensions, Fig 7B). In contrast, the V2 predictive dimensions are distinct and:

  • they are less numerous
  • they are not well aligned with the V1 predictive dimensions
  • nor are they well aligned with the V1 dominant dimensions 

The authors conclude by suggesting that the communication subspace is an advantageous design principle of inter-area communication in the brain. The ability of a source area to communicate only certain patterns while keeping others private could be a means of selective routing of signals between areas. This selective routing allows moment-to-moment modulation of interactions between cortical areas.

Comments from the journal club

  • An alternative to reduced rank linear regression would be to use canonical correlation analysis (CCA)
  • What information is encoded in the different dimensions, both predictive and dominant? This should be easy to check.
  • The analyses here are entirely linear, but V2 neurons most likely perform nonlinear operations on inputs received from V1. The approach used here was to study local fluctuations around set points. The justification provided for this approach is that the trial-to-trial variability around the mean response functions effectively as local linear perturbations in the nonlinear transformation between V1 and V2.
  • All the analyses reveal subspaces of relatively low dimensionality. Might this be a consequence of the low-dimensional stimulus? Nonetheless, why would the (“noise”) fluctuations be low-dimensional even for a low-dimensional stimulus?

Distilling Free-Form Natural Laws from Experimental Data

Schmidt, M., & Lipson, H. (2009). Science324(5923): 81-85.

Automating the Search for Natural Laws

Scientists have always been concerned with identifying the laws that govern natural phenomenon. We now live in an age where we can collect massive amounts of experimental data from a wide range of systems – subatomic to astronomical. We are also blessed with constantly growing computational power. Can we use these resources to automate the search for the governing laws of any physical system? In this paper, Schmidt and Lipson present an approach based on symbolic regression to automate the search for natural laws from experimental data. Mathematical symmetries and invariants underlie nearly all physical phenomenon. Thus, the search for natural laws is invariably a search for conserved quantities or invariant equations. However, the most prohibitive obstacle for automating this process is finding meaningful invariants. This paper proposes a principle for the identification of non-trivial invariants.

Symbolic regression

Several methods exist for modeling scientific data:

  • Fixed-form parametric models based on expert knowledge
  • Numerical models aimed at prediction, for e.g. neural networks
  • Restricted model spaces using greedy search

The goal here, however, is to find principal unconstrained analytical expressions that explain symbolically precise conserved relations. This requires searching the space of both functions and parameters.

In this paper, Symbolic Regression, which is an evolutionary computation method, is used for searching the space of mathematical functions. Initial mathematical expressions are constructed using basic mathematical building blocks – algebraic operators (+,-,\times), a basic set of analytical functions (for e.g. sine, cosine), constants and state variables. The search algorithm, based on genetic programming, forms new equations by recombining previous equations and probabilistically varying sub-expressions. These models are associated with a fitness score. Models with high fitness are retained and unpromising solutions are discarded. The algorithm terminates after the obtained equations reach a desired level of fitness.

Identification of nontrivial relations

Rather than try to model any specific signal, the goal here is to find any underlying physical law that the system obeys. The candidate equations should predict relationships between dynamics of the components of the system. Specifically, this paper proposes a Predictive Ability Criterion: the candidate equations should predict connections among derivatives of groups of variables over time. The fitness score used by their symbolic regression algorithm is, thus, a measure of the difference between partial derivatives obtained

  • symbolically from the candidate equations
  • numerically from the experimental data.

Further, instead of producing just one candidate, the algorithm outputs a short list of final candidate analytical expressions on the accuracy-complexity Pareto frontier.

Results

The search algorithm with the partial-derivative-pairs criterion was applied to measurements from a few simple physical systems – air-track oscillators and double pendulums. One important feature of the algorithm is that one could control the type of law that the system might find by choosing the input data. For example, for a pendulum, if only position coordinates are provided as input, the algorithm converges to the equation of a circle. Given position and velocity data, the algorithm converges on energy laws. The algorithm could extract Hamiltonians of the air-track oscillators and the conservation of angular momentum laws for the double pendulum.

Caveats

Though the algorithm can present equations corresponding to physical laws in their mathematical form, the bulk constants in the expressions are not characterized. The authors propose a systematic approach to parsing these coefficients by analyzing multiple data sets from the same system, albeit with different configurations and parameters. They demonstrate this approach by using measurements from simulated air-track oscillators and pendulums.

The time to converge on the law equations depends exponentially on complexity of the law itself, dimensionality of the system and measurement noise. Therefore, a key challenge is scaling to higher complexity. One potential solution proposed is the use of candidate equations obtained from simpler systems as initial seeds for complex systems. This seeding approach does not constrain the equation search, but instead biases the search to reuse terms from previous laws.

Problems

In the course of our discussion of this paper, there were a couple of questions raised. The primary doubt stems from the Predictive Ability Criterion. The text describes a “candidate law equation f” whereas f is just an expression, not an equation. Now, if you do make it a conservation equation, then basic calculus gives you the opposite answer from what is reported as the predictive ability criterion. In particular, for a conservation law f(x,y) = c, any change in x must be accompanied by a change in y, and the result is exactly the negative of Eq S2 in the supplement!

  • f(x,y) = c . . . . . . . . . . . . . . . # conservation law
  • \frac{df}{dx} + \frac{df}{dy} \frac{dy}{dx} = 0 . . . . . . . . . . . . . . . # diff. both sides w.r.t x, use chain rule
  • \frac{dy}{dx}= - \frac{df/dx}{df/dy} . . . . . . . . . . . . . . . # solve for dy/dx as desired

Equation S2 in the supplement states instead that 

  • \frac{dy}{dx}= + \frac{df/dx}{df/dy} . . . . . . . . . . . . . . . # note opposite sign!

The proposed algorithm is designed to identify nontrivial conservation laws. However, the authors also claim that this algorithm can be used to identify equations such as Lagrangian equations, which summarize the system dynamics but are not invariant. It is not quite clear how these Lagrangians could be obtained.

When we contacted the corresponding author about these problems, he replied that we should read the supplement, and mentioned that absolute values on the implicit derivatives produce Lagrangians. However, these suggestions did not resolve our questions. Any further clarifications from the authors or scientific community would be most welcome. There is room for commenting below.

Applications to Neuroscience?

This paper presents an approach to identify analytical, human interpretable governing laws from experimental data without any prior knowledge of the system, and demonstrates its success for simple, low-dimensional physical systems. For us, the pertinent question is – can this method be scaled to neural data? We can now simultaneously record the activities of tens of thousands of neurons in the brain. Neural data has various sources of noise and variations across experimental subjects/systems. Further, the governing principles of the brain (if indeed the brain has any) probably describe the dynamics of what the neurons represent. Perhaps with the right kind of fitness score (which also accounts for identifying the appropriate neural representations) and sufficient computational power, the symbolic regression approach could potentially extract some underlying computational principles of the brain.

The Dynamical Regime of Sensory Cortex: Stable Dynamics around a Single Stimulus-Tuned Attractor Account for Patterns of Noise Variability

Neurons in the cortex are characterized by irregular spiking patterns that can be correlated across the population. However, both the variability and covariability in neuronal activity can be modulated by stimuli (Churchland et al., 2010). The reduction in variability and covariability observed experimentally has been explained using multi-attractor models (Ponce-Alvarez et al., 2013), and chaotic networks (Molgedey et al., 1992). In this paper the authors propose an alternative explanation using a stochastic, stabilized supralinear network (SSN).  The main mechanism driving the modulation in this model is stabilizing inhibitory feedback.

In multi-attractor models, the network operates in a multi-stable regime and in the absence of stimulus fluctuations causes the network to wander between different attractors. This meandering gives rise to correlated variability in the activity of different cells.  Upon stimulus onset, the variability is suppressed as the network’s activity is pinned to the neighborhood of a single attractor.  In chaotic models, variability is due to chaotic fluctuations in activity. Certain types of stimuli can suppress such chaotic fluctuations by forcing network dynamics to follow specific trajectories, thus quenching variability across trials. While both of these models explain stimulus-induced reduction of variability,  only the multi-attractor models can explain the stimulus-tuning of variability reduction. Even then, this requires either considerable parameter tuning or very strong noise. 

To address this question the authors present an alternative way to explain this phenomenon using a stochastic stabilized supralinear network (SSN). The dynamics of the spiking SSN is defined via equations governing the voltages of the individual neurons:

\tau_i \frac{dV_i}{dt}=-V_i(t)+V_{rest}+h_i(t)+\eta_i(t)+\sum_{j\in E cells}J_{ij}a_j(t)-\sum_{j\in I cells}J_{ij}a_j(t).

Neurons integrate their external and recurrent inputs linearly in their membrane potential, but their output (rate) is a nonlinear function of the voltage. The quantity \eta_i(t) is noise, and a_j(t) is the low pass filtered version of the spike train of neuron j. Neuron j generates spikes at each time bin dt with an instantaneous probability dt\times f(V_j), where f(V_j) is the instantaneous firing rate of that neuron. 

The choice of the nonlinearity, f, translating voltage into spikes is the crucial part of the model. As shown in earlier work by the authors (Ahmadian et al, 2013), this function needs to be superlinear, and here a threshold-quadratic function is chosen. To understand the mechanics of variability modulation SSNs the authors linearize the dynamics around the input dependent fixed point, and obtain a Schur decomposition of the Jacobian matrix. This allows them to show analytically that transient variability increases for weak inputs, due to weak recurrent inhibitory self coupling and strong balance amplification. Variability decreases for strong inputs due to the supralinear growth in recurrent inhibition, stabilizing the fixed point and damping fluctuations. This mechanism can already be understood using a network of two units representing the excitatory (E) and inhibitory (I) subpopulations, respectively: Weak inputs lead to an increase in variability (across trials). However, for strong inputs variability is suppressed. Modulation in variability requires recurrent interactions: a feed-forward circuit is not enough.

The insights obtained from analyzing a two population model hold more generally. The authors next considered a network of 5,000 neurons (80% E and 20% I) randomly connected with low probability and synaptic weights chosen from the population model (meaning any two neurons from populations a and b connected with the same strength, for a=E,I and b=E,I). In this case, variability suppression was achieved at the single cell level (voltage), as well as at the population level (LFP and rates). This model primarily suppressed shared rather than private variability, which was also observed in experiments (Churchland et al, 2010). This is due to effective mean field connectivity in the network causing the dynamics to be coupled to network-wide patterns of activity across E or across I cells. These patterns were affected by the changes in effective connectivity due to the stimulus. Significant variability in these patterns was present due to correlated noise. Hence, shared E and I patterns behaved as units from the population model, and variability suppression also caused suppression of covariability. In addition, the model accounts for the stimulus-induced modulation of the power spectrum and cross-coherence of LFP and single cell voltage seen experimentally in visual area V1 of awake monkeys.

Since in this model the network was randomly connected, and neurons were not stimulus selective, variability was suppressed uniformly across the population. However, experiments show that variability suppression depends on stimulus tuning. Therefore, the authors next introduced an extension of the network with a ring architecture, with neurons around the ring having different preferred orientations (PO). In accord with previous models of this type (Ponce-Alvarez et al., 2013; Lombardo et al., 2015; Lin et al., 2015), neurons with similar tuning were connected more strongly than neurons with large difference in PO. In this network, a bump of activity is present only when a stimulus is present, and does not represent the short term memory of the stimulus. The dynamics of this model agree well with data from V1 in awake monkeys: The model displays variability suppression at the single cell level and at the population level, a drop in the Fano factor, and U-shaped tuning of variability suppression with stimulus orientation in Fano factors and membrane potentials. In particular, the fact that shared variability was mainly suppressed over private variability is due to the spatially smooth connectivity profile, resulting in regular patterns of activity across the population.

The authors examined the structure of quenched noise variability after stimulus onset. They showed that most of the shared variability arose from variability in the location and width of the bump of activity. These small transformations resulted in a characteristic pattern of deviation of network activity from the mean bump. These two patterns contributed two distinct spatial covariance templates, which taken together accounted for 87\%  of the structure in the full covariance matrix of the network. Hence, bump kinetics correctly predicted membrane potential variances.

As mentioned before, chaotic models cannot account for stimulus-modulated changes in variability, because neurons are not tuned to different stimuli. Therefore, the authors compared the SSN to the multi-attractor model, in order to determine if their proposed mechanism better explains the  suppression in variability seen in experiments. They find that multi-attractor networks show a more limited repertoire of variability patterns. In the SSN model fluctuations in bump location and width led to very weak variability modulation between orthogonal cells, and a much shallower modulation between similarly tuned cells. The attractor model did not display this cancellation even for orthogonally tuned cells. Therefore, correlations between orthogonally tuned cells were modulated as strongly as similarly tuned cells, in disagreement with experimental data

Lastly, the authors explored the temporal dynamics of variability modulation, because this can potentially show fundamental differences between the three different mechanisms. In order to do so, they measured the timescales on which suppression and recovery of variability took place. For SSNs, suppression and recovery was fastest and in the same timescale as the membrane time constant, in agreement with experiments. In contrast, chaotic networks were 4-15 times slower, and the multi-attractor network was at least 20 times slower than the single cell time constant. 

In conclusion, the authors proposed a robust model (SSN) which can capture key aspects of variability modulation, such as stimulus-induced quenching, stimulus-tuned quenching, and realistic timescales of suppression and recovery, without the need of precise fine-tuning of parameters, or large amounts of noise. The main insight here is that in SSNs, the supra linear dependence of firing on inputs increases the effective connectivity with increasing input, which in turn modulates the variability and covariability of responses. Two remarkable effects that explain modulation of variability: balanced amplification, which amplifies variability due to E-I interactions and dominates at weak input; and inhibitory feedback, which quenches variability by generating strong inhibitory input in the network and dominates at large inputs. In combination, these mechanisms robustly reproduce experimentally observed spatial and temporal patterns of variability quenching and modulation.

References:

[1] Ahmadian, Y., Rubin D.B., and Miller K.D. (2013). Neural Comput. 25, 1994-2037.
[2] Churchland, M.M., Yu, B.M., Cunningham, J.P., Sugrue, L.P., Cohen, M.R., Corrado, G.S., Newsome, W.T., Clark, A.M., Hosseini, P., Scott, B.B., et al. (2010). Nat. Neurosci. 13, 369-378.
[3] Lin, I.-C., Okun, M., Carandini, M., and Harris, K.D. (2015). Neuron 87, 644-656.
[4] Lombardo, J., Macelliao, M., Liu, B., Osborne, L.C., and Palmer, S.E. (2015). In 2015 Neuroscience Meeting Planner (online) (Washington, DC: Society for Neuroscience).
[5] Molgedey, L., Schuchhardt, J., and Schuster, H.G. (1992). Phys. Rev. Lett. 69, 3717-3719.
[6] Ponce-Alvarez, A., Thiele, A., Albright, T.D., Stoner, G.R., and DEco, G. (2013). Proc. Natl. Acad. Sci. USA 110, 13162-13167.

Neural nets for audio predicts brain responses

A Task-Optimized Neural Network Replicated Human Auditory Behavior Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy

By  Alexander Kell, Dan Yamins, Erica Shook, Sam V. Norman-Haignere, Josh H. McDermott

Summary by Josue Ortega-Caro

In this work, the authors study the capabilities of a deep neural network to predict both neural responses of human auditory cortex and the behavior of human participants. In addition, the authors point to their results as evidence of hierarchical processing in the auditory cortex. Their results are part of a growing body of work that uses deep neural networks as models of human cortex (see [1] for references), especially the visual cortex.

The assumption behind this literature states that everyday perceptual tasks may impose strong constraints on the brain. These constraints may yield general principles of how tasks can be solved within a distributed network architecture. Therefore, deep neural networks trained directly on these perceptual tasks are likely to implement the same solutions as the cortex, and thereby exhibit brain-like representations and transformations.

Through comparisons to humans in several different noise conditions, the authors show the strong similarities between their models and humans in speech and music recognition tasks. These results suggest that deep neural networks also provide a good model for auditory cortex.

The authors used two tasks: speech and music recognition tasks. Every signal was preprocessed by transforming into a cochleogram prior to passing through the network. On the first task, subjects were given a two second clip of speech, and were asked to recognize one of 587 possible words in the middle of a sentence. On the second task, subjects listened to a two second clip of a song, and were asked to identify one of 41 possible genres to which it belongs. In addition, every two second clip was perturbed by different types and levels of background noise, as a way to compare the models’ and humans’ abilities to perform auditory recognition under difficult conditions.

First, the authors performed an architecture search over 180 architectures to find neural networks that perform well on speech and music recognition independently. Next they searched for ways to combine the architectures into a single network that performed both tasks. The authors argue that both speech and music recognition should share low-level auditory features similar to visual recognition tasks. (It was not clear why they wanted to combine networks trained on each task, except perhaps as a way of decreasing the number of parameters needed for both tasks together.) The final network has three parts: a shared-core (up to Conv3) for both tasks, followed by a word-classifier and a genre-classifier branches, selected by the performance on both the speech and music recognition tasks.

The authors compare the behavior (performance) of the model and humans under different noise conditions. Figure 2 shows that under five types of noise at six intensity levels, the human and model behavior have r^2 values of 0.25 and 0.92 for music and speech recognition, respectively. One concern during the discussion was the inter-human variance in the task (via r^2 between humans), because if humans cannot do this task, consistently, then the small r^2 between the model and humans (For music recognition) is a better result. An important caveat stated by the authors is that the genre recognition task has very fine distinction between genres, e.g. “New Age” versus “Ambient”. Therefore, human subjects were asked to perform top-5 categorization.

Moreover, they tested the ability of the model trained on speech and music recognition to predict fMRI recordings of 8 participants to 165 natural sounds. Some of these sounds were speech and music related but most (113 out of 165) were not. They did independent linear readouts from the activations of different layers of their network in order to find which layers better predict each voxel response to the 165 natural sounds. Figure 3 shows that median variance explained by their model is higher than a spatiotemporal model (baseline model) and a random network baseline. This shows that training the model did help its predictive power.

The most interesting results comes when they examine the fine-grain predictions of different model layers for different voxels in the auditory hierarchy. In Figure 4, their hierarchical map shows higher correlation between early layers of auditory cortex and the predicted voxels from the shared-core layers. At the same time, higher cortical areas have higher correlation with their respective task specific branch. In addition, they show that even though the shared layers can predict the higher cortical areas, the specific branches predict significantly better those areas compare to the shared-core and vice versa. Furthermore, music and speech specific voxels are better predicted by their respective branches of the model. Even though this result is very clear, we were not completely sure if the correlation differences are significant between speech and music branches.

Moreover, the authors explore the ability of their model to explicitly represent acoustic features of the stimulus (via linear decodability). They argue that this is a way to understand the representation learned by the model. They show the variance explained of the spectral filters is higher in early layers, and that variance explained of the spectral modulation peaks at intermediate layers. In contrast, the untrained network has a similar profile but not the same increase or decrease in decodability as the trained network. One experiment they might have done is linearly decode of the spectral filters on the fMRI data and see if the voxels that have higher decodability also have high correlation with the early layers of the network.

Lastly, the authors show in Figure 7 that networks at intermediate points during training explain variance that is correlated to its performance in the speech and music tasks. I think this is the clearest result of their philosophical position: There is a linear relationship between the network’s performance on an everyday task, and the presence of brain-like representational transformations.

In conclusion, the authors show a correlation between humans and deep neural networks in neural, behavioral and representational axis during speech and music recognition tasks. As stated before, this framework of task-optimized neural networks is becoming widespread throughout the neuroscience community, going from visual cortex [1], motor cortex [2], etc., and now to auditory cortex. It is important to note, as also noted by the authors, that the behavioral, neural and representational comparison done by the authors are only sufficient to suggest correlations between humans and deep neural networks. We still need more causal ways to compare both systems in order to see if the similarities are superficial (merely because of the highly-optimized nature of both humans and models on speech and music tasks) or if they reveal a deeper relationship between deep neural models and humans. In addition, we wondered if including other neural motifs such as recurrence, feedback modulation and cell-specific functions would help improve the results presented in this paper.

[1] Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences111(23), 8619-8624.

[2] Michaels, J. A., Dann, B., Intveld, R. W., & Scherberger, H. (2018). Neural dynamics of variable grasp movement preparation in the macaque fronto-parietal network. Journal of Neuroscience, 2557-17.

Sparse Manifold Transform

The Sparse Manifold Transform (SMT)

by Yubei Chen, Dylan Paiton, and Bruno Olshausen

https://arxiv.org/abs/1806.08887

Theories of neural computation have included several different ideas: efficient coding, sparse coding, nonlinear transformations, extracting invariances and equivariances, hierarchical probabilistic inference using priors over natural sensory inputs. This paper combined a few of these in a theoretically satisfying way. It is not a machine-learning paper with benchmarks, but is rather a conceptual framework for thinking about neural coding and geometry. It can be applied hierarchically to generate a deep learning method, and the authors are interested in applying this to more complex problems and data sets, but haven’t yet.

The three main ideas synthesized in this paper are: sparse coding, manifold flattening, and continuous representations.

Sparse coding represents an input \boldsymbol{x} using an overcomplete basis set \Phi (a “dictionary”), by encouraging sparseness (rarity of nonzero coefficients a_i): \boldsymbol{x}\approx\sum_i \Phi_i a_i. This can be implemented using a biologically plausible neural network that creates a nonnegative sparse code \boldsymbol{a} for input \boldsymbol{x} with inhibitory lateral connections [Rozell et al 2008].

A manifold is a smooth curved space that is locally isomorphic to a Euclidean space. Natural images are often thought of occupying low-dimensional manifolds embedded within the much larger space of all possible images. That is, any single natural image with N pixels is a point \boldsymbol{x} that lives in a manifold \boldsymbol{x}\in\mathcal{M}\subset\mathbb{R}^N. Since the intrinsic dimensionality of the manifold is much smaller than the embedding space, we should be able to compress the image by constructing coordinates within the manifold. These intrinsic coordinates could then provide a locally flat space for the data. They may also provide greater interpretability. With some possibly substantial geometric or topological distortions (including rips), the whole manifold could be flattened too.

People have published several ways to flatten manifolds. One prominent one is Locally Linear Embedding [Roweis and Saul 2000], which expresses points on the manifold as linear combinations of nearby data points. A coarser and more compressed variant is Locally Linear Landmarks (LLL) [Vladymyrov M, Carreira-Perpinán 2013] which just uses a smaller set of points as “landmarks” rather than using all data. In both variants, each landmark \Phi_i in the original space is mapped to a landmark P_i in a lower-dimensional space, so that points in the original curved manifold, \boldsymbol{x}=\Phi\boldsymbol{a}, are mapped to points in the flat space, \boldsymbol{y}=P\boldsymbol{a}, using the same coordinates \boldsymbol{a}.

The SMT authors observe that sparse coding and manifold flattening are connected. Since you only use a few landmarks to reconstruct the original signal, the coefficients \boldsymbol{a} are sparse: the LLL is a sparse code for the manifold.

Continuous representations: In general, sparse coding does not impose any particular organization on the dictionary elements. One can pre-specify some locality structure to the dictionary, as in Topographic ICA [Hyvärinen et al 2001] which favors dictionary elements \Phi_i that have high similarity to other dictionary elements \Phi_j when i and j are close. In SMT, the authors propose to find dictionaries P in the target space for which smooth trajectories \boldsymbol{y}(t)=P\boldsymbol{a}(t) are flat. That is, any representation \boldsymbol{y}(t) at time t should be halfway between the representations at the previous and next frames: P\boldsymbol{a}(t)=\frac{1}{2}P\boldsymbol{a}(t-1)+\frac{1}{2}P\boldsymbol{a}(t+1). If the target trajectories \boldsymbol{y}(t) are curved, this will not be true. The authors achieve this by minimizing the average second derivative of the data trajectories in the flattened space, \langle\|\ddot{\boldsymbol{y}}\|^2\rangle. They have an analytic solution that works in some cases, but they use stochastic gradient descent to accommodate more flexible constraints.

The authors relate this idea to Slow Feature Analysis (SFA) [Wiskott and Sejnowski 2002], another unsupervised learning concept that I really like. In SFA, one finds nonlinear transformations \boldsymbol{y}=\boldsymbol{g}(\boldsymbol{x})=\sum_i w_ih_i(\boldsymbol{x}) that change slowly, according to \langle\|\dot{\boldsymbol{y}}\|^2\rangle. The idea is that these directions are features that are invariant to fast changes in the appearance of objects (lighting, pose) and encode instead the consistent properties of those objects. In SMT, this is achieved by minimizing the second derivative of \boldsymbol{y} instead of the first.

There’s actually an interesting distinction to be drawn here: their manifold flattening should not throw away information, as do many models for extracting invariances, including SFA. Instead the coordinates on the manifold actually try to represent all relevant directions, and the flat representation ensures that changes in the appearance \boldsymbol{x} are equally matched by changes in the representation \boldsymbol{y}. This concept is called equivariance. The goal is reformatting for easier access to properties of interest.

Applications

Finally, the authors note that they can apply this method hierarchically, too, to progressively flatten the manifold more and more. Each step expands the representation nonlinearly to obtain a sparse code \boldsymbol{a}, and then compresses it to a flatter representation \boldsymbol{y}.

They demonstrate their approach on one toy problem for illustration, and then apply it to time sequences of image patches. As in many other methods, they recover Gabor-like features once again, but they discover some locality to their Gabors and smoothness in the sparse coefficients and target representation. When applied iteratively, the Gabors seem to get a bit longer and may find some striped textures, but nothing impressive yet. Again I think the conceptual framework is the real value here. The authors plan on deploying their method on larger scales, and it will be interesting to see what emerges.

Discussion

Besides the foundational concepts (sparse coding, LLL, SFA), the authors only briefly mention other related works. It would be helpful to see this section significantly expanded, both in terms of how the conceptual framework relates to other approaches, and how performance compares.

One fundamental and interesting concept that the authors could elaborate substantially more is how they think of the data manifold. They state that “natural images are not well described as a single point on a continuous manifold of fixed dimensionality,” but that is how I described the manifold above: an image was just a point \boldsymbol{x}\in\mathcal{M}. Instead they favor viewing images as a function over a manifold. A single location on the manifold is like a 1-sparse function over the manifold (with a fixed amplitude). They prefer instead to allow an h-sparse function, i.e. h points on the same manifold. I emailed the authors asking for clarification about this, and they graciously answered quickly. Bruno said “an image patch extracted from a natural scene could easily contain two or more edges moving through it in different directions… which you can think of that as two points moving in different trajectories along the same manifold. We would call that a 2-sparse function on the manifold. It is better described this way rather than collapsing the entire image patch into a single point on a bigger manifold, because when you do that the structure of the manifold is going to get very complicated.” This is an interesting perspective, even though it does not come through well in the current version. One crucial thing to highlight from this quote is that the h-sparse functions lie on a smaller manifold than the N-dimensional image space. What is this smaller manifold? Can it be embedded in the same pixel space? And how do the h points on the manifold interact in cases like occlusion or yoked parts?

Yubei suggested this view could have a major benefit to hierarchical inference: “Just like we reuse the pixels for different objects, we can reuse the dictionary elements too. Sparsity models the discreteness of the signal, and manifolds model the continuity of the signal, so we propose that a natural way to combine them is to imagine a sparse function defined on a low dimensional manifold.

The geometry of the sensory space is quite interesting and important and merits deeper thinking. It would also be interesting to consider how the manifold structure of sensory inputs affects the manifold structure of beliefs (e.g. posterior probabilities) about those inputs. This seems like a job for Information Geometry [Amari and Nagaoka 2007].

References:

Amari SI, Nagaoka H. Methods of information geometry. American Mathematical Soc.; 2007.

Hyvärinen A, Hoyer PO, Inki M. Topographic independent component analysis. Neural computation. 2001 Jul 1;13(7):1527-58.

Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. science. 2000 Dec 22;290(5500):2323-6.

Rozell CJ, Johnson DH, Baraniuk RG, Olshausen BA. Sparse coding via thresholding and local competition in neural circuits. Neural computation. 2008 Oct;20(10):2526-63.

Vladymyrov M, Carreira-Perpinán MA. Locally linear landmarks for large-scale manifold learning. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases 2013 Sep 23 (pp. 256-271). Springer, Berlin, Heidelberg.

Wiskott L, Sejnowski TJ. Slow feature analysis: Unsupervised learning of invariances. Neural computation. 2002 Apr 1;14(4):715-70.

Linking connectivity, dynamics and computations in recurrent neural networks

Summary by Alan Akil

In a recent preprint, Mastrogiuseppe and Ostojic discuss an important extension of classical results on neural networks: Theoretical results dating from the late 80s [e.g. Sompolinsky, Cristani and Sommers 1988] show how high-dimensional, random (chaotic) activity arises robustly in networks of rate units whose activity evolves according to

x'_i(t) =-x_i(t)+\sum_{j=1}^{N}J_{ij}\phi(x_j(t))+I_i

for nonlinear activation functions \phi. However, this classical work assumes that connectivity in the network is unstructured. Mastrogiuseppe  and Ostojic discuss the case when the connectivity matrix has structure that is defined by a low-rank matrix. In particular, they assume that the connectivity is given by a sum of a low-rank and a random matrix, as

J_{ij}=g \chi_{ij}+P_{ij}

where g is the disorder strength, \chi_{ij} is a Gaussian all-to-all random matrix and every element is drawn from a centered normal distribution with variance 1/N, and P_{ij}=\frac{m_in_j}{N} where m and n are N-dimensional vectors.

Interestingly, the model remains highly tractable: The activity can either be predominantly confined to a few dimensions determined by the (constant) input, and the vectors that define the low-rank component of the connectivity, or it can exhibit high-dimensional chaos, when the unstructured component dominates. The tractability of the model allowed the authors to to design networks that perform specific computations. Moreover, increasing the rank of the structured part of the connectivity leads to networks that can support an ever wider dynamical repertoire, accompanied by an expanding computational capacity.  This allows for the implementation of complex tasks such as context-dependent decision making.

The authors start by studying a recurrent network with a rank-one structured connectivity component and no external input. In this case the network supports four states of spontaneous activity that depend mainly on disorder strength and structure strength.  For instance, strong structure and low disorder lead to heterogeneous firing rates that are approximately constant in time. The most interesting case occurs when the strengths are comparable, leading to a structured chaotic state characterized by approximately one-dimensional dynamics accompanied by high dimensional temporal fluctuations. This state, is characterized by the emergence of very slow timescales, which may be of separate interest [40]. Importantly, the transitions between these four states can be obtained analytically.

Next, the authors examine what happens when a constant, spatially heterogeneous input drives the neurons in the network [e.g. Rajan, Abbott, Sompolinsky 2010]. In this case, the relation between the left and right structure vectors and the input gave rise to different dynamics in the network. The two structure vectors play different roles in the dynamics: the right-structure vector determines the output pattern of network activity, while the left-structure vector selected inputs that give rise to patterned outputs. An increase in external input generally led to suppression of chaotic and bistable dynamics.

Networks with structured connectivity can be used perform a specific computation, and the authors start with a simple Go-Nogo discrimination task (equivalent to simple classification). Here  the animal has to produce a specific motor output in response to a sensory stimulus (Go simulus) and ignore all others (Nogo stimuli). This implementation showed very desirable computational properties such as generalization to noisy or novel stimuli, and was extended to the detection of multiple stimuli.  However, as far as we could see, although individual units are nonlinear, the network  still acts as a linear discriminator.

Rank-two structure in the connectivity matrix leads to a richer repertoire of behaviors. The authors do not provide a full dynamical description. However, they show that the two unit-rank terms in the connectivity can implement two independent input-output channels. This observations allows for an implementation of a network that can perform a Two-Alternative Choice Task (2AFC) which requires two different classes of inputs to be mapped to two different readout directions.

Networks with a rank-two structure can also support  a continuum of spontaneous states that lie on a circle in the two-dimensional circle in the m_1-m_2 plane. The points on the ring-like attractor lie on a slow manifold, and this ring structure was remarkably robust to changes in the disorder strength.

Rank-two structure in the network can also be used to implement a context-dependent discrimination task using the described rank-two structure network. In this case, the stimuli were characterized by two features A and B. The stimuli are random dot kinetograms and the features A and B were direction of motion and color, respectively. Hence, the task consisted in classifying these stimuli based on an explicit contextual cue. The stimulus features were represented as independent and thus mutually orthogonal. The key of this implementation is that the irrelevant feature needs to be ignored, no matter how strong it is. The task was implemented with success and in context A, the output was nearly independent of the stimulus B; and similarly for context B.

Lastly, the authors considered an example in which the geometrical configuration was such that the right- and left-structure vectors exhibited cross-overlaps. In particular, one of these cross-overlaps was negative, which implies that the two vectors were anti-correlated. This gave rise to an effective negative feedback loop, which could generate oscillatory activity. In a particularly interesting regime this activity was a low-dimensional mixture of oscillatory and chaotic activity. Also, since different units have very diverse temporal profiles of activity, a linear readout unit added to the network can exploit them as a rich basis set for constructing a range of periodic outputs.

This work builds on a range of ideas in computational neuroscience  from Hopfield networks, echo-state networks (ESN), to FORCE learning. In the framework of Hopfield networks, memory patterns are stored in the network by adding a rank-one for each pattern to the connectivity matrix. There are studies in which the connectivity matrix consists of a sum of rank-one terms and a random part [51, 52, 53]. This is similar to the approach used here but it differs in some ways. First, the rank-one terms are symmetric, however here the authors considered any right- and left-structure vectors. Second, the rank-one terms are generally uncorrelated, whereas here general vectors are considered. And third, the interest of this paper is not on fixed points of spontaneous activity, but on responses to external inputs, and input-output computations. While in Hopfield networks the focus is on stored patterns and network capacity,  here the authors show that the full dynamical repertoire relies on the geometrical arrangement of the structure vectors, and increasing to rank-two structure shows a significant increase in computational capacity.

In the frameworks of ESN and FORCE learning, randomly connected recurrent networks are trained to produce specified outputs using a feedback loop from the readout unit to the network. This feedback loop is equivalent to adding a rank-one term to the connectivity matrix, where the left-structure vector corresponds to the readout vector and the right-structure corresponds to the feedback. When extending the analysis to the ESN case, the solutions matched the ones obtained by ESN. Also, the correlations between rank-one structure obtained through training and the realization of the random matrix are weak (they are zero for ESN), and the readout error scaled as 1/\sqrt{N}.

It is important to note that the class of network proposed here lacks many biophysical constraints. Regardless, it was shown that in low rank recurrent networks, the representation of stimuli and outputs is high dimensional, distributed and mixed, however the computations are based on emergent low-dimensional dynamics, as found in large-scale recordings of behaving animals [2]. Additionally, this class of networks have the property that stimulus onset reduces variability in neural activity, which is also seen in experiments. The unit-rank structure inferred from computational constraints reproduces known properties of synaptic connectivity: If two neurons both strongly encode some stimulus, their reciprocal connections are stronger than expected, in accord with experimental findings.

In conclusion, the authors were able to describe in detail the spontaneous and stimulus evoked activity using mean field analysis a network that featured both structured and low-rank connectivity. A key result is that low rank structure in the connectivity induces low-dimensional dynamics in the network, a hallmark of population activity recorded in behaving animals. Additionally, they predicted the low-dimensional subspace that contains the dominant part of the dynamics based on the connectivity and input structure, and that the dynamical repertoire increases sharply with the rank of the connectivity structure. Finally, they also showed how to easily implement context-dependent computations, a task that can be challenging in realistic neural networks.

Note: The authors have notified us that they can also implement a discrimination
task that cannot be performed with a linear discriminator.