Review of “Optimal policy for multi-alternative decisions”

Paper by Satohiro Tajima, Jan Drugowitsch, Nisheet Patel, and Alexendre Pouget Nature Neruoscience, August 5th, 2019

Review by Nick Barendregt (CU, Boulder)

Summary

Organisms must develop robust and accurate strategies to make decisions in order to survive in complex environments. Recent studies have largely focused on value-based or perceptual decisions where observers must choose between two alternatives. However, many real-world situations require choosing between multiple options, and it is not clear if the strategies that are optimal for two-alternative tasks can be directly translated to multi-alternative tasks. To address this question, the authors use dynamic programming to find the optimal strategy for an n-alternative task. Using Bellman’s equation, the authors find that the belief thresholds at which a decision process is terminated are time-varying non-linear functions. To understand how observers could implement such a complex stopping rule, the authors then develop a neural network model that approximates the optimal strategy. Using this network model, they then show that several experimental observations, such as the independence of irrelevant alternatives (IIA) principle, that had been thought to be suboptimal can in fact be explained by the non-linearity of the network. The authors conclude by using their network model to generate testable hypotheses about observer behavior in multi-alternative decision tasks.

Optimal Decision Strategy

To find the optimal strategy for an n-alternative decision task, the authors assume the observer accumulates evidence to update their belief, and that the observer commits to a choice whenever their belief becomes strong enough; this can be described mathematically by the belief crossing a threshold. To find these thresholds, the authors assume that the observer sets their thresholds to maximize their reward rate, or the average reward (less the average cost of accumulating evidence) per average trial length. These assumptions allow them to construct a utility, or value, function for the observer. At each timestep, the observer collects a new piece of evidence and uses it to update their belief. With this new belief, the observer calculates the utility associated with two classes of actions. The first class, which has total actions, is committing to a choice, which has utility equal to the reward for a correct choice times the probability their choice is correct (i.e., their belief in the choice being correct). The second class, which has a single action, is waiting and drawing a new observation, which has utility equal to the average future utility less some cost of obtaining a new observation. The utility function selects the maximum of these n+1 actions for the observer. The decision thresholds are then given by the belief values where the maximal-utility action changes.

Using Bellman’s equations for the utility function, the authors find the decision thresholds are non-linear functions that evolve in time. From the form of these thresholds, the authors surmise that the belief-update process can be projected onto a lower-dimensional space, and that the thresholds collapse as time increases, reflecting the urgency the observer faces to make a choice and proceed to the next trial.

Neural Network Model

To see how observers may approximate this non-linear stopping rule, the authors construct a recurrent neural network that implements a sub-optimal version of the decision strategy. This network model has n neurons, one for each option, which track the belief associated with each option. The network also includes divisive normalization, which is widespread in the cortex, and an urgency signal, which increases the gain as time increases. These two features allow the model to well-approximate the optimal stopping rule, and result in a model that has a similar lower-dimensional projection and collapsing thresholds. When comparing their network model to a standard race model, the authors find that adding normalization and urgency improves model performance in both value-based and perceptual tasks, with normalization having a larger impact on performance.

Results and Predictions

Using their neural network model, the authors are able reproduce several well-established results, such as Hick’s law for response times, and explain several behavioral- and physiological-findings in humans that have long been thought to be sub-optimal. First, because of the normalization, the model is able to replicate the violation of IIA, which says that in a choice between two high-value options, adding a third option of lower value should not influence the decision process. The normalization also replicates the similarity effect, which says that when choosing between option 1 and option 2, adding a third option similar to that of option 1 decreases the probability of choosing option 1. The authors then conclude that the key explanation of these behaviors is divisive normalization.

After validating their model by reproducing these previously-observer results, the authors then make predictions about observer behavior in multi-alternative tasks. The main prediction is in the two types of strategies used for multi-alternative tasks: the “max-vs.-average” strategy and the “max-vs.-next” strategy. The model predicts that the reward distribution across choice should cause observers to smoothly transition between these two strategies; this prediction is something that could be tested in psychophysics experiments. 

Leave a Reply

Your email address will not be published. Required fields are marked *