Okinawa Computational Neuroscience Course

Computational Modeling of Learning and Action Selection in Conditioned Behavior

Yael Niv, Gatsby Computational Neuroscience Unit, UCL and Interdisciplinary Center for Neural Computation, Hebrew University Website

Humans, as well as animals, are decision makers at every instant: we must continuously decide (implicitly or explicitly) what actions to take in order to achieve our goals. A century of animal research in experimental psychology has revealed the role of two basic constantly ongoing learning processes, Pavlovian (classical) conditioning and instrumental conditioning, in shaping behavior. More recent investigations have further divided instrumental responding into goal-directed and habitual behavior, a distinction based on the underlying information controlling action selection in each.

Reinforcement learning (RL) is the computational field which models these types of animal learning and behavior within a normative, optimal control framework. This approach has achieved much success, most notably -- it has led to the by now well-accepted link between dopamine and prediction learning via a reward prediction error (Montague, Dayan & Sejnowski, 1996). Such a tight link between behavior, optimal control and neural substrates, which brought about a new understanding of the role of dopamine in learning and action selection, and led to numerous verifiable predictions, exemplifies the benefits of using a computational normative approach to study behavior and the brain.

In this project group we will use several different RL tools/approaches within a Markov decision process formalism, in order to understand and model conditioned behavior and action choice. The models we will consider are not biophysical models and do not correspond to the neural substrate directly, but are simplified abstractions of the computations presumably taking place in whole neural systems. The projects suggested below are most directly related to the lectures of Bernard Balleine, Peter Dayan, John O'Doherty and Nathaniel Daw. They are of different levels of difficulty, and all assume a learning curve on the student's part (i.e., they do not assume you already know everything needed for the project). The suggestions are meant as general guidelines only, and ideas for extensions or related projects are most welcome.

Suggested projects:

1) The effects of noise on temporal difference learning

Temporal difference (TD) learning is by now well-ingrained into our thinking about the role of dopamine in learning. However, TD models usually assume a fully observable state space, which is known to be an unrealistic simplification. In this project we will examine the effects of different sources of noise on TD learning. We will consider external noise (probabilistic rewards, as in Nakahara et al. (2004), Morris et al. (2004) and Fiorillo et al. (2003)), internal noise as a result of a noisy representation, and most importantly -- timing noise which is inherent in most learning scenarios (see the first part of Gallistel & Gibbon (2000)'s "Time, rate and conditioning" for a comprehensive review).

Preliminary suggested directions:

a. (Basic) Investigate robustness of tapped-delay line TD to each source of noise and compare to available data.

Target papers:
Niv Y., Duff M.O. and Dayan P. -- The effects of uncertainty on TD learning (unpublished) (pdf)
Niv Y., Duff M.O. and Dayan P. (2005) -- Dopamine, uncertainty and TD learning (pdf)

b. (More advanced) Incorporate a semi-Markov framework and investigate scalar timing noise.

Target papers:
Daw N.D., Courville A.C. and Touretsky D.S. (2003) -- Timing and partial observability in the dopamine system (pdf)
Daw N.D., Courville A.C. and Touretsky D.S. (2002) -- Dopamine and inference about timing (pdf)
Gibbon J., (1977) -- Scalar expectancy theory and Weber's Law in animal timing -- Psychological Review, 84, 279-325.

2) Uncertainty and learning: competition and collaboration between different predictors.

Frequently a relevant event (say, a reward) is predicted by more than one stimulus, however, these stimuli may be differentially reliable. If your broker told you that he thinks the stock of company X is likely to go up, but your friend working for company X told you that the situation at work is dire -- would you buy or sell? In day to day learning scenarios animals and humans are very often faced with different predictive cues (such as from different modalities) which they must integrate in order to make informed decisions. In this project we will look at competition between predictive cues based on normative (Bayesian) methods, and their behavioral predictions.

Preliminary suggested directions:

a. (basic) Explore a Kalman filter model for learning based on uncertain stimuli.

Target paper and resources:
Dayan P., Kakade S. and Montague P.R. (2000) - Learning and selective attention (pdf)
Gharamani Z. -- Lecture slides on the Kalman filter and pseudocode (pdf)

b. (more advanced) Study the Kalman filter model and explaining away in fairly bizarre (but very interesting) paradigms in classical conditioning.

Target paper:
Dayan P., Kakade S. (2000) - Explaining away in weight space (pdf)

c. (Open ended, related also to project 1) Study the effects of two sequential stimuli which predict the reward in light of different reliabilities and differences in timing noise.

Target paper:
Kakade S. and Dayan P. (2002) -- Acquisition and extinction in autoshaping (pdf)

d. (advanced) Uncertainty based competition between model based, goal-directed behavior and model free habitual behavior -- replicate results from target paper, and/or compare the results of value learning using a distribution over possible models, to sampling models and using certainty equivalent value estimation.

Target paper:
Daw N.D., Niv Y. and Dayan P. (submitted) -- Uncertainty-based competition between prefrontal and striatal systems for behavioral control (pdf)

3) Exploration vs. exploitation in two-armed bandit situations

The basic characteristic of bandit situations is that of performance in an uncertain scenario, in which our choices determine not only the rewards we harvest, but also the information we gather. Imagine a situation in which two different slot machines (or food sources) have two different (unknown) probabilities of reward, and you can play these (choose which patch to forage on) repeatedly and sequentially. How should you choose in order to maximize the obtained rewards? In this project we will examine the different approximations and exact solutions to this famous exploration vs. exploitation problem, and compare them in terms of the harvested rewards, the complexity of computation, and their resulting behavioral strategies (such as different degrees of risk aversion).

Preliminary suggested directions:

a. Strategies for stationary environments -- comparison of simple heuristics such as epsilon-greedy or softmax policies to more sophisticated policies which use exploration bonuses, and finally to the exact solution using Gittins indices.

Target papers:
Duff M.O. (2002) -- Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes (chapters 2+3 of thesis)
Kakade S. and Dayan P. (2002) - Dopamine: generalization and bonuses (pdf)

b. (advanced) Generalization to MDPs -- smart approximations in cases where the exact solution is not available.

Target papers:
Dayan P. and Sejnowski T.J. (1996) -- Exploration bonuses and dual control (pdf)
Dearden R., Friedman N. and Russell S. (1998) -- Bayesian Q-learning (pdf)

c. (more basic and open ended) Dealing with a changing environment -- the effects of online learning and decision strategies on risk aversion and probability matching.

Target paper:
Niv Y., Joel D., Meilijson I. and Ruppin E. (2002) -- Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors (pdf)

4) Modeling free operant behavior rates using a hidden Markov model

An important but still largely open question in experimental psychology is whether animals show graded behavior (a "response curve") or merely all-or-none behavior that seems graded only when averaged over trials and/or subjects. This seemingly innocuous question has important implications to theories of acquisition, as well as timing, and performance on free operant schedules of reinforcement.

In this project we will use advanced RL techniques not to model a process that corresponds to some hypothesized neural computation, but to directly attempt to understand the structure of animal behavior using a Markov framework. By modeling instantaneous response rates in free operant behavior using a hidden Markov model, we will try to discern between data in which behavior was strictly generated by a low and a high rate, and behavior that was generated by a gradually ramping rate. We will explore the sensitivity of this novel approach to the underlying rates, and its strength in terms of conclusively answering the above question. After studying synthetic (but realistic) data sets, we will apply our model to real data from an instrumental conditioning experiment, and finally resolve the mystery...

Related papers/resources:
Gallistel C.R., Fairhurst S. and Balsam P. (2004) -- The learning curve: Implications of a quantitative analysis (pdf)
Gharamani Z. (2004) -- Lecture slides on Hidden Markov Models and pseudocode (pdf)
Roweis S. (1999) -- Tutorial on Hidden Markov Models and pseudocode (pdf)

 

 

Copyright © Okinawa Computational Neuroscience Course 2005. All Rights Reserved.