Awesome

POMDP

Implementing a reinforcement learning algorithm based upon a partially observable Markov decision process.

The task
The model
Running the code
References

The task

Here the agent will be presented with a two-alternative forced decision task. Over a number of trials the agent will be able to choose and then perform an action based upon a given stimulus. The stimulus values range from -0.5 to 0.5. When the stimulus value is less than 0, the agent should choose Left to make a correct decision, and when the stimulus value is greater than 0 the agent should choose Right to make a correct decision. If the stimulus value is 0, the correct decision is randomly assigned to be either left or right for the given trial, and the agent will be rewarded accordingly.

Block structure

The agent is rewarded in an asymmetric manner. For some trials, the agent receives an additional reward for making a left correct action. For the remaining trials, the agent receives an additional reward for making a right correct action. The trials are presented to the agent in blocks.

Reward structure

Task parameters

This code allows the user to choose some of the parameters of the task. For instance,

the number of trials
the number of reward blocks
options for reward blocks ('right','left' or 'none', where 'none' is optional)
stimulus values

The model

Note that this model implements a POMDP with Q-values. Q-values are a quantification of the agent's value of choosing a particular action. Q-values are updated with every trial based upon the reward received. The higher the Q-value, the higher the agent currently values making a particular action.

At the beginning of each trial, the agent receives some stimulus, s. The larger the absolute value of stimulus, the clearer the stimulus appears to the agent.
In order to model the agent having an imperfect perception of the stimulus, noise is added to the stimulus value. The perceived stimulus value is sampled from a normal distribution with mean s (the stimulus value) and standard deviation, sigma. The value of sigma is a parameter of the model.
Using its perceived, noisy value of the stimulus, the agent then forms a belief as to the correct side of the stimulus. The agent calculates the probability of the stimulus being on a given side by calculating the cumulative probability of a normally distributed random variable (with mean noisy-stimulus-value and standard deviation sigma, as above) at zero.
The agent then combines its belief as to the current side of the stimulus with its stored Q-values.
The agent chooses either a left or right action, and receives the appropriate reward. This reward depends firstly on whether the agent has chosen the correct side, and secondly which is the current reward block. The current reward block will dictate whether the agent receives an additional reward for a correct action. The value of this additional reward is a second parameter of the model.
The agent calculates the error in its prediction. This is equivalent to the reward minus the Q-value of the action taken.
The prediction error, the agent's belief and the agent's learning rate (a third parameter of the model) are then used to update the Q-values for the next iteration.

POMDP model flowchart

Model parameters

sigma, the noise added to the agent's perception of the stimulus and the standard deviation in the agent's belief distribution.
the value of the additional reward.
the learning rate.

Running the code

The file 'Main.m' is the file which runs the model. The code runs as is, and will plot the results.

The first two sections of the allow the user to alter both the task parameters and the model parameters. The third section generates random stimulus values and reward blocks to be fed to the agent. The fourth section implements the POMDP in with the function 'RunPOMDP'. The final section plots the results.

References

The ideas used to build the model implemented here are largely drawn from

Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto
Planning and acting in partially observable stochastic domains by Kaelbling et al. (1998)

Terminology and the majority of the notation are also taken from these sources.

The task implemented is based upon

Midbrain Dopamine Neurons Signal Belief in Choice Accuracy during a Perceptual Decision by Lak et al. (2017)
High-yield methods for accurate two-alternative visual psychophysics in head-fixed mice by Burgess et al. (2017)