Policy Gradient Algorithm

This came from a discussion with a friend about policy gradients. To begin with, it is good to consider the derivation of the basic policy gradient equation.

Let be a trajectory and be the reward obtained for that trajectory. One should note that this depends on the policy that one adopts. The next step is to compute the average expected reward and maximize it.

Let us define it to be equal to the following:

Now, one needs to take the gradient and find the correct set of parameters $\theta$ that maximizes the utility $U$.

Taking gradients on both sides we get,

By definition
is the same for all trajectories, and gradient is a linear operator
multiplying and dividing by the same expression
rearranging
Definition of expectation

So, to get the gradient estimate, do a set of “rollouts” using the distribution , then just average a few trajectories. The approximation to (6) is given below. The intuition is very explicit from the expression: find how to go the higher probability of the trajectory and the direction of travel, then adjust the speed by controlling it based on the reward. If the reward is negative, go against it.

Now we bring in the dynamics of the system, and things get interesting and also will enable us to compute . Remember is composed of several states and actions. So, expanding this will result in the following:

The last term is what a neural network can deliver to us (remember, this is all for one trajectory, and I am omitting the superscript on and ).

Now, this implies we can ignore the dynamics model and focus on the grad provided by the autograd given by the neural network.

While it is trivial in the discrete case to output the probability of a particular action given the state, in the continuous action case, the neural net has to output some mean and variance (parameters of the distribution, which in turn give the probability of different actions).

What needs to be noted is that is the action at time , and we need the probability of that particular action given state and the parameters . So, we essentially need a probability distribution as output from the neural net, either using the distribution parameters (mean, variance, etc.) or directly from the probability distribution output from the neural net.

What I mean is that from an implementation perspective, it is just the following

def policy_gradient_loss(params, state, action, reward, key): # The action comes from sampling the distribution.
 mean, std = policy_network(params, state) # Comes from the neural net
 log_prob = log_prob_gaussian(mean, std, action) # from a function you define based on mean and std
    return -log_prob * reward  # Negative for gradient descent

I hope it helps.

</script>

Posts

Nov 8, 2024
Sets of Learning
Nov 8, 2024
Policy Gradient Algorithm
Aug 17, 2024
Visit to Weston Park Sheffield
Apr 22, 2024
Notes on Inverse transform sampling
Oct 25, 2022
Eigenvalues and poles
Jul 3, 2021
Back Prop Algorithm - What remains constant in derivatives
Mar 13, 2021
Wordpress to Jekyll Conversion
Mar 6, 2021
Phase functions
Mar 6, 2021
Solving Dynamical Systems in Javascript
Mar 5, 2021
Javascript on markdown file
Feb 9, 2021
Walking data
Feb 6, 2021
Walking, it is complicated
Feb 4, 2021
PRC
Feb 3, 2021
Isochrone
Nov 26, 2020
Walking, it's complicated
Aug 1, 2020
Newtons iteration as a map - Part 2
Jul 27, 2020
Newton's iteration as map - Part 1
Dec 31, 2019
ChooseRight
Aug 1, 2019
Mathematica for machine learning - Learning a map
Apr 2, 2019
Prediction and Detection, A Note
Dec 3, 2018
Why we walk ?
Nov 11, 2018
The equations that fall in love!
Jul 6, 2018
Oru cbi diarykkuripp(ഒരു സിബിഐ ഡയറിക്കുറിപ്പ്)
Jun 12, 2018
A way to detect your stress levels!!
Jun 11, 2018
In search of the cause in motor control
Jun 10, 2018
Compressive sensing - the most magical of signal processing.
Jun 10, 2018
Machine Learning using python in 5 lines
Jun 10, 2018
Can we measure blood pressure from radial artery pulse?