Policy Gradient Algorithm
This came from a discussion with a friend about policy gradients. To begin with, it is good to consider the derivation of the basic policy gradient equation.
Let \(\tau\) be a trajectory and \(R(\tau):= \Sigma_{t=0:H} R(s_t, u_t) \mid \pi_{\theta}\) be the reward obtained for that trajectory. One should note that this depends on the policy \(\pi_{\theta}\) that one adopts. The next step is to compute the average expected reward and maximize it.
Let us define it to be equal to the following: \(U({\theta}) := \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\)
Now, one needs to take the gradient and find the correct set of parameters $\theta$ that maximizes the utility $U$.
Taking gradients on both sides we get,
\[\begin{align} \nabla_{\theta} U(\theta) &:= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\\ &= \Sigma_{\tau} \nabla_{\theta} P(\tau; \theta)R(\tau)\\ &= \Sigma_{\tau} \frac{P(\tau; \theta)}{P(\tau; \theta)} \nabla_{\theta} P(\tau; \theta)R(\tau)\\ &= \Sigma_{\tau} P(\tau; \theta) \frac{\nabla_{\theta}P(\tau; \theta)}{P(\tau; \theta)} R(\tau)\\ &=\mathbb{E}_{\tau \sim P(\tau;\theta)}[\frac{\nabla_{\theta}P(\tau; \theta)}{P(\tau; \theta)} R(\tau)]\\ &=\mathbb{E}_{\tau \sim P(\tau;\theta)}[\nabla_{\theta} \log P(\tau; \theta)R(\tau)] \end{align}\\\]- By definition
- \(\theta\) is the same for all trajectories, and gradient is a linear operator
- multiplying and dividing by the same expression
- rearranging
- Definition of expectation
So, to get the gradient estimate, do a set of “rollouts” using the distribution \(P(\tau; \theta)\), then just average a few trajectories. The approximation to (6) is given below. \(\nabla U(\theta) \approx \frac{1}{m}\Sigma_{i=1:m} \nabla_{\theta} \log P(\tau ^{i}; \theta)R(\tau^{i})\) The intuition is very explicit from the expression: find how to go the higher probability of the trajectory and the direction of travel, then adjust the speed by controlling it based on the reward. If the reward is negative, go against it.
Now we bring in the dynamics of the system, and things get interesting and also will enable us to compute \(\nabla_{\theta} \log P(\tau ^{i}; \theta)\). Remember \(\tau ^ {i}\) is composed of several states and actions. So, expanding this will result in the following:
\[\nabla_{\theta} \log P(\tau ^{i}; \theta) = \nabla_{\theta} \log \Pi_{t=0:H} P(s_{t+1} \mid s_{t}, u(t))P(u_t \mid s_t, \theta)\]The last term is what a neural network can deliver to us (remember, this is all for one trajectory, and I am omitting the superscript \(i\) on \(s\) and \(u\)).
\[\begin{align} \nabla_{\theta} \log P(\tau ^{i}; \theta) &= \nabla_{\theta} \log \Pi_{t=0:H} P(s_{t+1} \mid s_{t}, u(t)) \pi_{\theta}(u_t \mid s_t)\\ &= \Sigma_{t=0:H} \nabla_{\theta}\log P(s_{t+1} \mid s_{t}, u(t)) + \Sigma_{t=0:H} \nabla_{\theta} \log \pi_{\theta}(u_t \mid s_t)\\ &=\Sigma_{t=0:H} \nabla_{\theta} \log \pi_{\theta}(u_t \mid s_t) \end{align}\]Now, this implies we can ignore the dynamics model and focus on the grad provided by the autograd given by the neural network.
While it is trivial in the discrete case to output the probability of a particular action given the state, in the continuous action case, the neural net has to output some mean and variance (parameters of the distribution, which in turn give the probability of different actions).
What needs to be noted is that \(u_t\) is the action at time \(t\), and we need the probability of that particular action given state and the parameters \(\theta\). So, we essentially need a probability distribution as output from the neural net, either using the distribution parameters (mean, variance, etc.) or directly from the probability distribution output from the neural net.
What I mean is that from an implementation perspective, it is just the following
def policy_gradient_loss(params, state, action, reward, key): # The action comes from sampling the distribution.
mean, std = policy_network(params, state) # Comes from the neural net
log_prob = log_prob_gaussian(mean, std, action) # from a function you define based on mean and std
return -log_prob * reward # Negative for gradient descent
I hope it helps.
</script>
Posts
Policy Gradient Algorithm
Visit to Weston Park Sheffield
Notes on Inverse transform sampling
Eigenvalues and poles
Back Prop Algorithm - What remains constant in derivatives
Wordpress to Jekyll Conversion
Phase functions
Solving Dynamical Systems in Javascript
Javascript on markdown file
Walking data
Walking, it is complicated
PRC
Isochrone
Walking, it's complicated
Newtons iteration as a map - Part 2
Newton's iteration as map - Part 1
ChooseRight
Mathematica for machine learning - Learning a map
Prediction and Detection, A Note
Why we walk ?
The equations that fall in love!
Oru cbi diarykkuripp(ഒരു സിബിഐ ഡയറിക്കുറിപ്പ്)
A way to detect your stress levels!!
In search of the cause in motor control
Compressive sensing - the most magical of signal processing.
Machine Learning using python in 5 lines
Can we measure blood pressure from radial artery pulse?
subscribe via RSS