Policy Gradient Algorithm

This came from a discussion with a friend about policy gradients. To begin with, it is good to consider the derivation of the basic policy gradient equation.

Let \(\tau\) be a trajectory and \(R(\tau):= \Sigma_{t=0:H} R(s_t, u_t) \mid \pi_{\theta}\) be the reward obtained for that trajectory. One should note that this depends on the policy \(\pi_{\theta}\) that one adopts. The next step is to compute the average expected reward and maximize it.

Let us define it to be equal to the following: \(U({\theta}) := \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\)

Now, one needs to take the gradient and find the correct set of parameters $\theta$ that maximizes the utility $U$.

Taking gradients on both sides we get,

\[\begin{align} \nabla_{\theta} U(\theta) &:= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\\ &= \Sigma_{\tau} \nabla_{\theta} P(\tau; \theta)R(\tau)\\ &= \Sigma_{\tau} \frac{P(\tau; \theta)}{P(\tau; \theta)} \nabla_{\theta} P(\tau; \theta)R(\tau)\\ &= \Sigma_{\tau} P(\tau; \theta) \frac{\nabla_{\theta}P(\tau; \theta)}{P(\tau; \theta)} R(\tau)\\ &=\mathbb{E}_{\tau \sim P(\tau;\theta)}[\frac{\nabla_{\theta}P(\tau; \theta)}{P(\tau; \theta)} R(\tau)]\\ &=\mathbb{E}_{\tau \sim P(\tau;\theta)}[\nabla_{\theta} \log P(\tau; \theta)R(\tau)] \end{align}\\\]
  1. By definition
  2. \(\theta\) is the same for all trajectories, and gradient is a linear operator
  3. multiplying and dividing by the same expression
  4. rearranging
  5. Definition of expectation

So, to get the gradient estimate, do a set of “rollouts” using the distribution \(P(\tau; \theta)\), then just average a few trajectories. The approximation to (6) is given below. \(\nabla U(\theta) \approx \frac{1}{m}\Sigma_{i=1:m} \nabla_{\theta} \log P(\tau ^{i}; \theta)R(\tau^{i})\) The intuition is very explicit from the expression: find how to go the higher probability of the trajectory and the direction of travel, then adjust the speed by controlling it based on the reward. If the reward is negative, go against it.

Now we bring in the dynamics of the system, and things get interesting and also will enable us to compute \(\nabla_{\theta} \log P(\tau ^{i}; \theta)\). Remember \(\tau ^ {i}\) is composed of several states and actions. So, expanding this will result in the following:

\[\nabla_{\theta} \log P(\tau ^{i}; \theta) = \nabla_{\theta} \log \Pi_{t=0:H} P(s_{t+1} \mid s_{t}, u(t))P(u_t \mid s_t, \theta)\]

The last term is what a neural network can deliver to us (remember, this is all for one trajectory, and I am omitting the superscript \(i\) on \(s\) and \(u\)).

\[\begin{align} \nabla_{\theta} \log P(\tau ^{i}; \theta) &= \nabla_{\theta} \log \Pi_{t=0:H} P(s_{t+1} \mid s_{t}, u(t)) \pi_{\theta}(u_t \mid s_t)\\ &= \Sigma_{t=0:H} \nabla_{\theta}\log P(s_{t+1} \mid s_{t}, u(t)) + \Sigma_{t=0:H} \nabla_{\theta} \log \pi_{\theta}(u_t \mid s_t)\\ &=\Sigma_{t=0:H} \nabla_{\theta} \log \pi_{\theta}(u_t \mid s_t) \end{align}\]

Now, this implies we can ignore the dynamics model and focus on the grad provided by the autograd given by the neural network.

While it is trivial in the discrete case to output the probability of a particular action given the state, in the continuous action case, the neural net has to output some mean and variance (parameters of the distribution, which in turn give the probability of different actions).

What needs to be noted is that \(u_t\) is the action at time \(t\), and we need the probability of that particular action given state and the parameters \(\theta\). So, we essentially need a probability distribution as output from the neural net, either using the distribution parameters (mean, variance, etc.) or directly from the probability distribution output from the neural net.

What I mean is that from an implementation perspective, it is just the following

def policy_gradient_loss(params, state, action, reward, key): # The action comes from sampling the distribution.
 mean, std = policy_network(params, state) # Comes from the neural net
 log_prob = log_prob_gaussian(mean, std, action) # from a function you define based on mean and std
    return -log_prob * reward  # Negative for gradient descent

I hope it helps.

</script>

Posts

subscribe via RSS