Arguably the goal in Reinforcement learning is to find an optimal policy. By nature this optimal policy results in relatively highest rewards(relative to other non-optimal policies). With policy gradients, we technically incentivize the distribution of actions to generate a higher reward and likewise deter distributions of actions that generate sub-optimal rewards. Overtime we generate better trajectories, creating the optimal policy.

The goal of RL is to optimize the objective function.

$$\definecolor{red}{RGB}{255,59,48}\definecolor{orange}{RGB}{255,149,0}\definecolor{yellow}{RGB}{255,204,0}\definecolor{green}{RGB}{76,217,100}\definecolor{tealblue}{RGB}{90,200,250}\definecolor{blue}{RGB}{0,122,255}\definecolor{purple}{RGB}{88,86,214}\definecolor{pink}{RGB}{255,45,85}$$

$$\pi_{\theta}^\star = \text{arg}\underset{\pi_{\theta}}{\max}\color{orange}E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)]$$

What we're doing here is taking all the state action pairs along the trajectory, $(s_t, a_t)$, summing their rewards up, finding the total reward, and maximizing them with respect to $\theta$, the parameters of the policy.

Let's denote the expectation of the sum $\color{orange}E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)]$ as $\color{red}J(\pi_{\theta})$.

Now that we have $\color{red}J(\pi_{\theta})$ what do we do? Find the gradient to optimize this expectation via gradient ascent.

Expanding $J_{\theta}$ from expectation form (using definition of expectation):

$$\color{red}J(\pi_{\theta}) \color{black}=\color{orange} E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)] \color{black}= \color{green}\int \color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

Finding the gradient:

$$\nabla_{\theta}\color{red}J(\pi_{\theta}) \color{black}= \nabla_{\theta}\color{green}\int \color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

$$= \color{green}\int \color{black}\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

Using the log derivative trick:

$$\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta}) \color{black}= \color{blue}P(\tau|\pi_{\theta})\frac{\color{black}\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta})}{P(\tau|\pi_{\theta})} = \color{yellow}P(\tau|\pi_{\theta})\color{black}\nabla_{\theta}\color{yellow}\log P(\tau|\pi_{\theta})$$

Continuing from our previous step:

$$= \color{green}\int \color{yellow}P(\tau|\pi_{\theta})\color{black}\nabla_{\theta}\color{black}\log P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

Going back to expectation form we get:

$$= \underset{\tau\sim\pi_{\theta}}{E}[\nabla_{\theta}\color{yellow}\log P(\tau|\pi_{\theta})\color{green}R(\tau)\color{black}]$$

But we still don't have $P(\tau|\pi_{\theta})$ which by the way is the probability of the trajectory

$$P(\tau|\pi_{\theta}) = corr(s_{0})\prod_{t=1}^{T} P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_{t}|s_{t})$$

Taking the log of both sides to help us simplify our previous expectation:

$$\log P(\tau|\pi_{\theta}) = \log corr(s_{o}) + \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))$$

$$\nabla_{\theta}\log P(\tau|\pi_{\theta}) = \nabla_{\theta} \log corr(s_{o}) + \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))$$

And since $\log corr(s_{o})$ and $\log \pi_{\theta}(a_t|s_t))$ don't depend on $\theta$, we can simplify the the gradient of the expectation to:

$$\nabla_{\theta}J(\pi_{\theta}) = \underset{\tau\sim\pi_{\theta}}{E}[\cancel{\nabla_{\theta}\log corr(s_{o})} + \nabla_{\theta}\color{yellow} \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))\color{green}R(\tau)\color{black}]$$

Unfinished derivation (finished the implementation which I will soon update here). Currently working on a paper and will come back to finish this post.

Surya Dantuluri writes articles on Machine Learning, Full Stack Development, and Insightful Topics