Constrained Policy Optimization (CPO)

Abstract

Welcome to our paper on constraint policy optimization (CPO). In this article, we will discuss the CPO method for safe reinforcement learning, which was presented in Achiam et al. . Constrained policy optimization is a policy search algorithm for constrained Markov decision processes (CMDPs). According to Achiam et al. it is the first such algorithm able to guarantee constraint satisfaction during training. Which is a huge upside compared to prior policy search algorithms for CMDPs as the one proposed in Chow et al. , that were only able to guarantee convergence towards a safe behaviour. Inspired by the authors' idea, we use CPO to train two safe AI models for the OpenAI environments: CartPole and LunarLander.
In Achiam et al. , the authors talk about linearizing the obective functions and the constraint functions to approximately solve the CPO update as a convex optimization problem, which we can efficiently solve using duality. The authors propose the solution to the dual problem and mention using the conjugate gradient method (CG method) to approximately solve the inverse matrix problem during a CPO update, which we also apply in our implementation. Furthermore, we experimented different with the CG method to see if the algorithm could be improved as we increase the number of CG-iterations.
Also, it was mentioned in the paper that the implementation of trust region could accelerate the model's training. Theoretically, a model with a large trust region would converge to the objective function faster but fit it poorly, and one with a small trust region would converge slower but fit the objective function better. Our hypothesis is: For a trust region with a large size, the model will achieve higher rewards but might violate the constraint more often. On the other hand, a model with a small trust region would stay closer to the constraint, at the cost of achieving higher rewards.
For the implementation of the algorithm, we based our code on the work of Chaudhary, Sapana and on Achiam et al.


Table of contents

  1. Introduction of the algorithm
    1. Abstract theory
    2. Solution, solving the dual problem and formulating a recovery step
  2. Results Showcase
    1. LunarLander Environment
    2. CartPole Environment
  3. Experiments:
    1. CG-method
    2. Trust region step
  4. Conclusion

Introduction of the algorithm

In this section we briefly describe how one derives the practical implementation of the CPO algorithm from the theoretical framework Achiam et al. have build .

Abstract theory

The authors of Achiam et al. derive an update step which theoretically guarantees constraint satisfaction whilst not decreasing the reward function. For details we refer to their paper . In each iteration $k = 1,2, ...$ to obtain a new policy $\pi_{k+1}$ a problem of the following kind is approximately solved

\pi_{k+1} = \arg \max_{\pi \in \Pi_\theta} J(\pi), \text{s.t. } J_{C_i}(\pi) \le d_i \text{ for } i=1,...,m, \quad D_{kl}(\pi, \pi_k) \le \delta ,

where \Pi_\theta is some set of parametrized policies, $J:\Pi_\theta \to \mathbb{R}$ is some reward function, $J_{C_i}:\Pi_\theta \to \mathbb{R}$ are some constraint functions, d_i are the respective limits for i = 1,...,m and D_{kl} is the KL-divergence, which is implemented here as a kind of distance measure.

For the practical implementation the theoretical mathematical objects in the problem above, we here only mentioned vaguely, are replaced by surrogate function which can be obtained by samples collected from evaluating the current policy. The problem then becomes

\theta_{k+1} = \arg \max_{\theta} g^T (\theta - \theta_k), \text{s.t. } c_i + b_i^T(\theta - \theta_k) \le 0 \text{ for } i=1,...,m, \quad \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \le \delta ,

where here g is the gradient of the objective function, H is the hessian of the KL-divergence, b_i is the gradient of the constraint function and c_i is the current value of the constraint function for i = 1,...,m.

As one assumes $H$ to be positive definite, all the functions appearing in this problem are either linear or convex, which makes this a convex optimization problem. According to Achiam et al. when the problem is feasible, i.e. the set of solutions is not empty, one can hence dualize it and the dual is given by

\max_{\lambda \geq 0, \nu \geq 0} \frac{-1}{2\lambda} (g^\top H^{-1}g - 2r^T \nu + \nu^{\top}S\nu) + \nu^\top c \frac{\lambda \delta}{2},

where $B,c,r$ and $S$ are defined as:

$B := [b_1,..., b_m]^\top, \quad$ $c := [c_1,..., c_m]^\top, \quad $ $r := g^\top H^{-1} B, \quad $ $S := B^\top H^{-1} B$.

Moreover they state that the solution to the primal problem can be recovered in the following way when the dual is solved

$\theta^* = \theta_k + \frac{1}{\lambda^*}H^{-1}(g - B\nu^*),$

where $\lambda^*$ and $\nu^*$ are the solutions to the dual problem. The advantage of solving the dual problem is a way lower number of variables, as in the dual appears one variable per constraint. Hence for a small number of constraints solving the dual is computationally less expensive. In the special case for one constraint ($m=1$) the solution can even be analytically derived. We show this in a bit more detail in the next section.

Solution, solving the dual problem and formulating a recovery step

For more details on the following we refer to Appendix 10.2 . As stated above when $m=1$ the dual can be analytically solved for $\lambda^*$ and $\nu^*$ and the primal solution can be recovered by

$\theta^* = \theta_k + \frac{1}{\lambda^*}H^{-1}(g - B\nu^*).$

The only question left to answer is how to obtain the optimal dual solutions $\lambda^*$ and $\nu^*$. Using strict duality Achiam et al. derive the following equalities for $\lambda^*$ and $\nu^*$,

$\nu^* = (\frac{\lambda^*c-r}{S})_+$

$\lambda^* =\arg \max_{\lambda \ge 0} \begin{cases} f_a(\lambda) = \frac{1}{2\lambda}(\frac{r^2}{s}-q) + \frac{\lambda}{2}(\frac{c^2}{s}-q), \text{ if } \lambda c - r > 0 \\ f_b(\lambda) = \frac{1}{2}(\frac{q}{\lambda}+ \lambda \delta), \text{ if otherwise } \end{cases}$ ,

where here $q$ is given by $g^T H^{-1} g$. Both of these functions $f_a, f_b$ required for the calculation of $\lambda^*$ are of the form

$f(\lambda) = -\frac{1}{2}(\frac{A}{\lambda} + B\lambda)$

for certain $A, B \in \mathbb{R}$ respectively. By deriving this expression and setting it equal to zero one easily obtains the optimal solution

$\lambda^* = \frac{A}{B} ~.$

Now one would have to check if the respective optimal solutions $\lambda^*_a, \lambda^*_b$ for $f_a$ and $f_b$ lie in the correct intervals and if not project them back in, but we will spare the details here. Then everything left is to compare the values $f_a(\lambda^*_a)$ and $f_b(\lambda^*_b)$ and choose the one obtaining the higher value. Basically this is in a nutshell how to derive the CPO update, when the problem is feasible.
When it is however infeasible, to obtain a feasible problem again, Achiam et al. propose the following recovery step

$\theta^* = \theta_k - \sqrt{\frac{2\delta}{b^\top H^{-1}b}}H^{-1}b$ .

One can check that this step indeed reduces the surrogate constraint value while staying within the trust region. As $H$ is positive definite the inverse $H^{-1}$ is positive definite as well, thus it holds $b^T H^{-1} b > 0$ and we obtain

$c + b^T(\theta^* - \theta_k) = c - \sqrt{\frac{2\delta}{b^\top H^{-1}b}} b^T H^{-1}b < c $ .

With a more involved calculation one can also check that

$\frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) = \delta$,

so the new iterate sits right on the edge of the trust region.
In the code of Chaudhary the approach described here is implemented, the optimal step is found by obtaining the optimal primal solution via computing the dual solution using the equalities derived by Achiam et al. .

Results Showcase

In this section we present two models, one each for the environments we used, namely the LunarLander and the CartPole environment. While doing so we also introduce the constraint functions for the respective environments we used for our experiments.
For parameterizing the policies we used neural networks with two hidden layers of size 100. Both environments required a scalar output.

LunarLander environment

First we are going to present a model trained on the Lunarlander environment. Here the goal of the agent is to land on a landing pad in the middle of the screen. For details on the reward we refer to , most notably for us, the agent obtains negative reward for firing the side and main engines. For the constraint we decided to limit the speed of the aircraft. This is implemented in our code in the following way.

def LunarLander_vel(state, dtype, device, bound=1.5): #state[:, 2] velocity in x direction, state[:, 3] velocity in y direction costs = torch.tensor(np.zeros(state.shape[0]), dtype=dtype).to(device) idx = torch.abs(state[:, 2]) + torch.abs(state[:, 3]) > bound costs[idx] = 1 return costs

In mathematical terms the constraint function is given by

$C: \mathbb{R}^8 \to \mathbb{R} ~,~~C(s) = \begin{cases} 1, \text{ if } |s_3| + |s_4| > 1.5 \\ 0, \text{ else} \end{cases}.$

The agent is punished with a constraint cost of 1 if the 1-norm of the velocity-vector is above 1.5, elsewhise if it is below this threshhold, the agent receives a constraint cost of 0.
As we were unable to install the MuJoCo environments we opted to create a similar setting Achiam et al. implemented in the Humanoid-Circle environment. There the agent is rewarded for runnning in a circle, the larger this circle is and the faster he runs the more reward the agent obtains. At the same time the constraint function punishes the agent if the circle is too big. As we punish our agent for moving too fast, which should make him slow down by firing the main engine more, the constraint function should entice the agent to give up on reward, similar to .

Here is a video of several runs of our model after about 90 iterations.

To visualize how well the model obeys to the constraint we plotted the speed of the aircraft throughout about 100 runs. Each trajactory resembles the speed of the agent during one run. As we can see almost all trajectories run below the limit.

Plot displaying the speed of the aircraft throughout a run

Unfortunately we ran into multiple difficulties during training. For example in the LunarLander environment it was always the case that at some point the algorithm stopped working properly, sadly we could not figure out why.

Plot displaying the learning progress

But we can see that for the first 100 iterations where the algorithm is working the constrained limit is being nicely tracked and the reward is increasing.

CartPole environment

Now we come to a model we trained on the CartPole environment. Here the agent is rewarded for balancing a pole on a tiny cart, for detail on the environment we refer to . For the constraint we decided to restrict the agent to only move in the middle part of the screen. We also tried some more involved constraints but there we were less successful The constraint is implemented in our code in the following way.

def CartPole_pos(state, dtype, device, bound=2): # state[:, 0] position of the pole, ranges from -4.8 to 4.8 # the cart is constrained to stay within the range [-bound, bound] costs = torch.tensor(np.zeros(state.shape[0]), dtype=dtype).to(device) idx = torch.abs(state[:, 0]) > bound costs[idx] = 1 return costs

In mathematical terms the constraint function is given by

$C: \mathbb{R}^4 \to \mathbb{R} ~,~~C(s) = \begin{cases} 1, \text{ if } |s_1| > 2 \\ 0, \text{ else} \end{cases}.$

The agent is punished with a constraint cost of 1 if he leaves middle part, elsewhise he stays in the middle the agent receives a constraint cost of 0.

Here is a video of our agent after about 100 iterations

When training in the CartPole environment we ran in to much less difficulties. Here is the learning progress of our model. The reward is steadily increasing. It might look like the constraint is not being tracked so nice, but this just might be the case as the limit is so low and hence the relative deviations are so big, indeed the constraint value drops below the limit every couple of iterations.
Plot displaying the learning progress

Experiments

In this section we present our experiments. As described in our introduction we experimented with different numbers of CG-iterations and different sizes for the trust region and observed their impact on the learning progress.

CG iteration

As seen in the introductory part computing the CPO update involves a lot of expressions as H^{-1}b, H^{-1}g . Computing the inverse of a matrix is computationally very expensive, hence Achiam et al. advice the reader to circumvent it, by instead solving the linear algebraic systems Hx = b and Hx = g respectively. Obviously in exact arithmetics the respective solutions would give us the quantities we are looking for, but as most of us only have access to the floating point numbers in the IEEE 754 standard, for this matter we will have to resort to some approximation. Here Achiam et al. propose to use the CG Algorithm, this is an iterative method for solving linear algebraic systems particularly designed for symmetric positiv-definite matrices, cf. chapter 4.2.

To observe the influence of the quality of the approximation of said quantities we ran the CPO algorithm with different numbers of CG-iterations. Our hypothesis was that increasing the number of iterations would lead to an improvement in the learning progress. Since the better approximation would lead to more accurate steps and thus the practical implemantation would be closer to the theoretically derived update.

Learning progress and constraint satisfaction in the LunarLander and CartPole environment for various agents trained with different numbers of CG-iterations within each update step.
In the LunarLander environment the agent was supposed to not surpase a certain speed whilst for CartPole he was supposed to stay whithin the middle section of the screen.

To further understand the impact of the increased amount of CG-iterations we had a look at the relative forward error. For a linear algebraic system Ax = b this quantity is given by \frac{||x-\hat{x}||}{||x||} for some approximation \hat{x} and some norm ||\cdot|| , this value is useful to evaluate the quality of an estimation. As we do not know the exact solution x in most cases, we can only use bounds to assess it. One such bound we are going to use, is the following

\frac{||x-\hat{x}||}{||x||} \le K(A) \frac{||r||}{||b||},

where K(A) is the condition number of A w.r.t. ||\cdot|| and r = b - A\hat{x} is the residual, cf. Theorem 2.4. This bound tells us, if the matrix A has a reasonable condition number a small residual implies a good approximation.

Plot of the residual and bound on the relative forward error with respect to the 2-norm, here r1 = ||b - H\hat{x}||_2 and r2 = ||g - H\hat{x}||_2 .

Indeed we can see in the second graph that increasing the number of CG-iterations to fifty and above drastically lowers the residualnorm and the bound on relative forward error. Also for both environments for 10 and 20 CG-iterations the bound on the relative forward error sometimes lies way beyond 1, implying that the approximation might not be better than a random guess. On the other hand, in terms of the reward when looking at the first graph the impact is not noticeable, it behaves rather similar for all four different values. One explanation could be that our approximation of the consdition number of H is too pesimistic.

Direct comparison between 10 and 100 CG-iterations of the behaviour of the constraint value.

Only when considering the constraint value one could argue that increasing the number of CG-iterations leads to a somewhat more stable behaviour during training. Moreover to further justify this point for oneself one would have to ignore the last 25 iterations in the LunarLander environment. All in all our experiments seem to indicate that the impact of the number of CG-iterations is rather subtle and does not considerably improve the training proccess, obviously for more reliable results more experiments would need to be conducted.

Trust region

In our introductory part, we already mentioned our hypothesis of the trust region size's effect on the model's training
To test this hypothesis, we train the agent for $\delta \in \{0.001, 0.01, 0.05, 0.1\}$ and again for 100 iterations. According to our hypothesis, the performance of a model w.r.t a large trust region size should be higher than those of the smaller size for the reward values. Its constraint value, however, should be higher and might not converge to the limit.

In the first figure, we showcase the reward value obtained after 100 training iterations for both CartPole and LunarLander environments.

Comparision of reward values for different trust region sizes

In the second figure, we showcase the constraint value obtained after 100 training iterations for both CartPole and LunarLander environments.

Comparision of constraint values for different trust region sizes

For the CartPole environment, our hypothesis seems to be true, since the agent trained with large $\delta$ managed to achieve a much higher reward but on the other hand, its constraint value never really managed to converge to the limit. If you look at the graph closely, for small $\delta$ values, the agent has managed to stay somewhat close to the constraint.
For the LunarLander environment, the result, however, does not support our hypothesis. If we take a look at the reward values w.r.t $\delta = 0.05$, the graph oscilitates after the 60th iteration. On the other hand, the same case does not apply for an even larger $\delta = 0.1$. Ergo, we cannot confirm our hypothesis after having analyzed the LunarLander environment. Although, this problem might occur because of a mistake in our technical implementation.

Conclusion

In general, our CPO models achieve good reward performance and stay close to the constraint for a suitable trust region coefficient. However, as you can see, we encountered some technical issues during our implementation, mainly in the LunarLander environment. This problem might come from the incompatibility of PyTorch versions.
Despite that, we can still see that our second hypothesis is partially reasonable, at least we could confirm it in the CartPole environment. Additionally, the performance of our model in the CartPole environment with sufficient coefficients demonstrates that CPO could be a reasonable choice for developing a safe AI. Furthermore, if we manage to resolve the problem mentioned in the LunarLander environment, our agent can perform consistently well during the entire training, as seen in the first 100 iterations.
All in all, our results coincide with the conclusion of Achiam et al. (9th capital), in which CPO guarantees an improvement in return and simultaneously satisfies the constraints, when training neural networks.