Welcome to our paper on constraint policy optimization (CPO).
In this article, we will discuss the CPO method for safe reinforcement learning,
which was presented in Achiam et al.
In Achiam et al.
Also, it was mentioned in the paper that the implementation of trust region could accelerate the model's training.
Theoretically, a model with a large trust region would converge to the objective function faster but fit it poorly,
and one with a small trust region would converge slower but fit the objective function better. Our hypothesis is:
For a trust region with a large size, the model will achieve higher rewards but might violate
the constraint more often. On the other hand, a model with a small trust region would stay closer to the constraint,
at the cost of achieving higher rewards.
For the implementation of the algorithm, we based our code on the work of Chaudhary, Sapana
In this section we briefly describe how one derives the practical implementation of the CPO algorithm from the theoretical framework Achiam et al.
The authors of Achiam et al.
where
For the practical implementation the theoretical mathematical objects in the problem above, we here only mentioned vaguely, are replaced by surrogate function which can be obtained by samples collected from evaluating the current policy. The problem then becomes
where here
As one assumes $H$ to be positive definite, all the functions appearing in this problem are either linear or convex, which makes this a convex optimization problem.
According to Achiam et al.
where $B,c,r$ and $S$ are defined as:
$B := [b_1,..., b_m]^\top, \quad$ $c := [c_1,..., c_m]^\top, \quad $ $r := g^\top H^{-1} B, \quad $ $S := B^\top H^{-1} B$.
Moreover they state that the solution to the primal problem can be recovered in the following way when the dual is solved
$\theta^* = \theta_k + \frac{1}{\lambda^*}H^{-1}(g - B\nu^*),$
where $\lambda^*$ and $\nu^*$ are the solutions to the dual problem. The advantage of solving the dual problem is a way lower number of variables, as in the dual appears one variable per constraint. Hence for a small number of constraints solving the dual is computationally less expensive. In the special case for one constraint ($m=1$) the solution can even be analytically derived. We show this in a bit more detail in the next section.
For more details on the following we refer to
$\theta^* = \theta_k + \frac{1}{\lambda^*}H^{-1}(g - B\nu^*).$
The only question left to answer is how to obtain the optimal dual solutions $\lambda^*$ and $\nu^*$.
Using strict duality Achiam et al.
$\nu^* = (\frac{\lambda^*c-r}{S})_+$
$\lambda^* =\arg \max_{\lambda \ge 0} \begin{cases} f_a(\lambda) = \frac{1}{2\lambda}(\frac{r^2}{s}-q) + \frac{\lambda}{2}(\frac{c^2}{s}-q), \text{ if } \lambda c - r > 0 \\ f_b(\lambda) = \frac{1}{2}(\frac{q}{\lambda}+ \lambda \delta), \text{ if otherwise } \end{cases}$ ,
where here $q$ is given by $g^T H^{-1} g$. Both of these functions $f_a, f_b$ required for the calculation of $\lambda^*$ are of the form
$f(\lambda) = -\frac{1}{2}(\frac{A}{\lambda} + B\lambda)$
for certain $A, B \in \mathbb{R}$ respectively. By deriving this expression and setting it equal to zero one easily obtains the optimal solution$\lambda^* = \frac{A}{B} ~.$
Now one would have to check if the respective optimal solutions $\lambda^*_a, \lambda^*_b$ for $f_a$ and $f_b$ lie in the correct intervals and if not project them back in, but we will spare the details here.
Then everything left is to compare the values $f_a(\lambda^*_a)$ and $f_b(\lambda^*_b)$ and choose the one obtaining the higher value.
Basically this is in a nutshell how to derive the CPO update, when the problem is feasible.
When it is however infeasible, to obtain a feasible problem again,
Achiam et al.
$\theta^* = \theta_k - \sqrt{\frac{2\delta}{b^\top H^{-1}b}}H^{-1}b$ .
One can check that this step indeed reduces the surrogate constraint value while staying within the trust region. As $H$ is positive definite the inverse $H^{-1}$ is positive definite as well, thus it holds $b^T H^{-1} b > 0$ and we obtain
$c + b^T(\theta^* - \theta_k) = c - \sqrt{\frac{2\delta}{b^\top H^{-1}b}} b^T H^{-1}b < c $ .
With a more involved calculation one can also check that
$\frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) = \delta$,
so the new iterate sits right on the edge of the trust region.
In the code of Chaudhary
In this section we present two models, one each for the environments we used,
namely the LunarLander and the CartPole environment.
While doing so we also introduce the constraint functions for the respective environments
we used for our experiments.
For parameterizing the policies we used neural networks with two hidden layers of size 100.
Both environments required a scalar output.
First we are going to present a model trained on the Lunarlander environment.
Here the goal of the agent is to land on a landing pad in the middle of the screen.
For details on the reward we refer to
In mathematical terms the constraint function is given by
$C: \mathbb{R}^8 \to \mathbb{R} ~,~~C(s) = \begin{cases} 1, \text{ if } |s_3| + |s_4| > 1.5 \\ 0, \text{ else} \end{cases}.$
The agent is punished with a constraint cost of 1 if the 1-norm of the velocity-vector is above 1.5, elsewhise if it is below this threshhold, the agent receives a constraint cost of 0.
As we were unable to install the MuJoCo environments we opted to create a similar setting Achiam et al.
Here is a video of several runs of our model after about 90 iterations.
To visualize how well the model obeys to the constraint we plotted the speed of the aircraft throughout about 100 runs. Each trajactory resembles the speed of the agent during one run. As we can see almost all trajectories run below the limit.
Unfortunately we ran into multiple difficulties during training. For example in the LunarLander environment it was always the case that at some point the algorithm stopped working properly, sadly we could not figure out why.
But we can see that for the first 100 iterations where the algorithm is working the constrained limit is being nicely tracked and the reward is increasing.
In mathematical terms the constraint function is given by
$C: \mathbb{R}^4 \to \mathbb{R} ~,~~C(s) = \begin{cases} 1, \text{ if } |s_1| > 2 \\ 0, \text{ else} \end{cases}.$
The agent is punished with a constraint cost of 1 if he leaves middle part, elsewhise he stays in the middle the agent receives a constraint cost of 0.
Here is a video of our agent after about 100 iterations
In this section we present our experiments. As described in our introduction we experimented with different numbers of CG-iterations and different sizes for the trust region and observed their impact on the learning progress.
As seen in the introductory part computing the CPO update involves a lot of expressions as
To observe the influence of the quality of the approximation of said quantities we ran the CPO algorithm with different numbers of CG-iterations. Our hypothesis was that increasing the number of iterations would lead to an improvement in the learning progress. Since the better approximation would lead to more accurate steps and thus the practical implemantation would be closer to the theoretically derived update.
To further understand the impact of the increased amount of CG-iterations we had a look at the relative forward error.
For a linear algebraic system
where
Indeed we can see in the second graph that increasing the number of CG-iterations to fifty and above drastically lowers the residualnorm and the bound on relative forward error.
Also for both environments for 10 and 20 CG-iterations the bound on the relative forward error sometimes lies way beyond 1, implying that the approximation might not be better than a random guess. On the other hand, in terms of the reward when looking at the first graph the impact is not noticeable, it behaves rather similar for all four different values.
One explanation could be that our approximation of the consdition number of
Only when considering the constraint value one could argue that increasing the number of CG-iterations leads to a somewhat more stable behaviour during training. Moreover to further justify this point for oneself one would have to ignore the last 25 iterations in the LunarLander environment. All in all our experiments seem to indicate that the impact of the number of CG-iterations is rather subtle and does not considerably improve the training proccess, obviously for more reliable results more experiments would need to be conducted.
In our introductory part, we already mentioned our hypothesis of the trust region size's effect on the model's training
To test this hypothesis, we train the agent for $\delta \in \{0.001, 0.01, 0.05, 0.1\}$ and again for 100 iterations.
According to our hypothesis, the performance of a model w.r.t a large trust region size should be higher than those of the smaller size for the reward values. Its constraint
value, however, should be higher and might not converge to the limit.
In the first figure, we showcase the reward value obtained after 100 training iterations for both CartPole and LunarLander environments.
In the second figure, we showcase the constraint value obtained after 100 training iterations for both CartPole and LunarLander environments.
For the CartPole environment, our hypothesis seems to be true, since the agent trained with large $\delta$ managed
to achieve a much higher reward but on the other hand, its constraint value never really managed to converge to the limit.
If you look at the graph closely, for small $\delta$ values, the agent has managed to stay somewhat close to the constraint.
For the LunarLander environment, the result, however, does not support our hypothesis. If we take a look at the reward values
w.r.t $\delta = 0.05$, the graph oscilitates after the 60th iteration. On the other hand, the same case does not apply for an even larger $\delta = 0.1$. Ergo, we cannot confirm our hypothesis after having analyzed the LunarLander environment.
Although, this problem might occur because of a mistake in our technical implementation.
In general, our CPO models achieve good reward performance and stay close to the constraint for a suitable trust region coefficient.
However, as you can see, we encountered some technical issues during our implementation, mainly in the LunarLander environment.
This problem might come from the incompatibility of PyTorch versions.
Despite that, we can still see that our second hypothesis is partially reasonable, at least we could confirm it in the CartPole environment.
Additionally, the performance of our model in the CartPole environment with sufficient coefficients demonstrates that CPO could be a reasonable choice for developing a safe AI.
Furthermore, if we manage to resolve the problem mentioned in the LunarLander environment,
our agent can perform consistently well during the entire training, as seen in the first 100 iterations.
All in all, our results coincide with the conclusion of Achiam et al.