Watch Your Step! - Safe Training in Reinforcement Learning

The Idea

The key idea of Curriculum Induction for Safe RL (CISR) is that a teacher trains a student to solve a given task while avoiding failure. This can be helpful in safety-critical systems, as the agent is already being kept safe during training, preventing costly failures.

To be able to save the student with a given set of interventions, the teacher needs to know how to detect dangerous states, but it does not need to know how to solve the task itself. The curriculum policy defines the order and duration in which interventions are applied. Learning the curriculum policy requires the teacher to train multiple students while assessing their performance.

Our Project

We give an interactive introduction to curriculum learning and provide the theoretical background to understand and apply the method. In our experiments, we compare the students trained by the Optimized curriculum policy proposed by Turchetta et al. to students trained with our own curriculum policies.

Background

For the application of RL, safety can be the deciding factor in enabling or preventing the usage of a system. This is especially true for physical systems, as they can degrade or destroy themselves or even their environment. Thereby, it is not only important for the system to be safe after deployment, but also during training in the real world. Approaches to RL safety include Constrained Markov Decision Processes (CDMPs), as used in this work, budgeted MDPs, and Lyapunov functions.

An example application for the need of safety during training of RL systems are autonomous cars. While simulations are helpful and maybe a good starting point, training on the streets is still necessary. During this process, it is crucial to prevent crashes and harm to people, property and the car itself.

CISR, as a form of curriculum learning, relies on a teacher as an aid during training of the agent, which is analogously called the student. To be able to help the student, the teacher is given a set of interventions and has to decide on when to apply which. For example, when teaching a child how to ride a bike, possible interventions may be adding training wheels, catching them when they fall or giving them knee and elbow guards. The order in which the interventions are applied has to be optimized as it can significantly impact the student's performance. In our bike example, it may be detrimental to skip the training wheels or to never remove them, since this would either give the child no chance at getting started to learn the task or at improving itself. By training multiple students, the teacher is able to learn an optimal curriculum policy.

In contrast to learning from demonstration , curriculum learning does not expect the teacher to know how to solve the task, but rather relies on the teacher to supervise and structure the learning process. A partly similar approach to CISR was introduced by Graves et al. , using a nonstationary multi-armed bandit algorithm to determine an optimized curriculum. Matiisen et al. formalized the concept of learning a curriculum with an additional RL agent as Teacher-Student Curriculum Learning (TSCL) and applied the method to solve mazes in Minecraft.

CISR can also be viewed as a meta-learning framework , optimizing the curriculum policy as a hyperparameter. In practice, the curriculum policy could be optimized in simulation or in simplified settings before being deployed for the actual training, where training time is scarce. For example, this could make the training with physical robots faster and more secure.

Methodology

In CISR , the student is a RL agent trained in a Constrained Markov Decision Process (CMDP), which is created by the teacher in each interaction unit using an intervention $i\in\mathcal{I}$ as described below.

$\mathcal{M}_i = \langle \mathcal{S},\mathcal{A},\mathcal{P}_i,r_i,\mathcal{D}, \mathcal{D}_i \rangle$

The teacher gets a set $\mathcal{I}$ of interventions $\{ \langle \mathcal{D}_i, \mathcal{T}_i \rangle \}_{i=1}^K$ as input, which consist of trigger states $\mathcal{D}_i \subset \mathcal{S}$ and reset distributions $\mathcal{T}_i: \mathcal{S} \rightarrow \Delta_{\mathcal{S} \backslash \mathcal{D}_i}$. If the student enters a trigger state $s\in \mathcal{D}_i$, the transition is modified such that $\mathcal{P}_i(s'|s,a) = \mathcal{T}_i(s'|s)$, leading the student to a safe state $s'\not\in \mathcal{D_i}$. An intervention by the teacher does not reduce the student's reward, i.e. $r_i(s,a,s')=0$ for $s\in \mathcal{D_i}$ and $s' \not\in \mathcal{D_i}$ To prevent the student from relying on interventions, a constraint on the number of times the teachers can help the student is set. It is enforced by the CMDP solver, which penalizes the student for excessive use of the teachers help. .

At the beginning of each interaction unit $n\in [N_s]$, the teacher decides on an intervention $i_n \in \mathcal{I}$, which induces a CMDP $\mathcal{M}_{i_n}$ as described above. This decision is done using the teachers curriculum policy $\pi^T: \mathcal{H} \rightarrow \mathcal{I}$, which maps the teacher's observation history $\phi(\pi_1),...,\phi(\pi_{n-1})\in\mathcal{H}$ to the intervention $i_n$. The observations $\phi(\pi_n)$ are features computed from the performance of the student, e.g. by taking an estimate of the the student's policy value $\hat{V}(\pi_n)$ or the number of necessary teacher interventions into account. As this curriculum policy is learned, we call it the Optimized curriculum policy from now on.

Curriculum policies independent of the student's policy are simply a mapping $\pi^T:[N_s]\rightarrow \mathcal{I}$, assigning each interaction unit a specific intervention. Except for the Optimized curriculum policy, all policies we will work with in this article are of this kind. An advantage of student-independent curriculum policies is that they do not require a training process and therefore no measure of the student's performance.

A sequence of CMDPs $\mathcal{M}_{i_1},...,\mathcal{M}_{i_{N_s}}$, induced from a curriculum policy, is called curriculum. The figure below shows a curriculum induced by the Optimized curriculum policy with two simple interventions, either resetting the agent to the start (Hard Reset) or moving them one step back (Soft Reset). Note that interaction units and curriculum steps mean the same and can be used interchangeably.

Training

Below is the CISR algorithm, which shows how the curriculum policy is optimized. The teacher learns online in $N_t$ rounds and plays a decision rule $\pi^T_j$ that makes a new student $j$ learn under an adaptively constructed sequence of CMDPs $\mathcal{M}_{i_n}$. Each student $j$ learns via $N_s$ interaction units, updating its policy by transferring between units. Then, the teacher computes features $\phi$ by evaluating the student's policies, based on which it proposes the next CMDP. At the end of each round, the teacher adjusts its decision rule.

$N_t$: Number of rounds
$N_s$: Interaction units
$\pi_{n,j}$: Policy of the $j$th student in the $n$th interaction unit
$\mathcal{M}_{i_n}$: CMDP induced by an intervention $i_n \in \mathcal{I}$
$\pi^T_j$: Decision rule of the teacher for the $j$th student
$o_n^T$: Teacher observations in the $n$th interaction unit
$\phi(\pi_{n,j})$: Features of the $j$th student's policy in the $n$th interaction unit

Experiments

Our experiments concentrated on comparing the Optimized policy to other student-independent curriculum policies and evaluating how well they generalize to an environment of different size. To accomplish this, we created our own Frozen Smiley environment, which is based on the Frozen Lake environment used by Turchetta et al. . In addition, we propose two new curriculum policies, which will be explained in detail in this section.

Environments

We used the two environments which are shown below. Both are based on the Frozen Lake environment from the gym-library . The idea of the environments is that the agent has to find its way over a frozen lake, from start to goal, while avoiding holes in the ice. They are implemented as two-dimensional square grid-worlds, where the agent can move in four directions. Moving to a safe state gives the agent a negative reward of $-0.01$, while reaching the goal rewards the agent with $6$ points. Interventions by the teacher do not impose costs on the agent, however failing resets the score for the round to $0$. Because ice is slippery, there is a $20\%$ chance that the agent moves to the side instead of forward, causing the game to be non-deterministic. The trigger states, which the teacher uses to detect when it has to intervene, are positioned around the dangerous holes within a pre-defined reachability distance. To get a better idea of how the environments behave, we recommend to try out the minigame on top of this article.

Curriculum Policies

While a learned curriculum policy is presumably more generally applicable and does not require manual optimization, it also comes with the disadvantage of a training process and the need for a performance measure of the student. With this in mind, we came up with two simple curriculum policies of the form $\pi^T : [N_s] \rightarrow \mathcal{I}$, which will be described in the following.

The Back Policy

One of the simplest curriculum policy one could think of involves always going back $x$ steps when a trigger state is visited. For our experiments, we tested values for $x$ in the interval $[1,9]$. When playing the minigame on top of this article, you can try out the Back$_4$ policy by yourself and see how it influences training. Below is our implementation in Python, which simply involves a class taking $x$ as an input and always taking the resulting state of the action at index $x-1$, which corresponds to going back $x$ steps.

To compare the Back curriculum policy with the Optimized one, we plotted them for different values of $x$ below. Move the slider and see how they measure up in the Frozen Smiley environment for yourself.

The Incremental Policy

The idea behind the Incremental curriculum policy is the tradeoff between exploration and exploitation. While the agent should be free to explore the map in the beginning, it should be punished harder for failures as the learning process progresses. This is realized by incrementally increasing the amount of steps the agent is reset linearly. Formally, Incremental$_x$ resets the agent by $\lceil \frac{1}{2^x} \cdot n \rceil$ steps in the $n$th curriculum step. During our experiments, we tried out values for $x$ in the range $[0,4]$. Our implementation below simply increases a counter after each intervention unit and scales it linearly with the parameter $x$. The rounded result is then used as the index for the action, corresponding to the number of steps the agent is reset by.

Below we plotted the Incremental and the Optimized curriculum policy for different values of $x$. Again, use the slider to try out which value of $x$ works best in the Frozen Smiley environment.

Results

In this section, we take a look at how our curriculum policies compare to the others and see how the larger Frozen Smiley environment influences their performance.

Student Performance

For our comparisons, we measure the successes, training failures and average rewards both for the Frozen Lake and the Frozen Smiley environment. Each curriculum policy was applied to $10$ students to account for errors and the randomness of the environment.

Frozen Lake

From experimentation with the values of $x$ like described in the previous section, we found that in the Frozen Lake environment $x=6$ and $x=2$ worked best for the Back and Incremental curriculum policies respectively. In this configuration, both the Back and the Incremental policy surpass the others within the first curriculum steps and quickly reach a success rate around $90\%$.

This is also reflected in the table below, which shows that all teachers keep the student safe while training the student in the original environment leads to failures during training. On average, the Back and Incremental policy outperform the others with the Back policy performing slightly better than the Incremental policy.

Frozen Smiley

For the Frozen Smiley environment, we found that $x=8$ works best for the Back curriculum policy, while Incremental performs best for $x=4$. Noticeably, the increase within the first curriculum steps is not as steep in this environment, only reaching a stable success rate around $90\%$ after $6$ intervention units. Even more apparent is the lower performance of the SR1 (Soft Reset 1) and the Bandit policy.

Like previously, all teachers keep the students safe during training as seen below. The Optimized curriculum policy manages to keep up with the Incremental policy in terms of successes after $10$ interaction units. Still, the Back policy performs best at a final success rate of over $90\%$ and the largest average returns.

Trajectories

The figure below shows a visualization of the paths taken by the students during training. During the first curriculum steps we can see the students exploring the map, while at later steps, they follow the previously found path, with only slight deviations or optimizations.

Evaluation

Adding the teacher with its trigger states and intervention transitions to the maps kept the students safe during training. When choosing the best parameters out of those we tested, the Back and Incremental policy were able to outperform the Optimized one. Both the higher values for $x$ and slower increase in success rates could be explained by the increased size of the environment. As the Frozen Smiley environment is slightly larger than the original one, the increasing path lengths made an increase in reset steps for the Back policy necessary. Similarly, the Incremental curriculum policy took advantage of a longer exploration phase in the larger environment, by slowing down the increase of reset steps.

Limitations

	Successes	Training Failures	Average Returns
Back$_6$	$0.921$	$0.000$	$5.259$
Incremental$_2$	$0.886$	$0.000$	$5.053$
Trained	$0.743$	$0.000$	$4.035$
Original	$0.641$	$3672.600$	$3.586$
HR	$0.000$	$0.000$	$-0.295$
SR1	$0.820$	$0.000$	$4.661$
Bandit	$0.659$	$0.000$	$3.638$

	Successes	Training Failures	Average Returns
Back$_8$	$0.929$	$0.000$	$5.293$
Incremental$_4$	$0.886$	$0.000$	$5.054$
Trained	$0.879$	$0.000$	$4.856$
Original	$0.741$	$2453.400$	$4.166$
HR	$0.000$	$0.000$	$-0.140$
SR1	$0.597$	$0.000$	$3.367$
Bandit	$0.398$	$0.000$	$2.133$

The described method has several clear advantages, but there are also disadvantages and limitations coming along with it. While the used Frozen Lake environments allow to simply define which states trigger the teacher's interventions by using a distance measure from danger zones, this can be a complex task in environments with continuous state spaces and observations. For example, the observation space of the car racing environment consists of $96 \times 96$ pixel images. In this case, defining the trigger states would require to keep track of the car's position and to work with thresholds. In addition, the set of possible reset transitions has to be hand-crafted and the teacher needs to be trained, requiring additional computational resources.

Some tasks might also be simply so risky, that the teacher prevents the student from solving them at all, because it prevents the student from taking necessary risks. An example for this can be the minigame at the beginning, when selecting the Frozen Smiley environment and the teacher with trigger states within two tiles around danger zones. In these cases, less restrictive trigger states are needed, which might have to allow failures during training.

Conclusion

When comparing the different curriculum policies, we see that the Optimized one can outperform the Bandit and No Intervention policies, including the Soft and Hard Reset policies it is based on. However, it is possible to define simple curriculum policies like Back or Incremental, which can perform even better than the Optimized policy on the tested environments. We also observed that larger environments require a longer exploration phase and that the original HR, SR and Bandit policies do not generalize well to larger environments. Fundamentally, we found that defining reset transitions which keep the student safe is easier than defining suitable trigger states in the first place. When state spaces become more complex, dynamic or just partly observable, this could become a problem.

Outlook

Going forward, the method could be applied to other environments in OpenAI's Safety Gym to test the robustness of different strategies. Additionally, the amount of available interventions for the Optimized curriculum policy could be increased to account for this complexity. Finally, it has to be evaluated how well our curriculum policies generalize to more dynamic or random environments.