Watch Your Step! - Safe Training in Reinforcement Learning

An Interactive Introduction to Curriculum Induction

The minigame above lets you play the role of a Reinforcement Learning (RL) agent trying to maximize the reward by reaching a goal. Try to avoid the holes and navigate over the surface of the Frozen Lake with the arrow keys, but watch your step as the ice may be slippery. Use the drop-down menus to configure a teacher which helps you to avoid danger while solving the task.

Curriculum Learning is all about applying the safeguard mechanisms you just tried out in an optimal way.

The Idea

The key idea of Curriculum Induction for Safe RL (CISR) is that a teacher trains a student to solve a given task while avoiding failure. This can be helpful in safety-critical systems, as the agent is already being kept safe during training, preventing costly failures.

To be able to save the student with a given set of interventions, the teacher needs to know how to detect dangerous states, but it does not need to know how to solve the task itself. The curriculum policy defines the order and duration in which interventions are applied. Learning the curriculum policy requires the teacher to train multiple students while assessing their performance.

Our Project

We give an interactive introduction to curriculum learning and provide the theoretical background to understand and apply the method. In our experiments, we compare the students trained by the Optimized curriculum policy proposed by Turchetta et al. to students trained with our own curriculum policies.

Background

For the application of RL, safety can be the deciding factor in enabling or preventing the usage of a system. This is especially true for physical systems, as they can degrade or destroy themselves or even their environment. Thereby, it is not only important for the system to be safe after deployment, but also during training in the real world. Approaches to RL safety include Constrained Markov Decision Processes (CDMPs), as used in this work, budgeted MDPs, and Lyapunov functions.

An example application for the need of safety during training of RL systems are autonomous cars. While simulations are helpful and maybe a good starting point, training on the streets is still necessary. During this process, it is crucial to prevent crashes and harm to people, property and the car itself.

CISR, as a form of curriculum learning, relies on a teacher as an aid during training of the agent, which is analogously called the student. To be able to help the student, the teacher is given a set of interventions and has to decide on when to apply which. For example, when teaching a child how to ride a bike, possible interventions may be adding training wheels, catching them when they fall or giving them knee and elbow guards. The order in which the interventions are applied has to be optimized as it can significantly impact the student's performance. In our bike example, it may be detrimental to skip the training wheels or to never remove them, since this would either give the child no chance at getting started to learn the task or at improving itself. By training multiple students, the teacher is able to learn an optimal curriculum policy.

In contrast to learning from demonstration , curriculum learning does not expect the teacher to know how to solve the task, but rather relies on the teacher to supervise and structure the learning process. A partly similar approach to CISR was introduced by Graves et al. , using a nonstationary multi-armed bandit algorithm to determine an optimized curriculum. Matiisen et al. formalized the concept of learning a curriculum with an additional RL agent as Teacher-Student Curriculum Learning (TSCL) and applied the method to solve mazes in Minecraft.

CISR can also be viewed as a meta-learning framework , optimizing the curriculum policy as a hyperparameter. In practice, the curriculum policy could be optimized in simulation or in simplified settings before being deployed for the actual training, where training time is scarce. For example, this could make the training with physical robots faster and more secure.

Methodology

In CISR , the student is a RL agent trained in a Constrained Markov Decision Process (CMDP), which is created by the teacher in each interaction unit using an intervention $i\in\mathcal{I}$ as described below.

$\mathcal{M}_i = \langle \mathcal{S},\mathcal{A},\mathcal{P}_i,r_i,\mathcal{D}, \mathcal{D}_i \rangle$

The teacher gets a set $\mathcal{I}$ of interventions $\{ \langle \mathcal{D}_i, \mathcal{T}_i \rangle \}_{i=1}^K$ as input, which consist of trigger states $\mathcal{D}_i \subset \mathcal{S}$ and reset distributions $\mathcal{T}_i: \mathcal{S} \rightarrow \Delta_{\mathcal{S} \backslash \mathcal{D}_i}$. If the student enters a trigger state $s\in \mathcal{D}_i$, the transition is modified such that $\mathcal{P}_i(s'|s,a) = \mathcal{T}_i(s'|s)$, leading the student to a safe state $s'\not\in \mathcal{D_i}$. An intervention by the teacher does not reduce the student's reward, i.e. $r_i(s,a,s')=0$ for $s\in \mathcal{D_i}$ and $s' \not\in \mathcal{D_i}$ To prevent the student from relying on interventions, a constraint on the number of times the teachers can help the student is set. It is enforced by the CMDP solver, which penalizes the student for excessive use of the teachers help. .

At the beginning of each interaction unit $n\in [N_s]$, the teacher decides on an intervention $i_n \in \mathcal{I}$, which induces a CMDP $\mathcal{M}_{i_n}$ as described above. This decision is done using the teachers curriculum policy $\pi^T: \mathcal{H} \rightarrow \mathcal{I}$, which maps the teacher's observation history $\phi(\pi_1),...,\phi(\pi_{n-1})\in\mathcal{H}$ to the intervention $i_n$. The observations $\phi(\pi_n)$ are features computed from the performance of the student, e.g. by taking an estimate of the the student's policy value $\hat{V}(\pi_n)$ or the number of necessary teacher interventions into account. As this curriculum policy is learned, we call it the Optimized curriculum policy from now on.

Curriculum policies independent of the student's policy are simply a mapping $\pi^T:[N_s]\rightarrow \mathcal{I}$, assigning each interaction unit a specific intervention. Except for the Optimized curriculum policy, all policies we will work with in this article are of this kind. An advantage of student-independent curriculum policies is that they do not require a training process and therefore no measure of the student's performance.

A sequence of CMDPs $\mathcal{M}_{i_1},...,\mathcal{M}_{i_{N_s}}$, induced from a curriculum policy, is called curriculum. The figure below shows a curriculum induced by the Optimized curriculum policy with two simple interventions, either resetting the agent to the start (Hard Reset) or moving them one step back (Soft Reset). Note that interaction units and curriculum steps mean the same and can be used interchangeably.

0123456Curriculumsteps0.00.20.40.60.8successesOptimizedSR1HR
The Optimized curriculum policy switching interventions from Soft Reset 1 (SR1 moves the agent one step back) to Hard Reset (HR resets the agent back to the start) after three interaction units.

Training

Below is the CISR algorithm, which shows how the curriculum policy is optimized. The teacher learns online in $N_t$ rounds and plays a decision rule $\pi^T_j$ that makes a new student $j$ learn under an adaptively constructed sequence of CMDPs $\mathcal{M}_{i_n}$. Each student $j$ learns via $N_s$ interaction units, updating its policy by transferring between units. Then, the teacher computes features $\phi$ by evaluating the student's policies, based on which it proposes the next CMDP. At the end of each round, the teacher adjusts its decision rule.

The CISR algorithm by Turchetta et al..

Experiments

Our experiments concentrated on comparing the Optimized policy to other student-independent curriculum policies and evaluating how well they generalize to an environment of different size. To accomplish this, we created our own Frozen Smiley environment, which is based on the Frozen Lake environment used by Turchetta et al. . In addition, we propose two new curriculum policies, which will be explained in detail in this section.

Environments

We used the two environments which are shown below. Both are based on the Frozen Lake environment from the gym-library . The idea of the environments is that the agent has to find its way over a frozen lake, from start to goal, while avoiding holes in the ice. They are implemented as two-dimensional square grid-worlds, where the agent can move in four directions. Moving to a safe state gives the agent a negative reward of $-0.01$, while reaching the goal rewards the agent with $6$ points. Interventions by the teacher do not impose costs on the agent, however failing resets the score for the round to $0$. Because ice is slippery, there is a $20\%$ chance that the agent moves to the side instead of forward, causing the game to be non-deterministic. The trigger states, which the teacher uses to detect when it has to intervene, are positioned around the dangerous holes within a pre-defined reachability distance. To get a better idea of how the environments behave, we recommend to try out the minigame on top of this article.

SafeGoalDangerStartTrigger
The Frozen Lake environment used by Turchetta et al. on the left (size $10 \times 10$) and our Frozen Smiley environment on the right (size $16 \times 16$). Interventions are triggered at distance $1$ from holes.

Curriculum Policies

While a learned curriculum policy is presumably more generally applicable and does not require manual optimization, it also comes with the disadvantage of a training process and the need for a performance measure of the student. With this in mind, we came up with two simple curriculum policies of the form $\pi^T : [N_s] \rightarrow \mathcal{I}$, which will be described in the following.

The Back Policy

One of the simplest curriculum policy one could think of involves always going back $x$ steps when a trigger state is visited. For our experiments, we tested values for $x$ in the interval $[1,9]$. When playing the minigame on top of this article, you can try out the Back$_4$ policy by yourself and see how it influences training. Below is our implementation in Python, which simply involves a class taking $x$ as an input and always taking the resulting state of the action at index $x-1$, which corresponds to going back $x$ steps.

class Back: """ Teacher that goes back a constant number of steps """ def __init__(self, action_sequence, x=None): self.actions = action_sequence self.x = x def predict(self, obs): return self.actions[self.x - 1], None
Our implementation of the Back$_x$ curriculum policy. The $n$th element of the $\texttt{action\_sequence}$ list corresponds to the action which resets the agent by $n$ steps.

To compare the Back curriculum policy with the Optimized one, we plotted them for different values of $x$ below. Move the slider and see how they measure up in the Frozen Smiley environment for yourself.

Successes for the Optimized curriculum policy and the Back policy for different values of $x$ in the Frozen Smiley environment. The transparent areas show the standard deviation for $N_t = 10$ students.
The Incremental Policy

The idea behind the Incremental curriculum policy is the tradeoff between exploration and exploitation. While the agent should be free to explore the map in the beginning, it should be punished harder for failures as the learning process progresses. This is realized by incrementally increasing the amount of steps the agent is reset linearly. Formally, Incremental$_x$ resets the agent by $\lceil \frac{1}{2^x} \cdot n \rceil$ steps in the $n$th curriculum step. During our experiments, we tried out values for $x$ in the range $[0,4]$. Our implementation below simply increases a counter after each intervention unit and scales it linearly with the parameter $x$. The rounded result is then used as the index for the action, corresponding to the number of steps the agent is reset by.

class IncrementalTeacher: """ Incremental heuristic teacher that increases the buffer size on each curriculum step """ def __init__(self, action_sequence, x=None): self.actions = action_sequence self.step = 0 self.x = x def predict(self, obs): action = int(np.ceil((1/(2 ** self.x)) * self.step)) self.step += 1 return self.actions[action], None
Our implementation of the Incremental$_x$ curriculum policy. The $n$th element of the $\texttt{action\_sequence}$ list corresponds to the action which resets the agent by $n$ steps.

Below we plotted the Incremental and the Optimized curriculum policy for different values of $x$. Again, use the slider to try out which value of $x$ works best in the Frozen Smiley environment.

Successes for the Optimized curriculum policy and the Incremental policy for different values of $x$ in the Frozen Smiley environment. The transparent areas show the standard deviation for $N_t = 10$ students.

Results

In this section, we take a look at how our curriculum policies compare to the others and see how the larger Frozen Smiley environment influences their performance.

Student Performance

For our comparisons, we measure the successes, training failures and average rewards both for the Frozen Lake and the Frozen Smiley environment. Each curriculum policy was applied to $10$ students to account for errors and the randomness of the environment.

Frozen Lake

From experimentation with the values of $x$ like described in the previous section, we found that in the Frozen Lake environment $x=6$ and $x=2$ worked best for the Back and Incremental curriculum policies respectively. In this configuration, both the Back and the Incremental policy surpass the others within the first curriculum steps and quickly reach a success rate around $90\%$.

2022-08-05T20:50:24.816750image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Success rates of different curriculum policies on the Frozen Lake environment. For our policies, the best found parameters $x$ are used. The transparent areas show the standard deviation for $N_t = 10$ students.

This is also reflected in the table below, which shows that all teachers keep the student safe while training the student in the original environment leads to failures during training. On average, the Back and Incremental policy outperform the others with the Back policy performing slightly better than the Incremental policy.

Successes Training Failures Average Returns
Back$_6$ $0.921$ $0.000$ $5.259$
Incremental$_2$ $0.886$ $0.000$ $5.053$
Trained $0.743$ $0.000$ $4.035$
Original $0.641$ $3672.600$ $3.586$
HR $0.000$ $0.000$ $-0.295$
SR1 $0.820$ $0.000$ $4.661$
Bandit $0.659$ $0.000$ $3.638$
Success rates, training failures and average returns of different curriculum policies on the Frozen Lake environment. The highest values in each column are highlighted.
Frozen Smiley

For the Frozen Smiley environment, we found that $x=8$ works best for the Back curriculum policy, while Incremental performs best for $x=4$. Noticeably, the increase within the first curriculum steps is not as steep in this environment, only reaching a stable success rate around $90\%$ after $6$ intervention units. Even more apparent is the lower performance of the SR1 (Soft Reset 1) and the Bandit policy.

2022-08-05T20:52:44.344582image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Success rates of different curriculum policies on the Frozen Smiley environment. For our policies, the best found parameters $x$ are used. The transparent areas show the standard deviation for $N_t = 10$ students.

Like previously, all teachers keep the students safe during training as seen below. The Optimized curriculum policy manages to keep up with the Incremental policy in terms of successes after $10$ interaction units. Still, the Back policy performs best at a final success rate of over $90\%$ and the largest average returns.

Successes Training Failures Average Returns
Back$_8$ $0.929$ $0.000$ $5.293$
Incremental$_4$ $0.886$ $0.000$ $5.054$
Trained $0.879$ $0.000$ $4.856$
Original $0.741$ $2453.400$ $4.166$
HR $0.000$ $0.000$ $-0.140$
SR1 $0.597$ $0.000$ $3.367$
Bandit $0.398$ $0.000$ $2.133$
Success rates, training failures and average returns of different curriculum policies on the Frozen Smiley environment. The highest values in each column are highlighted.

Trajectories

The figure below shows a visualization of the paths taken by the students during training. During the first curriculum steps we can see the students exploring the map, while at later steps, they follow the previously found path, with only slight deviations or optimizations.

Exemplary trajectories for the Frozen Smiley environment with the Optimized policy. The lines represent the steps taken, while the background shows a heatmap of the student's positions. The trajectories show a progression from the first curriculum step to a later step.

Evaluation

Adding the teacher with its trigger states and intervention transitions to the maps kept the students safe during training. When choosing the best parameters out of those we tested, the Back and Incremental policy were able to outperform the Optimized one. Both the higher values for $x$ and slower increase in success rates could be explained by the increased size of the environment. As the Frozen Smiley environment is slightly larger than the original one, the increasing path lengths made an increase in reset steps for the Back policy necessary. Similarly, the Incremental curriculum policy took advantage of a longer exploration phase in the larger environment, by slowing down the increase of reset steps.

Limitations

The described method has several clear advantages, but there are also disadvantages and limitations coming along with it. While the used Frozen Lake environments allow to simply define which states trigger the teacher's interventions by using a distance measure from danger zones, this can be a complex task in environments with continuous state spaces and observations. For example, the observation space of the car racing environment consists of $96 \times 96$ pixel images. In this case, defining the trigger states would require to keep track of the car's position and to work with thresholds. In addition, the set of possible reset transitions has to be hand-crafted and the teacher needs to be trained, requiring additional computational resources.

Some tasks might also be simply so risky, that the teacher prevents the student from solving them at all, because it prevents the student from taking necessary risks. An example for this can be the minigame at the beginning, when selecting the Frozen Smiley environment and the teacher with trigger states within two tiles around danger zones. In these cases, less restrictive trigger states are needed, which might have to allow failures during training.

Conclusion

When comparing the different curriculum policies, we see that the Optimized one can outperform the Bandit and No Intervention policies, including the Soft and Hard Reset policies it is based on. However, it is possible to define simple curriculum policies like Back or Incremental, which can perform even better than the Optimized policy on the tested environments. We also observed that larger environments require a longer exploration phase and that the original HR, SR and Bandit policies do not generalize well to larger environments. Fundamentally, we found that defining reset transitions which keep the student safe is easier than defining suitable trigger states in the first place. When state spaces become more complex, dynamic or just partly observable, this could become a problem.

Outlook

Going forward, the method could be applied to other environments in OpenAI's Safety Gym to test the robustness of different strategies. Additionally, the amount of available interventions for the Optimized curriculum policy could be increased to account for this complexity. Finally, it has to be evaluated how well our curriculum policies generalize to more dynamic or random environments.

Acknowledgments

We are grateful to Rong Guo for supervising this project and continuously giving us feedback.

Author Contributions

This article was co-authored by Marvin Sextro and Jonas Loos under supervision of Rong Guo. Klaus Obermayer provided feedback on the project.

* equal contributions

Additional Material

For a summary of the main findings presented in this article, also see our conference poster.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.