An Interactive Introduction to Curriculum Induction
The minigame above lets you play the role of a Reinforcement Learning (RL) agent trying to maximize the reward by reaching a goal. Try to avoid the holes and navigate over the surface of the Frozen Lake with the arrow keys, but watch your step as the ice may be slippery. Use the drop-down menus to configure a teacher which helps you to avoid danger while solving the task.
Curriculum Learning is all about applying the safeguard mechanisms you just tried out in an optimal way.
The key idea of Curriculum Induction for Safe RL (CISR)
To be able to save the student with a given set of interventions, the teacher needs to know how to detect dangerous states, but it does not need to know how to solve the task itself. The curriculum policy defines the order and duration in which interventions are applied. Learning the curriculum policy requires the teacher to train multiple students while assessing their performance.
We give an interactive introduction to curriculum learning and provide the theoretical background to understand and apply the method.
In our experiments, we compare the students trained by the Optimized curriculum policy proposed by Turchetta et al.
For the application of RL, safety can be the deciding factor in enabling or preventing the usage of a system.
This is especially true for physical systems, as they can degrade or destroy themselves or even their environment.
Thereby, it is not only important for the system to be safe after deployment, but also during training in the real world.
Approaches to RL safety include Constrained Markov Decision Processes (CDMPs), as used in this work, budgeted MDPs, and Lyapunov functions.
An example application for the need of safety during training of RL systems are autonomous cars.
While simulations are helpful and maybe a good starting point, training on the streets is still necessary.
During this process, it is crucial to prevent crashes and harm to people, property and the car itself.
CISR, as a form of curriculum learning, relies on a teacher as an aid during training of the agent, which is analogously called the student.
To be able to help the student, the teacher is given a set of interventions and has to decide on when to apply which.
For example, when teaching a child how to ride a bike, possible interventions may be adding training wheels, catching them when they fall or giving them knee and elbow guards.
The order in which the interventions are applied has to be optimized as it can significantly impact the student's performance.
In our bike example, it may be detrimental to skip the training wheels or to never remove them, since this would either give the child no chance at getting started to learn the task or at improving itself.
By training multiple students, the teacher is able to learn an optimal curriculum policy.
In contrast to learning from demonstration
CISR can also be viewed as a meta-learning framework
In CISR
$\mathcal{M}_i = \langle \mathcal{S},\mathcal{A},\mathcal{P}_i,r_i,\mathcal{D}, \mathcal{D}_i \rangle$
The teacher gets a set $\mathcal{I}$ of interventions $\{ \langle \mathcal{D}_i, \mathcal{T}_i \rangle \}_{i=1}^K$ as input, which consist of trigger states $\mathcal{D}_i \subset \mathcal{S}$ and reset distributions $\mathcal{T}_i: \mathcal{S} \rightarrow \Delta_{\mathcal{S} \backslash \mathcal{D}_i}$.
If the student enters a trigger state $s\in \mathcal{D}_i$, the transition is modified such that $\mathcal{P}_i(s'|s,a) = \mathcal{T}_i(s'|s)$, leading the student to a safe state $s'\not\in \mathcal{D_i}$.
An intervention by the teacher does not reduce the student's reward, i.e. $r_i(s,a,s')=0$ for $s\in \mathcal{D_i}$ and $s' \not\in \mathcal{D_i}$
At the beginning of each interaction unit $n\in [N_s]$, the teacher decides on an intervention $i_n \in \mathcal{I}$, which induces a CMDP $\mathcal{M}_{i_n}$ as described above. This decision is done using the teachers curriculum policy $\pi^T: \mathcal{H} \rightarrow \mathcal{I}$, which maps the teacher's observation history $\phi(\pi_1),...,\phi(\pi_{n-1})\in\mathcal{H}$ to the intervention $i_n$. The observations $\phi(\pi_n)$ are features computed from the performance of the student, e.g. by taking an estimate of the the student's policy value $\hat{V}(\pi_n)$ or the number of necessary teacher interventions into account. As this curriculum policy is learned, we call it the Optimized curriculum policy from now on.
Curriculum policies independent of the student's policy are simply a mapping $\pi^T:[N_s]\rightarrow \mathcal{I}$, assigning each interaction unit a specific intervention. Except for the Optimized curriculum policy, all policies we will work with in this article are of this kind. An advantage of student-independent curriculum policies is that they do not require a training process and therefore no measure of the student's performance.
A sequence of CMDPs $\mathcal{M}_{i_1},...,\mathcal{M}_{i_{N_s}}$, induced from a curriculum policy, is called curriculum. The figure below shows a curriculum induced by the Optimized curriculum policy with two simple interventions, either resetting the agent to the start (Hard Reset) or moving them one step back (Soft Reset). Note that interaction units and curriculum steps mean the same and can be used interchangeably.
Below is the CISR algorithm, which shows how the curriculum policy is optimized. The teacher learns online in $N_t$ rounds and plays a decision rule $\pi^T_j$ that makes a new student $j$ learn under an adaptively constructed sequence of CMDPs $\mathcal{M}_{i_n}$. Each student $j$ learns via $N_s$ interaction units, updating its policy by transferring between units. Then, the teacher computes features $\phi$ by evaluating the student's policies, based on which it proposes the next CMDP. At the end of each round, the teacher adjusts its decision rule.
Our experiments concentrated on comparing the Optimized policy to other student-independent curriculum policies and evaluating how well they generalize to an environment of different size.
To accomplish this, we created our own Frozen Smiley environment, which is based on the Frozen Lake environment used by Turchetta et al.
We used the two environments which are shown below.
Both are based on the Frozen Lake environment from the gym-library
While a learned curriculum policy is presumably more generally applicable and does not require manual optimization, it also comes with the disadvantage of a training process and the need for a performance measure of the student. With this in mind, we came up with two simple curriculum policies of the form $\pi^T : [N_s] \rightarrow \mathcal{I}$, which will be described in the following.
One of the simplest curriculum policy one could think of involves always going back $x$ steps when a trigger state is visited. For our experiments, we tested values for $x$ in the interval $[1,9]$. When playing the minigame on top of this article, you can try out the Back$_4$ policy by yourself and see how it influences training. Below is our implementation in Python, which simply involves a class taking $x$ as an input and always taking the resulting state of the action at index $x-1$, which corresponds to going back $x$ steps.
To compare the Back curriculum policy with the Optimized one, we plotted them for different values of $x$ below. Move the slider and see how they measure up in the Frozen Smiley environment for yourself.
The idea behind the Incremental curriculum policy is the tradeoff between exploration and exploitation. While the agent should be free to explore the map in the beginning, it should be punished harder for failures as the learning process progresses. This is realized by incrementally increasing the amount of steps the agent is reset linearly. Formally, Incremental$_x$ resets the agent by $\lceil \frac{1}{2^x} \cdot n \rceil$ steps in the $n$th curriculum step. During our experiments, we tried out values for $x$ in the range $[0,4]$. Our implementation below simply increases a counter after each intervention unit and scales it linearly with the parameter $x$. The rounded result is then used as the index for the action, corresponding to the number of steps the agent is reset by.
Below we plotted the Incremental and the Optimized curriculum policy for different values of $x$. Again, use the slider to try out which value of $x$ works best in the Frozen Smiley environment.
In this section, we take a look at how our curriculum policies compare to the others and see how the larger Frozen Smiley environment influences their performance.
For our comparisons, we measure the successes, training failures and average rewards both for the Frozen Lake and the Frozen Smiley environment. Each curriculum policy was applied to $10$ students to account for errors and the randomness of the environment.
From experimentation with the values of $x$ like described in the previous section, we found that in the Frozen Lake environment $x=6$ and $x=2$ worked best for the Back and Incremental curriculum policies respectively. In this configuration, both the Back and the Incremental policy surpass the others within the first curriculum steps and quickly reach a success rate around $90\%$.
This is also reflected in the table below, which shows that all teachers keep the student safe while training the student in the original environment leads to failures during training. On average, the Back and Incremental policy outperform the others with the Back policy performing slightly better than the Incremental policy.
Successes | Training Failures | Average Returns | |
---|---|---|---|
Back$_6$ | $0.921$ | $0.000$ | $5.259$ |
Incremental$_2$ | $0.886$ | $0.000$ | $5.053$ |
Trained | $0.743$ | $0.000$ | $4.035$ |
Original | $0.641$ | $3672.600$ | $3.586$ |
HR | $0.000$ | $0.000$ | $-0.295$ |
SR1 | $0.820$ | $0.000$ | $4.661$ |
Bandit | $0.659$ | $0.000$ | $3.638$ |
For the Frozen Smiley environment, we found that $x=8$ works best for the Back curriculum policy, while Incremental performs best for $x=4$. Noticeably, the increase within the first curriculum steps is not as steep in this environment, only reaching a stable success rate around $90\%$ after $6$ intervention units. Even more apparent is the lower performance of the SR1 (Soft Reset 1) and the Bandit policy.
Like previously, all teachers keep the students safe during training as seen below. The Optimized curriculum policy manages to keep up with the Incremental policy in terms of successes after $10$ interaction units. Still, the Back policy performs best at a final success rate of over $90\%$ and the largest average returns.
Successes | Training Failures | Average Returns | |
---|---|---|---|
Back$_8$ | $0.929$ | $0.000$ | $5.293$ |
Incremental$_4$ | $0.886$ | $0.000$ | $5.054$ |
Trained | $0.879$ | $0.000$ | $4.856$ |
Original | $0.741$ | $2453.400$ | $4.166$ |
HR | $0.000$ | $0.000$ | $-0.140$ |
SR1 | $0.597$ | $0.000$ | $3.367$ |
Bandit | $0.398$ | $0.000$ | $2.133$ |
The figure below shows a visualization of the paths taken by the students during training. During the first curriculum steps we can see the students exploring the map, while at later steps, they follow the previously found path, with only slight deviations or optimizations.
Adding the teacher with its trigger states and intervention transitions to the maps kept the students safe during training. When choosing the best parameters out of those we tested, the Back and Incremental policy were able to outperform the Optimized one. Both the higher values for $x$ and slower increase in success rates could be explained by the increased size of the environment. As the Frozen Smiley environment is slightly larger than the original one, the increasing path lengths made an increase in reset steps for the Back policy necessary. Similarly, the Incremental curriculum policy took advantage of a longer exploration phase in the larger environment, by slowing down the increase of reset steps.
The described method has several clear advantages, but there are also disadvantages and limitations coming along with it.
While the used Frozen Lake environments allow to simply define which states trigger the teacher's interventions by using a distance measure from danger zones, this can be a complex task in environments with continuous state spaces and observations.
For example, the observation space of the car racing environment
Some tasks might also be simply so risky, that the teacher prevents the student from solving them at all, because it prevents the student from taking necessary risks. An example for this can be the minigame at the beginning, when selecting the Frozen Smiley environment and the teacher with trigger states within two tiles around danger zones. In these cases, less restrictive trigger states are needed, which might have to allow failures during training.
When comparing the different curriculum policies, we see that the Optimized one can outperform the Bandit and No Intervention policies, including the Soft and Hard Reset policies it is based on. However, it is possible to define simple curriculum policies like Back or Incremental, which can perform even better than the Optimized policy on the tested environments. We also observed that larger environments require a longer exploration phase and that the original HR, SR and Bandit policies do not generalize well to larger environments. Fundamentally, we found that defining reset transitions which keep the student safe is easier than defining suitable trigger states in the first place. When state spaces become more complex, dynamic or just partly observable, this could become a problem.
Going forward, the method could be applied to other environments in OpenAI's Safety Gym
We are grateful to Rong Guo for supervising this project and continuously giving us feedback.
This article was co-authored by Marvin Sextro and Jonas Loos under supervision of Rong Guo. Klaus Obermayer provided feedback on the project.
* equal contributions
For a summary of the main findings presented in this article, also see our conference poster.
If you see mistakes or want to suggest changes, please create an issue on GitHub.