Teachable Reinforcement Learning via Advice Distillation
An explanation of advice distillation with off-policy learning and an extension making it on-policy
Abstract
Reinforcement Learning is a very promising machine learning technique but has the issue of requiring
a very large amount of data for learning.
The paper we investigated tries to tackle this issue by implementing a learning scheme similar to how humans learn:
gradual, first mastering easy tasks before trying out more complex ones.
The approach is to gradually teach an agent to follow coaching instructions increasing in complexity in a process called distillation.
We provide an in depth mathematical explanation on how learning works with distillation.
Our extension of the paper was to create an alternate way of distilling coaching instructions and compare it to the paper's
original approach. Our method showed a smaller initial drop in performance at the start of distillation.
Introduction
Reinforcement Learning
is a machine learning technique in which agents explore their environments, receive rewards and
learn strategies for maximizing the amount of received reward. Desired behaviour is reinforced through
the
reward hence the name Reinforcement Learning.
The typical way in which agents learn is through exploration of the environment and by receiving rewards when
executing desired actions or reaching certain states.
Agents have to "try out" various actions in each state by
"choosing" from a set of possible
actions. Normally agents are learning from scratch and without any "guidance".
Thus, they must try out many (state, action) pairs to
make sure they have exhaustively found out "enough" information about the environment.
RL has been employed successfully in solving some tasks with performance better than humans have been able to do.
However, the setting until now has been
quite limited
.
Ideally, we wish to have agents that are able to solve from complex to
very complex tasks
.
Usually complex tasks present the challenge of having high dimensionality (state, action) space. Combined
with the
random exploration technique, this forces the agent to do a lot of random, unguided exploration. This, in turn,
leads to
the
issue of requiring
high number of samples
. For example, the algorithms required in obtaining at least 20% of human performance playing Atari Games
at least
10 million samples
.
This is in stark contrast to how humans learn, though. Humans start as children
imitating
what other humans do.
They then continue imitation by learning
indirectly
using
communication
in
in natural language
.
Human communication
is considered
low effort and
high bandwidth
.
In this way, humans are told how to solve tasks,
typically by more expert peers (e.g. going to school, university or having a coach).
During the learning process, students receive constant feedback from their peers on how well they do. Thus
humans quickly calibrate to make sure that task solving strategies are appropriate.
The experts also usually use a complex stepwise strategy for teaching. The student usually starts with some very
basic
training and gets
introduced to more complex tasks after he has a pretty good understanding on how to solve easier tasks.
Moreover, humans typically learn fast and require fewer samples when compared to typical RL techniques.
Interestingly enough, research suggests that humans themselves are driven by inner rewards.
One of the main neurotransmitters is
dopamine
. Its purpose is to encode
reward prediction error
. Work has been done suggesting dopamine is a very good candidate signal for
driving learning
. This potentially mirrors the purpose of reward signals in reinforcement learning.
The paper we investigated
suggests a strategy to reduce the number of samples agents require
by enabling agents to follow a similar stepwise learning
strategy.
More concretely, agents are made
teachable
i.e. learn how to follow instructions
from humans.
The teachers are giving instructions to the agent on how to solve intermediary steps of a task and are not
allowed to directly
control the agent's movements. The paper calls these instructions advice.
Similarly to the stepwise school curriculum of humans, the agents are trained on various levels of complexity of
the advice.
The paper suggests 4 steps of learning:
- Grounding - teaching the agent how to follow, simple, low level advice
- Grounding to multiple types of advice - teaching the agent how to follow, tuples of simple, low
level advice
- Improvement to higher level advice - teaching the grounded agent to follow more complex, higher
level advice
- Improvement to advice independence - removing the teacher completely and allowing the agent to
interact
with its environment independently
After learning, the agent goes through a typical evaluation phase to test its performance.
The paper makes the claim that it ["proposes a framework for training automated agents using similarly
rich interactive supervision"] that we do not regard as being true. The advice implemented in the codebase
is not
rich at all, coming mostly in the shape of 2-D vector. This is described in more detail in Experimental
Setup.
We will suggest in Conclusion a possible method to extend this to a more
rich
language.
Tiered learning, also called distillation and formally defined later, is achieved via augmenting the
reward
signal typical in an RL setting. The teacher has the ability to present a
reward to the agent depending on how well it is following the given advice. Thus, the teacher acts as a
coach
and the
agent learns how to react to human feedback
.
To understand how this works, we will
present the Coaching-Augmented Markov Decision Process formalism.
We will then
explain how
this formalism is used to leverage this tiered structure of learning using
Off-policy Learning
see also
.
We will then present our contribution to how we made the algorithm make use of
On policy Learning
.
We will present some preliminary results, talk about the challenges we faced and then discuss our findings.
.
Other attempts have been made at enabling agents to learn more like humans do. These include:
-
imitation learning
i.e.
closely mimicking demonstrated behaviour
- No Regret Learning:
DAgger
-
Preference Learning
The big disadvantage of these techniques, though, is the low bandwidth of communication.
This means that little
information
is extracted from each interaction with humans.
Background
Markov Decision Processes
RL typically works by implementing the Markov Decision Process formalism. The MDP is defined as a tuple
{S, A, T, R, ρ, γ, p} where
- S is the state space and represents valid positions where the agent could be found at any time
- A(s) is the action space and represents the valid actions that an agent can take while in a
particular state
- T(st, a, st+1) is the transition dynamic and represents the probability of
arriving at
st+1 if at time t the agent was at st and executing action a
- R(s, a) is the reward that an agent receives while in state s and executing action a
- ρ(s0) is the initial state distribution representing where the agent starts each episode
- γ is the discount factor balancing how important future rewards vs immediate ones are
- p(τ) is the distribution over tasks i.e. what kind of task the agent is supposed to solve
The agent decides on an action to take at each time step t. A set of decisions the agent takes is
called a
policy and is typically denoted by πθ(·|st, τ). The policy is
called
θ and is usually implemented by a probability distribution on the set A.
The agent thus interacts with the environment and collects trajectories of the shape
D = {(s0,a0,r1),(s1,a1,r2),···
,(sH-1,aH-1,rH)}j=1N.
Solving the MDP
The objective of a multi task MDP is to find the policy θ that maximizes the amount of
future discounted rewards. Formally, it looks for
maxθ [Eat∼πθ(·|st, τ)(∑∞t=0
γtr(st, at, τ)>)]
where E(X)=<X> represents the expected value of the random variable X.
Exploration/exploitation dilemma
Typically agents need to execute random actions to discover trajectories which prove to be of high reward. In
case such
are found, the agent increases the probability of taking similar actions in the future. Because of high
dimensional (state, action) space,
the agent typically needs to try out a lot of combinations to make sure it found the best one. The agent always
needs a balance
between trying out random new actions and commiting to already known high reward ones. It is still an unsolved
problem
to find this optimal balance. This is called the exploration/exploitation dilemma agents typically face
and quickly explains the need for many samples. This was described in Introduction
.
Coaching-Augmented Markov Decision Processes
The paper extends the classical MDP by providing two extensions:
- C = {ct}, the set of coaching functions
where ct represents advice given to the agent at time t.
- RCMDP=R(s,a) + Rcoach(s,a), where R(s,a) is the previous reward presented by
the
environment and
Rcoach(s,a) represents the additional reward the coach provides if the agent follows his
advice.
cj used in the paper is either:
- Cardinal Advice (North (0,1), South (0, -1), East(-1, 0) or West(1, 0)
- Directional Advice (e.g. Direction (0.5, 0.5))
- Waypoint Advice (e.g. Go To (3,1))
- Offset Waypoint Advice where a waypoint (e.g. Go To (3,1)) is considered relative to the agent's
position
but could be extended to include natural language or other richer types of advice
(see Conclusion).
Thus, we formally define the Coaching Augmented MDP (CAMDP) as the tuple {S, A, T, RCAMDP, ρ, γ, p,
C}.
The agent then captures trajectories of the shape:
D = {(s0,a0,c0
,r1),(s1,a1,c1,r2),···
,(sH-1,aH-1, cH-1,rH)}j=1N.
The new optimization problem is to find the best policy θ that maximizes rewards from both
the environment
and the coaching functions i.e.
maxθ [Eat∼πθ(·|st, τ,
ct)(∑∞t=0 γtr(st, at, ct,
τ)>)]
representing an agent that interacts with the environment and has access to advice presented in the form of
coaching functions ct.
The big advantage of CAMDP over plain MDP is that it formalizes the interaction of the agent with
a human in the loop trainer. The agent learns that following human instructions/advice provides
reward and it starts doing so, enabling the agent to take advantage of expert knowledge.
Method
Our target is to quickly train agents that are able to solve complex tasks.
Considering the Exploration/exploitation dilemma, we would want agents that quickly find high reward
policies eliminating a lot of random exploration.
The paper suggests a tier based teaching scheme, speeding up learning
versus typical MDP.
This is done by:
- making the agent follow the coaching it receives
- introducing increasingly complex coaching
- guiding the agent to the goal
- allowing him to quickly understand that specific policies provide high reward
- eliminate the coaching
- allow the agent to follow the already found high reward policies
The paper introduces the following phases:
- Grounding - with the focus of making the agent interpret and follow low level, simple advice
- Improvement, which is of two types:
- from one type of advice to another type of advice - typically from low level,
simple advice to
high level, more complex advice
- from one type of advice to no advice - allowing the agent to figure out
policies
that allow him to decide independently on next actions
- Evaluation - which represents the phase in which the agent does not learn anymore and the already
learned policy is evaluated
Grounding
The main objective of grounding is to make the agent follow/interpret the provided advice.
The big advantage vs. plain MDP solving tasks is that the agent can be trained on a very simple
environment. The trajectories can be a lot simpler/shorter than the ones in a complex environment, where the
agent
must follow many steps to reach a goal (e.g. a game or a maze).
Theoretically the advice in the grounding phase can be of any nature. However, chosen wisely it can support the
idea of tiered
learning. Therefore, the grounding phase is the candidate for the simplest available advice i.e.
Directional Advice.
At every time step, the agent is rewarded with the dot product between the advised direction and the action it
took.
E.g. Should the agent be advised to move up (i.e. Direction (0, 1)) and he moves in direction (0, 0.5) he will
be rewarded with <(0, 1) * (0, 0.5)> = 0.5.
Should he move in direction (1, -0.5) i.e. diagonally down, he will receive a negative reward of
<(0, 1) * (1, -0.5)> = - 0.5
By applying the framework of CAMDP with the provided low-level advice, then we will obtain the
grounded
policy
πθgrounded(·|st, τ, clow level, t)
i.e. a policy that can take the state s, target τ and the low level advice ct and provides
a probability distribution of next actions.
Distillation to other types of advice
Once we have a policy able to interpret the simplest type of advice, we can use it to more quickly teach the agent
other types of advice.
The process of using one type of advice to more quickly learn another one is called distillation and represents
the key innovation of this paper.
Formally, the agent gathers trajectories of the shape:
D = { (s0, a0,cl0, ch0, r1),
(s1, a1,cl1, ch1, r2),···,
(sH-1,aH-1, clH-1, chH-1,rH)}j=1N.
clt represents the low level advice while cht
represents the
high level, more complex type of advice.
Distillation can be achieved using two types of learning:
- using off-policy actor critic learning - the codebase mainly implements this method
- using on-policy actor critic learning combined with supervised learning of the mapping from
low level to high level advice - done in the code extension we implemented
In the first method, the new policy to be learned πΦnew is a newly initialized policy. The
agent
explores the environment using Φnew but learns off-policy by using θgrounded.
This approach comes with the fact that the exploration/exploitation dilemma is basically reset. The agent
is forced to start by
randomly exploring again. Having enough trajectories, θgrounded off loads the grounded knowledge base.
Thus the new policy takes advantage of the grounded phase.
We tried to tackle the issue of restarting with random exploration in our implementation. We reuse the already existing
θgrounded by learning a mapper from the new type of advice to the old one.
Like this, the old policy continues to work because of no change in the structure of parameters.
The mapping from cht to clt was learned via supervised learning.
Our reasoning was that we can take advantage of the existing pairs (cht,
clt)
that can be learned in a supervised way.
Our expectation was then that θgrounded, high level -> low level would start from a higher
baseline than
Φnew. This should be measurable in experiments.
After this step we have reached our goal of grounding i.e. having
πΦ(·|st, τ, ct)
a policy that can accept a tuple of advice of the shape (clt, ch1t, ch2t, ...).
Improvement
The ultimate goal is to obtain a policy
πθ(·|st, τ)
which does not require the coaching functions. The paper uses the already explained distillation technique
to learn such a policy.
.
Distillation can be done either:
- by distilling from the grounded policy to another intermediary policy that accepts even more complex,
abstract,
and sparse type of advice
OR
- by distilling to no advice, achieving advice independence by taking advantage of already known high reward trajectories.
Even though the agent collects
D = { (s0, a0, c0, r1),
(s1, a1,c1, r2),···,
(sH-1,aH-1, cH-1,rH)}j=1N.
the agent optimizes:
maxθ E(st, at, τ)t∼D(·|st, τ)
[log πθ(at|st, τ)]
thus eliminating the coaching functions.
The advantage of advice distillation over imitation learning is that the agent accepts a more sparse and
abstract type of advice. This allows the agent to generalize better because the advice is invariant to internal
distributions shifts of the agent.
Evaluation
During evaluation let the agent explore using πθ and compute the actual amount of reward the
environment provides.
Experimental setup
To test the paper's approach, we compared the method of advice distillation described above with a simple baseline case: training a multi-layer perceptron (mlp) to convert high-level to low-level advice. The basic steps for our method are:
- Train an mlp to take high-level advice as input and return equivalent low-level advice
- For the grounding phase, train our agent on low-level advice just like in the paper's method
- For the distillation phase, keep the agent the same and replace the low-level advice with the mlp's output
With this approach, the agent does not have to learn how to follow a new kind of advice because the advice it gets is equivalent to what it was receiving before. Instead, the training between advice types is done in advance by pre-training the mlp.
We chose this baseline of comparison because the goal of advice distillation is to quickly transfer the already learned knowledge from low-level advice to higher-level advice. As proposed in the original paper and supported by their experiments, this allows the agent to learn faster (both in terms of literal training time and the amount of instruction needed) than it does if it starts with only the high-level advice.
Our advice-conversion mlp applies the same principle with a very basic architecture, directly mapping high-level onto low-level advice instead of training the agent to follow the high-level advice directly. By comparing the paper's method against this baseline, we can test whether giving the agent access to high-level advice results in better performance, or if a direct advice-mapping to low-level advice is sufficient.
Our advice-conversion mlp had a 383-value input layer, consisting of a 255-value observation of the environment state and a 128-value advice component, a 128-value hidden layer, and a 2-value output layer. For our experiments, the input advice was offset waypoint (a sparse, high-level advice type), and the label advice was directional (a low-level advice type). Each advice type is a 2-D vector describing the agent's optimal movement. The offset waypoint advice was passed through a fixed-weight mlp to expand it to 128 dimentions before being passed as input to the advice-conversion mlp.
The training set consisted of observation/offset waypoint advice/direction advice triples. For each triple the waypoint location, agent position and velocity were randomly generated, and the agent's usual offset waypoint and direction teachers were queried to get the input and label respectively.
Because the high-level advice is sparse, we have to take into account the movement of the agent, which can cause the old offset waypoint to no longer indicate the direction of the true waypoint. Therefore, the correct direction that would be given as low-level advice may not be the same direction given by the old high-level advice. To simulate this, we included each generated waypoint five times in the training set, each with a different random nearby agent position. The offset waypoint advice given was always based on the first position in the set, but the directional advice label was based on the actual current position. This ensures that the agent will not just copy the offset waypoint given but will also take into account the actual state of the environment.
One weakness of our mlp architecture is that it does not have any memory of previous environmental states. In our input generation the agent positions are independent of each other, but in an actual environment the next position would be based on the current position and velocity and the action taken. We did not include this because we wanted to keep our architecture simple and focused on the advice rather than the environment itself. But having an understanding of previous states and actions is one advantage the advice distillation agent has over this baseline. Future experiments could expand this mlp to take this information into account, for example changing the output to a time series representing the best actions to take over several time steps to reach the waypoint.
The advice converter was trained using stochastic gradient descent with 5000 batches of 10,000 values each. The step size was initially 0.001 and was annealed to 0.0001 after 100 epochs. After a training time of about 7 hours, the mlp achieved a final loss of about 2.47. See Results for a more detailed analysis of the training loss and its implications for the mlp's performance. During the distillation phase, the offset waypoint advice that would normally be passed directly to the agent was instead run through this mlp, and the mlp's output was passed to the agent instead.
For our experiment, both our method and the paper's method used the same grounded policy θgrounded, which was run for 320 iterations on directional advice. For the paper's method, a new policy Φnew that took offset waypoint advice was created, and θgrounded was used for off-policy relabeling.
For our method, θgrounded was reused and the directional advice was replaced with the output of our pretrained advice-conversion mlp. Our method was run for 900 more iterations, and the paper's method was only run for 440 more iterations before an issue caused the training to stop early. As a result, we focused on the first 440 post-grounding iterations for our analysis.
Results and Discussion
We measured the loss (mean squared error) of our advice-conversion mlp during pretraining as an indicator of how well it could approximate the real directional advice.
At the end of training, the network's loss was around 2.5. The values in the direction vector were restricted to a range of [-3.8, 3.8], so a loss of 2.5 means the network's output is still relatively inaccurate.
This seems to be because of the number of weights involved in the network. An earlier version of the advice-training took a 6-value input (just the offset waypoint advice, the agent position, and the agent velocity) and achieved a much better performance, with a loss of 0.5 after only an hour of training. However, this model only worked when the offset waypoint advice was given densely, because it had no way to know the true waypoint location if the offset waypoint given is inaccurate. The current advice converter, while much slower to converge, is able to properly interpret sparsely given advice.
While we did not have time to train the mlp for longer, its performance was still improving at the end of the pretraining period, so it would likely continue to improve with more training time. Future experiments could confirm this by testing the effects of increased training time, and the effect of a better-converged model on the agent's overall performance.
We also compared the average reward of the two policies using both the paper's method and ours as a measurement of how well the agents were able to complete the task.
Some of the specific reward values are highlighted in the table below:
Iteration |
Original Distillation Reward |
Our Method Reward |
|
Grounding Phase (same agent for both methods) |
|
0 |
0.00266 |
0.00266 |
100 |
0.06159 |
0.06159 |
200 |
0.07142 |
0.07142 |
300 |
0.08013 |
0.08013 |
|
Start of Distillation (at iteration 320) |
|
320 |
0.00533 |
0.01 |
420 |
0.00416 |
0.01886 |
520 |
0.00466 |
0.02290 |
620 |
0.002 |
0.02207 |
720 |
0.00666 |
0.03841 |
The agent improves rapidly during the grounding phase, as it it given relatively simple and informative directional advice to follow. When the distillation phase begins, both agents' performances drop. However, while the original distillation agent's performance is as poor as it was at the beginning, the agent using our method is still somewhat better.
This is consistent with what we would expect given how the agents work and with what we had hoped to measure. The paper's version of the distillation starts with a new agent and a newly-initialized exploration policy, so it essentially has to restart its learning from scratch. (However, the paper shows that learning with the offloaded grounded policy is still faster than starting learning with only the high-level advice. See the next figure for a comparison of advice distillation vs direct learning of offset waypoint advice.) Our version, meanwhile, keeps the same agent and just changes to a new advice type that is trained to match the old advice type, so it retains some of its progress.
As mentioned above, our mlp was still a relatively inaccurate approximation of the actual low-level advice, explaining the drop that we do see. Presumably, if the mlp was allowed to train longer, this drop would be smaller because the network's output would be closer to the accurate directional advice that the agent is used to recieving. Alternatively, the drop may be due to the policy not having time to be fully grounded. As the new advice types are not as easy to interpret as the old directional advice, the agents do not improve as quickly during the distillation phase, but we would expect to see them converge to a better performance given more iterations to run.
Our method's ability to switch to a higher-level advice type without a drop in performance may be useful in situations where a smooth transition between advice types is necessary. However, because the mlp's output will currently always be at least somewhat different from the actual best action we suspect that the original advice distillation method will eventually converge to a better performance. While our method accomplishes the basic goal of allowing an agent trained on low-level advice to understand high-level advice using only a simple mlp architecture, more distillation iterations would be needed to compare the long-term performance of the two methods.
Conclusion
The point-maze agent learns quickly when given low-level directional advice, but its performance drops when switching to high-level offset waypoint advice at the start of the distillation phase. Because the new advice is an approximation of the old directional advice, our advice-conversion method experiences less of a drop and initially performs better than the paper's advice-distillation method.
Limitations
The main limitation of our experiment is the limited time we were able to run the grounding, distillation and pre-training processes for. Because of this, we chose to focus on the immediate consequences of the switch in advice type, but we would need more time for an effective comparison of the two methods' convergence speeds or overall performance. More pre-training time would also likely improve the performance of our method's agent because the advice it is receiving would be closer to the directional advice it was grounded on.
Additionally, our method has some limits that would make it impractical for some cases. First, it needs a large amount of high- and low-level advice to serve as a training dataset, which could be impractical for some cases, for instance if the advice needs to be human-provided. In this case, providing the high and low-level advice pairs for training becomes expensive and having enough data for proper pretraining is implausible. Allowing the advice-conversion mlp to continue to train during the agent's own training would help with this issue, but then would not allow as smooth of a transition as the pre-training does. The pairing of high-level with low-level advice in pairs also would not work in cases where the two advice types do not have a clear relationship. Finally, the methods used to generate the mlp's training data may not reflect the actual environmental conditions (for instance, our training data assumed no walls inside of the maze), which may hurt the agent's performance when using this data.
Future Research Directions
Simply running the phases of our method for longer would be a good future experiment, allowing the later performance of the two methods to be compared. There are also several tweaks to our method that could be tested in future experiments. The time-convergence tradeoff of the advice-conversion mlp could be explored, and the mlp could be allowed to continue training during the distillation phase or even integrated into the agent's own network rather than being separate. Both the paper's and our methods could also be applied to other types of environments and advice, in order to see if the results hold across environment and advice types.
Finally, we addressed earlier the critique that the advice provided in this experiment is fairly simple and low-bandwidth, just being a 2-D vector. This is not really rich in a comparable way to the advice humans learn from. It would be a very interesting future experiment to have a real natural language processing layer that could parse human language into an advice signal that could be provided to and interpreted by a RL agent. This addition would allow much easier human coaching, as coaches would not need to have a technical background and would not need to provide low-level, possibly cryptic advice, making this paper more relevant to practical situations.