Teachable Reinforcement Learning via Advice Distillation

An explanation of advice distillation with off-policy learning and an extension making it on-policy

Abstract

Reinforcement Learning is a very promising machine learning technique but has the issue of requiring a very large amount of data for learning. The paper we investigated tries to tackle this issue by implementing a learning scheme similar to how humans learn: gradual, first mastering easy tasks before trying out more complex ones. The approach is to gradually teach an agent to follow coaching instructions increasing in complexity in a process called distillation. We provide an in depth mathematical explanation on how learning works with distillation.

Our extension of the paper was to create an alternate way of distilling coaching instructions and compare it to the paper's original approach. Our method showed a smaller initial drop in performance at the start of distillation.

Introduction

Reinforcement Learning is a machine learning technique in which agents explore their environments, receive rewards and learn strategies for maximizing the amount of received reward. Desired behaviour is reinforced through the reward hence the name Reinforcement Learning.

The typical way in which agents learn is through exploration of the environment and by receiving rewards when executing desired actions or reaching certain states. Agents have to "try out" various actions in each state by "choosing" from a set of possible actions. Normally agents are learning from scratch and without any "guidance". Thus, they must try out many (state, action) pairs to make sure they have exhaustively found out "enough" information about the environment.

RL has been employed successfully in solving some tasks with performance better than humans have been able to do. However, the setting until now has been quite limited . Ideally, we wish to have agents that are able to solve from complex to very complex tasks . Usually complex tasks present the challenge of having high dimensionality (state, action) space. Combined with the random exploration technique, this forces the agent to do a lot of random, unguided exploration. This, in turn, leads to the issue of requiring high number of samples . For example, the algorithms required in obtaining at least 20% of human performance playing Atari Games at least 10 million samples .

This is in stark contrast to how humans learn, though. Humans start as children imitating what other humans do. They then continue imitation by learning indirectly using communication in in natural language . Human communication is considered low effort and high bandwidth . In this way, humans are told how to solve tasks, typically by more expert peers (e.g. going to school, university or having a coach). During the learning process, students receive constant feedback from their peers on how well they do. Thus humans quickly calibrate to make sure that task solving strategies are appropriate. The experts also usually use a complex stepwise strategy for teaching. The student usually starts with some very basic training and gets introduced to more complex tasks after he has a pretty good understanding on how to solve easier tasks.

Moreover, humans typically learn fast and require fewer samples when compared to typical RL techniques.

Interestingly enough, research suggests that humans themselves are driven by inner rewards. One of the main neurotransmitters is dopamine . Its purpose is to encode reward prediction error . Work has been done suggesting dopamine is a very good candidate signal for driving learning . This potentially mirrors the purpose of reward signals in reinforcement learning.

The paper we investigated suggests a strategy to reduce the number of samples agents require by enabling agents to follow a similar stepwise learning strategy. More concretely, agents are made teachable i.e. learn how to follow instructions from humans. The teachers are giving instructions to the agent on how to solve intermediary steps of a task and are not allowed to directly control the agent's movements. The paper calls these instructions advice.

Similarly to the stepwise school curriculum of humans, the agents are trained on various levels of complexity of the advice. The paper suggests 4 steps of learning:

Grounding - teaching the agent how to follow, simple, low level advice
Grounding to multiple types of advice - teaching the agent how to follow, tuples of simple, low level advice
Improvement to higher level advice - teaching the grounded agent to follow more complex, higher level advice
Improvement to advice independence - removing the teacher completely and allowing the agent to interact with its environment independently

After learning, the agent goes through a typical evaluation phase to test its performance.

The paper makes the claim that it ["proposes a framework for training automated agents using similarly rich interactive supervision"] that we do not regard as being true. The advice implemented in the codebase is not rich at all, coming mostly in the shape of 2-D vector. This is described in more detail in Experimental Setup. We will suggest in Conclusion a possible method to extend this to a more rich language.

Tiered learning, also called distillation and formally defined later, is achieved via augmenting the reward signal typical in an RL setting. The teacher has the ability to present a reward to the agent depending on how well it is following the given advice. Thus, the teacher acts as a coach and the agent learns how to react to human feedback .

To understand how this works, we will present the Coaching-Augmented Markov Decision Process formalism. We will then explain how this formalism is used to leverage this tiered structure of learning using Off-policy Learning see also . We will then present our contribution to how we made the algorithm make use of On policy Learning . We will present some preliminary results, talk about the challenges we faced and then discuss our findings.

Other attempts have been made at enabling agents to learn more like humans do. These include:

imitation learning i.e. closely mimicking demonstrated behaviour
No Regret Learning: DAgger
Preference Learning

The big disadvantage of these techniques, though, is the low bandwidth of communication. This means that little information is extracted from each interaction with humans.

Background

Markov Decision Processes

RL typically works by implementing the Markov Decision Process formalism. The MDP is defined as a tuple {S, A, T, R, ρ, γ, p} where

S is the state space and represents valid positions where the agent could be found at any time
A(s) is the action space and represents the valid actions that an agent can take while in a particular state
T(s_t, a, s_t+1) is the transition dynamic and represents the probability of arriving at s_t+1 if at time t the agent was at s_t and executing action a
R(s, a) is the reward that an agent receives while in state s and executing action a
ρ(s₀) is the initial state distribution representing where the agent starts each episode
γ is the discount factor balancing how important future rewards vs immediate ones are
p(τ) is the distribution over tasks i.e. what kind of task the agent is supposed to solve

The agent decides on an action to take at each time step t. A set of decisions the agent takes is called a policy and is typically denoted by π_θ(·|s_t, τ). The policy is called θ and is usually implemented by a probability distribution on the set A. The agent thus interacts with the environment and collects trajectories of the shape

D = {(s₀,a₀,r₁),(s₁,a₁,r₂),··· ,(s_H-1,a_H-1,r_H)}_j=1^N.

Solving the MDP

The objective of a multi task MDP is to find the policy θ that maximizes the amount of future discounted rewards. Formally, it looks for

max_θ [E_{a_t∼π_θ(·|s_t, τ)}(∑^∞_t=0 γ^tr(s_t, a_t, τ)>)]

where E(X)=<X> represents the expected value of the random variable X.

Exploration/exploitation dilemma

Typically agents need to execute random actions to discover trajectories which prove to be of high reward. In case such are found, the agent increases the probability of taking similar actions in the future. Because of high dimensional (state, action) space, the agent typically needs to try out a lot of combinations to make sure it found the best one. The agent always needs a balance between trying out random new actions and commiting to already known high reward ones. It is still an unsolved problem to find this optimal balance. This is called the exploration/exploitation dilemma agents typically face and quickly explains the need for many samples. This was described in Introduction

Coaching-Augmented Markov Decision Processes

The paper extends the classical MDP by providing two extensions:

C = {c_t}, the set of coaching functions where c_t represents advice given to the agent at time t.
R_CMDP=R(s,a) + R_coach(s,a), where R(s,a) is the previous reward presented by the environment and R_coach(s,a) represents the additional reward the coach provides if the agent follows his advice.

c_j used in the paper is either:

Cardinal Advice (North (0,1), South (0, -1), East(-1, 0) or West(1, 0)
Directional Advice (e.g. Direction (0.5, 0.5))
Waypoint Advice (e.g. Go To (3,1))
Offset Waypoint Advice where a waypoint (e.g. Go To (3,1)) is considered relative to the agent's position

but could be extended to include natural language or other richer types of advice (see Conclusion).

Thus, we formally define the Coaching Augmented MDP (CAMDP) as the tuple {S, A, T, R_CAMDP, ρ, γ, p, C}. The agent then captures trajectories of the shape:

D = {(s₀,a₀,c₀ ,r₁),(s₁,a₁,c₁,r₂),··· ,(s_H-1,a_H-1, c_H-1,r_H)}_j=1^N.

The new optimization problem is to find the best policy θ that maximizes rewards from both the environment and the coaching functions i.e.

max_θ [E_{a_t∼π_θ(·|s_t, τ,
c_t)}(∑^∞_t=0 γ^tr(s_t, a_t, c_t, τ)>)]

representing an agent that interacts with the environment and has access to advice presented in the form of coaching functions c_t.

The big advantage of CAMDP over plain MDP is that it formalizes the interaction of the agent with a human in the loop trainer. The agent learns that following human instructions/advice provides reward and it starts doing so, enabling the agent to take advantage of expert knowledge.

Method

Our target is to quickly train agents that are able to solve complex tasks. Considering the Exploration/exploitation dilemma, we would want agents that quickly find high reward policies eliminating a lot of random exploration.

The paper suggests a tier based teaching scheme, speeding up learning versus typical MDP.

This is done by:

making the agent follow the coaching it receives
introducing increasingly complex coaching
guiding the agent to the goal
allowing him to quickly understand that specific policies provide high reward
eliminate the coaching
allow the agent to follow the already found high reward policies

The paper introduces the following phases:

Grounding - with the focus of making the agent interpret and follow low level, simple advice
Improvement, which is of two types:
1. from one type of advice to another type of advice - typically from low level, simple advice to high level, more complex advice
2. from one type of advice to no advice - allowing the agent to figure out policies that allow him to decide independently on next actions
Evaluation - which represents the phase in which the agent does not learn anymore and the already learned policy is evaluated

Grounding

The main objective of grounding is to make the agent follow/interpret the provided advice. The big advantage vs. plain MDP solving tasks is that the agent can be trained on a very simple environment. The trajectories can be a lot simpler/shorter than the ones in a complex environment, where the agent must follow many steps to reach a goal (e.g. a game or a maze).

Theoretically the advice in the grounding phase can be of any nature. However, chosen wisely it can support the idea of tiered learning. Therefore, the grounding phase is the candidate for the simplest available advice i.e. Directional Advice. At every time step, the agent is rewarded with the dot product between the advised direction and the action it took. E.g. Should the agent be advised to move up (i.e. Direction (0, 1)) and he moves in direction (0, 0.5) he will be rewarded with <(0, 1) * (0, 0.5)> = 0.5.
Should he move in direction (1, -0.5) i.e. diagonally down, he will receive a negative reward of <(0, 1) * (1, -0.5)> = - 0.5

By applying the framework of CAMDP with the provided low-level advice, then we will obtain the grounded policy

π_{θ_grounded}(·|s_t, τ, c_{low level, t})

i.e. a policy that can take the state s, target τ and the low level advice c_t and provides a probability distribution of next actions.

Distillation to other types of advice

Once we have a policy able to interpret the simplest type of advice, we can use it to more quickly teach the agent other types of advice.

The process of using one type of advice to more quickly learn another one is called distillation and represents the key innovation of this paper.

Formally, the agent gathers trajectories of the shape:

D = { (s₀, a₀,c^l₀, c^h₀, r₁), (s₁, a₁,c^l₁, c^h₁, r₂),···, (s_H-1,a_H-1, c^l_H-1, c^h_H-1,r_H)}_j=1^N.

c^l_t represents the low level advice while c^h_t represents the high level, more complex type of advice.

Distillation can be achieved using two types of learning:

using off-policy actor critic learning - the codebase mainly implements this method
using on-policy actor critic learning combined with supervised learning of the mapping from low level to high level advice - done in the code extension we implemented

In the first method, the new policy to be learned π_{Φ_new} is a newly initialized policy. The agent explores the environment using Φ_new but learns off-policy by using θ_grounded. This approach comes with the fact that the exploration/exploitation dilemma is basically reset. The agent is forced to start by randomly exploring again. Having enough trajectories, θ_grounded off loads the grounded knowledge base. Thus the new policy takes advantage of the grounded phase.

We tried to tackle the issue of restarting with random exploration in our implementation. We reuse the already existing θ_grounded by learning a mapper from the new type of advice to the old one. Like this, the old policy continues to work because of no change in the structure of parameters.

The mapping from c^h_t to c^l_t was learned via supervised learning. Our reasoning was that we can take advantage of the existing pairs (c^h_t, c^l_t) that can be learned in a supervised way.

Our expectation was then that θ_{grounded, high level -> low level} would start from a higher baseline than Φ_new. This should be measurable in experiments.

After this step we have reached our goal of grounding i.e. having

π_Φ(·|s_t, τ, c_t)

a policy that can accept a tuple of advice of the shape (c^l_t, c^h₁_t, c^h₂_t, ...).

Improvement

The ultimate goal is to obtain a policy

π_θ(·|s_t, τ)

which does not require the coaching functions. The paper uses the already explained distillation technique to learn such a policy.

Distillation can be done either:

by distilling from the grounded policy to another intermediary policy that accepts even more complex, abstract, and sparse type of advice

by distilling to no advice, achieving advice independence by taking advantage of already known high reward trajectories.

Even though the agent collects

D = { (s₀, a₀, c₀, r₁), (s₁, a₁,c₁, r₂),···, (s_H-1,a_H-1, c_H-1,r_H)}_j=1^N.

the agent optimizes:

max_θ E_{(s_t, a_t, τ)_t∼D(·|s_t, τ)} [log π_θ(a_t|s_t, τ)]

thus eliminating the coaching functions.

The advantage of advice distillation over imitation learning is that the agent accepts a more sparse and abstract type of advice. This allows the agent to generalize better because the advice is invariant to internal distributions shifts of the agent.

Evaluation

During evaluation let the agent explore using π_θ and compute the actual amount of reward the environment provides.

Experimental setup

To test the paper's approach, we compared the method of advice distillation described above with a simple baseline case: training a multi-layer perceptron (mlp) to convert high-level to low-level advice. The basic steps for our method are:

Train an mlp to take high-level advice as input and return equivalent low-level advice
For the grounding phase, train our agent on low-level advice just like in the paper's method
For the distillation phase, keep the agent the same and replace the low-level advice with the mlp's output

With this approach, the agent does not have to learn how to follow a new kind of advice because the advice it gets is equivalent to what it was receiving before. Instead, the training between advice types is done in advance by pre-training the mlp.

We chose this baseline of comparison because the goal of advice distillation is to quickly transfer the already learned knowledge from low-level advice to higher-level advice. As proposed in the original paper and supported by their experiments, this allows the agent to learn faster (both in terms of literal training time and the amount of instruction needed) than it does if it starts with only the high-level advice.

Our advice-conversion mlp applies the same principle with a very basic architecture, directly mapping high-level onto low-level advice instead of training the agent to follow the high-level advice directly. By comparing the paper's method against this baseline, we can test whether giving the agent access to high-level advice results in better performance, or if a direct advice-mapping to low-level advice is sufficient.

Our advice-conversion mlp had a 383-value input layer, consisting of a 255-value observation of the environment state and a 128-value advice component, a 128-value hidden layer, and a 2-value output layer. For our experiments, the input advice was offset waypoint (a sparse, high-level advice type), and the label advice was directional (a low-level advice type). Each advice type is a 2-D vector describing the agent's optimal movement. The offset waypoint advice was passed through a fixed-weight mlp to expand it to 128 dimentions before being passed as input to the advice-conversion mlp.

The training set consisted of observation/offset waypoint advice/direction advice triples. For each triple the waypoint location, agent position and velocity were randomly generated, and the agent's usual offset waypoint and direction teachers were queried to get the input and label respectively.

Because the high-level advice is sparse, we have to take into account the movement of the agent, which can cause the old offset waypoint to no longer indicate the direction of the true waypoint. Therefore, the correct direction that would be given as low-level advice may not be the same direction given by the old high-level advice. To simulate this, we included each generated waypoint five times in the training set, each with a different random nearby agent position. The offset waypoint advice given was always based on the first position in the set, but the directional advice label was based on the actual current position. This ensures that the agent will not just copy the offset waypoint given but will also take into account the actual state of the environment.

One weakness of our mlp architecture is that it does not have any memory of previous environmental states. In our input generation the agent positions are independent of each other, but in an actual environment the next position would be based on the current position and velocity and the action taken. We did not include this because we wanted to keep our architecture simple and focused on the advice rather than the environment itself. But having an understanding of previous states and actions is one advantage the advice distillation agent has over this baseline. Future experiments could expand this mlp to take this information into account, for example changing the output to a time series representing the best actions to take over several time steps to reach the waypoint.

The advice converter was trained using stochastic gradient descent with 5000 batches of 10,000 values each. The step size was initially 0.001 and was annealed to 0.0001 after 100 epochs. After a training time of about 7 hours, the mlp achieved a final loss of about 2.47. See Results for a more detailed analysis of the training loss and its implications for the mlp's performance. During the distillation phase, the offset waypoint advice that would normally be passed directly to the agent was instead run through this mlp, and the mlp's output was passed to the agent instead.

For our experiment, both our method and the paper's method used the same grounded policy θ_grounded, which was run for 320 iterations on directional advice. For the paper's method, a new policy Φ_new that took offset waypoint advice was created, and θ_grounded was used for off-policy relabeling.

For our method, θ_grounded was reused and the directional advice was replaced with the output of our pretrained advice-conversion mlp. Our method was run for 900 more iterations, and the paper's method was only run for 440 more iterations before an issue caused the training to stop early. As a result, we focused on the first 440 post-grounding iterations for our analysis.

Results and Discussion

We measured the loss (mean squared error) of our advice-conversion mlp during pretraining as an indicator of how well it could approximate the real directional advice.

At the end of training, the network's loss was around 2.5. The values in the direction vector were restricted to a range of [-3.8, 3.8], so a loss of 2.5 means the network's output is still relatively inaccurate.

This seems to be because of the number of weights involved in the network. An earlier version of the advice-training took a 6-value input (just the offset waypoint advice, the agent position, and the agent velocity) and achieved a much better performance, with a loss of 0.5 after only an hour of training. However, this model only worked when the offset waypoint advice was given densely, because it had no way to know the true waypoint location if the offset waypoint given is inaccurate. The current advice converter, while much slower to converge, is able to properly interpret sparsely given advice.

While we did not have time to train the mlp for longer, its performance was still improving at the end of the pretraining period, so it would likely continue to improve with more training time. Future experiments could confirm this by testing the effects of increased training time, and the effect of a better-converged model on the agent's overall performance.

We also compared the average reward of the two policies using both the paper's method and ours as a measurement of how well the agents were able to complete the task.

Some of the specific reward values are highlighted in the table below:

Iteration	Original Distillation Reward	Our Method Reward
	Grounding Phase (same agent for both methods)
0	0.00266	0.00266
100	0.06159	0.06159
200	0.07142	0.07142
300	0.08013	0.08013
	Start of Distillation (at iteration 320)
320	0.00533	0.01
420	0.00416	0.01886
520	0.00466	0.02290
620	0.002	0.02207
720	0.00666	0.03841

The agent improves rapidly during the grounding phase, as it it given relatively simple and informative directional advice to follow. When the distillation phase begins, both agents' performances drop. However, while the original distillation agent's performance is as poor as it was at the beginning, the agent using our method is still somewhat better.

This is consistent with what we would expect given how the agents work and with what we had hoped to measure. The paper's version of the distillation starts with a new agent and a newly-initialized exploration policy, so it essentially has to restart its learning from scratch. (However, the paper shows that learning with the offloaded grounded policy is still faster than starting learning with only the high-level advice. See the next figure for a comparison of advice distillation vs direct learning of offset waypoint advice.) Our version, meanwhile, keeps the same agent and just changes to a new advice type that is trained to match the old advice type, so it retains some of its progress.

As mentioned above, our mlp was still a relatively inaccurate approximation of the actual low-level advice, explaining the drop that we do see. Presumably, if the mlp was allowed to train longer, this drop would be smaller because the network's output would be closer to the accurate directional advice that the agent is used to recieving. Alternatively, the drop may be due to the policy not having time to be fully grounded. As the new advice types are not as easy to interpret as the old directional advice, the agents do not improve as quickly during the distillation phase, but we would expect to see them converge to a better performance given more iterations to run.

Our method's ability to switch to a higher-level advice type without a drop in performance may be useful in situations where a smooth transition between advice types is necessary. However, because the mlp's output will currently always be at least somewhat different from the actual best action we suspect that the original advice distillation method will eventually converge to a better performance. While our method accomplishes the basic goal of allowing an agent trained on low-level advice to understand high-level advice using only a simple mlp architecture, more distillation iterations would be needed to compare the long-term performance of the two methods.

Conclusion

The point-maze agent learns quickly when given low-level directional advice, but its performance drops when switching to high-level offset waypoint advice at the start of the distillation phase. Because the new advice is an approximation of the old directional advice, our advice-conversion method experiences less of a drop and initially performs better than the paper's advice-distillation method.

Limitations

The main limitation of our experiment is the limited time we were able to run the grounding, distillation and pre-training processes for. Because of this, we chose to focus on the immediate consequences of the switch in advice type, but we would need more time for an effective comparison of the two methods' convergence speeds or overall performance. More pre-training time would also likely improve the performance of our method's agent because the advice it is receiving would be closer to the directional advice it was grounded on.

Additionally, our method has some limits that would make it impractical for some cases. First, it needs a large amount of high- and low-level advice to serve as a training dataset, which could be impractical for some cases, for instance if the advice needs to be human-provided. In this case, providing the high and low-level advice pairs for training becomes expensive and having enough data for proper pretraining is implausible. Allowing the advice-conversion mlp to continue to train during the agent's own training would help with this issue, but then would not allow as smooth of a transition as the pre-training does. The pairing of high-level with low-level advice in pairs also would not work in cases where the two advice types do not have a clear relationship. Finally, the methods used to generate the mlp's training data may not reflect the actual environmental conditions (for instance, our training data assumed no walls inside of the maze), which may hurt the agent's performance when using this data.

Future Research Directions

Simply running the phases of our method for longer would be a good future experiment, allowing the later performance of the two methods to be compared. There are also several tweaks to our method that could be tested in future experiments. The time-convergence tradeoff of the advice-conversion mlp could be explored, and the mlp could be allowed to continue training during the distillation phase or even integrated into the agent's own network rather than being separate. Both the paper's and our methods could also be applied to other types of environments and advice, in order to see if the results hold across environment and advice types.

Finally, we addressed earlier the critique that the advice provided in this experiment is fairly simple and low-bandwidth, just being a 2-D vector. This is not really rich in a comparable way to the advice humans learn from. It would be a very interesting future experiment to have a real natural language processing layer that could parse human language into an advice signal that could be provided to and interpreted by a RL agent. This addition would allow much easier human coaching, as coaches would not need to have a technical background and would not need to provide low-level, possibly cryptic advice, making this paper more relevant to practical situations.