RL-22-23-Report

SAFE REINFORCEMENT LEARNING WITH NATURAL LANGUAGE CONSTRAINTS

Abstract

The paper I need to reproduce this semester addresses the problem of learning task control policies under the constraints of providing a natural language. Unlike instruction following, the language here is not used to specify goals, but rather to describe situations that agents must avoid when exploring the environment. Specifying constraints in natural language also differs from the predominant paradigm in safety reinforcement learning, where safety criteria are enforced by a manually defined cost function. While natural language allows for simple and flexible specification of safety constraints and budget limits, its ambiguity poses a challenge in mapping these specifications to representations that can be used by safety reinforcement learning techniques. To address this problem, its authors developed a model with two parts. (1) a constraint interpreter that encodes natural language constraints into vector representations that capture spatial and temporal information about forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable, and we train it using a recently proposed constrained policy optimization algorithm.

Constraint Interpreter

The focus of this article is on how to convert textual descriptions of constraints into values that can be directly understood by reinforcement learning. The following model is used in the paper.

(1)The constraint mask module uses the observation o_t and the text x to predict a binary constraint mask, denoted by \hat{M}_C, that is a prediction of the true constraint mask M_C. If there is a cost entity (i.e., the forbidden state mentioned in the text) in row i and column j (denoted by o_{t(i, j)}) of observation o_y, each cell in \hat{M}_C will contain a one. Otherwise, the cell contains a zero. The authors use \hat{M}_C to identify the cost entities in the text while preserving their spatial information for the policy network. (2) The constraint threshold module uses an LSTM to obtain a text vector representation, followed by a dense layer to generate h. The dense layer generates \hat{h}_C, which predicts the true constraint threshold h_C.

Policy Network

Training Procedure for Constraint Interpreter

The constraint translator is composed of two parts as described in previous sections, firstly the constraint mask module, and the constraint threshold module. The LSTM is used in the thesis as its semantic analysis module, and due to the excellent performance of the Transformer in language modeling, I decided to use this model as an alternative to it in my experiments.

The constraint descriptions I use are mainly from the method written by the authors of the paper. The method uses existing phrases to generate random constraint descriptions based on the target constraints. Observations are then collected in the provided environment using a random policy. Finally, based on the collected observations and the constraints corresponding to the observations, a constraint mask can be generated as the ground truth for training. The constraint thresholds are already known at the time of generating the constraint descriptions, so no additional processing is required.

The modified constraint mask interpreter

The modified constraint threshold interpreter

First I tried to train the model from scratch using the basic Transformer model. In more detail, I'm using just the encoder part of the Transformer, which is the left part of the following part. The tokenrizer of the model is just a simple numbering of words in the word bank for the vocabulary. The final accuracy of the model is about 91%.

To better verify the feasibility of the approach, I used a pre-trained natural language processing model, BERT, in my final experiments to analyze the constraints of textual descriptions. It is noteworthy that the mainstream natural language analysis models nowadays no longer use LSTM, but are replaced by Transformer, and BERT is one of them. Since the base model of BERT also uses the Transformer, I think it can be used as an extension experiment of the previous step. In addition, since our data will not be optimized for the language model, the BERT model needs to be frozen during training.

The final number of parameters of the constraint mask model is as follows.

Layer (type)	Output Shape	Param #
tf_bert_model (TFBertModel)	multiple	108310272
dense (Dense)	multiple	769
conv2d (Conv2D)	multiple	9280
sequential (Sequential)	(50, 49)	19249
Total params: 108,339,570
Trainable params: 29,298
Non-trainable params: 108,310,272

The final number of parameters of the constraint threshold model is as follows.

Layer (type)	Output Shape	Param #
tf_bert_model (TFBertModel)	multiple	108310272
sequential (Sequential)	(128, 1)	4915329
Total params: 113,225,601
Trainable params: 4,915,329
Non-trainable params: 108,310,272

In addition, during the training Constraint Interpreter phase, the data are generated by the agents in a random environment using random policies. The constrained text descriptions are generated from a fixed library of words by permutation.

Training Procedure for Policy Network

Since the policy module described in the paper does not differ from the Constrained Policy Optimization , here I use the code base provided by OpenAI https://github.com/openai/safety-starter-agents as a template to use. In addition, the environment used for the experiments is Safety Gym https://openai.com/blog/safety-gym/ , which has been customized by the authors of the paper, and the code has been made available on Github https://github.com/michahu/hazard-world-grid .

The general process of training remains the same as in the template, the part that is different is the middle layer of the policy module. To better fit the data of this paper, I replaced the multilayer perceptron in the template with a convolutional network, thus being able to save memory usage to a great extent.

Result

Constraint Interpreter

Using the constraints generated by permutation using a fixed word bank together with the observed data, a constrained interpreter with an accuracy of about 90% can be finally trained. However, accuracy is not the part I want to emphasize. In my experiments, I use the pre-trained model provided by the huggingface library to transcribe the constraint text. The transcribed constraint text was then used for testing. The final accuracy was also about 90% as the original test accuracy. This shows that the network after Transformer did learn how to use the embedded text and the observed data to derive the constraint masks.

Constraint interpreter accuracies
Base Model	Mask Module	Threshold Module
Basic Transformer	90.50%	94.02%
Pre-trained BERT	90.28%	94.37%

Policy Network

As described in the previous sections, I use the CPO from the existing code to train the policy network. However, some parameters in it were adjusted. First, it is important to note that the training method provided inside this code base uses an online approach, so the data for each epoch needs to be generated online. To ensure the availability of the data and to ensure that there are enough trajectories per epoch, I set the maximum length of each trajectory to 1e3 and the maximum length of each epoch to 1e5. A total of 1e3 epochs are trained. During the training the gamma has been set to 0.95, lambda 0.90, and gamma for cost is 0.99 and lambda for cost is 0.97, lastly target kl-divergence is 0.01.

Cost trend per Epoch

Performance trend per Epoch

Challenging Issues

Limited wording of the simulator

In the module provided in the paper, there is no training data available for use. All the data for this experiment needs to be regenerated. However the method used to generate the language constraints in the code of the thesis is extremely simple and contains only 30 words. This not only makes it very difficult to train the language model, but also has no way to verify the generality of the language model. To solve the problem, I use a pre-trained sentence rephrase model to rewirte the language constraints. The model used is the one provided in the HuggingFace library called Vamsi Paraphrase. I paraphrased all the language restrictions with this model, with the purpose of testing the generality of the language model. The final results were not surprising, as the generalizability of the language model trained using the original model was far less than the accuracy of the pre-trained model.

Constraint interpreter accuracies with paraphrased constraints
Base Model	Mask Module	Threshold Module
Basic Transformer	75.33%	82.42%
Pre-trained BERT	91.02%	93.77%

Multiple models used simultaneously

One of the problems that bothered me most when conducting the paper replication was that I had to load three models at the same time in the experiment. Two of the models involved natural language analysis, and such models tend to be relatively large in size. The problem with this is that the memory used by the models will be huge. Three models already require nearly 20GB of memory. The training was finally done on an NVIDIA 3090. But even with the 3090, the model size did not get a huge boost because it needed to be loaded at the same time.

Online Learning

To try to solve the above problem, I had considered using Google's TPU for training. However, after some research, I found that since the experiment is online learning and thus requires the use of CPUs to fetch data from the simulated environment. However, the interface provided by the TPU is very limited, and its coordination requires a lot of work if you want the CPU as well as the TPU to work simultaneously. As an individual, it is very difficult to accomplish such a task. So I finally gave up the idea.

Summary

The paper I needed to reproduce this semester focused on deep learning for natural language learning. In the process of replication, I used modules that I am familiar with to modify the model described in the paper to some extent and justify the modification. Although I encountered many difficulties during the replication process, including incomplete data, I found solutions to most of them, and I learned a lot about the practical problems that reinforcement learning requires. I also received very valuable advice from the authors of the paper, which gave me a sense of the enthusiasm of the people working in this research area. I look forward to better developments in the field of reinforcement learning in the future.

Acknowledgments

Many thanks to Michael Hu for his advice.