SAFE REINFORCEMENT LEARNING WITH NATURAL LANGUAGE CONSTRAINTS
The paper I need to reproduce this semester addresses the problem of learning task control policies under the constraints of providing a natural language. Unlike instruction following, the language here is not used to specify goals, but rather to describe situations that agents must avoid when exploring the environment. Specifying constraints in natural language also differs from the predominant paradigm in safety reinforcement learning, where safety criteria are enforced by a manually defined cost function. While natural language allows for simple and flexible specification of safety constraints and budget limits, its ambiguity poses a challenge in mapping these specifications to representations that can be used by safety reinforcement learning techniques. To address this problem, its authors developed a model with two parts. (1) a constraint interpreter that encodes natural language constraints into vector representations that capture spatial and temporal information about forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable, and we train it using a recently proposed constrained policy optimization algorithm.
(1)The constraint mask module uses the observation
The constraint translator is composed of two parts as described in previous sections, firstly the constraint mask module, and the constraint threshold module. The LSTM is used in the thesis as its semantic analysis module, and due to the excellent performance of the Transformer in language modeling, I decided to use this model as an alternative to it in my experiments.
The constraint descriptions I use are mainly from the method written by the authors of the paper. The method uses existing phrases to generate random constraint descriptions based on the target constraints. Observations are then collected in the provided environment using a random policy. Finally, based on the collected observations and the constraints corresponding to the observations, a constraint mask can be generated as the ground truth for training. The constraint thresholds are already known at the time of generating the constraint descriptions, so no additional processing is required.
First I tried to train the model from scratch using the basic Transformer model. In more detail, I'm using just the encoder part of the Transformer, which is the left part of the following part. The tokenrizer of the model is just a simple numbering of words in the word bank for the vocabulary. The final accuracy of the model is about 91%.
To better verify the feasibility of the approach, I used a pre-trained natural language processing model, BERT, in my final experiments to analyze the constraints of textual descriptions. It is noteworthy that the mainstream natural language analysis models nowadays no longer use LSTM, but are replaced by Transformer, and BERT is one of them. Since the base model of BERT also uses the Transformer, I think it can be used as an extension experiment of the previous step. In addition, since our data will not be optimized for the language model, the BERT model needs to be frozen during training.
The final number of parameters of the constraint mask model is as follows.
Layer (type) | Output Shape | Param # |
---|---|---|
tf_bert_model (TFBertModel) | multiple | 108310272 |
dense (Dense) | multiple | 769 |
conv2d (Conv2D) | multiple | 9280 |
sequential (Sequential) | (50, 49) | 19249 |
Total params: 108,339,570 | ||
Trainable params: 29,298 | ||
Non-trainable params: 108,310,272 |
The final number of parameters of the constraint threshold model is as follows.
Layer (type) | Output Shape | Param # |
---|---|---|
tf_bert_model (TFBertModel) | multiple | 108310272 |
sequential (Sequential) | (128, 1) | 4915329 |
Total params: 113,225,601 | ||
Trainable params: 4,915,329 | ||
Non-trainable params: 108,310,272 |
In addition, during the training Constraint Interpreter phase, the data are generated by the agents in a random environment using random policies. The constrained text descriptions are generated from a fixed library of words by permutation.
Since the policy module described in the paper does not differ from the Constrained Policy Optimization
The general process of training remains the same as in the template, the part that is different is the middle layer of the policy module. To better fit the data of this paper, I replaced the multilayer perceptron in the template with a convolutional network, thus being able to save memory usage to a great extent.
Using the constraints generated by permutation using a fixed word bank together with the observed data,
a constrained interpreter with an accuracy of about 90% can be finally trained.
However, accuracy is not the part I want to emphasize.
In my experiments, I use the pre-trained model provided by the huggingface library
Base Model | Mask Module | Threshold Module |
---|---|---|
Basic Transformer | 90.50% | 94.02% |
Pre-trained BERT | 90.28% | 94.37% |
As described in the previous sections, I use the CPO from the existing code to train the policy network. However, some parameters in it were adjusted. First, it is important to note that the training method provided inside this code base uses an online approach, so the data for each epoch needs to be generated online. To ensure the availability of the data and to ensure that there are enough trajectories per epoch, I set the maximum length of each trajectory to 1e3 and the maximum length of each epoch to 1e5. A total of 1e3 epochs are trained. During the training the gamma has been set to 0.95, lambda 0.90, and gamma for cost is 0.99 and lambda for cost is 0.97, lastly target kl-divergence is 0.01.
In the module provided in the paper,
there is no training data available for use. All the data for this experiment needs to be regenerated.
However the method used to generate the language constraints in the code of the thesis is extremely simple and contains only 30 words.
This not only makes it very difficult to train the language model, but also has no way to verify the generality of the language model.
To solve the problem, I use a pre-trained sentence rephrase model to rewirte the language constraints.
The model used is the one provided in the HuggingFace library
Base Model | Mask Module | Threshold Module |
---|---|---|
Basic Transformer | 75.33% | 82.42% |
Pre-trained BERT | 91.02% | 93.77% |
One of the problems that bothered me most when conducting the paper replication was that I had to load three models at the same time in the experiment. Two of the models involved natural language analysis, and such models tend to be relatively large in size. The problem with this is that the memory used by the models will be huge. Three models already require nearly 20GB of memory. The training was finally done on an NVIDIA 3090. But even with the 3090, the model size did not get a huge boost because it needed to be loaded at the same time.
To try to solve the above problem, I had considered using Google's TPU for training. However, after some research, I found that since the experiment is online learning and thus requires the use of CPUs to fetch data from the simulated environment. However, the interface provided by the TPU is very limited, and its coordination requires a lot of work if you want the CPU as well as the TPU to work simultaneously. As an individual, it is very difficult to accomplish such a task. So I finally gave up the idea.
The paper I needed to reproduce this semester focused on deep learning for natural language learning. In the process of replication, I used modules that I am familiar with to modify the model described in the paper to some extent and justify the modification. Although I encountered many difficulties during the replication process, including incomplete data, I found solutions to most of them, and I learned a lot about the practical problems that reinforcement learning requires. I also received very valuable advice from the authors of the paper, which gave me a sense of the enthusiasm of the people working in this research area. I look forward to better developments in the field of reinforcement learning in the future.
Many thanks to Michael Hu for his advice.