RL-22-23-Report

SAFE REINFORCEMENT LEARNING WITH NATURAL LANGUAGE CONSTRAINTS

Abstract

The paper I need to reproduce this semester addresses the problem of learning task control policies under the constraints of providing a natural language. Unlike instruction following, the language here is not used to specify goals, but rather to describe situations that agents must avoid when exploring the environment. Specifying constraints in natural language also differs from the predominant paradigm in safety reinforcement learning, where safety criteria are enforced by a manually defined cost function. While natural language allows for simple and flexible specification of safety constraints and budget limits, its ambiguity poses a challenge in mapping these specifications to representations that can be used by safety reinforcement learning techniques. To address this problem, its authors developed a model with two parts. (1) a constraint interpreter that encodes natural language constraints into vector representations that capture spatial and temporal information about forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable, and we train it using a recently proposed constrained policy optimization algorithm.

Constraint Interpreter

The focus of this article is on how to convert textual descriptions of constraints into values that can be directly understood by reinforcement learning. The following model is used in the paper.

(1)The constraint mask module uses the observation o_t and the text x to predict a binary constraint mask, denoted by \hat{M}_C, that is a prediction of the true constraint mask M_C. If there is a cost entity (i.e., the forbidden state mentioned in the text) in row i and column j (denoted by o_{t(i, j)}) of observation o_y, each cell in \hat{M}_C will contain a one. Otherwise, the cell contains a zero. The authors use \hat{M}_C to identify the cost entities in the text while preserving their spatial information for the policy network. (2) The constraint threshold module uses an LSTM to obtain a text vector representation, followed by a dense layer to generate h. The dense layer generates \hat{h}_C, which predicts the true constraint threshold h_C.

Policy Network

Training Procedure for Constraint Interpreter

The constraint translator is composed of two parts as described in previous sections, firstly the constraint mask module, and the constraint threshold module. The LSTM is used in the thesis as its semantic analysis module, and due to the excellent performance of the Transformer in language modeling, I decided to use this model as an alternative to it in my experiments.

The constraint descriptions I use are mainly from the method written by the authors of the paper. The method uses existing phrases to generate random constraint descriptions based on the target constraints. Observations are then collected in the provided environment using a random policy. Finally, based on the collected observations and the constraints corresponding to the observations, a constraint mask can be generated as the ground truth for training. The constraint thresholds are already known at the time of generating the constraint descriptions, so no additional processing is required.

Tokenrize with length of 100
Tokenrize with length of 100
Input w
Input w
Embeding_size x 100
Embeding_size x 100
Transformer Encoder
Transformer Encoder
Embeding_size x 100
Embeding_size x 100
Dense Layer
Dense Layer
49
49
Replicate
Replicate
7x7x5
7x7x5
Observation 7x7x3
Observation 7x7x3
7x7x8
7x7x8
Convolution 3x10
Convolution 3x10
5x5x10
5x5x10
flatten
flatten
250
250
Dense Layer
Dense Layer
64
64
Dense Layer
Dense Layer
49
49
Reshape
Reshape
7x7
7x7
Viewer does not support full SVG 1.1
The modified constraint mask interpreter
Tokenrize with length of 100
Tokenrize with length of 100
Input w
Input w
Embeding_size x 100
Embeding_size x 100
Transformer Encoder
Transformer Encoder
Embeding_size x 100
Embeding_size x 100
Dense Layer
Dense Layer
64
64
Dense Layer
Dense Layer
h_c
h_c
Viewer does not support full SVG 1.1
The modified constraint threshold interpreter

First I tried to train the model from scratch using the basic Transformer model. In more detail, I'm using just the encoder part of the Transformer, which is the left part of the following part. The tokenrizer of the model is just a simple numbering of words in the word bank for the vocabulary. The final accuracy of the model is about 91%.

To better verify the feasibility of the approach, I used a pre-trained natural language processing model, BERT, in my final experiments to analyze the constraints of textual descriptions. It is noteworthy that the mainstream natural language analysis models nowadays no longer use LSTM, but are replaced by Transformer, and BERT is one of them. Since the base model of BERT also uses the Transformer, I think it can be used as an extension experiment of the previous step. In addition, since our data will not be optimized for the language model, the BERT model needs to be frozen during training.

The final number of parameters of the constraint mask model is as follows.

Layer (type) Output Shape Param #
tf_bert_model (TFBertModel) multiple 108310272
dense (Dense) multiple 769
conv2d (Conv2D) multiple 9280
sequential (Sequential) (50, 49) 19249
Total params: 108,339,570
Trainable params: 29,298
Non-trainable params: 108,310,272

The final number of parameters of the constraint threshold model is as follows.

Layer (type) Output Shape Param #
tf_bert_model (TFBertModel) multiple 108310272
sequential (Sequential) (128, 1) 4915329
Total params: 113,225,601
Trainable params: 4,915,329
Non-trainable params: 108,310,272

In addition, during the training Constraint Interpreter phase, the data are generated by the agents in a random environment using random policies. The constrained text descriptions are generated from a fixed library of words by permutation.

Training Procedure for Policy Network

Since the policy module described in the paper does not differ from the Constrained Policy Optimization , here I use the code base provided by OpenAI https://github.com/openai/safety-starter-agents as a template to use. In addition, the environment used for the experiments is Safety Gym https://openai.com/blog/safety-gym/ , which has been customized by the authors of the paper, and the code has been made available on Github https://github.com/michahu/hazard-world-grid .

The general process of training remains the same as in the template, the part that is different is the middle layer of the policy module. To better fit the data of this paper, I replaced the multilayer perceptron in the template with a convolutional network, thus being able to save memory usage to a great extent.

Result

Challenging Issues

Summary

The paper I needed to reproduce this semester focused on deep learning for natural language learning. In the process of replication, I used modules that I am familiar with to modify the model described in the paper to some extent and justify the modification. Although I encountered many difficulties during the replication process, including incomplete data, I found solutions to most of them, and I learned a lot about the practical problems that reinforcement learning requires. I also received very valuable advice from the authors of the paper, which gave me a sense of the enthusiasm of the people working in this research area. I look forward to better developments in the field of reinforcement learning in the future.

Acknowledgments

Many thanks to Michael Hu for his advice.