Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
Introduction
Lagrangian methods are classical approaches to solving constrained optimization problems and have become popular baselines in deep RL for their simplicity and effectiveness. However, gradient Lagrangian methods for safe RL often lead to constraint violations in intermediate iterations. Our key insight is that the traditional Lagrange multiplier update amounts to integral control on the constraint. To address this deficiency, we expand the scope of possible Lagrange multiplier update rules by interpreting the overall learning algorithm as a dynamical system. Specifically, we employ proportional and derivative control, adding terms corresponding to derivatives of the constraint function. This novel approach results in a more responsive safety mechanism, reducing violations and improving safe RL performance.
Method
Lagrangian methods learning dynamics exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior during agent training.
To overcome the shortcomings of Lagrangian methods, add a Derivative term is a good choice. Then we can easily think of using PID(proportion integral derivative) method
As you can see, adding a Derivative term effectively reduces oscillations and overshoot. So next we will discuss how to combine Lagrangian methods and PID control method together
Feedback Control for Constrained RL
The general expression for a discrete-time feedback system which could be controlled by PID method is:
\[X_{k+1} = F(x_k, u_k)\]
\[y_k = Z(x_k)\]
\[u_k = h(y_0,..., y_k)\]
state vector is X, dynamic function F, measurement index y,
control signal u, and the subscript represents the time step.
feedback rule h knows all past and present measurements.
Assuming that the state vector of the control system is the parameter of the neural network, then we can write Constrained RL as the following first-order dynamic system:
\[\theta_{k+1} = F(\theta_k, \lambda_k)\]
\[y_k = J_c(\lambda_{\theta_k})\]
\[\lambda_k = h(y_0,..., y_k, d)\]
F can be written as the following linear system:
\[F(\theta_k, \lambda_k) = f(\theta_k) + g(\theta_k)\lambda_k\]
\[f(\theta_k) = \theta_k + \eta\bigtriangledown_\theta J(\pi_{\theta_k})\]
\[g(\theta_k) = -\eta\bigtriangledown_\theta J_c(\pi_{\theta_k})\]
This is actually just splitting the update formula of the Stochastic Gradient Descent algorithm into f and g.
\[\theta = \theta -\eta\bigtriangledown_\theta(J - \lambda J_c) \]
Algorithms
Now we introduce differential control and linear control, and directly get the algorithm of the following figure. Note that it is actually the introduction of both the differential term and the integral term of \(J_c\).
Implementation
Since PID method is a algorithm based on Tuning parameters, i choosed Safety-Gym CartPole environment, which allows me to train it quick and compare the result with different parameters
env = gym.make('CartPole-v0').unwrapped
Then i build a NN with 2 layers and 10 nuerons to calculate the loss
layer = tf.layers.dense(
inputs=self.tf_ob,
units=10,
activation=tf.nn.tanh,
kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
bias_initializer=tf.constant_initializer(0.1),
name='fc1'
)
all_ac = tf.layers.dense(
inputs=layer,
units=self.n_actions,
activation=None,
kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
bias_initializer=tf.constant_initializer(0.1),
name='fc2'
)
Then i used proportion term:
self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
integral term:
running_add = running_add * self.gamma + self.ep_reward[t]
Derivative term:
loss = tf.reduce_mean(self.Derivative * neg_log_prob * self.tf_v)
to calculate the loss and use softmax to convert the output into a probability, then choose an action according to the probability
Result
with the proper Implementation, the car make balance as expected
Undoubtedly, with reasonable parameter settings, the performance of the PID algorithm in terms of training speed and error fluctuation avoidance is superior to the traditional Lagrangian algorithm.
Problem unsolved
However, I cannot reproduce the images presented by the author in the article.
The PID algorithm is a parameter-tuning-based algorithm, while reinforcement learning is an algorithm that improves performance through multiple rounds of training. The author did not clarify under what conditions he made comparisons of different parameters. Was it the first round? Or after a fixed number of rounds? Or when good performance had already been achieved? In my opinion, comparing parameters without a fixed benchmark is not rigorous.