Safe Reinforcement Learning

Learning Barrier Certificates in Higher Dimensional Mujoco Environments

Authors: Lars Chen, Jeremiah Flannery

Abstract

Having a barrier function verify safe states is an often-used strategy to guarantee that one doesn’t incur training-time errors in Safe RL. Depending on how one sets up this barrier function, it can require effortful hand-tuning specific to any new environment. Last year, Luo and Ma proposed a method that sidesteps this effort by co-learning three elements: 1) improving the confidence of the physics model, 2) increasing the size of verified regions, and 3) optimizing the policy. They posit that any of the three elements will incrementally improve after benefitting from improvements in the other two elements, creating a complimentary sequential structure. Instead of requiring a pre-made barrier function, their algorithm now requires an initial safe policy as a starting point. They showed in simulations with low dimensional environments that their algorithm was capable of expanding the safe region while incurring no training errors. We introduced the algorithm into two environments with higher dimensionality: double inverted pendulum and hopper, and we performed an analysis on the safety of pre-trained agents in the two environments. We found that pretraining achieved safety in less steps than expected, but the behavior of the agent when training the barrier certificates did not arrive to a safe behavior in the number of epochs we were able to train.

Table of Contents

read more