Adversarial attacks are a well-known phenomenon in machine learning that poses a threat to the security of many practical applications. To attack a neural network, one adds a small pertubation to its input, which leads to a misclassification. They have been studied extensively in the context of computer vision, and more recently in reinforcement learning (RL).

Hopper agent trained with PPO under normal circumstances
Reward: 3104

The same Hopper agent under adversarial attack from
Reward: 171

In an RL setting, perturbed observations can fool an agent with a neural network policy into taking actions that lead to poor performance . For example, look at the hopper on the right, where the observations have been unnoticeably perturbed, leading to a detrimental drop in the cumulative reward. There is a multitude of methods to attack RL agents, including classical methods such as FGSM , and methods that exploit RL specific properties . The majority of these methods assume that the attacker is able to freely modify the agent's observations, which might not be possible in many cases. However, an interesting opportunity arises in multi-agent RL (MARL) settings, where agents observe each other:

An attacker might take the form of an agent in MARL, whose actions directly influence the victim's observations.

This is the key idea in Gleave et al's paper "Adversarial Policies: Attacking Deep Reinforcement Learning" , who train adversarial agents that attack victims in several two-player games. In this post, we will take a closer look at those adversarial agents, analyze how they work, and discuss ways to defend against them.

Setting things up

We will train adversarial agents on four different mujoco environments, each of which implements a two-player, zero-sum game. Originally, these environments have been introduced in Bansal et al's paper "Emergent Complexity via Multi-Agent Competition" , who trained several agents competing in each environment. In the visualizations in the rest of the article, we will see the behavior of the first pretrained agent. For our experiments, we will focus on the following four:

You Shall Not Pass (humans)
Asymmetric, MLP policy
1 pretrained agent

Kick and Defend (humans)
Asymmetric, LSTM policy
3 pretrained agents

Sumo (humans)
Symmetric, LSTM polic
3 pretrained agents

Sumo (ants)
Symmetric, LSTM policy
4 pretrained agents

The score you see above has the form #wins - #losses - #draws, from the perspective of one of the agents, which we will call the attacker. Right now, the attacker acts according to its pretrained policy, but later we will train an adversarial policy to act in its place.

Training an adversarial agent

While training the attacker, we keep the victim's policy fixed. This is not an unrealistic assumption, as in practice neural networks are often frozen before deployment, in order to avoid performance degradation due to retraining. Additionally, this simplifies our training setup, because our victim can be "absorbed" into the environment, by incorporating the victim's policy into the transition function. In pseudocode, this could look something like this:

Note that the environment now takes in only an action from the attacker, and returns only the attacker's observation and reward, meaning...

Our goal for the attacker is simply to win as much as possible. Therefore, we don't use the reward function of the environment, but reward the attacker when it wins, and penalize it otherwise. We do this by giving a sparse reward at the end of the episode, either 1000 or -1000 depending on the game result.

With the environment and reward in place, we can use any standard RL algorithm to train our attacker. In our case, we used the PPO implementation of stable baselines 3 with the hyperparameters provided in .

Training results

You Shall Not Pass (humans)
MLP policy, attacker in ● Red

Kick and Defend (humans)
LSTM policy, attacker in ● Blue

Sumo (humans)
LSTM policy, attacker in ● Blue

Sumo (ants)
LSTM policy, attacker in ● Blue

The videos above show the adversaries at work. As you might notice, the success of our attack depends on the environment.

You Shall Not Pass (humans)
MLP policy, attacker in ● Red

In You shall not pass, the attacker is able to consistently score wins (it wins 71% of the games). It does this not by obstructing the victim, or by any other interpretable means, but by falling to the ground and moving seemingly at random. This shows that the attacker produces adversarial observations, because the victim cannot cross the line despite any obstruction. This is not the case when placing the victim against a random policy, or one that does not exert forces on its joints at all (the zero policy).

Kick and Defend (humans)
LSTM policy, attacker in ● Blue

In Kick and Defend, the attacker achieves even better results against the victim (winning 81% of the games). While the behavior of the attacker seems similar to the one in You Shall Not Pass, the effect it has on the victim is different. Instead of falling over, the victim seems to forget to kick the ball, resulting in a loss after the time limit is reached. Again, pairing up the victim against the random policy or the zero policy, it performs as expected.

Sumo (humans)
LSTM policy, attacker in ● Blue

In Sumo, our attacks work fairly well, but the results differ a lot between the victims. While our first attacker can score a 70% winrate, the second and third only win 12% and 36%, respectively. Because the attacker cannot win when falling over, it learns to kneel in a stable position, while moving its arms. The constraint of staying upright limits the impact the attacker can have on the victim, which could explain the high variance in the attackers performance.

Sumo (ants)
LSTM policy, attacker in ● Blue

In Sumo (ants) our attack did not work against any of the victims. This is in line with the research on adversarial attacks, which shows that they work better in higher-dimensional input spaces . In comparison, the state of an ant agent is 15-dimensional, while that of a human is 24-dimensional. Still, we can observe that the attacker seems to stall the victim, while almost not moving itself. This is a vastly differnt behavior than that of the pretrained agents.

While the results we presented here are purely qualitative and visual, we provide a complete table of winrates in the appendix. Still, you might be unsatisfied, because while we have seen that the attacks work, we have not seen how they work.

How do the adversarial attacks work?

We suspect that the easiest way to win against a well-trained agent in a short time is to create observations that the defender has not encountered during training. The attack aims to create inputs for the defender in an observation space that the defender has not well generalized before. To validate if this assumption is correct we are required to open up the black box of the defender and analyze its inner workings. One intuitive approach to achieve this is to examine the activations of semantically similar observations, which are expected to be close to each other in the network's representation. By analyzing the activations of the defender during the game, we can gain insight into how it processes different observations. To visualize the activations of the defender in a 2D space, we used the t-Distributed Stochastic Neighbor Embedding (T-SNE) algorithm, which is a dimensionality reduction technique that preserves the local structure of the data. This allows us to gain a better understanding of how the defender perceives the game environment and how it responds to different stimuli.

The T-SNE visualization shows that the attacker is able to create observations that stimulate the defender differently than the Zoo models. As this is a strictly qualitative analysis, we cannot draw any numeric conclusions from this visualization.

Therefore we also performed a statistical analysis of the defender's activations. This analysis is done by modeling the defender's activations when playing against the one of the Zoo models as a Gaussian mixture model. When comparing the activations induced by the other Zoo models we can observer that there is a weak positive log likelihood, that these activations where produced by the same model. However in most of the cases it was strongly unlikely, that the activations induced by the adversarial agent were produced by the same model.

How can we defend against adversarial agents?

As we have seen, the enabler for adversarial attacks is the manipulation of the victim's observation space. Therefore, how can we defend against such attacks? A super simple solution is to mask out the part of the observations that contains information about the attacker. For masking, we fix the values after initializing the agents within the environment so that they stay natural, unlike setting them to 0.

This technique works extremely well for "You Shall Not Pass," where the win rate against the adversary skyrockets to 99.2%. However, it does come at a cost: the performance against the pretrained opponent drops from 50.9% to 19.3%. Additionally, this method of masking does not work for other environments whose observations contain positions relative to the victim. When the victim moves, it still observes the same relative position, which we believe is detrimental to its performance. Besides this very simple method, there are many other ways to defend against adversarial agents.

One idea is finetuning a policy against adversarial attacks, such as in , where the authors increase a policies robustness by finetuning against observations attacked with FGSM. Similarly, in , the victim agents are finetuned against the attackers, which increases robustness while the attack can still be reapplied.

A related approach is co-training with adversarial agents. Here, the victim is trained at the same time as an adversary which can manipulate the victim's observations, building robustness from the ground up. In , the co-trained adversarial agent applies forces to the victim in order to destabelize it. In , the authors co-train an optimal adversary represented by a neural network that outputs an observation. This might be infeasible in large observation spaces, and is improved upon in , who split this network into an actor and director, which craft a state perturbation and propose a policy perturbation direction, respectively. Another approach is to co-train with adversarial agents, which is done in , where the authors train a victim with a diverse population of different opponents.

Discussion

In the high dimensional environments (sumo humans, kick and defend, you shall not pass) our experiments showed that adversarial policies can prove as strong opponents in only a fraction of the training time. Despite the success on these environment, in the low dimensional environment (sumo ants) the adversarial agent could not prove successfull very often. This makes sense, since the combinatorical explosion of possible states in the higher dimensional environments makes it more likely, that there are observation states which were not visited in the training process and not generalized yet. In the paper an effective measure for protecting against the adversarial attacks was found in masking out the position of the pretrained Zoo-models. In our studies we could not replicate these results. We attribute this to a detail in the implementation of the masking. In the authors did not set the masked values to a fixed value and kept it static throughout a run, but instead updated the relative position with respect to the movement of the agent. Since this is an inplausible setup for a black box environment we kept a static value throughout the run. With this change, we were not able to see the same results as in the paper, but had to notice a significant drop in performance against all the other models, with a winrate of less then 4% against every model.

	Adv1	ZooO1	Random	Zero
ZooV1	30.1	48.9	98.4	98.8
ZooM1	99.4	25.1	98.8	98.8

	Adv1	Adv2	Adv3	ZooO1	ZooO2	ZooO3	Random	Zero
ZooV1	38.4	35.7	38.4	58.1	57.7	57.7	62.0	60.4
ZooV2	31.7	33.0	32.2	33.3	33.4	35.4	45.3	41.4
ZooV3	17.9	16.8	17.8	38.8	37.6	39.0	68.4	60.8
ZooM1	37.5	4.6	0.4	4.2	0.5	2.0	40.7	0.5
ZooM2	19.7	2.3	0.1	0.6	27.5	2.0	12.9	0.7
ZooM3	43.4	58.8	0.4	41.2	19.5	32.5	18.2	1.0

	Adv1	Adv2	Adv3	Adv4	ZooO1	ZooO2	ZooO3	ZooO4	Random	Zero
ZooV1	60.2	60.6	65.0	68.2	39.4	38.6	38.8	36.1	66.0	68.1
ZooV2	61.7	61.8	59.9	62.6	35.7	33.7	35.4	38.0	65.2	59.5
ZooV3	69.8	64.1	65.6	64.5	37.7	35.8	36.8	38.8	65.2	68.2
ZooV4	68.1	71.2	69.1	69.2	44.3	43.4	41.1	40.5	77.2	71.1
ZooM1	33.2	36.5	35.2	30.5	8.8	15.1	19.4	18.9	44.1	43.0
ZooM2	15.5	26.7	20.0	25.1	15.5	13.9	15.7	14.0	27.7	25.7
ZooM3	27.5	22.5	29.6	24.0	16.8	14.5	14.1	12.3	30.3	20.2
ZooM4	14.4	17.0	15.3	28.4	11.2	9.5	16.4	9.6	26.3	13.9

	Adv1	Adv2	Adv3	ZooO1	ZooO2	ZooO3	Random	Zero
ZooV1	37.4	39.7	40.2	14.5	14.9	17.0	7.4	11.9
ZooV2	61.0	59.8	62.0	14.9	13.2	14.4	10.6	17.9
ZooV3	31.4	30.8	28.7	31.4	32.4	32.2	11.3	18.5
ZooM1	16.8	15.9	13.2	6.0	9.0	5.0	4.4	9.8
ZooM2	9.1	9.0	9.7	11.3	10.0	17.9	11.2	11.5
ZooM3	13.4	2.1	17.7	4.4	12.5	6.0	12.4	17.7

Adversarial Agents in multi-agent reinforcement learning

Setting things up

Training an adversarial agent

Training results

How do the adversarial attacks work?

How can we defend against adversarial agents?

Discussion

Appendix

Tournament

You Shall Not Pass

Kick and Defend

Sumo Ants

Sumo Huamns