Talk "Adversarial Attacks in Deep Reinfocement Learning: A Call for Robust Defenses"

Adversarial Attacks in Deep Reinfocement Learning: A Call for Robust Defenses by Adithya Mohan was presented at Munich Datageeks - January Edition 2025

Abstract

Deep Reinforcement Learning (DRL) has demonstrated remarkable potential across domains, including robotics, autonomous driving, and gaming. However, its vulnerability to adversarial attacks poses significant challenges to its reliability in real-world applications. Adversarial perturbations can exploit weaknesses in DRL models, leading to suboptimal or even unsafe decisions. This session will explore how adversarial perturbations target DRL systems, revealing critical vulnerabilities that compromise their performance and reliability. We’ll examine key attack strategies, their implications for DRL applications, and the emerging need for robust, scalable defense mechanisms. Whether you're a developer, researcher, or AI enthusiast, this talk aims to inspire new ideas in creating resilient DRL frameworks capable of withstanding adversarial threats.

About the speaker

Adithya Mohan is a Doctoral researcher specializing in adversarial attacks and defenses in Deep Reinforcement Learning (DRL), with a deep passion for advancing the robustness and security of DRL systems. His work bridges foundational AI research and practical applications, focusing on creating resilient frameworks for autonomous systems. Adithya’s contributions aim to drive innovation in security and AI for domains such as robotics and autonomous driving.

Transcript summary

Introduction and Research Context

The speaker is a research associate and PhD student at Technical University of Ingolstadt, working in the domain of deep reinforcement learning with a specific focus on autonomous driving applications. The central research question addresses how to improve the robustness of deep reinforcement learning models when considering adversarial attacks and defenses.

Understanding Adversarial Attacks and Defenses

The concept of adversarial attacks in AI can be compared to computer viruses and antivirus software in traditional computing. However, in the context of deep reinforcement learning and neural networks, adversarial attacks involve external agents attacking the neural network itself, or situations where the neural network can be fooled by manipulated inputs.

A classic example demonstrates this concept: When a panda image is passed to a classifier, it correctly identifies the image as a panda with 57% confidence. However, when small noise is introduced to the same image (imperceptible to human eyes), the classifier misidentifies it as a gibbon, despite the image still clearly appearing as a panda to human observers.

Real-World Safety Implications

The significance of adversarial attacks becomes critical in autonomous driving scenarios. The speaker uses Waymo, Google's self-driving taxi company operating in California and San Francisco, as an example. These vehicles use cameras for perception to identify traffic signs and road conditions.

In normal conditions, the perception system correctly classifies a stop sign. However, when adversarial noise is introduced, the same system might misclassify the stop sign as a yield sign. While this distinction might not seem critical in some contexts, in Germany (and other countries with strict traffic rules), confusing stop signs with yield signs can have serious safety consequences.

Experimental Environment Setup

For the research experiments, the speaker utilizes the Gymnasium environment library, which is commonly used in reinforcement learning research. Specifically, the Highway-Env library is employed, which offers various driving scenarios including:

Merge situations
Roundabouts
Parking scenarios
Intersections
Race track environments (for Formula 1 enthusiasts)

The highway environment was selected as it provides a balanced complexity level - neither too simple nor overly complicated. It features a continuous state space and discrete action space, making it suitable for training and evaluating reinforcement learning agents.

Deep Reinforcement Learning Architecture

The standard deep reinforcement learning architecture consists of a neural network that acts as an agent. Various algorithms can be employed, including PPO, DQN, and SAC. The basic operational flow works as follows:

The neural network predicts an action
This action is sent to the environment
The agent transitions from state T to state T+1
The environment returns a new state and a reward (positive or negative)
This information is fed back to the neural network
The cycle continues

The reward system operates similarly to training a dog - positive reinforcement for good actions and negative reinforcement for undesired behaviors.

Adversarial Attack Architecture

When adversarial attacks are introduced into the system, the architecture is modified. An attacker can introduce noise at various points:

Action perturbation: An action (Action 0) becomes Action + epsilon, where epsilon represents the introduced noise
State space perturbation: Noise can be added to the state observations
Neural network perturbation: The network weights themselves can be attacked

Multiple types of adversarial attacks exist, and correspondingly, multiple defense mechanisms can be employed and tested against deep reinforcement learning agents.

Experimental Results and Findings

The speaker conducted experiments using a vanilla DQN model trained over approximately 6,000 episodes. The evaluation involved two primary scenarios:

Adversarial Training Defense: Adversarial training is one of the most effective defense techniques currently available. During training, random noise is introduced to state perturbations, allowing the model to learn patterns and detect adversarial perturbations in later stages.

Attack Evaluation - FGSM: The Fast Gradient Sign Method (FGSM) was used to introduce adversarial noise to state perturbations. The results showed:

Without attack: The reward structure remained stable and positive (blue pattern in results)
With FGSM attack: A significant decline in rewards was observed (red pattern), with vehicles colliding with other cars, leaving the roadway, or exhibiting erratic behavior

Defense Evaluation: When the adversarially trained agent was evaluated over 500 episodes:

The reward structure remained much more stable compared to the attacked vanilla model
Some decline in rewards was still visible, indicating that adversarial perturbations still had some effect
Despite receiving adversarial perturbations, the agent managed to make better decisions and maintain positive reward accumulation
The agent demonstrated resilience by aligning its behavior to gather positive rewards even under attack conditions

Visual Evaluation Comparison: Three evaluation videos demonstrated the different scenarios:

No attack scenario: The agent moves definitively and smoothly, knowing exactly where to navigate
FGSM attack scenario: The agent exhibits stressed behavior, making erratic decisions and colliding with other vehicles despite having the option to travel straight
Adversarially trained agent under attack: The agent occasionally makes suboptimal moves but retains its core knowledge and attempts to maintain perfect behavior, showing only slight decline in reward performance while remaining functional

Future Research Direction

The current work focuses on simulations, but the next research phase will involve testing on real vehicles. The research group at Technical University of Ingolstadt has access to multiple cars equipped with sensors. The goal is to test deep reinforcement learning models along with adversarial attacks and defenses research on real vehicles, thereby bridging the gap between simulation and real-world applications in embodied AI.

The speaker is actively working to prove that adversarial training, despite being considered the most effective defense technique to date, may not be the optimal solution, and that better defense mechanisms exist.