Talk "A Competitive Time Trial AI for Need for Speed Most Wanted Using Deep RL"

Topic was presented at Munich Datageeks - 100th Munich Datageeks Meetup - October 2025

Abstract

This talk presents the development of a competitive AI agent for Need for Speed Most Wanted using deep reinforcement learning, achieving near-professional esports performance on consumer hardware. The project reimplements concepts from Sony's Gran Turismo research in a constrained environment without official APIs or source code, demonstrating that sophisticated AI capabilities can be achieved through creative engineering on modest resources.

The implementation required reverse-engineering the game's memory to extract telemetry data including XYZ coordinates, speed, acceleration, and vehicle orientation. A custom OpenAI Gym environment was built for real-time training at 30 Hz, using the Soft Actor-Critic algorithm. Instead of image-based observations, the agent uses a 594-component feature vector comprising vehicle telemetry, a lidar system providing 180-degree forward vision, and a GPS navigation system indicating the optimal racing line ahead.

Training on a 10-year-old consumer PC (Intel Xeon E3, GeForce GTX 1070) for 24 hours, the agent achieves lap times of 1:08.30 on the Heritage Heights track—faster than 99% of players and competitive with professional esports times. The human world record stands at 1:06.85. The agent demonstrates remarkable stability, completing laps consistently and recovering from disruptions. It generalizes well to different vehicles without retraining but struggles with unseen track elements like grass or sharp corners not encountered during training.

Compared to Sony's Gran Turismo research approach using thousands of PlayStation 4 consoles in parallel, this project demonstrates that significant reinforcement learning results can be achieved through efficient problem formulation and careful engineering on accessible hardware.

About the speaker

The speaker Sebastian has a unique history with Need for Speed Most Wanted, having been a professional esports athlete 15 years ago, playing for Team Alternate, one of Germany's biggest esports teams at the time. They competed in multiple Need for Speed titles including Most Wanted, Carbon, Pro Street, and Shift, primarily in the Electronic Sports League (ESL) and World Cyber Games. The speaker still holds a world record lap time of 1:06.85 on the Heritage Heights track. After transitioning from data science to becoming Head of Data and AI at Eon, they embarked on this project in late 2022 as a personal endeavor, inspired by Sony's papers on using deep reinforcement learning in Gran Turismo Sport.

Transcript summary

Game Context and Competitive Rules

Need for Speed Most Wanted is an arcade racing game released in 2005 and remains the best-selling title in the franchise with 16 million copies sold (German Wikipedia) or possibly 80 million (English Wikipedia). The game was a popular esports title with specific competitive rules:

Races were one-versus-one on circuits with 5-6 laps
All tuning parts were allowed in ESL, but Junkman parts were banned in World Cyber Games due to cheating potential
All cars were permitted, with the Lotus Elise and Porsche Carrera GT being the most competitive choices
NOS (turbo boost) was disabled because of infinite NOS hacks
Collisions were disabled due to poor netcode, especially problematic given that some players still used ISDN connections in 2005

Technical Challenge Overview

The project faced significant obstacles from the start. Unlike Sony's Gran Turismo research which had official game APIs and dedicated hardware, this project required building everything from scratch. There was no game API, no code reference from Sony's papers, and the training had to run in real-time on consumer hardware. The speaker needed to create a custom OpenAI Gym environment that could operate in real-time, as the game itself runs in real-time without external control options.

Custom API Development Through Memory Hacking

Since no official API existed, the speaker reverse-engineered the game's memory to extract necessary data. They used a Need for Speed mod that enabled windowed mode, more than five laps, and teleportation capabilities for resetting car positions. The memory extraction process involved:

Opening the game's approximately 300 MB memory space
Modifying values in-game (like speed) and searching for corresponding byte changes
Iteratively narrowing down memory addresses through multiple observations
Using correlation analysis for difficult-to-find variables like acceleration

The PyMem package enabled reading game memory in nanoseconds, which was crucial for real-time operation. Key variables discovered included XYZ coordinates (in meters), Euler angles, speed (meters per second), acceleration (which surprisingly had sign flips depending on the quadrant), spin, pitch, roll, lift distance, surface type, current lap, and controller inputs. One particularly challenging variable was acceleration, stored at 248 times meters per second squared with sign changes based on coordinate quadrants, which took three nights to figure out.

Track Mapping and Spatial Features

To provide the AI with environmental awareness, the speaker manually mapped tracks by driving along boundaries while recording XYZ coordinates in real-time. This 10-minute process per track captured thousands of boundary points. The system uses K-Dimensional Tree (KD-Tree) nearest neighbor algorithms to efficiently calculate distances to boundaries in real-time, operating in nanoseconds.

The lidar system provides distance measurements to the nearest borders in 180 degrees ahead in one-degree steps, giving the AI vision of upcoming track geometry. The GPS navigation system calculates the tangent on the optimal racing line for the next 200 meters, providing lookahead information similar to what a human driver would use.

Reinforcement Learning Architecture

The implementation uses the Soft Actor-Critic (SAC) algorithm through the Stable Baselines3 Python package. SAC aims to maximize the sum of all rewards plus an entropy term that balances exploitation (getting faster) and exploration (trying new approaches). The system consists of two main components:

The actor (policy) is a distribution function over possible actions given the current state, essentially serving as the driver. The critic is a value function that estimates maximum achievable future rewards discounted by factor gamma, acting as a co-driver. Training minimizes the Kullback-Leibler divergence between these two distribution functions using deep neural networks on a replay buffer that serves as the agent's memory.

Gym Environment Implementation

The custom gym environment required four core methods:

Init: Defines action space (braking, throttling, steering) and observation space
Step: Called 30 times per second, performs one action and returns observation, reward, done indicator, and info
Reset: Returns agent to start/finish line at specific angle and speed after each episode
Render: Takes screenshots at 300 fps capability, though the game renders itself

Actions were initially implemented through a virtual Xbox 360 controller, later changed to direct memory writes for potential performance gains. The action space uses continuous values from -1 to +1 for steering, while throttle/brake uses three discrete values: -0.4 (full brake without reverse), 0.7 (half throttle), and 1.0 (full throttle). This discretization prevents the agent from wasting time optimizing between 98% and 100% throttle.

Observation Vector Design

The observation vector contains 594 components comprising multiple data types. Telemetry includes speed, acceleration, direction vectors (replacing raw angles), spin, pitch, roll, lift distance, and surface type. The lidar provides 180 distance measurements in one-degree increments ahead of the car. GPS navigation supplies inverse lap radii ahead and lap angles ahead. Additional features include collision indicators, reverse indicators, and the last five steering inputs for temporal context.

This approach provides 500 times less data than image-based training with significantly less noise. The feature calculation runs in under 8 milliseconds on a 10-year-old PC, enabling 30 Hz training with sufficient time for gradient steps. The system works across different camera perspectives, weather conditions, and vehicles.

Reward Function Design and Pitfalls

The reward function measures delta lap completion per step (30 times per second) plus an optional collision penalty. This approximates lap time minimization while providing frequent feedback, unlike sparse lap time rewards that only appear after hundreds of actions. An episode terminates when a lap completes, a 3-minute time limit is reached, or the agent reverses.

The speaker shared several instructive failures from incorrect reward functions. Using speed as a reward caused the agent to do endless burnouts, as this provided infinite reward without risk of crashing or episode termination. Using delta lap completion without blocking reverse caused the agent to deliberately cross the finish line backwards, jumping from 0% to 100% completion for maximum reward. Setting collision penalties too high made the agent learn that living is painful, so it would simply stand still to avoid punishment.

Training Infrastructure and Performance

Training runs on modest hardware: a 10-year-old Intel Xeon E3 processor (4th generation) and GeForce GTX 1070 with 8 GB VRAM. GPU training in real-time takes approximately 24 hours for competitive performance. The system performs 30 gradient steps per second while the game runs simultaneously. All feature calculations avoid Python for loops and are streamlined to complete within the 6-millisecond window needed for 30 Hz operation.

The replay buffer size is maximized to fit within 8 GB VRAM alongside the running game on Windows. Callbacks handle logging and recovery from random game crashes that would otherwise lose the replay buffer. The game has a timing bug that introduces 5-10 second errors after 5 hours of continuous play, so external timing is used.

Results and Performance Comparison

The trained agent achieves a lap time of 1:08.30 on Heritage Heights, approximately 1.5 seconds slower than the human world record of 1:06.85. To contextualize this performance: the in-game AI on hard difficulty completes laps in 1:33.17, casual gamers typically achieve 1:18, good gamers reach 1:10, and pro gamers manage 1:09. The agent's 1:08.30 would have won nearly every match in ESL except against the top five players. The world record represents a one-in-a-million perfect lap.

Training progress shows the agent first completing a lap successfully, then consistently completing every subsequent lap while improving times. The agent finds the racing line reliably, doesn't collide with obstacles, and recovers if thrown off course. It makes consistent line choices through the tunnel and corner sequences, though with slight variations from human racing lines. Some jittering in steering occurs but doesn't significantly impact lap times.

Generalization and Limitations

The agent generalizes remarkably well to other cars without retraining. When given the Porsche Carrera GT, which handles very differently from the training Lotus Elise, the agent completes laps only 1-2 seconds slower without going off-track. However, it doesn't know the Porsche is wider, occasionally causing boundary collisions.

Generalization to other tracks is limited. If a new track contains elements not seen during training, like grass surfaces or 90-degree turns requiring hard braking, the agent fails. Grass causes drifting issues because the agent doesn't understand slippery surfaces. This limitation stems from the lack of transfer learning between tracks.

Collision Strategy Variations

Training with and without collision penalties produces different but equally fast strategies. Without collision penalties, the agent intentionally hits walls before corners, using the impact to position optimally for the exit. This technique exists in human competitive play but is difficult to execute consistently. With collision penalties, the agent uses brakes instead, which is more visually appealing and easier for humans to replicate while maintaining similar lap times.

Computational Efficiency Comparison

Compared to Sony's Gran Turismo research, this project is remarkably efficient. Sony's first paper achieved superhuman performance after 73 hours using 80 cars in parallel across four PS4s. Their second paper used 1,000 PS4s in parallel with 80 cars each, essentially brute-forcing the solution space. This project achieves competitive results in 24 hours on a single consumer PC, demonstrating that sophisticated infrastructure isn't always necessary for strong reinforcement learning performance.

Technical Challenges and Solutions

Maintaining the Markov property proved challenging since the game runs in real-time without external control. Setting the game to 30 fps helped maintain consistent frame rates and update frequencies. Some features like the lidar are inherently noisy due to discrete calculations from available data, but this doesn't significantly impact performance. The system required approximately 100 experimental runs totaling dozens of hours of training, providing excellent heating during winter months.

The speaker noted that getting reinforcement learning right is difficult, which is why using Stable Baselines3 was valuable. The package name reflects the challenge of achieving stable training. All calculations must complete within strict time constraints while the game runs, requiring careful optimization and elimination of all Python for loops in the codebase.