Academic Project · Reinforcement Learning

Highway-Env RL

Train an agent to drive on a 3-lane highway without crashing into other cars. We coded a DQN from scratch in PyTorch and compared it to PPO from Stable-Baselines3.

PyTorchDQNPPOStable-Baselines3GymnasiumTensorBoard

hissein02/Highway-env-project

Trained DQN agent weaving through traffic

The Environment

Observation

5 × 5 matrix

car + 4 nearest vehicles

Actions

5 discrete

left, stay, right, faster, slower

Training

100k steps

highway-fast-v0 (15x speed)

Evaluation

40 vehicles

aggressive, tight spacing

We used highway-v0 from Farama Foundation. The agent gets rewarded for driving fast and staying in the right lanes. If it crashes, it gets -1. Since the environment is random (vehicles spawn in different positions), running the same model twice gives different scores. We report the best results from each experiment.

DQN vs PPO

Two approaches, same task. One we built from scratch, one we used off the shelf.

DQN

Winner

33.40

mean reward (± 8.15)

Coded from scratch in PyTorch. Simple 3-layer MLP (25 → 256 → 256 → 5). The key was lowering gamma from 0.99 to 0.8. With 0.99, the agent tries to plan far ahead, which doesn't work when other drivers are aggressive. With 0.8, it reacts to what's right in front of it.

PPO

Plateaued

~27.8

mean reward (stuck after ~30k steps)

Used Stable-Baselines3 with 4 parallel environments. We tried longer rollouts, entropy bonus, different learning rates. The curve always flattened. PPO learns over full rollouts and seems to need a higher gamma, which hurts it in this chaotic environment.

DQN won because the action space is small (5 choices) and the replay buffer lets it reuse past experiences. PPO throws each batch away after one update. It's built for harder problems with continuous actions. For 5 discrete actions on a highway, it was overkill.

What we took away from this

Gamma matters most

Lowering it from 0.99 to 0.8 improved the score by ~4 points and completely changed the driving behavior.

Simple wins

Double DQN scored worse. PPO scored worse. The plain DQN with the right hyperparameters won.

Context matters

On the racetrack bonus (continuous actions), PPO was the right call. DQN can't handle it. The algorithm depends on the problem.

Project Report

Full write-up with both implementations, hyperparameter experiments, training curves, and the final comparison.

Open in New Tab Download PDF