Academic Project · Reinforcement Learning
Highway-Env RL
Train an agent to drive on a 3-lane highway without crashing into other cars. We coded a DQN from scratch in PyTorch and compared it to PPO from Stable-Baselines3.

Trained DQN agent weaving through traffic
The Environment
Observation
5 × 5 matrix
car + 4 nearest vehicles
Actions
5 discrete
left, stay, right, faster, slower
Training
100k steps
highway-fast-v0 (15x speed)
Evaluation
40 vehicles
aggressive, tight spacing
We used highway-v0 from Farama Foundation. The agent gets rewarded for driving fast and staying in the right lanes. If it crashes, it gets -1. Since the environment is random (vehicles spawn in different positions), running the same model twice gives different scores. We report the best results from each experiment.
DQN vs PPO
Two approaches, same task. One we built from scratch, one we used off the shelf.
DQN
Winner33.40
mean reward (± 8.15)
Coded from scratch in PyTorch. Simple 3-layer MLP (25 → 256 → 256 → 5). The key was lowering gamma from 0.99 to 0.8. With 0.99, the agent tries to plan far ahead, which doesn't work when other drivers are aggressive. With 0.8, it reacts to what's right in front of it.
PPO
Plateaued~27.8
mean reward (stuck after ~30k steps)
Used Stable-Baselines3 with 4 parallel environments. We tried longer rollouts, entropy bonus, different learning rates. The curve always flattened. PPO learns over full rollouts and seems to need a higher gamma, which hurts it in this chaotic environment.
DQN won because the action space is small (5 choices) and the replay buffer lets it reuse past experiences. PPO throws each batch away after one update. It's built for harder problems with continuous actions. For 5 discrete actions on a highway, it was overkill.
What we took away from this
Gamma matters most
Lowering it from 0.99 to 0.8 improved the score by ~4 points and completely changed the driving behavior.
Simple wins
Double DQN scored worse. PPO scored worse. The plain DQN with the right hyperparameters won.
Context matters
On the racetrack bonus (continuous actions), PPO was the right call. DQN can't handle it. The algorithm depends on the problem.
Project Report
Full write-up with both implementations, hyperparameter experiments, training curves, and the final comparison.