Solving the Cliff Walking Problem: RL Algorithms Comparison 🧗
A comparative study of Q-Learning, SARSA, and Monte Carlo methods in the classic "Cliff Walking" environment.
The Challenge
The Cliff Walking environment (from Gymnasium) is a grid world where the agent must start at separate points (Start and Goal) while avoiding a cliff that results in an instant reset.
I implemented and compared three fundamental Reinforcement Learning algorithms to solve this stochastic problem (10% probability of random action vs optimal move).
The Algorithms
- Q-Learning (Off-Policy): Learns the optimal policy directly. It tends to take the riskiest (shortest) path right along the cliff edge because it assumes optimal action selection.
- SARSA (On-Policy): Learns the "safe" path. Since it accounts for the ε-greedy policy (random exploration), it keeps a buffer zone away from the cliff to avoid falling due to random moves.
- Monte Carlo (First-Visit): Learns from complete episodes. Good convergence but high variance in this specific environment.
$/ src/algorithms.py
def update_q_learning(self, state, action, reward, next_state):
# Q(s,a) = Q(s,a) + alpha * (R + gamma * max(Q(s',a')) - Q(s,a))
best_next_action = np.argmax(self.q_table[next_state])
td_target = reward + self.gamma * self.q_table[next_state][best_next_action]
self.q_table[state][action] += self.alpha * (td_target - self.q_table[state][action])
Results
The experiments revealed a clear trade-off between optimality and safety. SARSA learned a safer path (keeping a distance from the cliff), while Q-Learning learned the optimal but risky path.
| Algorithm | Success Rate | Time (Training) | Optimal α |
|---|---|---|---|
| SARSA | 95.6% | 1.7s | 0.1 |
| Q-Learning | 95.4% | 2.1s | 0.1 |
| Monte Carlo | 66.4% | 17.2s | 0.01 |
Key Findings
Why TD > Monte Carlo?
Temporal Difference methods (SARSA/Q-Learning) propagate knowledge step-by-step, making them much more efficient in this environment where a single mistake (the cliff) ruins the episode. Monte Carlo high variance struggles to converge efficiently.
Check out the full comparison and code in the repository.