Solving the Cliff Walking Problem

The Challenge

The Cliff Walking environment (from Gymnasium) is a grid world where the agent must start at separate points (Start and Goal) while avoiding a cliff that results in an instant reset.

I implemented and compared three fundamental Reinforcement Learning algorithms to solve this stochastic problem (10% probability of random action vs optimal move).

The Algorithms

Q-Learning (Off-Policy): Learns the optimal policy directly. It tends to take the riskiest (shortest) path right along the cliff edge because it assumes optimal action selection.
SARSA (On-Policy): Learns the "safe" path. Since it accounts for the ε-greedy policy (random exploration), it keeps a buffer zone away from the cliff to avoid falling due to random moves.
Monte Carlo (First-Visit): Learns from complete episodes. Good convergence but high variance in this specific environment.

$/ src/algorithms.py

def update_q_learning(self, state, action, reward, next_state):
    # Q(s,a) = Q(s,a) + alpha * (R + gamma * max(Q(s',a')) - Q(s,a))
    best_next_action = np.argmax(self.q_table[next_state])
    td_target = reward + self.gamma * self.q_table[next_state][best_next_action]
    self.q_table[state][action] += self.alpha * (td_target - self.q_table[state][action])

Results

The experiments revealed a clear trade-off between optimality and safety. SARSA learned a safer path (keeping a distance from the cliff), while Q-Learning learned the optimal but risky path.

Algorithm	Success Rate	Time (Training)	Optimal α
SARSA	95.6%	1.7s	0.1
Q-Learning	95.4%	2.1s	0.1
Monte Carlo	66.4%	17.2s	0.01

Key Findings

Why TD > Monte Carlo?

Temporal Difference methods (SARSA/Q-Learning) propagate knowledge step-by-step, making them much more efficient in this environment where a single mistake (the cliff) ruins the episode. Monte Carlo high variance struggles to converge efficiently.

Check out the full comparison and code in the repository.

View Repository