Dec 18, 2025 • Entry #006

Solving the Cliff Walking Problem: RL Algorithms Comparison 🧗

A comparative study of Q-Learning, SARSA, and Monte Carlo methods in the classic "Cliff Walking" environment.

The Challenge

The Cliff Walking environment (from Gymnasium) is a grid world where the agent must start at separate points (Start and Goal) while avoiding a cliff that results in an instant reset.

I implemented and compared three fundamental Reinforcement Learning algorithms to solve this stochastic problem (10% probability of random action vs optimal move).

The Algorithms

$/ src/algorithms.py

def update_q_learning(self, state, action, reward, next_state):
    # Q(s,a) = Q(s,a) + alpha * (R + gamma * max(Q(s',a')) - Q(s,a))
    best_next_action = np.argmax(self.q_table[next_state])
    td_target = reward + self.gamma * self.q_table[next_state][best_next_action]
    self.q_table[state][action] += self.alpha * (td_target - self.q_table[state][action])

Results

The experiments revealed a clear trade-off between optimality and safety. SARSA learned a safer path (keeping a distance from the cliff), while Q-Learning learned the optimal but risky path.

Algorithm Success Rate Time (Training) Optimal α
SARSA 95.6% 1.7s 0.1
Q-Learning 95.4% 2.1s 0.1
Monte Carlo 66.4% 17.2s 0.01

Key Findings

Why TD > Monte Carlo?

Temporal Difference methods (SARSA/Q-Learning) propagate knowledge step-by-step, making them much more efficient in this environment where a single mistake (the cliff) ruins the episode. Monte Carlo high variance struggles to converge efficiently.

Check out the full comparison and code in the repository.

View Repository