Gradient Descent Optimizers: SGD, Adam, RMSprop Complete Comparison

Master gradient descent optimizers with practical examples. Compare SGD, Momentum, Adam, RMSprop, and AdaGrad. Learn when to use each optimizer for optimal deep learning results.

15 min read Jan 15, 2024

Gradient Descent Optimizers: SGD, Adam, RMSprop Complete Comparison

“The choice of optimizer can make or break your model’s training.” — Sebastian Ruder, Research Scientist at Google

Choosing the right optimizer is one of the most impactful decisions in deep learning. The same model architecture can converge quickly or fail to train entirely based on your optimizer choice.

In this comprehensive guide, you’ll understand every major optimizer, when to use each one, and how to tune their hyperparameters for optimal results.

Why Do Optimizers Matter?

The Optimization Problem

Training a neural network means finding parameters θ that minimize the loss:

$$\theta^* = \arg\min_\theta \mathcal{L}(\theta)$$

The challenge: modern networks have millions of parameters, and the loss landscape is highly non-convex with many local minima, saddle points, and flat regions.

What Makes a Good Optimizer?

Property	Importance
Convergence speed	Reach good solution quickly
Stability	Don’t oscillate or diverge
Generalization	Find solutions that work on test data
Memory efficiency	Don’t require too much RAM
Hyperparameter sensitivity	Work well without extensive tuning

How Does Vanilla Gradient Descent Work?

The Basic Update Rule

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$$

Where:

$\theta_t$ = parameters at step t
$\eta$ = learning rate
$\nabla_\theta \mathcal{L}$ = gradient of loss

import numpy as np
import matplotlib.pyplot as plt

class VanillaGradientDescent:
    """Basic gradient descent optimizer."""
    
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def update(self, params, grads):
        """Update parameters."""
        for i in range(len(params)):
            params[i] -= self.lr * grads[i]
        return params

# Demonstrate on simple 2D loss function
def rosenbrock(x, y):
    """Rosenbrock function - challenging optimization landscape."""
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_gradient(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

def optimize_and_visualize(optimizer, start, n_steps=1000):
    """Run optimizer and return path."""
    path = [start.copy()]
    params = start.copy()
    
    for _ in range(n_steps):
        grads = rosenbrock_gradient(params[0], params[1])
        params = np.array(optimizer.update([params], [grads])[0])
        path.append(params.copy())
        
        # Early stopping if converged
        if np.linalg.norm(grads) < 1e-8:
            break
    
    return np.array(path)

# Test vanilla GD
start = np.array([-1.5, 2.0])
gd = VanillaGradientDescent(lr=0.001)
path_gd = optimize_and_visualize(gd, start, n_steps=5000)

print(f"Vanilla GD:")
print(f"  Start: {start}")
print(f"  End: {path_gd[-1]}")
print(f"  Steps: {len(path_gd)}")
print(f"  Final loss: {rosenbrock(path_gd[-1][0], path_gd[-1][1]):.6f}")

Problems with Vanilla Gradient Descent

Uniform learning rate: Same rate for all parameters
Sensitive to LR choice: Too high = diverge, too low = slow
Stuck in saddle points: Zero gradient = no progress
Oscillation in ravines: Bounces between steep walls

What Is Stochastic Gradient Descent (SGD)?

From Batch to Stochastic

Batch GD: Uses all data to compute gradient $$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t; X, y)$$

Stochastic GD: Uses one sample $$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$$

Mini-batch GD: Uses a batch of samples $$\theta_{t+1} = \theta_t - \eta \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$$

class SGD:
    """Stochastic Gradient Descent optimizer."""
    
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def update(self, params, grads):
        """
        Update parameters.
        
        params: list of parameter arrays
        grads: list of gradient arrays
        """
        updated_params = []
        for param, grad in zip(params, grads):
            updated_params.append(param - self.lr * grad)
        return updated_params

# Training loop with mini-batch SGD
def train_with_sgd(model, X, y, batch_size=32, epochs=100, lr=0.01):
    """Train model using mini-batch SGD."""
    optimizer = SGD(lr=lr)
    n_samples = len(X)
    losses = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_loss = 0
        n_batches = 0
        
        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            
            # Forward pass
            y_pred = model.forward(X_batch)
            loss = model.compute_loss(y_pred, y_batch)
            
            # Backward pass
            grads = model.backward()
            
            # Update
            model.params = optimizer.update(model.params, grads)
            
            epoch_loss += loss
            n_batches += 1
        
        losses.append(epoch_loss / n_batches)
    
    return losses

Advantages of SGD

Advantage	Explanation
Memory efficient	Only need one batch in memory
Faster iterations	Each update is cheap
Escapes local minima	Noise helps exploration
Online learning	Can learn from streaming data

How Does Momentum Improve SGD?

The Physics Analogy

Think of a ball rolling down a hill. With momentum, the ball accumulates velocity and can roll through small bumps and flat regions.

$$v_t = \gamma v_{t-1} + \eta \nabla_\theta \mathcal{L}(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$

Where $\gamma$ (typically 0.9) is the momentum coefficient.

class SGDMomentum:
    """SGD with momentum."""
    
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = None
    
    def update(self, params, grads):
        """Update with momentum."""
        if self.velocity is None:
            self.velocity = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            self.velocity[i] = self.momentum * self.velocity[i] + self.lr * grad
            updated_params.append(param - self.velocity[i])
        
        return updated_params

# Nesterov Accelerated Gradient (NAG)
class SGDNesterov:
    """SGD with Nesterov momentum - look ahead before stepping."""
    
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = None
    
    def update(self, params, grads):
        """
        Nesterov update: compute gradient at looked-ahead position.
        
        v_t = γ * v_{t-1} + η * ∇L(θ - γ * v_{t-1})
        θ_{t+1} = θ_t - v_t
        """
        if self.velocity is None:
            self.velocity = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            v_prev = self.velocity[i]
            self.velocity[i] = self.momentum * v_prev + self.lr * grad
            # Nesterov correction
            updated_params.append(param - self.momentum * v_prev - (1 + self.momentum) * self.lr * grad)
        
        return updated_params

Visualizing Momentum Effect

import numpy as np
import matplotlib.pyplot as plt

def compare_momentum(n_steps=100):
    """Compare SGD with and without momentum."""
    
    # Simple quadratic loss with different curvatures
    def loss_grad(x, y):
        # Loss = x² + 10*y² (steep in y direction)
        return np.array([2*x, 20*y])
    
    # SGD without momentum
    start = np.array([3.0, 3.0])
    path_sgd = [start.copy()]
    pos = start.copy()
    lr = 0.1
    
    for _ in range(n_steps):
        grad = loss_grad(pos[0], pos[1])
        pos = pos - lr * grad
        path_sgd.append(pos.copy())
    
    # SGD with momentum
    path_momentum = [start.copy()]
    pos = start.copy()
    velocity = np.zeros(2)
    momentum = 0.9
    
    for _ in range(n_steps):
        grad = loss_grad(pos[0], pos[1])
        velocity = momentum * velocity + lr * grad
        pos = pos - velocity
        path_momentum.append(pos.copy())
    
    path_sgd = np.array(path_sgd)
    path_momentum = np.array(path_momentum)
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Contour plot
    x = np.linspace(-4, 4, 100)
    y = np.linspace(-4, 4, 100)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + 10*Y**2
    
    for ax, path, title in [(axes[0], path_sgd, 'SGD (no momentum)'),
                             (axes[1], path_momentum, 'SGD with Momentum')]:
        ax.contour(X, Y, Z, levels=20, cmap='viridis')
        ax.plot(path[:, 0], path[:, 1], 'r.-', markersize=3, linewidth=0.5)
        ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
        ax.plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
        ax.set_xlabel('x')
        ax.set_ylabel('y')
        ax.set_title(f'{title}\n{len(path)} steps, final pos: ({path[-1,0]:.4f}, {path[-1,1]:.4f})')
        ax.legend()
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4)
    
    plt.tight_layout()
    plt.show()
    
    return path_sgd, path_momentum

paths = compare_momentum()

What Makes AdaGrad Different?

Adaptive Learning Rates

Problem: Some parameters (e.g., rare words in NLP) need larger updates than frequent ones.

Solution: AdaGrad adapts the learning rate for each parameter based on historical gradients.

$$G_t = G_{t-1} + g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

class AdaGrad:
    """AdaGrad optimizer - adaptive learning rates."""
    
    def __init__(self, lr=0.01, epsilon=1e-8):
        self.lr = lr
        self.epsilon = epsilon
        self.G = None  # Accumulated squared gradients
    
    def update(self, params, grads):
        """Update with adaptive learning rates."""
        if self.G is None:
            self.G = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            # Accumulate squared gradients
            self.G[i] += grad ** 2
            
            # Adaptive learning rate
            adjusted_lr = self.lr / (np.sqrt(self.G[i]) + self.epsilon)
            
            updated_params.append(param - adjusted_lr * grad)
        
        return updated_params

AdaGrad Pros and Cons

Pros	Cons
Good for sparse gradients	Learning rate decays to zero
No manual LR tuning per param	Can stop learning prematurely
Works well for NLP	Not ideal for non-convex

How Does RMSprop Fix AdaGrad’s Issues?

Exponential Moving Average

RMSprop uses an exponentially decaying average instead of sum:

$$E[g^2]t = \rho E[g^2]{t-1} + (1-\rho) g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$

class RMSprop:
    """RMSprop optimizer - fixes AdaGrad's decaying learning rate."""
    
    def __init__(self, lr=0.001, rho=0.9, epsilon=1e-8):
        self.lr = lr
        self.rho = rho  # Decay rate
        self.epsilon = epsilon
        self.E_g2 = None  # Running average of squared gradients
    
    def update(self, params, grads):
        """Update with RMSprop."""
        if self.E_g2 is None:
            self.E_g2 = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            # Update running average
            self.E_g2[i] = self.rho * self.E_g2[i] + (1 - self.rho) * grad ** 2
            
            # Adaptive update
            adjusted_lr = self.lr / (np.sqrt(self.E_g2[i]) + self.epsilon)
            
            updated_params.append(param - adjusted_lr * grad)
        
        return updated_params

RMSprop in Practice

# Typical RMSprop hyperparameters
rmsprop_config = {
    'lr': 0.001,      # Learning rate (smaller than SGD)
    'rho': 0.9,       # Decay rate (0.9 is almost always good)
    'epsilon': 1e-8   # Numerical stability
}

# RMSprop works well for:
# - RNNs and LSTMs
# - Non-stationary objectives
# - Online learning

What Makes Adam the Most Popular Optimizer?

Best of Both Worlds

Adam (Adaptive Moment Estimation) combines momentum AND adaptive learning rates:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$ (momentum) $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$ (adaptive)

With bias correction: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Update: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

class Adam:
    """Adam optimizer - adaptive learning rates with momentum."""
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1  # Momentum decay
        self.beta2 = beta2  # RMSprop decay
        self.epsilon = epsilon
        self.m = None  # First moment (momentum)
        self.v = None  # Second moment (RMSprop)
        self.t = 0     # Time step for bias correction
    
    def update(self, params, grads):
        """Update with Adam."""
        self.t += 1
        
        if self.m is None:
            self.m = [np.zeros_like(p) for p in params]
            self.v = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            # Update biased first moment estimate
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            
            # Update biased second moment estimate
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad ** 2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            
            # Update
            updated_params.append(param - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon))
        
        return updated_params

# Default Adam hyperparameters (rarely need to change)
adam_config = {
    'lr': 0.001,       # Learning rate
    'beta1': 0.9,      # Momentum decay
    'beta2': 0.999,    # RMSprop decay  
    'epsilon': 1e-8    # Numerical stability
}

Why Adam Works So Well

Combines momentum and adaptivity: Best of SGD+Momentum and RMSprop
Bias correction: Handles the cold start problem
Works out of the box: Default hyperparameters work for most cases
Fast convergence: Often reaches good solutions quickly

Adam Variants

class AdamW:
    """AdamW - Adam with decoupled weight decay (better regularization)."""
    
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        self.m = None
        self.v = None
        self.t = 0
    
    def update(self, params, grads):
        """Update with decoupled weight decay."""
        self.t += 1
        
        if self.m is None:
            self.m = [np.zeros_like(p) for p in params]
            self.v = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            # Update moments
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad ** 2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            
            # Update with decoupled weight decay
            update = self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
            update += self.lr * self.weight_decay * param  # Decoupled!
            
            updated_params.append(param - update)
        
        return updated_params

Complete Optimizer Comparison

Side-by-Side Comparison

import numpy as np
import matplotlib.pyplot as plt

def compare_all_optimizers():
    """Compare all optimizers on Rosenbrock function."""
    
    def loss_and_grad(pos):
        x, y = pos
        loss = (1 - x)**2 + 100 * (y - x**2)**2
        dx = -2 * (1 - x) - 400 * x * (y - x**2)
        dy = 200 * (y - x**2)
        return loss, np.array([dx, dy])
    
    optimizers = {
        'SGD (lr=0.0001)': SGD(lr=0.0001),
        'SGD+Momentum': SGDMomentum(lr=0.0001, momentum=0.9),
        'AdaGrad': AdaGrad(lr=0.5),
        'RMSprop': RMSprop(lr=0.01),
        'Adam': Adam(lr=0.01)
    }
    
    n_steps = 5000
    start = np.array([-1.5, 2.0])
    
    results = {}
    
    for name, optimizer in optimizers.items():
        pos = start.copy()
        losses = []
        path = [pos.copy()]
        
        for _ in range(n_steps):
            loss, grad = loss_and_grad(pos)
            losses.append(loss)
            
            pos = np.array(optimizer.update([pos], [grad])[0])
            path.append(pos.copy())
            
            # Early stopping
            if loss < 1e-10:
                break
        
        results[name] = {
            'losses': losses,
            'path': np.array(path),
            'final_loss': losses[-1],
            'final_pos': pos
        }
    
    # Plot results
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Loss curves
    ax = axes[0, 0]
    for name, result in results.items():
        ax.semilogy(result['losses'], label=name)
    ax.set_xlabel('Step')
    ax.set_ylabel('Loss (log scale)')
    ax.set_title('Convergence Comparison')
    ax.legend()
    ax.grid(True)
    
    # Paths
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-1, 3, 100)
    X, Y = np.meshgrid(x, y)
    Z = (1 - X)**2 + 100 * (Y - X**2)**2
    
    for idx, (name, result) in enumerate(results.items()):
        if idx < 5:
            ax = axes[(idx+1)//3, (idx+1)%3]
            ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis')
            path = result['path'][:500]  # First 500 steps
            ax.plot(path[:, 0], path[:, 1], 'r.-', markersize=1, linewidth=0.5)
            ax.plot(path[0, 0], path[0, 1], 'go', markersize=8)
            ax.plot(1, 1, 'r*', markersize=15)  # Optimum
            ax.set_title(f"{name}\nFinal loss: {result['final_loss']:.2e}")
            ax.set_xlim(-2, 2)
            ax.set_ylim(-1, 3)
    
    plt.tight_layout()
    plt.show()
    
    # Summary table
    print("\nOptimizer Comparison Summary")
    print("=" * 60)
    print(f"{'Optimizer':<20} {'Final Loss':<15} {'Final Position'}")
    print("-" * 60)
    for name, result in results.items():
        pos = result['final_pos']
        print(f"{name:<20} {result['final_loss']:<15.2e} ({pos[0]:.4f}, {pos[1]:.4f})")
    
    return results

results = compare_all_optimizers()

Optimizer Selection Guide

Scenario	Recommended Optimizer	Why
Default choice	Adam	Works well in most cases
Computer vision	SGD + Momentum	Better generalization
NLP / Transformers	AdamW	Handles weight decay properly
RNNs / LSTMs	RMSprop	Handles non-stationary well
Sparse gradients	AdaGrad	Per-parameter learning rates
Final fine-tuning	SGD (low LR)	Often reaches lower final loss
Limited memory	SGD	Minimal overhead

Hyperparameter Tuning Guide

Learning Rate

The most important hyperparameter across all optimizers:

def learning_rate_finder(model, train_loader, start_lr=1e-7, end_lr=10, steps=100):
    """
    Find good learning rate range using LR finder technique.
    
    1. Train with exponentially increasing LR
    2. Plot loss vs LR
    3. Choose LR where loss is decreasing fastest
    """
    import copy
    
    # Save initial state
    initial_state = copy.deepcopy(model.state_dict())
    
    # Calculate LR multiplier
    lr_mult = (end_lr / start_lr) ** (1 / steps)
    
    lr = start_lr
    losses = []
    lrs = []
    
    optimizer = Adam(lr=lr)
    
    for step in range(steps):
        # Get batch
        batch_x, batch_y = next(iter(train_loader))
        
        # Forward and backward
        loss = train_step(model, batch_x, batch_y, optimizer)
        
        losses.append(loss)
        lrs.append(lr)
        
        # Increase learning rate
        lr *= lr_mult
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
        
        # Stop if loss explodes
        if loss > 4 * min(losses):
            break
    
    # Restore initial state
    model.load_state_dict(initial_state)
    
    # Plot
    plt.figure(figsize=(10, 5))
    plt.semilogx(lrs, losses)
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.grid(True)
    plt.show()
    
    # Suggest LR (where loss is decreasing fastest)
    gradients = np.gradient(losses)
    suggested_idx = np.argmin(gradients)
    suggested_lr = lrs[suggested_idx]
    
    print(f"Suggested learning rate: {suggested_lr:.2e}")
    return lrs, losses

Recommended Hyperparameters by Optimizer

# SGD
sgd_config = {
    'lr': 0.1,           # Start higher, use scheduler
    'momentum': 0.9,     # Almost always 0.9
    'weight_decay': 1e-4 # L2 regularization
}

# Adam
adam_config = {
    'lr': 0.001,         # Default, rarely needs change
    'betas': (0.9, 0.999),  # (β1, β2)
    'epsilon': 1e-8,     # Numerical stability
    'weight_decay': 0    # Use AdamW for regularization
}

# AdamW (preferred for transformers)
adamw_config = {
    'lr': 1e-4,          # Smaller for fine-tuning
    'betas': (0.9, 0.999),
    'epsilon': 1e-8,
    'weight_decay': 0.01 # Decoupled weight decay
}

# RMSprop
rmsprop_config = {
    'lr': 0.001,
    'alpha': 0.99,       # Decay rate (called rho in some implementations)
    'epsilon': 1e-8
}

Real-World Training Example

PyTorch Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Create a simple classification model
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def forward(self, x):
        return self.net(x)

# Training function with different optimizers
def train_and_compare(optimizers_config, model_class, X_train, y_train, X_val, y_val,
                      epochs=100, batch_size=32):
    """
    Train model with different optimizers and compare results.
    """
    # Create data loaders
    train_dataset = TensorDataset(torch.FloatTensor(X_train), torch.LongTensor(y_train))
    val_dataset = TensorDataset(torch.FloatTensor(X_val), torch.LongTensor(y_val))
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)
    
    results = {}
    
    for opt_name, opt_config in optimizers_config.items():
        print(f"\nTraining with {opt_name}...")
        
        # Create fresh model
        model = model_class(X_train.shape[1], 64, len(np.unique(y_train)))
        
        # Create optimizer
        optimizer = opt_config['class'](model.parameters(), **opt_config['params'])
        
        # Loss function
        criterion = nn.CrossEntropyLoss()
        
        # Training history
        train_losses = []
        val_accs = []
        
        for epoch in range(epochs):
            # Training
            model.train()
            epoch_loss = 0
            for batch_x, batch_y in train_loader:
                optimizer.zero_grad()
                outputs = model(batch_x)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()
            
            train_losses.append(epoch_loss / len(train_loader))
            
            # Validation
            model.eval()
            correct = 0
            total = 0
            with torch.no_grad():
                for batch_x, batch_y in val_loader:
                    outputs = model(batch_x)
                    _, predicted = torch.max(outputs.data, 1)
                    total += batch_y.size(0)
                    correct += (predicted == batch_y).sum().item()
            
            val_accs.append(correct / total)
            
            if (epoch + 1) % 20 == 0:
                print(f"  Epoch {epoch+1}: Loss={train_losses[-1]:.4f}, Val Acc={val_accs[-1]:.2%}")
        
        results[opt_name] = {
            'train_losses': train_losses,
            'val_accs': val_accs,
            'final_acc': val_accs[-1]
        }
    
    return results

# Example usage
optimizers_to_test = {
    'SGD': {
        'class': optim.SGD,
        'params': {'lr': 0.1, 'momentum': 0.9}
    },
    'Adam': {
        'class': optim.Adam,
        'params': {'lr': 0.001}
    },
    'AdamW': {
        'class': optim.AdamW,
        'params': {'lr': 0.001, 'weight_decay': 0.01}
    },
    'RMSprop': {
        'class': optim.RMSprop,
        'params': {'lr': 0.001}
    }
}

# Generate synthetic data
np.random.seed(42)
X_train = np.random.randn(1000, 20)
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int)
X_val = np.random.randn(200, 20)
y_val = (X_val[:, 0] + X_val[:, 1] > 0).astype(int)

# results = train_and_compare(optimizers_to_test, SimpleClassifier, 
#                              X_train, y_train, X_val, y_val)

FAQs

Which optimizer should I use for my project?

Start with Adam. It works well out of the box for most problems. If you need better generalization (especially for computer vision), try SGD with momentum. For transformers and language models, use AdamW.

Why does Adam sometimes generalize worse than SGD?

Adam can converge to sharper minima that generalize poorly. The adaptive learning rates can be too aggressive. Solutions:

Use AdamW with proper weight decay
Switch to SGD for final fine-tuning
Use learning rate warmup

How do I know if my learning rate is too high or too low?

Too high: Loss oscillates wildly or increases
Too low: Loss decreases very slowly
Just right: Loss decreases smoothly then plateaus

Use a learning rate finder to identify the optimal range.

Should I use weight decay with Adam?

Use AdamW instead of Adam + L2 regularization. In Adam, weight decay is coupled with the adaptive learning rate, reducing its effectiveness. AdamW decouples weight decay for better regularization.

Key Takeaways

SGD + Momentum often gives best generalization for vision tasks
Adam is the best default choice for most problems
AdamW is preferred for transformers and when using weight decay
RMSprop works well for RNNs and online learning
Learning rate is the most important hyperparameter to tune
Combine optimizers: Use Adam to converge fast, then SGD to fine-tune

Next Steps

Continue learning about training optimization:

Learning Rate Schedules - Warmup, cosine annealing, one-cycle
Backpropagation Explained - How gradients are computed
Regularization Techniques - Prevent overfitting

References

Kingma, D., Ba, J. “Adam: A Method for Stochastic Optimization” (2014) - ICLR
Ruder, S. “An overview of gradient descent optimization algorithms” (2016)
Loshchilov, I., Hutter, F. “Decoupled Weight Decay Regularization” (2017) - AdamW
Smith, L.N. “Cyclical Learning Rates for Training Neural Networks” (2017)

Last updated: January 2024. This guide is part of our Mathematics for Machine Learning series.

Documentation

Gradient Descent Optimizers: SGD, Adam, RMSprop Complete Comparison

Why Do Optimizers Matter?

The Optimization Problem

What Makes a Good Optimizer?

How Does Vanilla Gradient Descent Work?

The Basic Update Rule

Problems with Vanilla Gradient Descent

What Is Stochastic Gradient Descent (SGD)?

From Batch to Stochastic

Advantages of SGD

How Does Momentum Improve SGD?

The Physics Analogy

Visualizing Momentum Effect

What Makes AdaGrad Different?

Adaptive Learning Rates

AdaGrad Pros and Cons

How Does RMSprop Fix AdaGrad’s Issues?

Exponential Moving Average

RMSprop in Practice

What Makes Adam the Most Popular Optimizer?

Best of Both Worlds

Why Adam Works So Well

Adam Variants

Complete Optimizer Comparison

Side-by-Side Comparison

Optimizer Selection Guide

Hyperparameter Tuning Guide

Learning Rate

Recommended Hyperparameters by Optimizer

Real-World Training Example

PyTorch Implementation

FAQs

Which optimizer should I use for my project?

Why does Adam sometimes generalize worse than SGD?

How do I know if my learning rate is too high or too low?

Should I use weight decay with Adam?

Key Takeaways

Next Steps

References