Backpropagation Explained: How Neural Networks Actually Learn
Master backpropagation with step-by-step derivations, computational graphs, and practical code examples. Learn exactly how gradients flow through neural networks.
Backpropagation Explained: How Neural Networks Actually Learn
“Backpropagation is the key algorithm that makes deep learning work.” — Geoffrey Hinton, Turing Award Winner
Every time you train a neural network, backpropagation runs millions of times. It’s the algorithm that computes gradients efficiently, enabling neural networks to learn from data. Yet many practitioners treat it as a black box.
In this comprehensive guide, you’ll understand backpropagation from first principles—with mathematical derivations, visual explanations, and practical code implementations.
What Is Backpropagation?
The Core Idea
Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to every parameter in a neural network.
It has two key insights:
- Chain Rule Application: Gradients propagate backwards through the computational graph
- Efficient Reuse: Each intermediate gradient is computed once and reused
Forward Pass: Input → Layer 1 → Layer 2 → ... → Layer L → Loss
Backward Pass: Input ← Layer 1 ← Layer 2 ← ... ← Layer L ← Loss
↑ gradients flow backwards
Why Do We Need Backpropagation?
Without backpropagation, we’d need to compute gradients using finite differences:
$$\frac{\partial L}{\partial w_i} \approx \frac{L(w_1, …, w_i + h, …, w_n) - L(w_1, …, w_i, …, w_n)}{h}$$
For a model with n parameters, this requires n+1 forward passes! Modern models have billions of parameters—this would be computationally impossible.
Backpropagation computes all gradients in just TWO passes: one forward, one backward.
How Do Computational Graphs Work?
Building the Computation Graph
A computational graph represents the sequence of operations from input to output. Each node is an operation, each edge is data flowing between operations.
import numpy as np
# Example: y = (w * x + b)^2
# Computational graph:
#
# x ──┐
# ├──> mul ──> add ──> square ──> y
# w ──┘ ↑
# b
class ComputationalGraph:
"""Simple computational graph for demonstration."""
def __init__(self):
self.graph = []
def forward(self, x, w, b):
"""Build graph during forward pass."""
# Clear previous graph
self.graph = []
# Node 1: multiplication
self.z1 = w * x
self.graph.append(('mul', x, w, self.z1))
# Node 2: addition
self.z2 = self.z1 + b
self.graph.append(('add', self.z1, b, self.z2))
# Node 3: square
self.y = self.z2 ** 2
self.graph.append(('square', self.z2, None, self.y))
return self.y
def backward(self, dy=1.0):
"""Compute gradients using backpropagation."""
# Start with gradient of output
grad = dy
# Traverse graph in reverse
for op, input1, input2, output in reversed(self.graph):
if op == 'square':
# d(x^2)/dx = 2x
grad = grad * 2 * input1
elif op == 'add':
# d(a+b)/da = 1, d(a+b)/db = 1
grad_z1 = grad * 1
self.grad_b = grad * 1
grad = grad_z1
elif op == 'mul':
# d(w*x)/dw = x, d(w*x)/dx = w
self.grad_w = grad * input1 # input1 is x
self.grad_x = grad * input2 # input2 is w
return self.grad_x, self.grad_w, self.grad_b
# Test
x, w, b = 2.0, 3.0, 1.0
graph = ComputationalGraph()
y = graph.forward(x, w, b)
print(f"Forward: y = (w*x + b)² = ({w}*{x} + {b})² = {y}")
grad_x, grad_w, grad_b = graph.backward()
print(f"\nBackward:")
print(f" ∂y/∂x = {grad_x}")
print(f" ∂y/∂w = {grad_w}")
print(f" ∂y/∂b = {grad_b}")
# Verify with analytical derivatives
# y = (wx + b)²
# dy/dw = 2(wx + b) * x = 2 * 7 * 2 = 28
# dy/dx = 2(wx + b) * w = 2 * 7 * 3 = 42
# dy/db = 2(wx + b) * 1 = 2 * 7 = 14
print(f"\nVerification:")
print(f" 2(wx+b)*x = 2*{w*x+b}*{x} = {2*(w*x+b)*x}")
print(f" 2(wx+b)*w = 2*{w*x+b}*{w} = {2*(w*x+b)*w}")
print(f" 2(wx+b)*1 = 2*{w*x+b}*1 = {2*(w*x+b)}")
Forward vs Backward Mode Differentiation
| Mode | Direction | Best For | Complexity |
|---|---|---|---|
| Forward | Input → Output | Few inputs, many outputs | O(n) forward passes for n inputs |
| Reverse (Backprop) | Output → Input | Many inputs, few outputs | O(m) backward passes for m outputs |
Neural networks have millions of parameters (inputs to the gradient function) and typically one scalar loss (output). Reverse mode is perfect!
Step-by-Step Backpropagation Through a Network
Single Neuron Example
Let’s derive backpropagation for a single neuron with sigmoid activation:
Input: x
Parameters: w (weight), b (bias)
Forward: z = wx + b → a = σ(z) → L = (a - y)²
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(z):
s = sigmoid(z)
return s * (1 - s)
class SingleNeuron:
"""Single neuron with backpropagation."""
def __init__(self, w=0.5, b=0.1):
self.w = w
self.b = b
def forward(self, x, y_true):
"""
Forward pass storing intermediate values.
z = wx + b
a = σ(z)
L = (1/2)(a - y)²
"""
self.x = x
self.y_true = y_true
# Linear transformation
self.z = self.w * x + self.b
# Activation
self.a = sigmoid(self.z)
# Loss (MSE)
self.L = 0.5 * (self.a - y_true) ** 2
return self.a, self.L
def backward(self):
"""
Backward pass computing gradients.
dL/da = a - y
da/dz = σ'(z) = σ(z)(1 - σ(z))
dz/dw = x
dz/db = 1
Chain rule:
dL/dw = dL/da · da/dz · dz/dw
dL/db = dL/da · da/dz · dz/db
"""
# Output gradient
dL_da = self.a - self.y_true
# Sigmoid gradient
da_dz = sigmoid_derivative(self.z)
# Chain to z
dL_dz = dL_da * da_dz
# Parameter gradients
dL_dw = dL_dz * self.x
dL_db = dL_dz * 1
return dL_dw, dL_db
def update(self, lr=0.1):
"""Gradient descent update."""
dL_dw, dL_db = self.backward()
self.w -= lr * dL_dw
self.b -= lr * dL_db
return dL_dw, dL_db
# Training example
neuron = SingleNeuron(w=0.5, b=0.1)
x, y_true = 1.0, 1.0
print("Single Neuron Backpropagation")
print("=" * 50)
for step in range(10):
a, L = neuron.forward(x, y_true)
dL_dw, dL_db = neuron.update(lr=0.5)
if step % 2 == 0:
print(f"Step {step}: Loss={L:.6f}, a={a:.4f}, w={neuron.w:.4f}, b={neuron.b:.4f}")
Two-Layer Network Backpropagation
Now let’s scale up to a full two-layer network:
import numpy as np
class TwoLayerNetwork:
"""
Two-layer neural network with full backpropagation.
Architecture:
Input (n_in) → Hidden (n_hidden) → Output (n_out)
"""
def __init__(self, n_in, n_hidden, n_out):
# Xavier initialization
self.W1 = np.random.randn(n_in, n_hidden) * np.sqrt(2.0 / n_in)
self.b1 = np.zeros((1, n_hidden))
self.W2 = np.random.randn(n_hidden, n_out) * np.sqrt(2.0 / n_hidden)
self.b2 = np.zeros((1, n_out))
def relu(self, z):
return np.maximum(0, z)
def relu_derivative(self, z):
return (z > 0).astype(float)
def softmax(self, z):
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def forward(self, X):
"""
Forward pass.
Layer 1: z1 = X @ W1 + b1, a1 = relu(z1)
Layer 2: z2 = a1 @ W2 + b2, a2 = softmax(z2)
"""
self.X = X
self.batch_size = X.shape[0]
# Layer 1
self.z1 = X @ self.W1 + self.b1
self.a1 = self.relu(self.z1)
# Layer 2
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.softmax(self.z2)
return self.a2
def compute_loss(self, y_true):
"""Cross-entropy loss."""
# Add small epsilon to prevent log(0)
eps = 1e-10
self.y_true = y_true
loss = -np.mean(np.sum(y_true * np.log(self.a2 + eps), axis=1))
return loss
def backward(self):
"""
Backward pass.
For softmax + cross-entropy, the gradient simplifies to:
dL/dz2 = a2 - y_true
Then chain backwards:
dL/dW2 = a1.T @ dL/dz2
dL/db2 = sum(dL/dz2)
dL/da1 = dL/dz2 @ W2.T
dL/dz1 = dL/da1 * relu'(z1)
dL/dW1 = X.T @ dL/dz1
dL/db1 = sum(dL/dz1)
"""
# Output layer gradient
dL_dz2 = (self.a2 - self.y_true) / self.batch_size
# Layer 2 parameter gradients
self.dW2 = self.a1.T @ dL_dz2
self.db2 = np.sum(dL_dz2, axis=0, keepdims=True)
# Backprop to hidden layer
dL_da1 = dL_dz2 @ self.W2.T
dL_dz1 = dL_da1 * self.relu_derivative(self.z1)
# Layer 1 parameter gradients
self.dW1 = self.X.T @ dL_dz1
self.db1 = np.sum(dL_dz1, axis=0, keepdims=True)
return self.dW1, self.db1, self.dW2, self.db2
def update(self, lr=0.01):
"""Gradient descent update."""
self.W1 -= lr * self.dW1
self.b1 -= lr * self.db1
self.W2 -= lr * self.dW2
self.b2 -= lr * self.db2
def train_step(self, X, y, lr=0.01):
"""Complete training step."""
# Forward
self.forward(X)
loss = self.compute_loss(y)
# Backward
self.backward()
# Update
self.update(lr)
return loss
def predict(self, X):
"""Make predictions."""
return np.argmax(self.forward(X), axis=1)
# Test on synthetic data
np.random.seed(42)
# Generate spiral dataset
def generate_spiral_data(n_points, n_classes):
X = np.zeros((n_points * n_classes, 2))
y = np.zeros(n_points * n_classes, dtype=int)
for class_idx in range(n_classes):
ix = range(n_points * class_idx, n_points * (class_idx + 1))
r = np.linspace(0.0, 1, n_points)
t = np.linspace(class_idx * 4, (class_idx + 1) * 4, n_points) + np.random.randn(n_points) * 0.2
X[ix] = np.c_[r * np.sin(t), r * np.cos(t)]
y[ix] = class_idx
return X, y
X, y_int = generate_spiral_data(100, 3)
y_onehot = np.eye(3)[y_int]
# Train network
net = TwoLayerNetwork(n_in=2, n_hidden=100, n_out=3)
print("Training Two-Layer Network")
print("=" * 50)
for epoch in range(1001):
loss = net.train_step(X, y_onehot, lr=1.0)
if epoch % 200 == 0:
predictions = net.predict(X)
accuracy = np.mean(predictions == y_int)
print(f"Epoch {epoch:4d}: Loss={loss:.4f}, Accuracy={accuracy:.2%}")
Understanding the Chain Rule in Backpropagation
Visualizing the Chain Rule
For a network with L layers:
$$\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \cdot … \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}}$$
Layer L Layer L-1 Layer 2 Layer 1
┌─────────────────────────────────────────────────────┐
│ dL/da^L → da^L/dz^L → dz^L/da^(L-1) → ... → dz^1/dW^1 │
└─────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Output Activation Weight Weight
gradient derivative matrix gradient
Deriving Gradients for Common Layers
Linear Layer: $z = Wx + b$
def linear_forward(x, W, b):
return W @ x + b
def linear_backward(dL_dz, x, W):
"""
Gradients for linear layer.
z = Wx + b
dL/dW = dL/dz @ x.T (outer product)
dL/db = dL/dz
dL/dx = W.T @ dL/dz (for backprop to previous layer)
"""
dL_dW = np.outer(dL_dz, x)
dL_db = dL_dz
dL_dx = W.T @ dL_dz
return dL_dW, dL_db, dL_dx
ReLU Activation: $a = \max(0, z)$
def relu_forward(z):
return np.maximum(0, z)
def relu_backward(dL_da, z):
"""
Gradient for ReLU.
a = max(0, z)
da/dz = 1 if z > 0 else 0
dL/dz = dL/da * da/dz
"""
dL_dz = dL_da * (z > 0).astype(float)
return dL_dz
Sigmoid Activation: $a = \sigma(z) = \frac{1}{1 + e^{-z}}$
def sigmoid_forward(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_backward(dL_da, z):
"""
Gradient for sigmoid.
a = σ(z)
da/dz = σ(z) * (1 - σ(z)) = a * (1 - a)
dL/dz = dL/da * a * (1 - a)
"""
a = sigmoid_forward(z)
dL_dz = dL_da * a * (1 - a)
return dL_dz
Softmax + Cross-Entropy
def softmax_forward(z):
exp_z = np.exp(z - np.max(z))
return exp_z / np.sum(exp_z)
def cross_entropy_loss(y_pred, y_true):
return -np.sum(y_true * np.log(y_pred + 1e-10))
def softmax_cross_entropy_backward(y_pred, y_true):
"""
Combined gradient for softmax + cross-entropy.
This simplifies beautifully to:
dL/dz = y_pred - y_true
"""
return y_pred - y_true
Implementing Backpropagation from Scratch
Modular Implementation
import numpy as np
class Layer:
"""Base class for neural network layers."""
def forward(self, x):
raise NotImplementedError
def backward(self, dL_dout):
raise NotImplementedError
def update(self, lr):
pass # Override in parameterized layers
class Linear(Layer):
"""Fully connected layer."""
def __init__(self, in_features, out_features):
# He initialization
self.W = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
self.b = np.zeros((1, out_features))
self.dW = None
self.db = None
def forward(self, x):
self.x = x
return x @ self.W + self.b
def backward(self, dL_dout):
self.dW = self.x.T @ dL_dout
self.db = np.sum(dL_dout, axis=0, keepdims=True)
dL_dx = dL_dout @ self.W.T
return dL_dx
def update(self, lr):
self.W -= lr * self.dW
self.b -= lr * self.db
class ReLU(Layer):
"""ReLU activation."""
def forward(self, x):
self.x = x
return np.maximum(0, x)
def backward(self, dL_dout):
return dL_dout * (self.x > 0)
class Sigmoid(Layer):
"""Sigmoid activation."""
def forward(self, x):
self.out = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
return self.out
def backward(self, dL_dout):
return dL_dout * self.out * (1 - self.out)
class Tanh(Layer):
"""Tanh activation."""
def forward(self, x):
self.out = np.tanh(x)
return self.out
def backward(self, dL_dout):
return dL_dout * (1 - self.out ** 2)
class Softmax(Layer):
"""Softmax activation (use with cross-entropy loss)."""
def forward(self, x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
self.out = exp_x / np.sum(exp_x, axis=1, keepdims=True)
return self.out
def backward(self, dL_dout):
# When used with cross-entropy, pass gradient directly
return dL_dout
class CrossEntropyLoss:
"""Cross-entropy loss for classification."""
def forward(self, y_pred, y_true):
self.y_pred = y_pred
self.y_true = y_true
eps = 1e-10
return -np.mean(np.sum(y_true * np.log(y_pred + eps), axis=1))
def backward(self):
batch_size = self.y_true.shape[0]
# Combined softmax + cross-entropy gradient
return (self.y_pred - self.y_true) / batch_size
class MSELoss:
"""Mean squared error loss for regression."""
def forward(self, y_pred, y_true):
self.y_pred = y_pred
self.y_true = y_true
return np.mean((y_pred - y_true) ** 2)
def backward(self):
batch_size = self.y_true.shape[0]
return 2 * (self.y_pred - self.y_true) / batch_size
class Sequential:
"""Container for sequential layers."""
def __init__(self, layers):
self.layers = layers
def forward(self, x):
for layer in self.layers:
x = layer.forward(x)
return x
def backward(self, dL_dout):
for layer in reversed(self.layers):
dL_dout = layer.backward(dL_dout)
return dL_dout
def update(self, lr):
for layer in self.layers:
layer.update(lr)
def __call__(self, x):
return self.forward(x)
# Build and train a network
np.random.seed(42)
# Create network
model = Sequential([
Linear(2, 64),
ReLU(),
Linear(64, 32),
ReLU(),
Linear(32, 3),
Softmax()
])
loss_fn = CrossEntropyLoss()
# Generate data
X, y_int = generate_spiral_data(100, 3)
y_onehot = np.eye(3)[y_int]
# Training loop
print("Training Modular Network")
print("=" * 50)
for epoch in range(1001):
# Forward pass
y_pred = model(X)
loss = loss_fn.forward(y_pred, y_onehot)
# Backward pass
dL = loss_fn.backward()
model.backward(dL)
# Update parameters
model.update(lr=0.5)
if epoch % 200 == 0:
predictions = np.argmax(y_pred, axis=1)
accuracy = np.mean(predictions == y_int)
print(f"Epoch {epoch:4d}: Loss={loss:.4f}, Accuracy={accuracy:.2%}")
Gradient Checking: Verifying Your Backpropagation
Numerical Gradient Verification
import numpy as np
def numerical_gradient(model, loss_fn, X, y, layer_idx, param_name, epsilon=1e-5):
"""
Compute numerical gradient using finite differences.
Args:
model: Neural network model
loss_fn: Loss function
X: Input data
y: Target labels
layer_idx: Index of layer to check
param_name: 'W' or 'b'
epsilon: Small perturbation value
"""
layer = model.layers[layer_idx]
param = getattr(layer, param_name)
numerical_grad = np.zeros_like(param)
# Iterate over all elements
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
original = param[idx]
# f(x + epsilon)
param[idx] = original + epsilon
y_pred = model(X)
loss_plus = loss_fn.forward(y_pred, y)
# f(x - epsilon)
param[idx] = original - epsilon
y_pred = model(X)
loss_minus = loss_fn.forward(y_pred, y)
# Numerical gradient
numerical_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
# Restore original value
param[idx] = original
it.iternext()
return numerical_grad
def gradient_check(model, loss_fn, X, y, epsilon=1e-5):
"""
Check analytical gradients against numerical gradients.
"""
# Forward and backward pass to get analytical gradients
y_pred = model(X)
loss = loss_fn.forward(y_pred, y)
dL = loss_fn.backward()
model.backward(dL)
print("Gradient Check Results")
print("=" * 60)
# Check each parameterized layer
for i, layer in enumerate(model.layers):
if hasattr(layer, 'W'):
# Check weights
numerical_dW = numerical_gradient(model, loss_fn, X, y, i, 'W', epsilon)
analytical_dW = layer.dW
diff_W = np.linalg.norm(numerical_dW - analytical_dW)
norm_sum = np.linalg.norm(numerical_dW) + np.linalg.norm(analytical_dW)
relative_diff_W = diff_W / (norm_sum + 1e-10)
status_W = "✓ PASS" if relative_diff_W < 1e-5 else "✗ FAIL"
print(f"Layer {i} W: relative diff = {relative_diff_W:.2e} {status_W}")
# Check biases
numerical_db = numerical_gradient(model, loss_fn, X, y, i, 'b', epsilon)
analytical_db = layer.db
diff_b = np.linalg.norm(numerical_db - analytical_db)
norm_sum_b = np.linalg.norm(numerical_db) + np.linalg.norm(analytical_db)
relative_diff_b = diff_b / (norm_sum_b + 1e-10)
status_b = "✓ PASS" if relative_diff_b < 1e-5 else "✗ FAIL"
print(f"Layer {i} b: relative diff = {relative_diff_b:.2e} {status_b}")
# Run gradient check on small data
X_small = X[:5]
y_small = y_onehot[:5]
small_model = Sequential([
Linear(2, 4),
ReLU(),
Linear(4, 3),
Softmax()
])
gradient_check(small_model, CrossEntropyLoss(), X_small, y_small)
Common Pitfalls and How to Avoid Them
1. Vanishing Gradients
Problem: Gradients become extremely small in deep networks, especially with sigmoid/tanh.
# Demonstrate vanishing gradients
def demonstrate_vanishing_gradient():
"""Show how gradients vanish through many sigmoid layers."""
np.random.seed(42)
x = np.random.randn(10)
gradient = np.ones_like(x)
print("Gradient magnitude through sigmoid layers:")
for layer in range(1, 21):
# Sigmoid derivative: s(1-s), max = 0.25
z = np.random.randn(10)
s = 1 / (1 + np.exp(-z))
sigmoid_grad = s * (1 - s)
gradient = gradient * sigmoid_grad
if layer % 5 == 0:
print(f" Layer {layer:2d}: ||gradient|| = {np.linalg.norm(gradient):.2e}")
demonstrate_vanishing_gradient()
Solutions:
- Use ReLU or its variants (Leaky ReLU, ELU)
- Use batch normalization
- Use residual connections (skip connections)
2. Exploding Gradients
Problem: Gradients become extremely large, causing unstable training.
def demonstrate_exploding_gradient():
"""Show how gradients can explode."""
np.random.seed(42)
gradient = np.ones(10)
print("Gradient magnitude with large weights:")
for layer in range(1, 21):
# Large random weights
W = np.random.randn(10, 10) * 2 # Large initialization
gradient = W.T @ gradient
if layer % 5 == 0:
print(f" Layer {layer:2d}: ||gradient|| = {np.linalg.norm(gradient):.2e}")
demonstrate_exploding_gradient()
Solutions:
- Gradient clipping
- Proper weight initialization (Xavier, He)
- Batch normalization
3. Dying ReLU
Problem: Neurons get stuck outputting 0 and never recover.
def demonstrate_dying_relu():
"""Show the dying ReLU problem."""
# If a neuron always outputs negative values, ReLU kills it
z_values = np.array([-1.0, -0.5, 0.1, -2.0, -0.3])
relu_output = np.maximum(0, z_values)
relu_gradient = (z_values > 0).astype(float)
print("Dying ReLU demonstration:")
for i, (z, out, grad) in enumerate(zip(z_values, relu_output, relu_gradient)):
status = "DEAD 💀" if grad == 0 else "alive"
print(f" Neuron {i}: z={z:5.2f}, output={out:.2f}, gradient={grad:.0f} - {status}")
demonstrate_dying_relu()
Solutions:
- Use Leaky ReLU: $f(x) = \max(0.01x, x)$
- Use ELU or SELU
- Careful initialization
How Modern Frameworks Handle Backpropagation
PyTorch Autograd
import torch
# PyTorch handles backprop automatically
x = torch.tensor([2.0], requires_grad=True)
w = torch.tensor([3.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)
# Forward
y = (w * x + b) ** 2
# Backward (one line!)
y.backward()
print("PyTorch Autograd:")
print(f" dy/dx = {x.grad.item()}")
print(f" dy/dw = {w.grad.item()}")
print(f" dy/db = {b.grad.item()}")
# Compare with manual calculation
# y = (wx + b)² = (3*2 + 1)² = 49
# dy/dw = 2(wx+b)*x = 2*7*2 = 28
# dy/dx = 2(wx+b)*w = 2*7*3 = 42
# dy/db = 2(wx+b) = 2*7 = 14
TensorFlow GradientTape
import tensorflow as tf
x = tf.Variable([2.0])
w = tf.Variable([3.0])
b = tf.Variable([1.0])
with tf.GradientTape() as tape:
y = (w * x + b) ** 2
# Get gradients
gradients = tape.gradient(y, [x, w, b])
print("TensorFlow GradientTape:")
print(f" dy/dx = {gradients[0].numpy()[0]}")
print(f" dy/dw = {gradients[1].numpy()[0]}")
print(f" dy/db = {gradients[2].numpy()[0]}")
FAQs
Why is backpropagation efficient?
Backpropagation is efficient because it reuses intermediate computations. Instead of computing each gradient independently, it computes all gradients in a single backward pass by propagating the error signal through the network.
What’s the difference between backpropagation and gradient descent?
- Backpropagation: Algorithm to compute gradients
- Gradient descent: Algorithm to use gradients to update parameters
They work together: backprop computes ∂L/∂θ, then gradient descent applies θ = θ - η·∂L/∂θ.
Can backpropagation get stuck in local minima?
Theoretically yes, but in practice:
- Neural network loss surfaces have many saddle points, not local minima
- Stochastic gradient descent helps escape poor regions
- Modern optimizers like Adam have momentum to overcome small bumps
Key Takeaways
- Backpropagation computes gradients efficiently using the chain rule
- Computational graphs track operations for automatic differentiation
- Two passes: forward (compute outputs), backward (compute gradients)
- Gradient checking verifies your implementation is correct
- Modern frameworks handle backprop automatically via autograd
Next Steps
Continue your deep learning journey:
- Gradient Descent Optimizers - SGD, Adam, and beyond
- Calculus for Deep Learning - Mathematical foundations
- Building Neural Networks from Scratch - Complete implementation guide
References
- Rumelhart, D., Hinton, G., Williams, R. “Learning representations by back-propagating errors” (1986) - Nature
- Goodfellow, I., et al. “Deep Learning” (2016) - Chapter 6
- Karpathy, A. “Yes you should understand backprop” - Medium
- Stanford CS231n: “Backpropagation, Intuitions” - https://cs231n.github.io/optimization-2/
Last updated: January 2024. This guide is part of our Mathematics for Machine Learning series.