Calculus for Deep Learning: Derivatives, Gradients, and Chain Rule Explained

Master the calculus foundations for deep learning. Learn derivatives, gradients, partial derivatives, Jacobians, and chain rule with practical neural network examples.

15 min read Jan 15, 2024

Calculus for Deep Learning: Derivatives, Gradients, and Chain Rule Explained

“Calculus is the language of neural networks—every weight update speaks it fluently.” — Yann LeCun

When you train a neural network, you’re essentially performing millions of calculus operations. The model learns by computing derivatives to understand how to adjust its parameters. Without calculus, deep learning as we know it wouldn’t exist.

In this comprehensive guide, we’ll build your calculus intuition from derivatives to gradients to the chain rule—everything you need to understand how neural networks actually learn.

Why Does Deep Learning Need Calculus?

The Optimization Perspective

Every machine learning model is an optimization problem:

$$\theta^* = \arg\min_\theta \mathcal{L}(\theta)$$

Where:

$\theta$ = model parameters (weights and biases)
$\mathcal{L}$ = loss function
$\theta^*$ = optimal parameters

To find the minimum, we need to know which direction to move and how far. Calculus gives us both through derivatives.

What Happens During Training

Forward Pass: Compute predictions using current parameters
Loss Calculation: Measure how wrong predictions are
Backward Pass: Calculate derivatives (gradients) of loss w.r.t. parameters
Update: Adjust parameters in the direction that reduces loss

# Simplified training loop
for epoch in range(num_epochs):
    # Forward pass
    predictions = model(inputs)
    loss = compute_loss(predictions, targets)
    
    # Backward pass (calculus happens here!)
    gradients = compute_gradients(loss, model.parameters)
    
    # Update (gradient descent)
    for param, grad in zip(model.parameters, gradients):
        param -= learning_rate * grad

What Are Derivatives and Why Do They Matter?

The Intuition

A derivative measures how a function changes when its input changes. It’s the instantaneous rate of change—the slope at any point.

$$\frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

For deep learning, this tells us: “If I change this parameter slightly, how much will my loss change?”

Derivatives in Machine Learning

import numpy as np
import matplotlib.pyplot as plt

# Example: Simple quadratic loss
def loss_function(w):
    """Simplified loss as function of weight w."""
    return (w - 3)**2 + 1

def loss_derivative(w):
    """Analytical derivative."""
    return 2 * (w - 3)

# Numerical derivative (finite difference)
def numerical_derivative(f, x, h=1e-5):
    """Approximate derivative using finite difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Compare analytical and numerical
w = 5.0
analytical = loss_derivative(w)
numerical = numerical_derivative(loss_function, w)

print(f"Analytical derivative at w={w}: {analytical}")
print(f"Numerical derivative at w={w}: {numerical:.6f}")
print(f"Difference: {abs(analytical - numerical):.2e}")

# Visualization
w_values = np.linspace(-2, 8, 100)
losses = [loss_function(w) for w in w_values]
derivatives = [loss_derivative(w) for w in w_values]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(w_values, losses, 'b-', linewidth=2)
axes[0].axhline(y=1, color='r', linestyle='--', label='Minimum')
axes[0].axvline(x=3, color='r', linestyle='--')
axes[0].set_xlabel('Weight (w)')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Function')
axes[0].legend()

axes[1].plot(w_values, derivatives, 'g-', linewidth=2)
axes[1].axhline(y=0, color='r', linestyle='--', label='Zero gradient')
axes[1].set_xlabel('Weight (w)')
axes[1].set_ylabel('Derivative')
axes[1].set_title('Derivative of Loss')
axes[1].legend()

plt.tight_layout()
plt.show()

Common Derivatives in Deep Learning

Function	Formula	Derivative	Used In
Linear	$f(x) = ax + b$	$f’(x) = a$	Linear layers
Square	$f(x) = x^2$	$f’(x) = 2x$	MSE loss
Exponential	$f(x) = e^x$	$f’(x) = e^x$	Softmax
Logarithm	$f(x) = \ln(x)$	$f’(x) = 1/x$	Cross-entropy
Power	$f(x) = x^n$	$f’(x) = nx^{n-1}$	Polynomial features

How Do Partial Derivatives Extend to Multiple Variables?

The Need for Multiple Variables

Real neural networks have millions of parameters. We need to know how changing each parameter affects the loss. That’s where partial derivatives come in.

A partial derivative measures how a function changes when you change just one variable while holding others constant:

$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, …, x_i + h, …, x_n) - f(x_1, …, x_i, …, x_n)}{h}$$

Example: Loss with Two Parameters

import numpy as np

def loss_2d(w1, w2):
    """Loss function with two parameters."""
    return (w1 - 2)**2 + (w2 - 3)**2 + w1 * w2

def partial_w1(w1, w2):
    """Partial derivative with respect to w1."""
    return 2 * (w1 - 2) + w2

def partial_w2(w1, w2):
    """Partial derivative with respect to w2."""
    return 2 * (w2 - 3) + w1

# Example point
w1, w2 = 1.0, 1.0

print(f"Loss at ({w1}, {w2}): {loss_2d(w1, w2)}")
print(f"∂L/∂w1 = {partial_w1(w1, w2)}")
print(f"∂L/∂w2 = {partial_w2(w1, w2)}")

# Numerical verification
h = 1e-5
numerical_partial_w1 = (loss_2d(w1 + h, w2) - loss_2d(w1 - h, w2)) / (2 * h)
numerical_partial_w2 = (loss_2d(w1, w2 + h) - loss_2d(w1, w2 - h)) / (2 * h)

print(f"\nNumerical ∂L/∂w1 = {numerical_partial_w1:.6f}")
print(f"Numerical ∂L/∂w2 = {numerical_partial_w2:.6f}")

Visualizing Partial Derivatives

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create grid
w1_range = np.linspace(-2, 6, 50)
w2_range = np.linspace(-2, 8, 50)
W1, W2 = np.meshgrid(w1_range, w2_range)
Z = (W1 - 2)**2 + (W2 - 3)**2 + W1 * W2

# 3D plot
fig = plt.figure(figsize=(12, 5))

ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(W1, W2, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w1')
ax1.set_ylabel('w2')
ax1.set_zlabel('Loss')
ax1.set_title('Loss Surface')

# Contour plot with gradient arrows
ax2 = fig.add_subplot(122)
contour = ax2.contour(W1, W2, Z, levels=20)
ax2.clabel(contour, inline=True, fontsize=8)

# Add gradient arrows at selected points
for w1 in [0, 2, 4]:
    for w2 in [0, 2, 4, 6]:
        gw1 = partial_w1(w1, w2)
        gw2 = partial_w2(w1, w2)
        ax2.arrow(w1, w2, -gw1*0.2, -gw2*0.2, 
                  head_width=0.2, head_length=0.1, fc='red', ec='red')

ax2.set_xlabel('w1')
ax2.set_ylabel('w2')
ax2.set_title('Contour Plot with Negative Gradients')

plt.tight_layout()
plt.show()

What Is a Gradient and Why Is It Central to Training?

Definition of Gradient

The gradient is the vector of all partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The gradient points in the direction of steepest increase. To minimize loss, we move in the opposite direction (negative gradient).

Properties of Gradients

Direction: Points toward steepest ascent
Magnitude: Indicates steepness (large gradient = steep slope)
At minimum: Gradient equals zero vector

import numpy as np

class GradientDemo:
    """Demonstrate gradient properties."""
    
    def __init__(self):
        self.path = []
    
    def loss(self, params):
        """Bowl-shaped loss function."""
        w1, w2 = params
        return w1**2 + 2*w2**2
    
    def gradient(self, params):
        """Compute gradient."""
        w1, w2 = params
        return np.array([2*w1, 4*w2])
    
    def gradient_descent(self, start, lr=0.1, steps=50):
        """Perform gradient descent."""
        params = np.array(start, dtype=float)
        self.path = [params.copy()]
        
        for _ in range(steps):
            grad = self.gradient(params)
            params = params - lr * grad
            self.path.append(params.copy())
        
        return params
    
    def visualize(self):
        """Visualize gradient descent path."""
        import matplotlib.pyplot as plt
        
        # Create loss surface
        x = np.linspace(-5, 5, 100)
        y = np.linspace(-5, 5, 100)
        X, Y = np.meshgrid(x, y)
        Z = X**2 + 2*Y**2
        
        plt.figure(figsize=(10, 8))
        plt.contour(X, Y, Z, levels=20, cmap='viridis')
        plt.colorbar(label='Loss')
        
        # Plot path
        path = np.array(self.path)
        plt.plot(path[:, 0], path[:, 1], 'ro-', markersize=3, linewidth=1)
        plt.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
        plt.plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
        
        plt.xlabel('w1')
        plt.ylabel('w2')
        plt.title('Gradient Descent Path')
        plt.legend()
        plt.show()

# Run demonstration
demo = GradientDemo()
final_params = demo.gradient_descent(start=[4.0, 3.0], lr=0.1, steps=50)
print(f"Final parameters: {final_params}")
print(f"Final loss: {demo.loss(final_params):.6f}")
demo.visualize()

Gradient Descent Update Rule

The fundamental update rule of gradient descent:

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$$

Where:

$\theta_t$ = parameters at step t
$\eta$ = learning rate
$\nabla_\theta \mathcal{L}$ = gradient of loss with respect to parameters

What Is the Chain Rule and Why Does Backpropagation Need It?

The Chain Rule Explained

When functions are composed (nested), the chain rule tells us how to compute derivatives:

For $y = f(g(x))$:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} = f’(g(x)) \cdot g’(x)$$

Why Neural Networks Need the Chain Rule

A neural network is a composition of many functions:

$$\text{output} = f_L(f_{L-1}(…f_2(f_1(x))…))$$

Each layer applies a linear transformation followed by an activation:

$$a^{(l)} = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})$$

To compute how the loss changes with respect to weights in early layers, we must chain derivatives through all subsequent layers.

Chain Rule Example: Simple Network

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Simple 2-layer network
# Input -> Hidden -> Output
# x -> z1 -> a1 -> z2 -> a2 -> loss

class SimpleNetwork:
    def __init__(self):
        np.random.seed(42)
        # Layer 1: 2 inputs -> 3 hidden
        self.W1 = np.random.randn(2, 3) * 0.5
        self.b1 = np.zeros(3)
        # Layer 2: 3 hidden -> 1 output
        self.W2 = np.random.randn(3, 1) * 0.5
        self.b2 = np.zeros(1)
    
    def forward(self, x):
        """Forward pass with cached values for backprop."""
        # Layer 1
        self.z1 = x @ self.W1 + self.b1
        self.a1 = sigmoid(self.z1)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = sigmoid(self.z2)
        
        return self.a2
    
    def backward(self, x, y):
        """
        Backward pass using chain rule.
        
        Loss = (1/2) * (a2 - y)^2
        """
        m = x.shape[0]  # batch size
        
        # Output layer gradient
        # dL/da2 = a2 - y
        dL_da2 = self.a2 - y
        
        # Chain rule: dL/dz2 = dL/da2 * da2/dz2
        da2_dz2 = sigmoid_derivative(self.z2)
        dL_dz2 = dL_da2 * da2_dz2
        
        # dL/dW2 = dL/dz2 * dz2/dW2 = a1.T @ dL_dz2
        dL_dW2 = self.a1.T @ dL_dz2 / m
        dL_db2 = np.mean(dL_dz2, axis=0)
        
        # Hidden layer gradient (chain through W2)
        # dL/da1 = dL/dz2 @ W2.T
        dL_da1 = dL_dz2 @ self.W2.T
        
        # dL/dz1 = dL/da1 * da1/dz1
        da1_dz1 = sigmoid_derivative(self.z1)
        dL_dz1 = dL_da1 * da1_dz1
        
        # dL/dW1 = x.T @ dL_dz1
        dL_dW1 = x.T @ dL_dz1 / m
        dL_db1 = np.mean(dL_dz1, axis=0)
        
        return {'dW1': dL_dW1, 'db1': dL_db1, 
                'dW2': dL_dW2, 'db2': dL_db2}
    
    def train_step(self, x, y, lr=0.1):
        """One training step."""
        # Forward
        pred = self.forward(x)
        loss = 0.5 * np.mean((pred - y)**2)
        
        # Backward
        grads = self.backward(x, y)
        
        # Update
        self.W1 -= lr * grads['dW1']
        self.b1 -= lr * grads['db1']
        self.W2 -= lr * grads['dW2']
        self.b2 -= lr * grads['db2']
        
        return loss

# Training example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

net = SimpleNetwork()

print("Training Simple Network with Chain Rule Backprop")
print("-" * 50)

for epoch in range(1000):
    loss = net.train_step(X, y, lr=1.0)
    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")

print("\nFinal predictions:")
for i in range(len(X)):
    pred = net.forward(X[i:i+1])
    print(f"Input: {X[i]} -> Prediction: {pred[0,0]:.4f}, Target: {y[i,0]}")

Chain Rule Visualization

Forward Pass:
x → [W1, b1] → z1 → σ → a1 → [W2, b2] → z2 → σ → a2 → Loss

Backward Pass (Chain Rule):
dL/dW1 = dL/da2 · da2/dz2 · dz2/da1 · da1/dz1 · dz1/dW1
         ↑         ↑         ↑         ↑         ↑
       output   sigmoid    W2.T     sigmoid    x.T
       error   derivative          derivative

What Are Jacobians and When Do You Need Them?

Jacobian Matrix Definition

When a function maps vectors to vectors, the Jacobian is the matrix of all partial derivatives:

For $\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$:

$$\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

Jacobian in Neural Networks

Consider a layer that transforms $\mathbf{x} \in \mathbb{R}^n$ to $\mathbf{y} \in \mathbb{R}^m$:

import numpy as np

def layer_forward(x, W, b):
    """Linear layer: y = Wx + b"""
    return W @ x + b

def jacobian_layer(W):
    """Jacobian of linear layer w.r.t. input is just W."""
    return W

# Example
np.random.seed(42)
W = np.random.randn(3, 4)  # 3 outputs, 4 inputs
b = np.zeros(3)
x = np.random.randn(4)

y = layer_forward(x, W, b)
J = jacobian_layer(W)

print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
print(f"Jacobian shape: {J.shape}")  # (3, 4)

# Verify: small change in x should cause J @ dx change in y
dx = np.random.randn(4) * 0.001
y_new = layer_forward(x + dx, W, b)
dy_actual = y_new - y
dy_predicted = J @ dx

print(f"\nActual dy: {dy_actual}")
print(f"Predicted dy (J @ dx): {dy_predicted}")
print(f"Difference: {np.linalg.norm(dy_actual - dy_predicted):.2e}")

Jacobian of Softmax

import numpy as np

def softmax(x):
    """Compute softmax."""
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

def softmax_jacobian(x):
    """
    Compute Jacobian of softmax.
    
    J[i,j] = s[i] * (delta[i,j] - s[j])
    where delta[i,j] = 1 if i==j else 0
    """
    s = softmax(x)
    n = len(s)
    J = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            if i == j:
                J[i, j] = s[i] * (1 - s[i])
            else:
                J[i, j] = -s[i] * s[j]
    
    return J

# Vectorized version
def softmax_jacobian_vectorized(x):
    """Efficient Jacobian computation."""
    s = softmax(x).reshape(-1, 1)
    return np.diagflat(s) - s @ s.T

# Example
x = np.array([2.0, 1.0, 0.1])
s = softmax(x)
J = softmax_jacobian_vectorized(x)

print(f"Input: {x}")
print(f"Softmax: {s}")
print(f"\nJacobian:\n{J}")
print(f"Sum of each row: {J.sum(axis=1)}")  # Should be 0

Vector-Jacobian Products (VJP)

In practice, we don’t compute full Jacobians. Instead, we compute Vector-Jacobian Products:

$$\mathbf{v}^T \mathbf{J}$$

This is what backpropagation computes efficiently:

import numpy as np

def vjp_linear_layer(v, W):
    """
    Vector-Jacobian product for linear layer.
    
    v: gradient from next layer (shape: m)
    W: weight matrix (shape: m x n)
    
    Returns: gradient w.r.t. input (shape: n)
    """
    return v @ W

# Example
np.random.seed(42)
W = np.random.randn(3, 4)  # 3 outputs, 4 inputs
v = np.random.randn(3)     # Upstream gradient

# VJP gives us gradient w.r.t. input
grad_input = vjp_linear_layer(v, W)
print(f"Upstream gradient shape: {v.shape}")
print(f"Input gradient shape: {grad_input.shape}")

# This is the same as computing full Jacobian then multiplying
J = W  # Jacobian of linear layer is W
grad_input_full = v @ J
print(f"VJP result: {grad_input}")
print(f"Full Jacobian result: {grad_input_full}")

How Does Automatic Differentiation Work?

The Magic Behind PyTorch and TensorFlow

Modern deep learning frameworks use automatic differentiation (autodiff) to compute gradients. They build a computational graph and apply the chain rule automatically.

import numpy as np

class Value:
    """Simple autodiff implementation inspired by micrograd."""
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        
        return out
    
    def __pow__(self, other):
        out = Value(self.data ** other, (self,), f'**{other}')
        
        def _backward():
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        
        return out
    
    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)
    
    def __truediv__(self, other):
        return self * (other ** -1)
    
    def tanh(self):
        x = self.data
        t = (np.exp(2*x) - 1) / (np.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        
        return out
    
    def backward(self):
        """Compute gradients using reverse-mode autodiff."""
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

# Example: gradient of (w1 * x1 + w2 * x2)^2
x1 = Value(2.0)
x2 = Value(3.0)
w1 = Value(0.5)
w2 = Value(-0.3)

# Forward pass
y = w1 * x1 + w2 * x2
loss = y ** 2

print(f"y = {y.data:.4f}")
print(f"loss = {loss.data:.4f}")

# Backward pass
loss.backward()

print(f"\nGradients:")
print(f"dL/dw1 = {w1.grad:.4f}")
print(f"dL/dw2 = {w2.grad:.4f}")
print(f"dL/dx1 = {x1.grad:.4f}")
print(f"dL/dx2 = {x2.grad:.4f}")

# Verify with manual calculation
# y = w1*x1 + w2*x2 = 0.5*2 + (-0.3)*3 = 0.1
# loss = y^2 = 0.01
# dL/dy = 2y = 0.2
# dL/dw1 = dL/dy * dy/dw1 = 0.2 * x1 = 0.2 * 2 = 0.4
print(f"\nManual verification: dL/dw1 = 2 * y * x1 = 2 * {y.data:.2f} * {x1.data:.2f} = {2 * y.data * x1.data:.4f}")

Computational Graph Visualization

Input Layer          Hidden Layer         Output
     x1 ─────┐
             ├──> z1 = w1*x1 + w2*x2 ──> y = z1² ──> Loss
     x2 ─────┘
      ↑   ↑
     w1   w2

Backward flow:
dL/dw1 = dL/dy · dy/dz1 · dz1/dw1
       =   1   ·  2*z1  ·   x1
       = 2 * z1 * x1

Real-World Examples: Derivatives of Common Functions

Activation Function Derivatives

import numpy as np
import matplotlib.pyplot as plt

def plot_activation_and_derivative(name, func, deriv, x_range=(-5, 5)):
    """Plot activation function and its derivative."""
    x = np.linspace(x_range[0], x_range[1], 200)
    y = func(x)
    dy = deriv(x)
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].plot(x, y, 'b-', linewidth=2)
    axes[0].axhline(y=0, color='k', linewidth=0.5)
    axes[0].axvline(x=0, color='k', linewidth=0.5)
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].set_title(f'{name} Function')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(x, dy, 'r-', linewidth=2)
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    axes[1].axvline(x=0, color='k', linewidth=0.5)
    axes[1].set_xlabel('x')
    axes[1].set_ylabel("f'(x)")
    axes[1].set_title(f'{name} Derivative')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Sigmoid
sigmoid = lambda x: 1 / (1 + np.exp(-np.clip(x, -500, 500)))
sigmoid_deriv = lambda x: sigmoid(x) * (1 - sigmoid(x))
plot_activation_and_derivative('Sigmoid', sigmoid, sigmoid_deriv)

# Tanh
tanh_deriv = lambda x: 1 - np.tanh(x)**2
plot_activation_and_derivative('Tanh', np.tanh, tanh_deriv)

# ReLU
relu = lambda x: np.maximum(0, x)
relu_deriv = lambda x: (x > 0).astype(float)
plot_activation_and_derivative('ReLU', relu, relu_deriv)

# Leaky ReLU
alpha = 0.01
leaky_relu = lambda x: np.where(x > 0, x, alpha * x)
leaky_relu_deriv = lambda x: np.where(x > 0, 1, alpha)
plot_activation_and_derivative('Leaky ReLU', leaky_relu, leaky_relu_deriv)

# GELU (used in transformers)
gelu = lambda x: 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
gelu_deriv = lambda x: 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))) + \
                        0.5 * x * (1 - np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))**2) * \
                        np.sqrt(2/np.pi) * (1 + 3 * 0.044715 * x**2)
plot_activation_and_derivative('GELU', gelu, gelu_deriv)

Loss Function Derivatives

Loss	Formula	Derivative
MSE	$\frac{1}{n}\sum(y - \hat{y})^2$	$\frac{2}{n}(\hat{y} - y)$
Cross-Entropy	$-\sum y \log(\hat{y})$	$-\frac{y}{\hat{y}}$
Binary CE	$-[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$	$\frac{\hat{y} - y}{\hat{y}(1-\hat{y})}$

import numpy as np

# MSE Loss and derivative
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mse_derivative(y_true, y_pred):
    return 2 * (y_pred - y_true) / len(y_true)

# Cross-entropy loss and derivative (with softmax)
def cross_entropy_loss(y_true, y_pred):
    """y_pred should be probabilities from softmax."""
    return -np.sum(y_true * np.log(y_pred + 1e-10))

def softmax_cross_entropy_derivative(y_true, y_pred):
    """Combined softmax + cross-entropy derivative."""
    return y_pred - y_true  # Simplified!

# Example
y_true = np.array([1.0, 0.0, 0.0])  # One-hot
y_pred = np.array([0.7, 0.2, 0.1])  # Softmax output

loss = cross_entropy_loss(y_true, y_pred)
grad = softmax_cross_entropy_derivative(y_true, y_pred)

print(f"Cross-entropy loss: {loss:.4f}")
print(f"Gradient: {grad}")

FAQs

What’s the difference between gradient and derivative?

Derivative: Rate of change of a function with one variable
Gradient: Vector of all partial derivatives (multi-variable functions)

A gradient is essentially a collection of derivatives for functions with multiple inputs.

Why do vanishing/exploding gradients happen?

When gradients are repeatedly multiplied through layers:

Vanishing: Gradients < 1 multiply to ~0 (sigmoid, tanh)
Exploding: Gradients > 1 multiply to infinity

Solutions: ReLU activations, residual connections, batch normalization, gradient clipping.

How do you verify gradient computations?

Use gradient checking with finite differences:

def gradient_check(f, x, epsilon=1e-7):
    """Check analytical gradient against numerical gradient."""
    numerical_grad = np.zeros_like(x)
    
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += epsilon
        x_minus = x.copy()
        x_minus[i] -= epsilon
        numerical_grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    
    return numerical_grad

Key Takeaways

Derivatives measure change: They tell us how to adjust parameters to reduce loss
Gradients extend to multiple variables: The gradient vector points toward steepest ascent
Chain rule enables backpropagation: Composed functions require chained derivatives
Jacobians generalize to vector functions: Matrix of all partial derivatives
Autodiff automates everything: Modern frameworks handle calculus automatically
Numerical stability matters: Use appropriate activations and gradient clipping

Next Steps

Continue mastering the mathematics behind neural networks:

Backpropagation Explained - Deep dive into how networks learn
Gradient Descent Optimizers - SGD, Adam, and beyond
Linear Algebra for ML - Matrix operations for neural networks

References

Goodfellow, I., et al. “Deep Learning” (2016) - Chapter 6: Deep Feedforward Networks
Ruder, S. “An overview of gradient descent optimization algorithms” (2016)
Karpathy, A. “Micrograd” - https://github.com/karpathy/micrograd
Stanford CS231n: “Backpropagation, Intuitions” - https://cs231n.github.io/optimization-2/

Last updated: January 2024. This guide is part of our Mathematics for Machine Learning series.

Documentation

Calculus for Deep Learning: Derivatives, Gradients, and Chain Rule Explained

Why Does Deep Learning Need Calculus?

The Optimization Perspective

What Happens During Training

What Are Derivatives and Why Do They Matter?

The Intuition

Derivatives in Machine Learning

Common Derivatives in Deep Learning

How Do Partial Derivatives Extend to Multiple Variables?

The Need for Multiple Variables

Example: Loss with Two Parameters

Visualizing Partial Derivatives

What Is a Gradient and Why Is It Central to Training?

Definition of Gradient

Properties of Gradients

Gradient Descent Update Rule

What Is the Chain Rule and Why Does Backpropagation Need It?

The Chain Rule Explained

Why Neural Networks Need the Chain Rule

Chain Rule Example: Simple Network

Chain Rule Visualization

What Are Jacobians and When Do You Need Them?

Jacobian Matrix Definition

Jacobian in Neural Networks

Jacobian of Softmax

Vector-Jacobian Products (VJP)

How Does Automatic Differentiation Work?

The Magic Behind PyTorch and TensorFlow

Computational Graph Visualization

Real-World Examples: Derivatives of Common Functions

Activation Function Derivatives

Loss Function Derivatives

FAQs

What’s the difference between gradient and derivative?

Why do vanishing/exploding gradients happen?

How do you verify gradient computations?

Key Takeaways

Next Steps

References