Convex Optimization for Machine Learning: From Lagrange to KKT Conditions

Master convex optimization fundamentals for machine learning. Learn convex functions, Lagrange multipliers, KKT conditions, and their applications in SVMs, regularization, and more.

14 min read Jan 15, 2024

Convex Optimization for Machine Learning: From Lagrange to KKT Conditions

“Machine learning is convex optimization in disguise.” — Stephen Boyd, Stanford

Many machine learning algorithms are optimization problems. Understanding convex optimization gives you powerful tools to analyze, improve, and create new algorithms. From SVMs to regularized regression, convex optimization is everywhere in ML.

In this comprehensive guide, you’ll learn the theory and practice of convex optimization for machine learning.

Why Does Convexity Matter in Machine Learning?

The Guarantee of Global Optima

For convex problems:

Every local minimum is a global minimum
Gradient descent always converges to optimal solution
No saddle points to get stuck in

For non-convex problems (deep learning):

Many local minima and saddle points
No guarantee of finding global optimum
Requires careful initialization and hyperparameters

Convex ML Algorithms

Algorithm	Convex?	Why It Matters
Linear Regression	✅ Yes	Closed-form solution exists
Logistic Regression	✅ Yes	Guaranteed convergence
SVM	✅ Yes	Unique optimal hyperplane
Lasso/Ridge	✅ Yes	Sparse/stable solutions
Neural Networks	❌ No	Multiple local minima
K-Means	❌ No	Results depend on initialization

What Makes a Function Convex?

Definition of Convexity

A function $f$ is convex if for all $x, y$ and $\lambda \in [0, 1]$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$

Geometrically: the line segment between any two points lies above the function.

import numpy as np
import matplotlib.pyplot as plt

def visualize_convexity():
    """Visualize convex vs non-convex functions."""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    x = np.linspace(-3, 3, 100)
    
    # Convex function: x²
    y_convex = x ** 2
    
    # Non-convex function: x⁴ - 2x²
    y_nonconvex = x**4 - 2*x**2
    
    # Plot convex
    ax = axes[0]
    ax.plot(x, y_convex, 'b-', linewidth=2, label='f(x) = x²')
    
    # Show line segment above curve
    x1, x2 = -2, 1.5
    y1, y2 = x1**2, x2**2
    ax.plot([x1, x2], [y1, y2], 'r--', linewidth=2, label='Line segment')
    ax.plot([x1, x2], [y1, y2], 'ro', markersize=8)
    
    # Show convexity inequality
    lam = 0.6
    x_mid = lam * x1 + (1-lam) * x2
    y_line = lam * y1 + (1-lam) * y2
    y_func = x_mid ** 2
    ax.plot([x_mid, x_mid], [y_func, y_line], 'g-', linewidth=2)
    ax.plot(x_mid, y_func, 'go', markersize=8, label='f(mid) ≤ line')
    ax.plot(x_mid, y_line, 'g^', markersize=8)
    
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title('CONVEX: f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y)')
    ax.legend()
    ax.grid(True)
    
    # Plot non-convex
    ax = axes[1]
    ax.plot(x, y_nonconvex, 'b-', linewidth=2, label='f(x) = x⁴ - 2x²')
    
    # Show line segment below curve in some regions
    x1, x2 = -1.5, 1.5
    y1, y2 = x1**4 - 2*x1**2, x2**4 - 2*x2**2
    ax.plot([x1, x2], [y1, y2], 'r--', linewidth=2, label='Line segment')
    ax.plot([x1, x2], [y1, y2], 'ro', markersize=8)
    
    # Show violation
    x_mid = 0
    y_line = (y1 + y2) / 2
    y_func = x_mid**4 - 2*x_mid**2
    ax.plot([x_mid, x_mid], [y_func, y_line], 'g-', linewidth=2)
    ax.plot(x_mid, y_func, 'go', markersize=8, label='f(mid) > line (violation!)')
    ax.plot(x_mid, y_line, 'g^', markersize=8)
    
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title('NON-CONVEX: Multiple local minima')
    ax.legend()
    ax.grid(True)
    
    plt.tight_layout()
    plt.show()

visualize_convexity()

Testing for Convexity

For twice-differentiable functions:

1D: $f’’(x) \geq 0$ for all $x$

nD: Hessian $\nabla^2 f(x) \succeq 0$ (positive semidefinite)

import numpy as np

def check_convexity_1d(f, x_range, num_points=1000):
    """Check if 1D function is convex using second derivative."""
    x = np.linspace(x_range[0], x_range[1], num_points)
    h = (x_range[1] - x_range[0]) / num_points
    
    # Numerical second derivative
    f_values = f(x)
    second_deriv = np.diff(f_values, 2) / h**2
    
    is_convex = np.all(second_deriv >= -1e-10)  # Small tolerance
    
    return is_convex, second_deriv

# Test common functions
functions = {
    'x²': lambda x: x**2,
    'exp(x)': lambda x: np.exp(x),
    'x⁴ - 2x²': lambda x: x**4 - 2*x**2,
    'log(x)': lambda x: np.log(np.abs(x) + 1e-10),
    'sin(x)': lambda x: np.sin(x),
    '|x|': lambda x: np.abs(x)
}

print("Convexity Check Results:")
print("-" * 40)
for name, f in functions.items():
    is_convex, _ = check_convexity_1d(f, (-2, 2))
    status = "✓ Convex" if is_convex else "✗ Non-convex"
    print(f"{name:15s}: {status}")

Common Convex Functions in ML

Function	Formula	Use in ML
Squared error	$(y - \hat{y})^2$	Regression loss
Cross-entropy	$-y\log(\hat{y})$	Classification loss
Hinge loss	$\max(0, 1 - y\hat{y})$	SVM
L1 norm	$\|x\|_1$	Lasso regularization
L2 norm squared	$\|x\|_2^2$	Ridge regularization
Log-sum-exp	$\log(\sum e^{x_i})$	Softmax approximation

Convex Sets and Feasibility

What Is a Convex Set?

A set $C$ is convex if for any $x, y \in C$ and $\lambda \in [0, 1]$:

$$\lambda x + (1-\lambda)y \in C$$

import numpy as np
import matplotlib.pyplot as plt

def visualize_convex_sets():
    """Visualize convex vs non-convex sets."""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Convex set: Circle
    theta = np.linspace(0, 2*np.pi, 100)
    x_circle = np.cos(theta)
    y_circle = np.sin(theta)
    
    ax = axes[0]
    ax.fill(x_circle, y_circle, alpha=0.3, color='blue')
    ax.plot(x_circle, y_circle, 'b-', linewidth=2)
    
    # Show line segment inside
    x1, y1 = -0.5, 0.5
    x2, y2 = 0.7, -0.3
    ax.plot([x1, x2], [y1, y2], 'r-', linewidth=2)
    ax.plot([x1, x2], [y1, y2], 'ro', markersize=8)
    
    ax.set_title('CONVEX: Circle\nAll line segments stay inside')
    ax.set_aspect('equal')
    ax.grid(True)
    
    # Convex set: Polytope
    vertices = np.array([[0, 1], [1, 0.5], [0.8, -0.5], [-0.5, -0.7], [-1, 0.3], [0, 1]])
    
    ax = axes[1]
    ax.fill(vertices[:, 0], vertices[:, 1], alpha=0.3, color='green')
    ax.plot(vertices[:, 0], vertices[:, 1], 'g-', linewidth=2)
    
    # Show line segment inside
    x1, y1 = -0.3, 0.2
    x2, y2 = 0.5, -0.1
    ax.plot([x1, x2], [y1, y2], 'r-', linewidth=2)
    ax.plot([x1, x2], [y1, y2], 'ro', markersize=8)
    
    ax.set_title('CONVEX: Polytope\nAll line segments stay inside')
    ax.set_aspect('equal')
    ax.grid(True)
    
    # Non-convex set: Star shape
    theta = np.linspace(0, 2*np.pi, 100)
    r = 1 + 0.5 * np.sin(5*theta)
    x_star = r * np.cos(theta)
    y_star = r * np.sin(theta)
    
    ax = axes[2]
    ax.fill(x_star, y_star, alpha=0.3, color='red')
    ax.plot(x_star, y_star, 'r-', linewidth=2)
    
    # Show line segment going outside
    x1, y1 = -1.2, 0.3
    x2, y2 = 1.2, 0.3
    ax.plot([x1, x2], [y1, y2], 'k-', linewidth=2)
    ax.plot([x1, x2], [y1, y2], 'ko', markersize=8)
    
    ax.set_title('NON-CONVEX: Star\nLine segment goes outside!')
    ax.set_aspect('equal')
    ax.grid(True)
    
    plt.tight_layout()
    plt.show()

visualize_convex_sets()

Lagrange Multipliers: Handling Equality Constraints

The Problem

Minimize $f(x)$ subject to $g(x) = 0$

The Lagrangian

$$\mathcal{L}(x, \lambda) = f(x) + \lambda g(x)$$

At the optimum, gradients of $f$ and $g$ are parallel: $$\nabla f(x^) = -\lambda \nabla g(x^)$$

import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt

def lagrange_example():
    """
    Example: Minimize x² + y² subject to x + y = 1
    
    Lagrangian: L = x² + y² + λ(x + y - 1)
    
    ∂L/∂x = 2x + λ = 0  →  x = -λ/2
    ∂L/∂y = 2y + λ = 0  →  y = -λ/2
    ∂L/∂λ = x + y - 1 = 0
    
    From x = y = -λ/2 and x + y = 1:
    -λ = 1  →  λ = -1
    x = y = 0.5
    """
    
    # Objective function
    def f(xy):
        return xy[0]**2 + xy[1]**2
    
    # Constraint: x + y = 1
    def constraint(xy):
        return xy[0] + xy[1] - 1
    
    # Solve using scipy
    from scipy.optimize import minimize
    
    result = minimize(
        f, 
        x0=[0, 0],
        constraints={'type': 'eq', 'fun': constraint}
    )
    
    print("Lagrange Multiplier Example")
    print("=" * 40)
    print(f"Problem: min x² + y² s.t. x + y = 1")
    print(f"\nAnalytical solution:")
    print(f"  x* = 0.5, y* = 0.5, λ* = -1")
    print(f"\nNumerical solution:")
    print(f"  x* = {result.x[0]:.4f}, y* = {result.x[1]:.4f}")
    print(f"  min f = {result.fun:.4f}")
    
    # Visualization
    x = np.linspace(-1, 2, 100)
    y = np.linspace(-1, 2, 100)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + Y**2
    
    plt.figure(figsize=(10, 8))
    
    # Contours of objective
    contours = plt.contour(X, Y, Z, levels=20, cmap='viridis')
    plt.clabel(contours, inline=True, fontsize=8)
    
    # Constraint line
    plt.plot(x, 1 - x, 'r-', linewidth=2, label='x + y = 1')
    
    # Optimal point
    plt.plot(0.5, 0.5, 'r*', markersize=20, label='Optimal (0.5, 0.5)')
    
    # Gradient of f at optimal
    scale = 0.3
    plt.arrow(0.5, 0.5, scale*1, scale*1, head_width=0.05, color='blue', label='∇f')
    
    # Gradient of g (normal to constraint)
    plt.arrow(0.5, 0.5, scale*1, scale*1, head_width=0.05, color='green', label='∇g')
    
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('Lagrange Multiplier: ∇f parallel to ∇g at optimum')
    plt.legend()
    plt.axis('equal')
    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.grid(True)
    plt.show()
    
    return result

lagrange_example()

Lagrange Multipliers in Machine Learning

Application: Maximum Entropy Distribution

Find probability distribution $p$ that maximizes entropy subject to constraints:

$$\max_p -\sum_i p_i \log p_i \quad \text{s.t.} \quad \sum_i p_i = 1, \quad \mathbb{E}[f(x)] = c$$

KKT Conditions: Handling Inequality Constraints

The General Problem

$$\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \quad h_j(x) = 0$$

The Lagrangian

$$\mathcal{L}(x, \mu, \lambda) = f(x) + \sum_i \mu_i g_i(x) + \sum_j \lambda_j h_j(x)$$

KKT Conditions

For a point $x^*$ to be optimal, these conditions must hold:

Stationarity: $\nabla_x \mathcal{L}(x^, \mu^, \lambda^*) = 0$
Primal feasibility: $g_i(x^) \leq 0$, $h_j(x^) = 0$
Dual feasibility: $\mu_i^* \geq 0$
Complementary slackness: $\mu_i^* g_i(x^*) = 0$

import numpy as np
from scipy.optimize import minimize

def kkt_example():
    """
    Example: min x² + y²  s.t.  x + y ≥ 1
    
    Convert to: min x² + y²  s.t.  1 - x - y ≤ 0
    
    KKT conditions:
    1. 2x - μ = 0  →  x = μ/2
    2. 2y - μ = 0  →  y = μ/2
    3. 1 - x - y ≤ 0 (primal feasibility)
    4. μ ≥ 0 (dual feasibility)
    5. μ(1 - x - y) = 0 (complementary slackness)
    
    Case 1: μ = 0 (constraint inactive)
    Then x = y = 0, but this violates x + y ≥ 1
    
    Case 2: μ > 0 (constraint active, x + y = 1)
    From x = y = μ/2 and x + y = 1: μ = 1
    So x* = y* = 0.5, μ* = 1
    """
    
    # Objective
    def f(xy):
        return xy[0]**2 + xy[1]**2
    
    # Inequality constraint: x + y >= 1 → -(x + y - 1) <= 0
    def ineq_constraint(xy):
        return xy[0] + xy[1] - 1  # Must be >= 0
    
    result = minimize(
        f,
        x0=[0, 0],
        constraints={'type': 'ineq', 'fun': ineq_constraint}
    )
    
    print("KKT Example")
    print("=" * 40)
    print("Problem: min x² + y² s.t. x + y ≥ 1")
    print(f"\nSolution: x* = {result.x[0]:.4f}, y* = {result.x[1]:.4f}")
    print(f"Constraint value: x + y = {sum(result.x):.4f} ≥ 1 ✓")
    
    # Verify KKT
    x_opt = result.x
    
    # Compute gradients at optimum
    grad_f = 2 * x_opt  # [2x, 2y]
    grad_g = np.array([-1, -1])  # gradient of -(x+y-1)
    
    # From stationarity: grad_f + μ * grad_g = 0
    # 2x - μ = 0 → μ = 2x
    mu = 2 * x_opt[0]
    
    print(f"\nKKT Verification:")
    print(f"  μ* = {mu:.4f} ≥ 0 ✓ (dual feasibility)")
    print(f"  Constraint active: x + y = 1 ✓")
    print(f"  μ * (1 - x - y) = {mu * (1 - sum(x_opt)):.6f} ≈ 0 ✓ (complementary slackness)")
    
    # Visualization
    x = np.linspace(-0.5, 1.5, 100)
    y = np.linspace(-0.5, 1.5, 100)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + Y**2
    
    plt.figure(figsize=(10, 8))
    
    # Contours of objective
    contours = plt.contour(X, Y, Z, levels=15, cmap='viridis')
    plt.clabel(contours, inline=True, fontsize=8)
    
    # Constraint: x + y >= 1 (shade feasible region)
    plt.fill_between(x, 1-x, 2, alpha=0.2, color='green', label='Feasible region')
    plt.plot(x, 1 - x, 'r-', linewidth=2, label='x + y = 1 (boundary)')
    
    # Optimal point
    plt.plot(0.5, 0.5, 'r*', markersize=20, label='Optimal (0.5, 0.5)')
    
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('KKT: Optimal point on constraint boundary')
    plt.legend()
    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.grid(True)
    plt.show()
    
    return result

kkt_example()

Duality: A Powerful Optimization Tool

The Dual Problem

For the primal problem: $$\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0$$

The dual problem is: $$\max_\mu d(\mu) = \min_x \mathcal{L}(x, \mu) \quad \text{s.t.} \quad \mu \geq 0$$

Weak and Strong Duality

Weak duality: $d^* \leq p^*$ (dual optimal ≤ primal optimal)
Strong duality: $d^* = p^*$ (for convex problems under mild conditions)

Why Duality Matters in ML

SVM dual formulation is easier to solve and reveals support vectors
Lower bounds on optimal value
Kernel trick enabled by dual formulation

import numpy as np
from scipy.optimize import minimize

def svm_dual_example():
    """
    SVM Dual Formulation Example
    
    Primal: min (1/2)||w||² s.t. y_i(w·x_i + b) ≥ 1
    
    Dual: max Σα_i - (1/2)ΣΣ α_i α_j y_i y_j (x_i·x_j)
          s.t. α_i ≥ 0, Σα_i y_i = 0
    """
    
    # Simple 2D dataset (linearly separable)
    X = np.array([
        [1, 2],
        [2, 3],
        [3, 3],
        [6, 5],
        [7, 8],
        [8, 8]
    ])
    y = np.array([-1, -1, -1, 1, 1, 1])
    
    n_samples = len(X)
    
    # Compute Gram matrix
    K = X @ X.T
    
    # Dual objective (to maximize, so we negate for minimization)
    def dual_objective(alpha):
        return -(np.sum(alpha) - 0.5 * np.sum(np.outer(alpha * y, alpha * y) * K))
    
    def dual_gradient(alpha):
        return -(np.ones(n_samples) - K @ (alpha * y) * y)
    
    # Constraints
    constraints = [
        {'type': 'eq', 'fun': lambda a: np.dot(a, y)},  # Σα_i y_i = 0
    ]
    bounds = [(0, None) for _ in range(n_samples)]  # α_i ≥ 0
    
    # Solve dual
    result = minimize(
        dual_objective,
        x0=np.zeros(n_samples),
        method='SLSQP',
        bounds=bounds,
        constraints=constraints
    )
    
    alpha = result.x
    
    # Recover w from α: w = Σα_i y_i x_i
    w = np.sum(alpha[:, np.newaxis] * y[:, np.newaxis] * X, axis=0)
    
    # Find support vectors (α > 0)
    sv_mask = alpha > 1e-5
    support_vectors = X[sv_mask]
    
    # Compute b using support vector
    sv_idx = np.where(sv_mask)[0][0]
    b = y[sv_idx] - np.dot(w, X[sv_idx])
    
    print("SVM Dual Problem Example")
    print("=" * 40)
    print(f"Optimal α: {alpha.round(4)}")
    print(f"Support vectors (α > 0): {np.where(sv_mask)[0]}")
    print(f"w = {w.round(4)}")
    print(f"b = {b:.4f}")
    print(f"\nDecision boundary: {w[0]:.2f}x₁ + {w[1]:.2f}x₂ + {b:.2f} = 0")
    
    # Visualization
    plt.figure(figsize=(10, 8))
    
    # Plot points
    plt.scatter(X[y==-1, 0], X[y==-1, 1], c='red', s=100, label='Class -1', edgecolors='k')
    plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', s=100, label='Class +1', edgecolors='k')
    
    # Mark support vectors
    plt.scatter(support_vectors[:, 0], support_vectors[:, 1], 
                s=200, facecolors='none', edgecolors='green', linewidths=2,
                label='Support Vectors')
    
    # Plot decision boundary
    x_line = np.linspace(0, 10, 100)
    y_line = -(w[0] * x_line + b) / w[1]
    plt.plot(x_line, y_line, 'g-', linewidth=2, label='Decision boundary')
    
    # Plot margins
    margin = 1 / np.linalg.norm(w)
    y_margin_pos = -(w[0] * x_line + b - 1) / w[1]
    y_margin_neg = -(w[0] * x_line + b + 1) / w[1]
    plt.plot(x_line, y_margin_pos, 'g--', linewidth=1)
    plt.plot(x_line, y_margin_neg, 'g--', linewidth=1)
    plt.fill_between(x_line, y_margin_neg, y_margin_pos, alpha=0.1, color='green')
    
    plt.xlabel('x₁')
    plt.ylabel('x₂')
    plt.title(f'SVM: Found via Dual Problem (margin = {margin:.2f})')
    plt.legend()
    plt.xlim(0, 10)
    plt.ylim(0, 10)
    plt.grid(True)
    plt.show()
    
    return alpha, w, b

svm_dual_example()

Applications in Machine Learning

1. Ridge Regression (L2 Regularization)

$$\min_w |Xw - y|_2^2 + \lambda|w|_2^2$$

This is convex (sum of convex functions) with closed-form solution: $$w^* = (X^TX + \lambda I)^{-1}X^Ty$$

import numpy as np

class RidgeRegression:
    """Ridge regression - convex optimization with closed form."""
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.w = None
    
    def fit(self, X, y):
        """Solve via normal equations (convex -> unique solution)."""
        n_features = X.shape[1]
        # (X'X + αI)⁻¹ X'y
        self.w = np.linalg.solve(
            X.T @ X + self.alpha * np.eye(n_features),
            X.T @ y
        )
        return self
    
    def predict(self, X):
        return X @ self.w

# Example
np.random.seed(42)
X = np.random.randn(100, 5)
w_true = np.array([1, -2, 3, 0, 0])  # Sparse ground truth
y = X @ w_true + np.random.randn(100) * 0.1

model = RidgeRegression(alpha=0.1)
model.fit(X, y)

print("Ridge Regression (Convex)")
print(f"True weights:      {w_true}")
print(f"Estimated weights: {model.w.round(3)}")

2. Lasso Regression (L1 Regularization)

$$\min_w |Xw - y|_2^2 + \lambda|w|_1$$

Convex but not differentiable at zero. Use subgradient or proximal methods.

def soft_threshold(x, threshold):
    """Soft thresholding operator (proximal operator for L1)."""
    return np.sign(x) * np.maximum(np.abs(x) - threshold, 0)

class LassoRegression:
    """Lasso regression using coordinate descent (ISTA)."""
    
    def __init__(self, alpha=1.0, max_iter=1000, tol=1e-4):
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
        self.w = None
    
    def fit(self, X, y):
        """Coordinate descent for Lasso."""
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        
        # Precompute X'X and X'y
        XTX = X.T @ X
        XTy = X.T @ y
        
        for iteration in range(self.max_iter):
            w_old = self.w.copy()
            
            for j in range(n_features):
                # Compute residual excluding feature j
                r_j = XTy[j] - np.dot(XTX[j], self.w) + XTX[j, j] * self.w[j]
                
                # Soft thresholding update
                self.w[j] = soft_threshold(r_j, self.alpha * n_samples) / XTX[j, j]
            
            # Check convergence
            if np.linalg.norm(self.w - w_old) < self.tol:
                print(f"Converged at iteration {iteration}")
                break
        
        return self
    
    def predict(self, X):
        return X @ self.w

# Example
model_lasso = LassoRegression(alpha=0.1)
model_lasso.fit(X, y)

print("\nLasso Regression (Convex, Sparse)")
print(f"True weights:      {w_true}")
print(f"Estimated weights: {model_lasso.w.round(3)}")
print(f"Sparsity: {np.sum(np.abs(model_lasso.w) < 0.01)}/{len(model_lasso.w)} zeros")

3. Logistic Regression

$$\min_w \sum_i \log(1 + e^{-y_i w^T x_i}) + \lambda|w|_2^2$$

Convex! No local minima. Gradient descent converges to global optimum.

class LogisticRegression:
    """Logistic regression - convex optimization."""
    
    def __init__(self, lr=0.1, max_iter=1000, reg=0.01):
        self.lr = lr
        self.max_iter = max_iter
        self.reg = reg
        self.w = None
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def loss(self, X, y):
        """Logistic loss (convex)."""
        z = X @ self.w
        return -np.mean(y * np.log(self.sigmoid(z) + 1e-10) + 
                        (1 - y) * np.log(1 - self.sigmoid(z) + 1e-10)) + \
               self.reg * np.sum(self.w ** 2)
    
    def fit(self, X, y):
        """Gradient descent (guaranteed to converge for convex)."""
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        
        losses = []
        for i in range(self.max_iter):
            # Gradient
            pred = self.sigmoid(X @ self.w)
            grad = X.T @ (pred - y) / n_samples + 2 * self.reg * self.w
            
            # Update
            self.w -= self.lr * grad
            
            loss = self.loss(X, y)
            losses.append(loss)
            
            if i > 0 and abs(losses[-1] - losses[-2]) < 1e-6:
                print(f"Converged at iteration {i}")
                break
        
        return self, losses
    
    def predict_proba(self, X):
        return self.sigmoid(X @ self.w)
    
    def predict(self, X):
        return (self.predict_proba(X) >= 0.5).astype(int)

FAQs

When is a problem convex?

A problem is convex when:

The objective function is convex
Equality constraints are affine (linear)
Inequality constraints define convex sets

What if my problem is non-convex?

Options:

Convex relaxation: Approximate with convex problem
Local optimization: Gradient descent finds local optimum
Global methods: Branch and bound, random restarts
Heuristics: Problem-specific approaches

Why do we care about KKT conditions?

KKT conditions:

Provide necessary conditions for optimality
Enable deriving dual problems
Give insight into solution structure (active constraints)
Are used in many optimization algorithms

How do SVM support vectors relate to KKT?

By complementary slackness ($\mu_i g_i = 0$):

Points with $\mu_i > 0$ are support vectors (on margin)
Points with $\mu_i = 0$ have zero influence (inside margin)

Key Takeaways

Convex functions have no local minima - gradient descent always succeeds
Lagrange multipliers handle equality constraints
KKT conditions extend to inequality constraints
Duality enables efficient algorithms (SVM kernel trick)
Many ML algorithms are convex: regression, classification, SVMs

Next Steps

Continue your optimization journey:

Second-Order Optimization - Newton, L-BFGS methods
Gradient Descent Optimizers - SGD, Adam, and more
SVM Complete Guide - Convex optimization in action

References

Boyd, S., Vandenberghe, L. “Convex Optimization” (2004) - The bible
Nocedal, J., Wright, S. “Numerical Optimization” (2006)
Shalev-Shwartz, S., Ben-David, S. “Understanding Machine Learning” (2014) - Chapter 13
Stanford EE364a: Convex Optimization - https://web.stanford.edu/class/ee364a/

Last updated: January 2024. This guide is part of our Mathematics for Machine Learning series.

Documentation

Convex Optimization for Machine Learning: From Lagrange to KKT Conditions

Why Does Convexity Matter in Machine Learning?

The Guarantee of Global Optima

Convex ML Algorithms

What Makes a Function Convex?

Definition of Convexity

Testing for Convexity

Common Convex Functions in ML

Convex Sets and Feasibility

What Is a Convex Set?

Lagrange Multipliers: Handling Equality Constraints

The Problem

The Lagrangian

Lagrange Multipliers in Machine Learning

KKT Conditions: Handling Inequality Constraints

The General Problem

The Lagrangian

KKT Conditions

Duality: A Powerful Optimization Tool

The Dual Problem

Weak and Strong Duality

Why Duality Matters in ML

Applications in Machine Learning

1. Ridge Regression (L2 Regularization)

2. Lasso Regression (L1 Regularization)

3. Logistic Regression

FAQs

When is a problem convex?

What if my problem is non-convex?

Why do we care about KKT conditions?

How do SVM support vectors relate to KKT?

Key Takeaways

Next Steps

References