Minimum Description Length and Model Selection: Complete Guide 2025

Master Minimum Description Length (MDL) principle for model selection: understand Occam's razor, complexity regularization, and practical applications in machine learning.

21 min read Jan 26, 2025

The Minimum Description Length (MDL) principle provides a profound connection between data compression, statistical inference, and machine learning. It formalizes Occam’s razor—the preference for simpler explanations—into a rigorous mathematical framework for model selection.

Why Minimum Description Length Matters in Machine Learning

Consider the fundamental challenge of model selection: given data, how do you choose between a simple model that might underfit and a complex model that might overfit? MDL provides an elegant answer: choose the model that provides the shortest complete description of the data.

“The best model is the one that compresses the data the most.” — Jorma Rissanen, Founder of MDL Theory

Real-World Impact

Indian tech companies and research institutions apply MDL principles to:

Flipkart: Feature selection for recommendation systems
TCS Research: Anomaly detection in time series
IIT Bombay: Neural architecture search
Reliance Jio: Network traffic pattern analysis

Section 1: The Minimum Description Length Principle

What Is MDL and How Does It Work?

The MDL principle states that the best hypothesis for data is the one that minimizes the total description length:

$$L(\text{total}) = L(\text{model}) + L(\text{data} | \text{model})$$

where:

$L(\text{model})$ = bits needed to describe the model
$L(\text{data} | \text{model})$ = bits needed to describe data given the model

The Key Insight: A model that fits the data perfectly but is very complex (high $L(\text{model})$) is not necessarily better than a simpler model with slightly worse fit.

How Does MDL Relate to Kolmogorov Complexity?

Kolmogorov complexity $K(x)$ is the length of the shortest program that outputs $x$. While $K(x)$ is uncomputable, MDL provides a practical approximation:

Concept	Definition	Computability
Kolmogorov Complexity	Shortest program length	Uncomputable
MDL	Shortest description within model class	Computable
Shannon Entropy	Expected code length	Computable

import numpy as np
from scipy.stats import norm, poisson
from scipy.special import gammaln

def description_length_gaussian(data, mu, sigma):
    """
    Compute description length for data under Gaussian model.
    
    L(data|model) = -log p(data|mu, sigma)
    
    Args:
        data: Observed data points
        mu: Gaussian mean
        sigma: Gaussian standard deviation
    
    Returns:
        Description length in bits
    """
    n = len(data)
    
    # Negative log-likelihood (in bits)
    nll = 0.5 * n * np.log2(2 * np.pi * sigma**2)
    nll += np.sum((data - mu)**2) / (2 * sigma**2 * np.log(2))
    
    return nll

def description_length_polynomial(data, degree, x):
    """
    Compute description length for polynomial regression.
    
    Includes model complexity penalty.
    
    Args:
        data: (x, y) observed data
        degree: Polynomial degree
        x: Input values
    
    Returns:
        total_dl: Total description length
        model_dl: Model description length
        data_dl: Data description length given model
    """
    y = data
    
    # Fit polynomial
    coeffs = np.polyfit(x, y, degree)
    y_pred = np.polyval(coeffs, x)
    residuals = y - y_pred
    sigma = np.std(residuals) if np.std(residuals) > 0 else 1e-10
    
    # Model description length
    # Number of parameters * bits per parameter
    n_params = degree + 1
    bits_per_param = 32  # Assuming 32-bit float precision
    model_dl = n_params * bits_per_param
    
    # Data description length given model
    n = len(y)
    data_dl = description_length_gaussian(residuals, 0, sigma)
    
    # Add parameter encoding cost
    data_dl += n_params * np.log2(n) / 2  # Approximate MDL penalty
    
    total_dl = model_dl + data_dl
    
    return total_dl, model_dl, data_dl

# Example: Model selection using MDL
print("=== MDL for Polynomial Model Selection ===\n")

np.random.seed(42)

# Generate data from quadratic function
n = 100
x = np.linspace(-3, 3, n)
y_true = 0.5 * x**2 - x + 1  # True function is degree 2
y = y_true + np.random.randn(n) * 0.5  # Add noise

# Compare different polynomial degrees
print(f"{'Degree':<10} {'Total DL':<15} {'Model DL':<15} {'Data DL':<15}")
print("-" * 55)

best_degree = 0
best_dl = float('inf')

for degree in range(1, 10):
    total_dl, model_dl, data_dl = description_length_polynomial(y, degree, x)
    
    if total_dl < best_dl:
        best_dl = total_dl
        best_degree = degree
    
    print(f"{degree:<10} {total_dl:<15.2f} {model_dl:<15.2f} {data_dl:<15.2f}")

print("-" * 55)
print(f"\nBest model: degree {best_degree} (minimum description length)")
print(f"True model: degree 2")

Section 2: Two-Part vs Prequential MDL

What Are the Different MDL Approaches?

Two-Part MDL (Crude MDL):

$$L_{\text{two-part}}(x) = L(\hat{\theta}) + L(x | \hat{\theta})$$

Encode the model parameters $\hat{\theta}$ separately from the data.

Prequential (Predictive) MDL:

$$L_{\text{prequential}}(x_1, …, x_n) = -\sum_{i=1}^{n} \log p(x_i | x_1, …, x_{i-1})$$

No explicit model encoding—uses sequential predictions.

Normalized Maximum Likelihood (NML):

$$L_{\text{NML}}(x) = -\log p(x | \hat{\theta}(x)) + \log \sum_{y} p(y | \hat{\theta}(y))$$

The theoretically optimal MDL, but often intractable.

def two_part_mdl(data, model_class, param_precision=32):
    """
    Compute two-part MDL code length.
    
    L = L(parameters) + L(data|parameters)
    
    Args:
        data: Observed data
        model_class: Model class with fit() and nll() methods
        param_precision: Bits per parameter
    
    Returns:
        Total description length
    """
    # Fit model to get MLE parameters
    params = model_class.fit(data)
    n_params = len(params)
    
    # Model description length
    model_dl = n_params * param_precision
    
    # Data description length given model
    data_dl = model_class.nll(data, params) / np.log(2)  # Convert nats to bits
    
    return model_dl + data_dl, model_dl, data_dl

def prequential_mdl(data, model_class, initial_data=10):
    """
    Compute prequential (predictive) MDL code length.
    
    L = -sum_{i=1}^{n} log p(x_i | x_1, ..., x_{i-1})
    
    Uses online learning to make predictions.
    
    Args:
        data: Observed data sequence
        model_class: Model with online_update() and predict_prob() methods
        initial_data: Number of initial samples before prediction
    
    Returns:
        Total description length in bits
    """
    n = len(data)
    total_dl = 0
    
    # Initialize with uniform code for initial data
    # (simplified—real implementation would use universal prior)
    total_dl += initial_data * 8  # Assume 8 bits per initial sample
    
    # Prequential coding for remaining data
    model = model_class()
    model.fit(data[:initial_data])
    
    for i in range(initial_data, n):
        # Predict probability of next observation
        prob = model.predict_prob(data[i])
        
        # Code length for this observation
        total_dl += -np.log2(prob + 1e-10)
        
        # Update model with new observation
        model.update(data[i])
    
    return total_dl

class OnlineGaussian:
    """
    Online Gaussian model for prequential MDL.
    """
    
    def __init__(self):
        self.n = 0
        self.mean = 0
        self.M2 = 0  # Sum of squared differences
    
    def fit(self, data):
        """Initialize from data."""
        self.n = len(data)
        self.mean = np.mean(data)
        self.M2 = np.sum((data - self.mean)**2)
    
    def update(self, x):
        """Welford's online algorithm for mean and variance."""
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2
    
    def predict_prob(self, x):
        """Predict probability of observation."""
        if self.n < 2:
            return 0.1  # Uninformative prior
        
        var = self.M2 / (self.n - 1)
        std = np.sqrt(var) if var > 0 else 1e-10
        
        # Gaussian probability
        return norm.pdf(x, self.mean, std)

print("=== Prequential MDL Example ===\n")

# Generate Gaussian data
np.random.seed(42)
data = np.random.randn(200) * 2 + 5  # N(5, 4)

# Compute prequential MDL
model = OnlineGaussian()
model.fit(data[:10])

total_dl = 10 * 8  # Initial samples
for i in range(10, len(data)):
    prob = model.predict_prob(data[i])
    total_dl += -np.log2(prob + 1e-10)
    model.update(data[i])

print(f"Prequential MDL: {total_dl:.2f} bits")
print(f"Average bits per sample: {total_dl / len(data):.2f}")
print(f"Theoretical entropy: {0.5 * np.log2(2 * np.pi * np.e * 4):.2f} bits")

Section 3: MDL and Regularization

How Does MDL Connect to Regularization?

MDL provides a principled justification for regularization in machine learning. The connection is:

$$\text{MDL} = -\log p(\text{data}|\theta) + \lambda \cdot \text{complexity}(\theta)$$

This is equivalent to regularized maximum likelihood!

MDL Component	ML Equivalent	Effect
$L(\text{data}\|\theta)$	Negative log-likelihood	Fit data
$L(\theta)$	Regularization term	Control complexity
Total MDL	Regularized loss	Balance fit and complexity

def mdl_regularization_equivalence():
    """
    Demonstrate the equivalence between MDL and regularized ML.
    """
    print("=== MDL-Regularization Equivalence ===\n")
    
    # Generate linear regression data
    np.random.seed(42)
    n, d = 100, 10
    
    # True model uses only first 3 features
    X = np.random.randn(n, d)
    true_weights = np.zeros(d)
    true_weights[:3] = [2, -1, 0.5]
    y = X @ true_weights + np.random.randn(n) * 0.5
    
    # Method 1: MDL-based feature selection
    def mdl_feature_selection(X, y):
        """Select features using MDL criterion."""
        n, d = X.shape
        
        best_features = []
        best_mdl = float('inf')
        
        # Greedy forward selection
        remaining = list(range(d))
        selected = []
        
        while remaining:
            best_next = None
            
            for f in remaining:
                test_features = selected + [f]
                X_sub = X[:, test_features]
                
                # Fit model
                coeffs = np.linalg.lstsq(X_sub, y, rcond=None)[0]
                residuals = y - X_sub @ coeffs
                sigma = np.std(residuals)
                
                # MDL: model complexity + data given model
                n_params = len(test_features) + 1  # +1 for variance
                model_dl = n_params * 32  # bits per parameter
                data_dl = n * 0.5 * np.log2(2 * np.pi * np.e * sigma**2)
                total_mdl = model_dl + data_dl
                
                if total_mdl < best_mdl:
                    best_mdl = total_mdl
                    best_next = f
            
            if best_next is not None:
                selected.append(best_next)
                remaining.remove(best_next)
            else:
                break
        
        return selected, best_mdl
    
    # Method 2: L1 regularization (Lasso)
    def lasso_feature_selection(X, y, alpha=0.1):
        """Select features using Lasso."""
        from sklearn.linear_model import Lasso
        
        lasso = Lasso(alpha=alpha)
        lasso.fit(X, y)
        
        selected = np.where(np.abs(lasso.coef_) > 1e-6)[0].tolist()
        return selected, lasso.coef_
    
    # Compare methods
    mdl_features, mdl_score = mdl_feature_selection(X, y)
    lasso_features, lasso_coefs = lasso_feature_selection(X, y)
    
    print("True non-zero features: [0, 1, 2]")
    print(f"MDL selected features: {sorted(mdl_features)}")
    print(f"Lasso selected features: {sorted(lasso_features)}")
    print(f"\nBoth methods identify sparse structure through complexity penalties!")

mdl_regularization_equivalence()

What Is the Relationship Between MDL and BIC/AIC?

MDL is closely related to information criteria used for model selection:

Bayesian Information Criterion (BIC): $$\text{BIC} = -2 \log L(\hat{\theta}) + k \log n$$

Akaike Information Criterion (AIC): $$\text{AIC} = -2 \log L(\hat{\theta}) + 2k$$

MDL (Stochastic Complexity): $$\text{MDL} \approx -\log L(\hat{\theta}) + \frac{k}{2} \log n + O(1)$$

def compare_model_selection_criteria():
    """
    Compare AIC, BIC, and MDL for model selection.
    """
    print("=== Model Selection Criteria Comparison ===\n")
    
    np.random.seed(42)
    
    # Generate polynomial data
    n = 50
    x = np.linspace(-2, 2, n)
    y_true = 0.5 * x**3 - x**2 + 2*x - 1  # Cubic polynomial
    y = y_true + np.random.randn(n) * 0.5
    
    def fit_polynomial(x, y, degree):
        """Fit polynomial and return statistics."""
        coeffs = np.polyfit(x, y, degree)
        y_pred = np.polyval(coeffs, x)
        residuals = y - y_pred
        
        # Residual sum of squares
        rss = np.sum(residuals**2)
        
        # Maximum likelihood variance estimate
        sigma2 = rss / n
        
        # Log-likelihood (Gaussian)
        log_lik = -n/2 * np.log(2*np.pi) - n/2 * np.log(sigma2) - rss/(2*sigma2)
        
        return log_lik, degree + 1  # k = degree + 1 parameters
    
    print(f"{'Degree':<8} {'Log-Lik':<12} {'AIC':<12} {'BIC':<12} {'MDL':<12}")
    print("-" * 56)
    
    best_aic = (float('inf'), 0)
    best_bic = (float('inf'), 0)
    best_mdl = (float('inf'), 0)
    
    for degree in range(1, 10):
        log_lik, k = fit_polynomial(x, y, degree)
        
        # AIC: -2*log_lik + 2*k
        aic = -2 * log_lik + 2 * k
        
        # BIC: -2*log_lik + k*log(n)
        bic = -2 * log_lik + k * np.log(n)
        
        # MDL approximation: -log_lik + (k/2)*log(n)
        mdl = -log_lik + (k/2) * np.log(n)
        
        if aic < best_aic[0]:
            best_aic = (aic, degree)
        if bic < best_bic[0]:
            best_bic = (bic, degree)
        if mdl < best_mdl[0]:
            best_mdl = (mdl, degree)
        
        print(f"{degree:<8} {log_lik:<12.2f} {aic:<12.2f} {bic:<12.2f} {mdl:<12.2f}")
    
    print("-" * 56)
    print(f"\nTrue model: degree 3")
    print(f"AIC selects: degree {best_aic[1]}")
    print(f"BIC selects: degree {best_bic[1]}")
    print(f"MDL selects: degree {best_mdl[1]}")
    
    print(f"\nNote: BIC and MDL have stronger complexity penalty than AIC")
    print(f"      BIC/MDL penalty ~ (k/2)*log(n), AIC penalty ~ k")

compare_model_selection_criteria()

Section 4: MDL for Neural Network Model Selection

How Can MDL Guide Neural Architecture Search?

MDL provides a principled criterion for choosing neural network architectures:

import torch
import torch.nn as nn

def neural_net_mdl(model, data_loader, bits_per_weight=32):
    """
    Compute MDL for a neural network.
    
    MDL = L(weights) + L(data|weights)
    
    Args:
        model: PyTorch model
        data_loader: Data loader
        bits_per_weight: Bits to encode each weight
    
    Returns:
        total_mdl: Total description length
        model_dl: Model (weights) description length
        data_dl: Data description length given model
    """
    # Count parameters
    n_params = sum(p.numel() for p in model.parameters())
    
    # Model description length
    model_dl = n_params * bits_per_weight
    
    # Data description length (negative log-likelihood)
    model.eval()
    total_nll = 0
    n_samples = 0
    
    with torch.no_grad():
        for x, y in data_loader:
            logits = model(x)
            nll = nn.functional.cross_entropy(logits, y, reduction='sum')
            total_nll += nll.item()
            n_samples += x.size(0)
    
    # Convert to bits
    data_dl = total_nll / np.log(2)
    
    total_mdl = model_dl + data_dl
    
    return total_mdl, model_dl, data_dl

class MDLArchitectureSearch:
    """
    Neural Architecture Search using MDL criterion.
    """
    
    def __init__(self, input_dim, output_dim, candidate_architectures):
        """
        Args:
            input_dim: Input feature dimension
            output_dim: Number of output classes
            candidate_architectures: List of hidden layer configurations
        """
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.candidates = candidate_architectures
    
    def build_model(self, hidden_dims):
        """Build MLP with given hidden dimensions."""
        layers = []
        prev_dim = self.input_dim
        
        for h_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, h_dim))
            layers.append(nn.ReLU())
            prev_dim = h_dim
        
        layers.append(nn.Linear(prev_dim, self.output_dim))
        return nn.Sequential(*layers)
    
    def count_parameters(self, hidden_dims):
        """Count parameters for architecture."""
        model = self.build_model(hidden_dims)
        return sum(p.numel() for p in model.parameters())
    
    def train_and_evaluate(self, hidden_dims, train_loader, val_loader, epochs=10):
        """Train model and compute MDL."""
        model = self.build_model(hidden_dims)
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
        
        # Training
        model.train()
        for epoch in range(epochs):
            for x, y in train_loader:
                optimizer.zero_grad()
                logits = model(x)
                loss = nn.functional.cross_entropy(logits, y)
                loss.backward()
                optimizer.step()
        
        # Compute MDL on validation set
        total_mdl, model_dl, data_dl = neural_net_mdl(model, val_loader)
        
        return total_mdl, model_dl, data_dl, model
    
    def search(self, train_loader, val_loader, epochs=10):
        """
        Search for best architecture using MDL.
        
        Returns:
            best_architecture: Hidden dimensions of best model
            best_model: Trained best model
            results: MDL for all candidates
        """
        results = []
        best_mdl = float('inf')
        best_arch = None
        best_model = None
        
        for hidden_dims in self.candidates:
            n_params = self.count_parameters(hidden_dims)
            total_mdl, model_dl, data_dl, model = self.train_and_evaluate(
                hidden_dims, train_loader, val_loader, epochs
            )
            
            results.append({
                'architecture': hidden_dims,
                'n_params': n_params,
                'total_mdl': total_mdl,
                'model_dl': model_dl,
                'data_dl': data_dl
            })
            
            if total_mdl < best_mdl:
                best_mdl = total_mdl
                best_arch = hidden_dims
                best_model = model
        
        return best_arch, best_model, results

# Example architecture search
print("=== MDL-based Neural Architecture Search ===\n")

# Define candidate architectures
candidates = [
    [32],
    [64],
    [128],
    [64, 32],
    [128, 64],
    [256, 128],
    [128, 64, 32],
    [256, 128, 64],
]

search = MDLArchitectureSearch(
    input_dim=784,
    output_dim=10,
    candidate_architectures=candidates
)

# Show parameter counts
print(f"{'Architecture':<20} {'Parameters':<15}")
print("-" * 35)
for arch in candidates:
    n_params = search.count_parameters(arch)
    print(f"{str(arch):<20} {n_params:<15,}")

print("\nNote: Larger architectures have higher model_dl but potentially lower data_dl")
print("MDL balances this tradeoff to find optimal complexity")

How Does Weight Pruning Relate to MDL?

Weight pruning can be viewed through the MDL lens—removing weights reduces $L(\text{model})$ at the cost of potentially increasing $L(\text{data}|\text{model})$:

class MDLPruning:
    """
    Prune neural network weights using MDL criterion.
    """
    
    def __init__(self, model, val_loader, bits_per_weight=32):
        self.model = model
        self.val_loader = val_loader
        self.bits_per_weight = bits_per_weight
        
        # Store original MDL
        self.original_mdl, _, _ = neural_net_mdl(
            model, val_loader, bits_per_weight
        )
    
    def get_weight_importance(self):
        """
        Compute importance of each weight using magnitude.
        
        Returns dictionary mapping parameter name to importance scores.
        """
        importance = {}
        
        for name, param in self.model.named_parameters():
            if 'weight' in name:
                importance[name] = torch.abs(param.data).flatten()
        
        return importance
    
    def prune_by_magnitude(self, sparsity):
        """
        Prune weights by magnitude.
        
        Args:
            sparsity: Fraction of weights to set to zero
        
        Returns:
            Pruned model (modifies in place)
        """
        # Collect all weight magnitudes
        all_magnitudes = []
        for name, param in self.model.named_parameters():
            if 'weight' in name:
                all_magnitudes.append(torch.abs(param.data).flatten())
        
        all_magnitudes = torch.cat(all_magnitudes)
        
        # Find threshold
        threshold = torch.quantile(all_magnitudes, sparsity)
        
        # Prune weights below threshold
        with torch.no_grad():
            for name, param in self.model.named_parameters():
                if 'weight' in name:
                    mask = torch.abs(param.data) > threshold
                    param.data *= mask.float()
        
        return self.model
    
    def find_optimal_sparsity(self, sparsity_levels):
        """
        Find optimal sparsity level using MDL.
        
        Tests different sparsity levels and selects the one
        with minimum MDL.
        
        Args:
            sparsity_levels: List of sparsity values to test
        
        Returns:
            best_sparsity: Optimal sparsity level
            results: MDL for each sparsity level
        """
        results = []
        best_mdl = self.original_mdl
        best_sparsity = 0
        
        # Save original weights
        original_state = {k: v.clone() for k, v in self.model.state_dict().items()}
        
        for sparsity in sparsity_levels:
            # Restore original weights
            self.model.load_state_dict(original_state)
            
            # Prune
            self.prune_by_magnitude(sparsity)
            
            # Compute MDL
            total_mdl, model_dl, data_dl = neural_net_mdl(
                self.model, self.val_loader, self.bits_per_weight
            )
            
            # Adjust model_dl for sparsity
            # Zero weights don't need to be encoded
            n_nonzero = sum(
                (p != 0).sum().item() 
                for p in self.model.parameters()
            )
            adjusted_model_dl = n_nonzero * self.bits_per_weight
            adjusted_mdl = adjusted_model_dl + data_dl
            
            results.append({
                'sparsity': sparsity,
                'n_nonzero': n_nonzero,
                'model_dl': adjusted_model_dl,
                'data_dl': data_dl,
                'total_mdl': adjusted_mdl
            })
            
            if adjusted_mdl < best_mdl:
                best_mdl = adjusted_mdl
                best_sparsity = sparsity
        
        # Restore and apply best sparsity
        self.model.load_state_dict(original_state)
        self.prune_by_magnitude(best_sparsity)
        
        return best_sparsity, results

print("=== MDL-based Weight Pruning ===\n")

# Create example model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

n_params = sum(p.numel() for p in model.parameters())
print(f"Original model parameters: {n_params:,}")

# Demonstrate pruning concept
sparsity_levels = [0, 0.3, 0.5, 0.7, 0.9]

print(f"\n{'Sparsity':<12} {'Non-zero':<15} {'% Remaining':<15}")
print("-" * 42)

for sparsity in sparsity_levels:
    remaining = int(n_params * (1 - sparsity))
    print(f"{sparsity:<12.1%} {remaining:<15,} {(1-sparsity):<15.1%}")

print("\nMDL-based pruning finds optimal sparsity that minimizes:")
print("  model_dl (proportional to non-zero weights) + data_dl (negative log-likelihood)")

Section 5: MDL for Anomaly Detection

How Can MDL Detect Anomalies?

Anomalies are data points that require disproportionately long descriptions. MDL provides a natural criterion:

class MDLAnomalyDetector:
    """
    Detect anomalies using Minimum Description Length.
    
    Anomalies are points that require unusually long code lengths.
    """
    
    def __init__(self, model_type='gaussian'):
        """
        Args:
            model_type: 'gaussian' or 'mixture'
        """
        self.model_type = model_type
        self.params = None
    
    def fit(self, X):
        """Fit model to training data."""
        if self.model_type == 'gaussian':
            self.params = {
                'mean': np.mean(X, axis=0),
                'cov': np.cov(X.T) if X.ndim > 1 else np.var(X)
            }
        elif self.model_type == 'mixture':
            # Simple 2-component GMM
            from sklearn.mixture import GaussianMixture
            gmm = GaussianMixture(n_components=2, random_state=42)
            gmm.fit(X)
            self.params = {'gmm': gmm}
    
    def code_length(self, x):
        """
        Compute code length for individual point.
        
        L(x) = -log p(x|model)
        """
        if self.model_type == 'gaussian':
            mean = self.params['mean']
            cov = self.params['cov']
            
            if np.ndim(cov) == 0:
                # 1D case
                diff = x - mean
                log_prob = -0.5 * np.log(2 * np.pi * cov) - 0.5 * diff**2 / cov
            else:
                # Multi-dimensional
                from scipy.stats import multivariate_normal
                log_prob = multivariate_normal.logpdf(x, mean, cov)
            
            return -log_prob / np.log(2)  # Convert to bits
        
        elif self.model_type == 'mixture':
            gmm = self.params['gmm']
            log_prob = gmm.score_samples(x.reshape(1, -1))[0]
            return -log_prob / np.log(2)
    
    def detect(self, X, threshold_percentile=95):
        """
        Detect anomalies based on code length.
        
        Args:
            X: Data to test
            threshold_percentile: Percentile for anomaly threshold
        
        Returns:
            anomaly_scores: Code length for each point
            is_anomaly: Boolean mask of anomalies
        """
        if X.ndim == 1:
            X = X.reshape(-1, 1)
        
        # Compute code lengths
        scores = np.array([self.code_length(x) for x in X])
        
        # Set threshold
        threshold = np.percentile(scores, threshold_percentile)
        
        # Detect anomalies
        is_anomaly = scores > threshold
        
        return scores, is_anomaly, threshold

# Example: Anomaly detection
print("=== MDL Anomaly Detection ===\n")

np.random.seed(42)

# Normal data
normal_data = np.random.randn(100, 2) * 1.5 + np.array([5, 5])

# Anomalies
anomalies = np.array([
    [5, 12],    # Far from cluster
    [-2, 5],    # Far from cluster
    [12, 5],    # Far from cluster
])

# Combine
X_train = normal_data
X_test = np.vstack([normal_data[:20], anomalies])

# Fit detector
detector = MDLAnomalyDetector(model_type='gaussian')
detector.fit(X_train)

# Detect
scores, is_anomaly, threshold = detector.detect(X_test, threshold_percentile=90)

print(f"Threshold code length: {threshold:.2f} bits")
print(f"Detected anomalies: {is_anomaly.sum()}")

# Show scores for known anomalies
print("\nCode lengths for test points:")
print(f"{'Type':<15} {'Code Length':<15} {'Anomaly?':<10}")
print("-" * 40)

for i in range(20, 23):
    print(f"{'Injected':<15} {scores[i]:<15.2f} {'Yes' if is_anomaly[i] else 'No':<10}")

Section 6: MDL for Time Series and Sequences

How Does MDL Apply to Sequential Data?

For time series, MDL helps select the optimal model order and detect change points:

class MDLTimeSeriesAnalysis:
    """
    MDL-based time series analysis.
    
    Applications:
    - AR model order selection
    - Change point detection
    - Segmentation
    """
    
    @staticmethod
    def ar_model_selection(data, max_order=10):
        """
        Select AR model order using MDL.
        
        Args:
            data: Time series data
            max_order: Maximum AR order to consider
        
        Returns:
            best_order: Optimal AR order
            mdl_scores: MDL for each order
        """
        n = len(data)
        mdl_scores = []
        
        for p in range(1, max_order + 1):
            # Build design matrix
            X = np.column_stack([
                data[i:n-p+i] for i in range(p)
            ])
            y = data[p:]
            
            # Fit AR model (OLS)
            coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
            residuals = y - X @ coeffs
            sigma2 = np.var(residuals)
            
            # MDL components
            # Model: p coefficients + variance
            model_dl = (p + 1) * 32
            
            # Data given model
            n_obs = len(y)
            data_dl = n_obs * 0.5 * np.log2(2 * np.pi * np.e * sigma2)
            data_dl += (p / 2) * np.log2(n_obs)  # MDL penalty
            
            total_mdl = model_dl + data_dl
            mdl_scores.append(total_mdl)
        
        best_order = np.argmin(mdl_scores) + 1
        
        return best_order, mdl_scores
    
    @staticmethod
    def change_point_detection(data, min_segment=10):
        """
        Detect change points using MDL.
        
        A change point is where a different model fits better.
        
        Args:
            data: Time series data
            min_segment: Minimum segment length
        
        Returns:
            change_points: List of detected change points
            segments: Segment boundaries
        """
        n = len(data)
        
        def segment_mdl(segment):
            """MDL for a single segment with Gaussian model."""
            if len(segment) < 2:
                return float('inf')
            
            # Fit Gaussian
            mean = np.mean(segment)
            var = np.var(segment) if np.var(segment) > 0 else 1e-10
            
            # Model: mean + variance
            model_dl = 2 * 32
            
            # Data
            data_dl = len(segment) * 0.5 * np.log2(2 * np.pi * np.e * var)
            
            return model_dl + data_dl
        
        # Greedy change point detection
        change_points = []
        
        def find_best_split(start, end):
            """Find best split point in range."""
            if end - start < 2 * min_segment:
                return None, float('inf')
            
            # MDL without split
            no_split_mdl = segment_mdl(data[start:end])
            
            best_point = None
            best_improvement = 0
            
            for cp in range(start + min_segment, end - min_segment):
                # MDL with split
                split_mdl = segment_mdl(data[start:cp]) + segment_mdl(data[cp:end])
                
                improvement = no_split_mdl - split_mdl
                
                if improvement > best_improvement:
                    best_improvement = improvement
                    best_point = cp
            
            return best_point, best_improvement
        
        # Binary segmentation
        segments = [(0, n)]
        
        while True:
            best_overall_point = None
            best_overall_improvement = 0
            best_segment_idx = None
            
            for idx, (start, end) in enumerate(segments):
                cp, improvement = find_best_split(start, end)
                
                if improvement > best_overall_improvement:
                    best_overall_improvement = improvement
                    best_overall_point = cp
                    best_segment_idx = idx
            
            if best_overall_point is None or best_overall_improvement < 100:  # threshold
                break
            
            # Split segment
            start, end = segments[best_segment_idx]
            segments[best_segment_idx] = (start, best_overall_point)
            segments.insert(best_segment_idx + 1, (best_overall_point, end))
            change_points.append(best_overall_point)
        
        return sorted(change_points), segments

# Example: Time series analysis
print("=== MDL Time Series Analysis ===\n")

np.random.seed(42)

# Generate AR(2) process
n = 200
ar_coeffs = [0.5, -0.3]
noise = np.random.randn(n)
ts = np.zeros(n)
ts[:2] = noise[:2]

for t in range(2, n):
    ts[t] = ar_coeffs[0] * ts[t-1] + ar_coeffs[1] * ts[t-2] + noise[t]

# Select AR order using MDL
best_order, mdl_scores = MDLTimeSeriesAnalysis.ar_model_selection(ts, max_order=8)

print("AR Model Order Selection:")
print(f"True order: 2")
print(f"MDL selected order: {best_order}")
print(f"\n{'Order':<8} {'MDL Score':<15}")
print("-" * 23)
for i, score in enumerate(mdl_scores, 1):
    marker = " *" if i == best_order else ""
    print(f"{i:<8} {score:<15.2f}{marker}")

# Change point detection
print("\n" + "="*50)
print("\nChange Point Detection:")

# Generate data with change points
segment1 = np.random.randn(50) * 1.0 + 0
segment2 = np.random.randn(50) * 1.5 + 3
segment3 = np.random.randn(50) * 0.8 - 2

ts_with_changes = np.concatenate([segment1, segment2, segment3])

change_points, segments = MDLTimeSeriesAnalysis.change_point_detection(
    ts_with_changes, min_segment=15
)

print(f"True change points: [50, 100]")
print(f"Detected change points: {change_points}")

Section 7: Advanced MDL Topics

What Is Normalized Maximum Likelihood (NML)?

NML provides the theoretically optimal MDL code, but is often intractable:

$$p_{NML}(x^n) = \frac{p(x^n | \hat{\theta}(x^n))}{\sum_{y^n} p(y^n | \hat{\theta}(y^n))}$$

The denominator (stochastic complexity) is the key challenge.

def compute_nml_simple(data, model_class):
    """
    Compute NML code length for simple cases.
    
    This is typically intractable for complex models.
    
    Args:
        data: Observed data
        model_class: Model with MLE fitting
    
    Returns:
        nml_code_length: NML description length
    """
    # MLE for observed data
    theta_hat = model_class.fit(data)
    log_prob_data = model_class.log_prob(data, theta_hat)
    
    # For tractable cases, we can approximate the normalizer
    # Here we show the concept
    
    # NML = -log p(data|theta_hat) + log(COMP)
    # where COMP = sum over all possible data of p(data'|theta_hat(data'))
    
    return -log_prob_data  # Simplified (ignoring normalizer)

class BernoulliNML:
    """
    NML for Bernoulli model (tractable case).
    """
    
    @staticmethod
    def stochastic_complexity(n):
        """
        Compute stochastic complexity for n Bernoulli trials.
        
        COMP = sum_{k=0}^{n} C(n,k) * (k/n)^k * ((n-k)/n)^(n-k)
        """
        from scipy.special import comb
        
        comp = 0
        for k in range(n + 1):
            if k == 0:
                term = 1  # 0^0 = 1
            elif k == n:
                term = 1
            else:
                p_mle = k / n
                term = comb(n, k) * (p_mle ** k) * ((1 - p_mle) ** (n - k))
            comp += term
        
        return comp
    
    @staticmethod
    def nml_code_length(data):
        """
        Compute NML code length for binary data.
        
        Args:
            data: Binary array
        
        Returns:
            NML code length in bits
        """
        n = len(data)
        k = sum(data)  # Number of 1s
        
        # MLE
        p_mle = k / n if n > 0 else 0.5
        
        # Log-likelihood at MLE
        if k == 0:
            log_lik = n * np.log(1 - p_mle + 1e-10)
        elif k == n:
            log_lik = n * np.log(p_mle + 1e-10)
        else:
            log_lik = k * np.log(p_mle) + (n - k) * np.log(1 - p_mle)
        
        # Stochastic complexity
        comp = BernoulliNML.stochastic_complexity(n)
        
        # NML code length
        nml = -log_lik / np.log(2) + np.log2(comp)
        
        return nml

# Example: NML for Bernoulli
print("=== Normalized Maximum Likelihood ===\n")

# Compare NML for different data
test_cases = [
    np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0]),  # 50% ones
    np.array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0]),  # 80% ones
    np.array([1] * 10),  # All ones
]

print(f"{'Data':<30} {'k':<5} {'NML (bits)':<15}")
print("-" * 50)

for data in test_cases:
    nml = BernoulliNML.nml_code_length(data)
    k = sum(data)
    print(f"{str(data.tolist()):<30} {k:<5} {nml:<15.2f}")

print(f"\nStochastic complexity for n=10: {BernoulliNML.stochastic_complexity(10):.4f}")

Comparison: MDL vs Other Model Selection Methods

Method	Principle	Penalty	Asymptotics	Best For
MDL	Compression	$\frac{k}{2}\log n$	Consistent	General
AIC	Information loss	$2k$	Efficient	Prediction
BIC	Bayesian	$k\log n$	Consistent	Model selection
Cross-validation	Empirical	Data-dependent	Variable	Finite samples
NML	Optimal coding	Stochastic complexity	Optimal	Theory

Frequently Asked Questions

How do I choose between MDL and cross-validation?

Use MDL when:

You want a principled, theory-grounded approach
Computational resources are limited (no repeated training)
You need interpretable model complexity measures

Use cross-validation when:

Sample size is small
The relationship between complexity and generalization is unclear
Prediction accuracy is the primary goal

Can MDL handle deep neural networks?

MDL can be applied to neural networks, but challenges include:

Parameter counting: Modern networks have millions of parameters
Precision: What precision to use for weights?
Architecture: How to account for architecture choices?

Practical approaches use:

Effective number of parameters (via Fisher information)
Pruning-based complexity measures
Variational approximations

How does MDL relate to Occam’s razor?

MDL is a formalization of Occam’s razor:

Simple models have short descriptions
Complex models need more bits to describe
MDL prefers the simplest model that explains the data well

The key insight is that “simplicity” is measured in bits, providing an objective criterion.

What is the relationship between MDL and PAC learning?

Both provide learning guarantees, but differently:

PAC: Bounds generalization error in terms of sample complexity
MDL: Bounds excess risk in terms of code length

The MDL bound is: with high probability, $$\text{Risk}(\hat{h}) - \text{Risk}(h^*) \leq O\left(\sqrt{\frac{L(\hat{h})}{n}}\right)$$

Key Takeaways

MDL formalizes Occam’s razor by measuring model complexity in bits
Two-part MDL separates model and data descriptions; prequential MDL uses sequential prediction
MDL justifies regularization by showing complexity penalties prevent overfitting
BIC approximates MDL with penalty $\frac{k}{2}\log n$
NML provides optimal coding but is often intractable
Applications include model selection, anomaly detection, change point detection, and neural architecture search

Next Steps in Your Learning Journey

Now that you understand MDL:

Read Grünwald’s “The Minimum Description Length Principle” for comprehensive theory
Implement MDL model selection for your own problems
Explore connections to Bayesian inference and information theory
Study computational tractability of different MDL variants
Apply MDL to neural network compression and architecture search

Last updated: January 2025. This article covers MDL principle for model selection as of current research.

Documentation