SVD Explained: Singular Value Decomposition for Data Science

Master Singular Value Decomposition (SVD) - understand the math, implement from scratch, and apply to recommender systems, image compression, and NLP.

9 min read Jan 15, 2025

SVD Explained: Singular Value Decomposition for Data Science

Reading Time: 16 minutes | Word Count: 4,000+ | Difficulty: Intermediate

“SVD is arguably the most important matrix decomposition.” — Gilbert Strang

Singular Value Decomposition (SVD) is one of the most powerful tools in linear algebra. It’s behind Netflix’s recommendation engine, Google’s latent semantic indexing, image compression, and countless ML applications.

In this guide, we’ll demystify SVD, build intuition for what it does, implement it from scratch, and apply it to real-world problems.

What Is SVD?

The Decomposition

SVD factorizes any matrix $\mathbf{A}$ (even rectangular) into three matrices:

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

Where:

$\mathbf{A}$ is the original $m \times n$ matrix
$\mathbf{U}$ is an $m \times m$ orthogonal matrix (left singular vectors)
$\mathbf{\Sigma}$ is an $m \times n$ diagonal matrix (singular values)
$\mathbf{V}^T$ is an $n \times n$ orthogonal matrix (right singular vectors)

Visual Intuition

SVD tells us that any linear transformation can be broken down into:

Rotation/reflection ($\mathbf{V}^T$)
Scaling ($\mathbf{\Sigma}$)
Another rotation/reflection ($\mathbf{U}$)

import numpy as np
import matplotlib.pyplot as plt

# Any matrix can be decomposed
A = np.array([[3, 2, 2],
              [2, 3, -2]])

# SVD decomposition
U, S, Vt = np.linalg.svd(A)

print(f"Original A shape: {A.shape}")  # (2, 3)
print(f"U shape: {U.shape}")            # (2, 2)
print(f"S (singular values): {S}")      # [5., 3.]
print(f"Vt shape: {Vt.shape}")          # (3, 3)

# Reconstruct A
Sigma = np.zeros((2, 3))
np.fill_diagonal(Sigma, S)
A_reconstructed = U @ Sigma @ Vt
print(f"\nReconstructed A:\n{A_reconstructed}")
print(f"Match: {np.allclose(A, A_reconstructed)}")

SVD vs Eigendecomposition

Feature	Eigendecomposition	SVD
Matrix shape	Square only ($n \times n$)	Any shape ($m \times n$)
Orthogonal vectors	Only if symmetric	Always orthogonal
Values	Eigenvalues (can be negative/complex)	Singular values (always non-negative real)
Use cases	PCA, spectral analysis	All of the above + more

The Mathematics Behind SVD

Key Properties

Singular values are non-negative and typically sorted: $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0$
U and V have orthonormal columns:
- $\mathbf{U}^T\mathbf{U} = \mathbf{I}$
- $\mathbf{V}^T\mathbf{V} = \mathbf{I}$
Rank equals number of non-zero singular values: $\text{rank}(\mathbf{A}) = r$
Connection to eigenvalues:
- Singular values of $\mathbf{A}$ = square roots of eigenvalues of $\mathbf{A}^T\mathbf{A}$ (or $\mathbf{A}\mathbf{A}^T$)

# Verify SVD properties
A = np.array([[3, 2, 2],
              [2, 3, -2]])

U, S, Vt = np.linalg.svd(A)

# Property 1: U has orthonormal columns
print("U^T @ U (should be identity):")
print(np.round(U.T @ U, 10))

# Property 2: Connection to eigenvalues
AtA = A.T @ A  # (3, 3)
eigenvalues_AtA = np.linalg.eigvalsh(AtA)
eigenvalues_AtA = np.sort(eigenvalues_AtA)[::-1]  # Sort descending

print(f"\nSingular values: {S}")
print(f"sqrt(eigenvalues of A^T A): {np.sqrt(eigenvalues_AtA[:2])}")

Full vs Reduced SVD

Full SVD: Complete decomposition (may have zero singular values)

U_full, S_full, Vt_full = np.linalg.svd(A, full_matrices=True)

Reduced (Economy) SVD: Only keeps non-zero parts (more efficient)

U_reduced, S_reduced, Vt_reduced = np.linalg.svd(A, full_matrices=False)
print(f"U reduced shape: {U_reduced.shape}")  # (2, 2) instead of (2, 2)
print(f"Vt reduced shape: {Vt_reduced.shape}")  # (2, 3) instead of (3, 3)

Low-Rank Approximation: The Power of SVD

Truncated SVD

The key insight: we can approximate a matrix using only the top-k singular values:

$$\mathbf{A}k = \sum{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$

This is the best rank-k approximation (in terms of Frobenius norm).

def low_rank_approximation(A, k):
    """
    Compute rank-k approximation of matrix A using SVD.
    """
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    
    # Keep only top-k components
    U_k = U[:, :k]
    S_k = S[:k]
    Vt_k = Vt[:k, :]
    
    # Reconstruct
    A_k = U_k @ np.diag(S_k) @ Vt_k
    
    return A_k, S

# Example: Approximating a matrix
np.random.seed(42)
A = np.random.randn(100, 50)

for k in [1, 5, 10, 25, 50]:
    A_k, S = low_rank_approximation(A, k)
    error = np.linalg.norm(A - A_k, 'fro') / np.linalg.norm(A, 'fro')
    print(f"Rank-{k:2d} approximation: {100*(1-error):.1f}% of energy preserved")

Energy/Variance Explained

Singular values tell us how much “energy” each component captures:

$$\text{Energy from top-}k = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{r} \sigma_i^2}$$

def explained_variance_ratio(S):
    """Calculate variance explained by each singular value."""
    variance = S ** 2
    return variance / variance.sum()

# Example
A = np.random.randn(100, 20)
_, S, _ = np.linalg.svd(A, full_matrices=False)

var_ratio = explained_variance_ratio(S)
cumulative = np.cumsum(var_ratio)

print("Variance explained by each component:")
for i, (single, cum) in enumerate(zip(var_ratio[:5], cumulative[:5])):
    print(f"  Component {i+1}: {100*single:.1f}% (cumulative: {100*cum:.1f}%)")

Application 1: Image Compression

SVD naturally compresses images by keeping only significant singular values:

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

# Load a grayscale image (or convert)
# Using a simple example
np.random.seed(42)

# Create a simple image with structure
x = np.linspace(0, 4*np.pi, 200)
y = np.linspace(0, 4*np.pi, 200)
X, Y = np.meshgrid(x, y)
image = np.sin(X) * np.cos(Y) + 0.1 * np.random.randn(200, 200)

# SVD
U, S, Vt = np.linalg.svd(image, full_matrices=False)
print(f"Original image: {image.shape}, requires {image.size} values")

# Compress with different ranks
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
ranks = [1, 5, 10, 25, 50, 100]

for ax, k in zip(axes.flatten(), ranks):
    # Rank-k approximation
    compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
    
    # Compression ratio
    original_size = image.size
    compressed_size = k * (U.shape[0] + Vt.shape[1] + 1)  # U_k, V_k, S_k
    ratio = compressed_size / original_size
    
    ax.imshow(compressed, cmap='gray')
    ax.set_title(f'Rank {k}\nCompression: {100*ratio:.1f}%')
    ax.axis('off')

plt.tight_layout()
plt.savefig('svd_image_compression.png', dpi=100, bbox_inches='tight')
print("Image saved: svd_image_compression.png")

Application 2: Recommender Systems

Netflix’s famous recommendation challenge used SVD for matrix factorization.

The Problem

User-item rating matrix with many missing values:

        Item1  Item2  Item3  Item4
User1     5      3      ?      1
User2     4      ?      ?      1
User3     1      1      ?      5
User4     ?      1      5      4

SVD for Collaborative Filtering

import numpy as np

def svd_recommender(ratings, k=2):
    """
    Simple SVD-based recommender system.
    
    Args:
        ratings: User-item matrix (NaN for unknown ratings)
        k: Number of latent factors
    
    Returns:
        Predicted ratings matrix
    """
    # Replace NaN with mean for initial SVD
    mask = ~np.isnan(ratings)
    ratings_filled = ratings.copy()
    ratings_filled[~mask] = np.nanmean(ratings)
    
    # SVD
    U, S, Vt = np.linalg.svd(ratings_filled, full_matrices=False)
    
    # Truncate to k components
    U_k = U[:, :k]
    S_k = np.diag(S[:k])
    Vt_k = Vt[:k, :]
    
    # Predict all ratings
    predictions = U_k @ S_k @ Vt_k
    
    return predictions, U_k, Vt_k

# Example: Movie ratings
np.random.seed(42)

# Create user-item matrix (5 users, 8 movies)
ratings = np.array([
    [5, 3, np.nan, 1, 4, np.nan, 2, np.nan],
    [4, np.nan, np.nan, 1, 5, np.nan, np.nan, 2],
    [1, 1, np.nan, 5, np.nan, 4, 5, 5],
    [np.nan, 1, 5, 4, np.nan, 4, 4, np.nan],
    [2, np.nan, 4, np.nan, 1, 5, 5, 4],
])

predictions, U_k, Vt_k = svd_recommender(ratings, k=3)

print("Original ratings (NaN = unknown):")
print(np.round(ratings, 1))
print("\nPredicted ratings:")
print(np.round(predictions, 1))

# Find recommendations for user 0
user_0_predictions = predictions[0]
user_0_known = ~np.isnan(ratings[0])
user_0_unknown = np.isnan(ratings[0])

print(f"\nRecommendations for User 0:")
unknown_indices = np.where(user_0_unknown)[0]
unknown_predictions = [(i, predictions[0, i]) for i in unknown_indices]
unknown_predictions.sort(key=lambda x: x[1], reverse=True)

for movie_idx, pred_rating in unknown_predictions:
    print(f"  Movie {movie_idx}: predicted rating {pred_rating:.2f}")

Latent Factors Interpretation

The decomposition $\mathbf{R} \approx \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ gives us:

U: User latent factors (user preferences in k-dimensional space)
V: Item latent factors (item characteristics in k-dimensional space)

print("\nUser latent factors (preferences):")
print(U_k)

print("\nItem latent factors (characteristics):")
print(Vt_k.T)

# Users with similar latent factors have similar tastes
from scipy.spatial.distance import cosine

print("\nUser similarity (1 - cosine distance):")
for i in range(len(U_k)):
    for j in range(i+1, len(U_k)):
        sim = 1 - cosine(U_k[i], U_k[j])
        print(f"  User {i} - User {j}: {sim:.3f}")

Application 3: Latent Semantic Analysis (LSA)

SVD powers semantic search and document similarity:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "AI and ML are transforming technology industries",
    "Python is popular for data science and machine learning",
    "Neural networks can learn complex patterns from data",
    "TensorFlow and PyTorch are deep learning frameworks",
    "Data science involves statistics and programming",
    "Artificial intelligence mimics human intelligence",
]

# Create TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")  # (8 docs, vocab_size)

# Apply SVD (TruncatedSVD for sparse matrices)
n_components = 3
svd = TruncatedSVD(n_components=n_components)
doc_vectors = svd.fit_transform(tfidf_matrix)

print(f"\nDocument vectors (LSA space): {doc_vectors.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.1%}")

# Find similar documents
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(doc_vectors)
print("\nDocument similarity matrix:")
print(np.round(similarity_matrix, 2))

# Query: find documents similar to a new query
query = "neural networks for artificial intelligence"
query_tfidf = vectorizer.transform([query])
query_lsa = svd.transform(query_tfidf)

similarities = cosine_similarity(query_lsa, doc_vectors)[0]
top_indices = np.argsort(similarities)[::-1][:3]

print(f"\nQuery: '{query}'")
print("Most similar documents:")
for idx in top_indices:
    print(f"  {similarities[idx]:.3f}: {documents[idx][:50]}...")

SVD for Noise Reduction

Low-rank approximation naturally removes noise:

np.random.seed(42)

# Create signal: low-rank structure
m, n, true_rank = 50, 30, 3
U_true = np.random.randn(m, true_rank)
V_true = np.random.randn(true_rank, n)
signal = U_true @ V_true

# Add noise
noise_level = 0.5
noisy_signal = signal + noise_level * np.random.randn(m, n)

# Denoise using SVD
U, S, Vt = np.linalg.svd(noisy_signal, full_matrices=False)

# Find elbow in singular values (signal vs noise)
print("Singular values:")
print(np.round(S[:10], 2))

# Keep only top-k (true rank)
k = true_rank
denoised = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Evaluate
print(f"\nNoise level: {noise_level}")
print(f"MSE before denoising: {np.mean((noisy_signal - signal)**2):.4f}")
print(f"MSE after denoising:  {np.mean((denoised - signal)**2):.4f}")
print(f"Improvement: {100*(1 - np.mean((denoised - signal)**2) / np.mean((noisy_signal - signal)**2)):.1f}%")

Efficient SVD Computation

For Large Sparse Matrices

from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import svds

# Large sparse matrix
m, n = 10000, 5000
density = 0.01  # 1% non-zero
A_sparse = sparse_random(m, n, density=density, format='csr')

print(f"Matrix shape: {A_sparse.shape}")
print(f"Non-zero elements: {A_sparse.nnz} ({100*A_sparse.nnz/(m*n):.2f}%)")

# Truncated SVD for top-k components (much faster than full SVD)
k = 10
U, S, Vt = svds(A_sparse, k=k)

print(f"\nTop {k} singular values: {S[::-1]}")  # svds returns ascending order

Randomized SVD (Even Faster)

from sklearn.utils.extmath import randomized_svd

# For dense but large matrices
np.random.seed(42)
A_large = np.random.randn(5000, 2000)

import time

# Standard SVD (slow for large matrices)
start = time.time()
U1, S1, Vt1 = np.linalg.svd(A_large, full_matrices=False)
print(f"Full SVD time: {time.time() - start:.2f}s")

# Randomized SVD (much faster)
k = 50
start = time.time()
U2, S2, Vt2 = randomized_svd(A_large, n_components=k, random_state=42)
print(f"Randomized SVD (k={k}) time: {time.time() - start:.2f}s")

# Quality check
print(f"\nTop 10 singular values match: {np.allclose(S1[:10], S2[:10], rtol=0.01)}")

Summary and Key Takeaways

What You Learned

Concept	Formula/Definition	Application
SVD	$\mathbf{A} = \mathbf{U\Sigma V}^T$	Factorize any matrix
Singular values	Diagonal of $\mathbf{\Sigma}$	Importance of each component
Low-rank approximation	Keep top-k singular values	Compression, denoising
U columns	Left singular vectors	Row space representation
V columns	Right singular vectors	Column space representation

When to Use SVD

Task	Why SVD?
Image compression	Optimal low-rank approximation
Recommender systems	Matrix completion
Text analysis (LSA)	Semantic relationships
Noise reduction	Separate signal from noise
Pseudoinverse	$\mathbf{A}^+ = \mathbf{V\Sigma}^+\mathbf{U}^T$

NumPy/SciPy Functions

# Dense matrix
U, S, Vt = np.linalg.svd(A, full_matrices=False)

# Sparse matrix (top-k only)
from scipy.sparse.linalg import svds
U, S, Vt = svds(A_sparse, k=k)

# Large matrices (randomized)
from sklearn.utils.extmath import randomized_svd
U, S, Vt = randomized_svd(A, n_components=k)

Practice Exercises

Exercise 1: Image Compression

Load a real image, apply SVD compression at different ranks, and plot PSNR vs compression ratio.

Exercise 2: Build a Movie Recommender

Use the MovieLens dataset to build an SVD-based recommender. Evaluate with RMSE on held-out ratings.

Exercise 3: Document Clustering

Apply LSA to a text corpus and cluster documents in the reduced space.

Frequently Asked Questions

Q: SVD vs PCA - what’s the difference?

A: PCA is SVD applied to centered data’s covariance matrix. For centered data $\mathbf{X}$: PCA eigenvectors = right singular vectors of $\mathbf{X}$.

Q: How do I choose the number of components k?

A: Common methods:

Explained variance threshold (e.g., 95%)
Elbow in singular value plot
Cross-validation for downstream task

Q: Can SVD handle missing values?

A: Standard SVD cannot. Use iterative methods like Alternating Least Squares (ALS) or matrix completion algorithms.

Next Steps

Read Next: PCA Deep Dive
Practice: Build a movie recommender with real data
Explore: Try truncated SVD on NLP tasks

Continue with our Mathematics for ML series. Next: PCA Deep Dive: Principal Component Analysis.

Documentation

SVD Explained: Singular Value Decomposition for Data Science

What Is SVD?

The Decomposition

Visual Intuition

SVD vs Eigendecomposition

The Mathematics Behind SVD

Key Properties

Full vs Reduced SVD

Low-Rank Approximation: The Power of SVD

Truncated SVD

Energy/Variance Explained

Application 1: Image Compression

Application 2: Recommender Systems

The Problem

SVD for Collaborative Filtering

Latent Factors Interpretation

Application 3: Latent Semantic Analysis (LSA)

SVD for Noise Reduction

Efficient SVD Computation

For Large Sparse Matrices

Randomized SVD (Even Faster)

Summary and Key Takeaways

What You Learned

When to Use SVD

NumPy/SciPy Functions

Practice Exercises

Exercise 1: Image Compression

Exercise 2: Build a Movie Recommender

Exercise 3: Document Clustering

Frequently Asked Questions

Q: SVD vs PCA - what’s the difference?

Q: How do I choose the number of components k?

Q: Can SVD handle missing values?

Next Steps