Chapter 19

SVD (Singular Value Decomposition)

Every matrix -- every transformation -- can be broken into three simple steps: rotate, scale along the axes, rotate again. That's the SVD, and it's the most powerful decomposition in linear algebra.

We've built up a rich vocabulary for understanding transformations. We know how matrices stretch, rotate, shear, and collapse space. We've seen how eigenvalues reveal the scaling along special directions, how orthogonal matrices preserve shape, and how rank tells us what survives. But eigenvectors have a limitation: not every matrix has enough of them, and when the matrix isn't square, eigenvalues don't even apply.

The Singular Value Decomposition (SVD) has no such limitation. It works for every matrix -- square or rectangular, full rank or rank-deficient, symmetric or not. And it answers the most fundamental geometric question you can ask about a transformation: what does it do to the unit circle?

The unit circle becomes an ellipse

Take any 2x2 matrix and apply it to every point on the unit circle. The result is always an ellipse (which could be degenerate -- collapsed to a line segment or a point, but let's start with the full case).

Consider the matrix:

A=[3112]A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}

Every point (x,y)(x, y) on the unit circle (x2+y2=1x^2 + y^2 = 1) gets mapped to a new point A[xy]A\begin{bmatrix}x\\y\end{bmatrix}. The collection of all those output points forms an ellipse. The SVD tells you everything about that ellipse: how long its axes are, and which directions they point.

unit circle (1,0) -> (3,1) (0,1) -> (1,2) Every matrix maps the unit circle to an ellipse

The dashed circle is the unit circle -- all vectors of length 1. The blue ellipse is where those vectors land after applying AA. Dashed lines connect a few sample points to their images. The ellipse is tilted and stretched: some directions got amplified more than others. The SVD explains exactly which directions and by exactly how much.

The shape of this ellipse encodes everything about the transformation. The length of the longest axis tells you the maximum stretching. The length of the shortest axis tells you the minimum. The directions of the axes tell you where the stretching happens. These are precisely the singular values and singular vectors.

The three-step decomposition

The SVD says that any matrix AA can be factored as:

A=UΣVTA = U \Sigma V^T

where VTV^T is a rotation (or reflection), Σ\Sigma is a diagonal scaling, and UU is another rotation (or reflection). Each step is simple on its own. Together, they produce the full transformation.

For our matrix A=[3112]A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}, the decomposition works out to singular values σ13.618\sigma_1 \approx 3.618 and σ21.382\sigma_2 \approx 1.382. Watch what happens to the unit circle at each stage:

Step 1: V^T rotate Circle -> circle sigma_1 sigma_2 Step 2: Sigma scale axes Circle -> axis-aligned ellipse Step 3: U rotate again Axis-aligned ellipse -> tilted ellipse

Three steps, each simple. Step 1 (VTV^T): rotate the input space -- the circle stays a circle, but coordinates are aligned to special directions. Step 2 (Σ\Sigma): scale along the axes -- the circle stretches into an axis-aligned ellipse, with lengths σ1\sigma_1 and σ2\sigma_2. Step 3 (UU): rotate the ellipse into its final orientation. The result is the same ellipse as applying AA directly.

This is the deep insight: no matter how complicated a transformation looks, it's secretly just rotate-scale-rotate. The rotations are rigid (they preserve shape), so all the "interesting" geometry -- the stretching and compressing -- happens in the middle step. The singular values σ1\sigma_1 and σ2\sigma_2 are the stretching factors, and they capture the transformation's essential character.

Singular values as semi-axes

The singular values σ1\sigma_1 and σ2\sigma_2 are the semi-axis lengths of the output ellipse. The directions of those axes come from the columns of UU (the "output" singular vectors). The directions on the input circle that map to those axes come from the columns of VV (the "input" singular vectors).

For our matrix, σ13.618\sigma_1 \approx 3.618 is the semi-major axis (the most stretching), and σ21.382\sigma_2 \approx 1.382 is the semi-minor axis (the least stretching). The column u1\vec{u}_1 of UU points along the major axis. The column u2\vec{u}_2 points along the minor axis. On the input side, v1\vec{v}_1 is the direction that gets stretched the most, and v2\vec{v}_2 gets stretched the least.

v1 v2 Input unit circle + V columns u1 sigma_1 u2 sigma_2 Output ellipse + U columns A A * v1 = sigma_1 * u1 A * v2 = sigma_2 * u2 Input directions V map to output directions U, scaled by singular values

Left: the unit circle with the input singular vectors v1\vec{v}_1 (green) and v2\vec{v}_2 (purple) -- the directions on the input that get mapped to the ellipse axes. Right: the output ellipse with the output singular vectors u1\vec{u}_1 (blue) and u2\vec{u}_2 (purple), scaled by σ1\sigma_1 and σ2\sigma_2. The transformation maps v1\vec{v}_1 to σ1u1\sigma_1\vec{u}_1 and v2\vec{v}_2 to σ2u2\sigma_2\vec{u}_2. The singular values are the semi-axis lengths of the ellipse.

This is the cleanest summary of what any matrix does: it picks out special input directions (VV columns), maps them to special output directions (UU columns), and scales each by its singular value. Everything else follows from linearity.

Low-rank approximation

Here's where SVD becomes a practical superpower. If some singular values are small, you can set them to zero and get a matrix that's close to the original but has lower rank. That lower-rank matrix is the best rank-kk approximation of AA -- no other rank-kk matrix is closer (in the Frobenius norm or the operator norm).

Imagine a matrix with singular values σ1=10\sigma_1 = 10, σ2=3\sigma_2 = 3, σ3=0.1\sigma_3 = 0.1. The third singular value is tiny -- the transformation barely does anything in that direction. If you drop it, setting σ3=0\sigma_3 = 0, you get a rank-2 approximation that captures 102+32102+32+0.12=109109.0199.99%\frac{10^2 + 3^2}{10^2 + 3^2 + 0.1^2} = \frac{109}{109.01} \approx 99.99\% of the matrix's "energy."

Dropping small singular values Full (rank 3) 10 sigma_1 3 sigma_2 0.1 sigma_3 drop small Rank-2 approx. 10 sigma_1 3 sigma_2 0 dropped Why this works for compression Full matrix storage: m x n values Rank-k SVD storage: k(m + n + 1) values For a 1000x1000 image, rank 50 uses ~10% of the storage with ~95% quality

Left: all three singular values of a matrix. The third (σ3=0.1\sigma_3 = 0.1) is tiny compared to the others. Right: set σ3=0\sigma_3 = 0, reducing the rank from 3 to 2. The approximation captures 99.99% of the matrix's energy. For data compression, this means massive storage savings with minimal quality loss.

This is why Netflix uses SVD for recommendations (users and movies form a matrix; most of the structure lives in a few singular values). It's why image compression works (natural images have rapidly decaying singular values). It's why principal component analysis (PCA) is SVD in disguise -- the principal components are the leading singular vectors.

The formal bit

The Singular Value Decomposition of any m×nm \times n matrix AA is:

A=UΣVTA = U \Sigma V^T

where:

The key relationships:

Avi=σiuiA\vec{v}_i = \sigma_i \vec{u}_i

This says: the matrix AA maps the right singular vector vi\vec{v}_i to the left singular vector ui\vec{u}_i, scaled by σi\sigma_i. That's the entire content of the decomposition in one equation.

Connection to eigenvalues. The singular values are the square roots of the eigenvalues of ATAA^TA (or equivalently, of AATAA^T):

ATA=VΣTΣVTA^TA = V\Sigma^T\Sigma V^T

The matrix ATAA^TA is symmetric and positive semi-definite, so it always has non-negative eigenvalues and an orthogonal eigenvector basis. The eigenvalues of ATAA^TA are σ12,σ22,\sigma_1^2, \sigma_2^2, \ldots, and the eigenvectors are the columns of VV. Similarly, the eigenvalues of AATAA^T are the same σi2\sigma_i^2, with eigenvectors being the columns of UU.

Rank and singular values. The rank of AA equals the number of nonzero singular values. A rank-kk matrix has exactly kk nonzero singular values. This gives us the best rank-kk approximation: truncate the SVD to the kk largest singular values:

Ak=i=1kσiuiviTA_k = \sum_{i=1}^{k} \sigma_i \vec{u}_i \vec{v}_i^T

The Eckart-Young theorem guarantees that AkA_k is the closest rank-kk matrix to AA in both the Frobenius norm and the operator norm. No other rank-kk matrix does better.

Let's verify with our example. A=[3112]A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}.

First, compute ATAA^TA:

ATA=[3112]T[3112]=[10555]A^TA = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}^T\begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 10 & 5 \\ 5 & 5 \end{bmatrix}

The eigenvalues of ATAA^TA are found from:

λ215λ+25=0λ=15±1252\lambda^2 - 15\lambda + 25 = 0 \quad \Rightarrow \quad \lambda = \frac{15 \pm \sqrt{125}}{2}

λ113.09,λ21.91\lambda_1 \approx 13.09, \quad \lambda_2 \approx 1.91

So σ1=13.093.618\sigma_1 = \sqrt{13.09} \approx 3.618 and σ2=1.911.382\sigma_2 = \sqrt{1.91} \approx 1.382. The maximum stretch is about 3.6x, and the minimum is about 1.4x. The transformation stretches space in every direction -- nothing gets compressed below the original size.

Worked example: image compression with SVD

A grayscale image is just a matrix -- each entry is a pixel intensity from 0 (black) to 255 (white). An 800x600 image is an 800x600 matrix with 480,000 values. The SVD can compress this dramatically.

Here's the idea. Compute the SVD of the image matrix: A=UΣVTA = U\Sigma V^T. The matrix has at most 600 singular values (the minimum dimension). But natural images have rapidly decaying singular values -- the first few capture the broad shapes, the next few add detail, and most of the rest are near zero.

Instead of storing all 480,000 pixel values, store only the kk largest singular values and their corresponding singular vectors:

Ak=σ1u1v1T+σ2u2v2T++σkukvkTA_k = \sigma_1 \vec{u}_1 \vec{v}_1^T + \sigma_2 \vec{u}_2 \vec{v}_2^T + \cdots + \sigma_k \vec{u}_k \vec{v}_k^T

Each term σiuiviT\sigma_i \vec{u}_i \vec{v}_i^T is a rank-1 matrix: an 800-dimensional column times a 600-dimensional row, which costs 800+600+1=1401800 + 600 + 1 = 1401 values to store. So a rank-kk approximation costs 1401k1401k values.

Rank kk Storage Compression Quality
1 1,401 0.3% of original Blobby -- a single weighted average of the image's dominant gradient
5 7,005 1.5% of original Recognizable -- major shapes and contrast visible
20 28,020 5.8% of original Sharp -- most details recovered, minor artifacts
50 70,050 14.6% of original Nearly indistinguishable from original
600 (full) 480,000 100% Exact reproduction
function svdCompress(imageMatrix, k) {
  // imageMatrix is m x n
  const { U, sigma, V } = computeSVD(imageMatrix);

  // Keep only k largest singular values
  let compressed = zeros(imageMatrix.length, imageMatrix[0].length);

  for (let i = 0; i < k; i++) {
    // Add rank-1 contribution: sigma[i] * U[:,i] * V[:,i]^T
    for (let row = 0; row < imageMatrix.length; row++) {
      for (let col = 0; col < imageMatrix[0].length; col++) {
        compressed[row][col] += sigma[i] * U[row][i] * V[col][i];
      }
    }
  }

  return compressed;
}

// Storage comparison
const m = 800, n = 600;
const fullStorage = m * n;                    // 480,000
const rank50Storage = 50 * (m + n + 1);       // 70,050
const compressionRatio = rank50Storage / fullStorage; // ~14.6%

What's happening geometrically? Each rank-1 term σiuiviT\sigma_i \vec{u}_i \vec{v}_i^T is a "layer" of the image. The first layer (i=1i = 1, largest σ\sigma) captures the coarsest structure -- the overall brightness gradient. The next layers add progressively finer detail. Dropping the layers with small σi\sigma_i means dropping the barely-visible fine details.

This same principle powers dimensionality reduction in machine learning (PCA), latent semantic analysis in search engines (treating documents as vectors), and noise reduction (noise contributes to small singular values, so truncating the SVD filters it out).

Key Takeaway: SVD reveals the skeleton of any transformation: its fundamental directions and magnitudes. Every matrix is rotate-scale-rotate. It's the Swiss Army knife of linear algebra -- it works for any matrix (square or not), finds the best low-rank approximation, and decomposes the transformation into its simplest possible pieces.

Where to go from here

We started this book with a single arrow in space -- a vector. We learned to combine vectors, transform them with matrices, and measure the effect of those transformations with determinants, rank, and null spaces. We discovered that dot products reveal projections and angles, cross products give us perpendicular directions and areas, and eigenvalues expose the directions where a transformation acts most simply. We built orthonormal bases with Gram-Schmidt, and finally arrived here: the SVD, which unifies all of it into a single factorization that works for every matrix.

That's the visual core of linear algebra. You can now look at any matrix and see what it does: the stretching, the rotation, the collapsing, the rank. Every concept connects to a geometric picture.

If you want to keep going, the trail branches in several directions. Abstract vector spaces generalize these ideas beyond arrows in Rn\mathbb{R}^n -- function spaces, polynomial spaces, and infinite-dimensional spaces all obey the same rules. Tensors extend matrices to higher dimensions, essential for physics and deep learning. Differential forms bring linear algebra into calculus, connecting determinants to integration and cross products to curl. And numerical linear algebra -- the study of how to compute all of this efficiently and stably on a real computer -- is a field unto itself, where the SVD is the undisputed champion algorithm.

Every one of those paths starts from what you've learned here. The geometry doesn't change -- it just gets richer.