Chapter 19

SVD (Singular Value Decomposition)

Every matrix -- every transformation -- can be broken into three simple steps: rotate, scale along the axes, rotate again. That's the SVD, and it's the most powerful decomposition in linear algebra.

We've built up a rich vocabulary for understanding transformations. We know how matrices stretch, rotate, shear, and collapse space. We've seen how eigenvalues reveal the scaling along special directions, how orthogonal matrices preserve shape, and how rank tells us what survives. But eigenvectors have a limitation: not every matrix has enough of them, and when the matrix isn't square, eigenvalues don't even apply.

The Singular Value Decomposition (SVD) has no such limitation. It works for every matrix -- square or rectangular, full rank or rank-deficient, symmetric or not. And it answers the most fundamental geometric question you can ask about a transformation: what does it do to the unit circle?

The unit circle becomes an ellipse

Take any 2x2 matrix and apply it to every point on the unit circle. The result is always an ellipse (which could be degenerate -- collapsed to a line segment or a point, but let's start with the full case).

Consider the matrix:

A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}

Every point $(x, y)$ on the unit circle ( $x^2 + y^2 = 1$ ) gets mapped to a new point $A\begin{bmatrix}x\\y\end{bmatrix}$ . The collection of all those output points forms an ellipse. The SVD tells you everything about that ellipse: how long its axes are, and which directions they point.

The dashed circle is the unit circle -- all vectors of length 1. The blue ellipse is where those vectors land after applying $A$ . Dashed lines connect a few sample points to their images. The ellipse is tilted and stretched: some directions got amplified more than others. The SVD explains exactly which directions and by exactly how much.

The shape of this ellipse encodes everything about the transformation. The length of the longest axis tells you the maximum stretching. The length of the shortest axis tells you the minimum. The directions of the axes tell you where the stretching happens. These are precisely the singular values and singular vectors.

The three-step decomposition

The SVD says that any matrix $A$ can be factored as:

A = U \Sigma V^T

where $V^T$ is a rotation (or reflection), $\Sigma$ is a diagonal scaling, and $U$ is another rotation (or reflection). Each step is simple on its own. Together, they produce the full transformation.

For our matrix $A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}$ , the decomposition works out to singular values $\sigma_1 \approx 3.618$ and $\sigma_2 \approx 1.382$ . Watch what happens to the unit circle at each stage:

Three steps, each simple. Step 1 ( $V^T$ ): rotate the input space -- the circle stays a circle, but coordinates are aligned to special directions. Step 2 ( $\Sigma$ ): scale along the axes -- the circle stretches into an axis-aligned ellipse, with lengths $\sigma_1$ and $\sigma_2$ . Step 3 ( $U$ ): rotate the ellipse into its final orientation. The result is the same ellipse as applying $A$ directly.

This is the deep insight: no matter how complicated a transformation looks, it's secretly just rotate-scale-rotate. The rotations are rigid (they preserve shape), so all the "interesting" geometry -- the stretching and compressing -- happens in the middle step. The singular values $\sigma_1$ and $\sigma_2$ are the stretching factors, and they capture the transformation's essential character.

Singular values as semi-axes

The singular values $\sigma_1$ and $\sigma_2$ are the semi-axis lengths of the output ellipse. The directions of those axes come from the columns of $U$ (the "output" singular vectors). The directions on the input circle that map to those axes come from the columns of $V$ (the "input" singular vectors).

For our matrix, $\sigma_1 \approx 3.618$ is the semi-major axis (the most stretching), and $\sigma_2 \approx 1.382$ is the semi-minor axis (the least stretching). The column $\vec{u}_1$ of $U$ points along the major axis. The column $\vec{u}_2$ points along the minor axis. On the input side, $\vec{v}_1$ is the direction that gets stretched the most, and $\vec{v}_2$ gets stretched the least.

Left: the unit circle with the input singular vectors $\vec{v}_1$ (green) and $\vec{v}_2$ (purple) -- the directions on the input that get mapped to the ellipse axes. Right: the output ellipse with the output singular vectors $\vec{u}_1$ (blue) and $\vec{u}_2$ (purple), scaled by $\sigma_1$ and $\sigma_2$ . The transformation maps $\vec{v}_1$ to $\sigma_1\vec{u}_1$ and $\vec{v}_2$ to $\sigma_2\vec{u}_2$ . The singular values are the semi-axis lengths of the ellipse.

This is the cleanest summary of what any matrix does: it picks out special input directions ( $V$ columns), maps them to special output directions ( $U$ columns), and scales each by its singular value. Everything else follows from linearity.

Low-rank approximation

Here's where SVD becomes a practical superpower. If some singular values are small, you can set them to zero and get a matrix that's close to the original but has lower rank. That lower-rank matrix is the best rank- $k$ approximation of $A$ -- no other rank- $k$ matrix is closer (in the Frobenius norm or the operator norm).

Imagine a matrix with singular values $\sigma_1 = 10$ , $\sigma_2 = 3$ , $\sigma_3 = 0.1$ . The third singular value is tiny -- the transformation barely does anything in that direction. If you drop it, setting $\sigma_3 = 0$ , you get a rank-2 approximation that captures $\frac{10^2 + 3^2}{10^2 + 3^2 + 0.1^2} = \frac{109}{109.01} \approx 99.99\%$ of the matrix's "energy."

Left: all three singular values of a matrix. The third ( $\sigma_3 = 0.1$ ) is tiny compared to the others. Right: set $\sigma_3 = 0$ , reducing the rank from 3 to 2. The approximation captures 99.99% of the matrix's energy. For data compression, this means massive storage savings with minimal quality loss.

This is why Netflix uses SVD for recommendations (users and movies form a matrix; most of the structure lives in a few singular values). It's why image compression works (natural images have rapidly decaying singular values). It's why principal component analysis (PCA) is SVD in disguise -- the principal components are the leading singular vectors.

The formal bit

The Singular Value Decomposition of any $m \times n$ matrix $A$ is:

A = U \Sigma V^T

where:

$U$ is an $m \times m$ orthogonal matrix. Its columns $\vec{u}_1, \ldots, \vec{u}_m$ are the left singular vectors.
$V$ is an $n \times n$ orthogonal matrix. Its columns $\vec{v}_1, \ldots, \vec{v}_n$ are the right singular vectors.
$\Sigma$ is an $m \times n$ diagonal matrix with non-negative entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ on the diagonal. These are the singular values.

The key relationships:

A\vec{v}_i = \sigma_i \vec{u}_i

This says: the matrix $A$ maps the right singular vector $\vec{v}_i$ to the left singular vector $\vec{u}_i$ , scaled by $\sigma_i$ . That's the entire content of the decomposition in one equation.

Connection to eigenvalues. The singular values are the square roots of the eigenvalues of $A^TA$ (or equivalently, of $AA^T$ ):

A^TA = V\Sigma^T\Sigma V^T

The matrix $A^TA$ is symmetric and positive semi-definite, so it always has non-negative eigenvalues and an orthogonal eigenvector basis. The eigenvalues of $A^TA$ are $\sigma_1^2, \sigma_2^2, \ldots$ , and the eigenvectors are the columns of $V$ . Similarly, the eigenvalues of $AA^T$ are the same $\sigma_i^2$ , with eigenvectors being the columns of $U$ .

Rank and singular values. The rank of $A$ equals the number of nonzero singular values. A rank- $k$ matrix has exactly $k$ nonzero singular values. This gives us the best rank- $k$ approximation: truncate the SVD to the $k$ largest singular values:

A_k = \sum_{i=1}^{k} \sigma_i \vec{u}_i \vec{v}_i^T

The Eckart-Young theorem guarantees that $A_k$ is the closest rank- $k$ matrix to $A$ in both the Frobenius norm and the operator norm. No other rank- $k$ matrix does better.

Let's verify with our example. $A = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}$ .

First, compute $A^TA$ :

A^TA = \begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}^T\begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 10 & 5 \\ 5 & 5 \end{bmatrix}

The eigenvalues of $A^TA$ are found from:

\lambda^2 - 15\lambda + 25 = 0 \quad \Rightarrow \quad \lambda = \frac{15 \pm \sqrt{125}}{2}

\lambda_1 \approx 13.09, \quad \lambda_2 \approx 1.91

So $\sigma_1 = \sqrt{13.09} \approx 3.618$ and $\sigma_2 = \sqrt{1.91} \approx 1.382$ . The maximum stretch is about 3.6x, and the minimum is about 1.4x. The transformation stretches space in every direction -- nothing gets compressed below the original size.

Worked example: image compression with SVD

A grayscale image is just a matrix -- each entry is a pixel intensity from 0 (black) to 255 (white). An 800x600 image is an 800x600 matrix with 480,000 values. The SVD can compress this dramatically.

Here's the idea. Compute the SVD of the image matrix: $A = U\Sigma V^T$ . The matrix has at most 600 singular values (the minimum dimension). But natural images have rapidly decaying singular values -- the first few capture the broad shapes, the next few add detail, and most of the rest are near zero.

Instead of storing all 480,000 pixel values, store only the $k$ largest singular values and their corresponding singular vectors:

A_k = \sigma_1 \vec{u}_1 \vec{v}_1^T + \sigma_2 \vec{u}_2 \vec{v}_2^T + \cdots + \sigma_k \vec{u}_k \vec{v}_k^T

Each term $\sigma_i \vec{u}_i \vec{v}_i^T$ is a rank-1 matrix: an 800-dimensional column times a 600-dimensional row, which costs $800 + 600 + 1 = 1401$ values to store. So a rank- $k$ approximation costs $1401k$ values.

Rank $k$	Storage	Compression	Quality
1	1,401	0.3% of original	Blobby -- a single weighted average of the image's dominant gradient
5	7,005	1.5% of original	Recognizable -- major shapes and contrast visible
20	28,020	5.8% of original	Sharp -- most details recovered, minor artifacts
50	70,050	14.6% of original	Nearly indistinguishable from original
600 (full)	480,000	100%	Exact reproduction

function svdCompress(imageMatrix, k) {
  // imageMatrix is m x n
  const { U, sigma, V } = computeSVD(imageMatrix);

  // Keep only k largest singular values
  let compressed = zeros(imageMatrix.length, imageMatrix[0].length);

  for (let i = 0; i < k; i++) {
    // Add rank-1 contribution: sigma[i] * U[:,i] * V[:,i]^T
    for (let row = 0; row < imageMatrix.length; row++) {
      for (let col = 0; col < imageMatrix[0].length; col++) {
        compressed[row][col] += sigma[i] * U[row][i] * V[col][i];
      }
    }
  }

  return compressed;
}

// Storage comparison
const m = 800, n = 600;
const fullStorage = m * n;                    // 480,000
const rank50Storage = 50 * (m + n + 1);       // 70,050
const compressionRatio = rank50Storage / fullStorage; // ~14.6%

What's happening geometrically? Each rank-1 term $\sigma_i \vec{u}_i \vec{v}_i^T$ is a "layer" of the image. The first layer ( $i = 1$ , largest $\sigma$ ) captures the coarsest structure -- the overall brightness gradient. The next layers add progressively finer detail. Dropping the layers with small $\sigma_i$ means dropping the barely-visible fine details.

This same principle powers dimensionality reduction in machine learning (PCA), latent semantic analysis in search engines (treating documents as vectors), and noise reduction (noise contributes to small singular values, so truncating the SVD filters it out).

Key Takeaway: SVD reveals the skeleton of any transformation: its fundamental directions and magnitudes. Every matrix is rotate-scale-rotate. It's the Swiss Army knife of linear algebra -- it works for any matrix (square or not), finds the best low-rank approximation, and decomposes the transformation into its simplest possible pieces.

Where to go from here

We started this book with a single arrow in space -- a vector. We learned to combine vectors, transform them with matrices, and measure the effect of those transformations with determinants, rank, and null spaces. We discovered that dot products reveal projections and angles, cross products give us perpendicular directions and areas, and eigenvalues expose the directions where a transformation acts most simply. We built orthonormal bases with Gram-Schmidt, and finally arrived here: the SVD, which unifies all of it into a single factorization that works for every matrix.

That's the visual core of linear algebra. You can now look at any matrix and see what it does: the stretching, the rotation, the collapsing, the rank. Every concept connects to a geometric picture.

If you want to keep going, the trail branches in several directions. Abstract vector spaces generalize these ideas beyond arrows in $\mathbb{R}^n$ -- function spaces, polynomial spaces, and infinite-dimensional spaces all obey the same rules. Tensors extend matrices to higher dimensions, essential for physics and deep learning. Differential forms bring linear algebra into calculus, connecting determinants to integration and cross products to curl. And numerical linear algebra -- the study of how to compute all of this efficiently and stably on a real computer -- is a field unto itself, where the SVD is the undisputed champion algorithm.

Every one of those paths starts from what you've learned here. The geometry doesn't change -- it just gets richer.