Chapter 18

3D Transformations & Camera

How does a 3D scene become a flat image on your screen? A pipeline of matrix multiplications -- each one transforming space for a different purpose.

Every 3D application -- game engines, CAD tools, film renderers -- solves the same fundamental problem. You have objects defined in their own local coordinate systems, a camera positioned somewhere in the world, and a flat screen that needs to show what the camera sees. The entire journey from "a cube modeled at the origin" to "pixels on a screen" is a chain of matrix multiplications. Each matrix reshapes space in a specific way: placing objects, repositioning the camera, and projecting 3D depth onto a 2D surface. Understanding this pipeline is understanding how every 3D renderer works.

The key tool is the 4x4 homogeneous matrix. We've been working in 2D with 2x2 matrices, but 3D transformations need an extra trick. A 3x3 matrix can rotate and scale 3D space, but it can't translate (shift) points. By embedding our 3D coordinates into 4D -- writing $(x, y, z)$ as $(x, y, z, 1)$ -- we gain the ability to encode translation inside a matrix multiplication. Every transformation in the pipeline is a 4x4 matrix, and the entire pipeline composes into a single 4x4 matrix called the MVP matrix.

The rendering pipeline

The journey from a 3D model to pixels on screen passes through a sequence of coordinate spaces. Each transition is a matrix multiplication:

Model Space $\xrightarrow{M}$ World Space $\xrightarrow{V}$ Camera Space $\xrightarrow{P}$ Clip Space $\rightarrow$ Screen

The model matrix $M$ places an object into the world. The view matrix $V$ repositions everything relative to the camera. The projection matrix $P$ squashes 3D depth into a normalized volume ready for rasterization. These three matrices compose into one: $M_{\text{clip}} = P \cdot V \cdot M$ .

The rendering pipeline is a sequence of coordinate space transformations. Each arrow is a matrix multiplication. The three core matrices -- Model, View, and Projection -- compose into a single MVP matrix that takes a vertex from its local definition all the way to screen-ready coordinates.

Every vertex in a 3D scene goes through this pipeline. A model with 10,000 vertices means 10,000 MVP multiplications per frame. That's why GPUs are built as massive parallel matrix multipliers -- the entire graphics pipeline is linear algebra at high throughput.

Model transform: placing objects in the world

A 3D model is typically defined with its center at the origin. A cube might have vertices at $(\pm 1, \pm 1, \pm 1)$ . To place this cube at position $(5, 0, -10)$ in the world, rotated 45 degrees around the y-axis, and scaled to double size, you construct a model matrix -- a single 4x4 matrix that encodes translation, rotation, and scaling.

The 4x4 model matrix has a specific structure:

M = \begin{bmatrix} & & & t_x \\ & R \cdot S & & t_y \\ & & & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix}

The upper-left 3x3 block encodes rotation and scaling. The rightmost column encodes translation. The bottom row is always $(0, 0, 0, 1)$ . This is why we need 4x4 matrices -- a 3x3 matrix can't translate, but the fourth dimension sneaks it in.

Left: a cube defined at the origin in model space. Right: the same cube placed into the world -- translated to $(5, 0, -10)$ , rotated, and scaled. The model matrix $M$ encodes all three operations in a single 4x4 matrix. The upper-left 3x3 block handles rotation and scaling; the right column handles translation.

The order matters. Scale first, then rotate, then translate: $M = T \cdot R \cdot S$ . If you translate first and then rotate, the object orbits around the world origin instead of rotating in place. This is exactly the same composition-order issue we saw with 2D matrix multiplication -- transformations are applied right to left.

View transform: the camera's perspective

The view matrix does something conceptually simple: it moves the entire world so that the camera ends up at the origin, looking down the negative z-axis. Instead of moving the camera, you move everything else in the opposite direction.

If the camera is at position $\vec{e}$ (the "eye"), with right vector $\vec{r}$ , up vector $\vec{u}$ , and forward vector $\vec{f}$ (pointing where the camera looks), the view matrix is:

V = \begin{bmatrix} r_x & r_y & r_z & -\vec{r} \cdot \vec{e} \\ u_x & u_y & u_z & -\vec{u} \cdot \vec{e} \\ -f_x & -f_y & -f_z & \vec{f} \cdot \vec{e} \\ 0 & 0 & 0 & 1 \end{bmatrix}

This is a change of basis (the rotation part -- the 3x3 block of camera axes) combined with a translation (the dot-product terms that shift the camera to the origin). If the camera axes form an orthonormal basis, the rotation part is just the transpose of the matrix whose columns are $\vec{r}$ , $\vec{u}$ , $\vec{f}$ -- the inverse is free because orthogonal matrices invert by transposing.

Left: the world from above, with a camera at some position and objects scattered around. Right: after applying the view matrix, the camera sits at the origin looking down $-z$ , and all objects have been repositioned relative to it. The view matrix is a change of basis -- the geometry is unchanged, only the coordinate system.

This is the change of basis from Chapter 12 applied in full 3D. The camera's right/up/forward vectors define a basis, and the view matrix converts from world coordinates into camera coordinates. The rotation part of the view matrix is the transpose of the camera's orientation matrix (because orthogonal matrices invert by transposing), and the translation part shifts the world to center on the camera.

Projection: depth becomes flatness

The projection matrix is where the magic of perspective happens. Objects farther away appear smaller. Parallel lines converge to a vanishing point. A 3D scene becomes a 2D image.

The perspective projection maps a truncated pyramid (a "frustum") into a normalized cube. The frustum is defined by a near plane, a far plane, and the field of view angle. Everything inside the frustum is visible; everything outside gets clipped.

The key idea: divide $x$ and $y$ by $z$ . If a point is twice as far away (double $z$ ), its projected position is halved. That's perspective. The projection matrix encodes this division using a clever trick with homogeneous coordinates -- it places $z$ into the $w$ component, and the GPU performs the division $x/w$ , $y/w$ after the matrix multiplication.

P_{\text{persp}} = \begin{bmatrix} \frac{1}{\text{aspect} \cdot \tan(\frac{\theta}{2})} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan(\frac{\theta}{2})} & 0 & 0 \\ 0 & 0 & \frac{-(f+n)}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & -1 & 0 \end{bmatrix}

where $\theta$ is the vertical field of view, $n$ and $f$ are the near and far planes, and "aspect" is the width/height ratio.

Left: the view frustum -- the truncated pyramid of visible space in front of the camera. Near objects take up more of the view; far objects take up less. Right: after the projection matrix, the frustum is warped into a normalized cube. Near objects remain large; far objects shrink. This is perspective -- achieved by dividing x and y by z.

The bottom row of the perspective matrix is $(0, 0, -1, 0)$ instead of the usual $(0, 0, 0, 1)$ . This is the trick: when you multiply a point $(x, y, z, 1)$ by this matrix, the output's $w$ component becomes $-z$ . The subsequent perspective divide ( $x/w$ , $y/w$ , $z/w$ ) is what creates the foreshortening effect. Points with larger $z$ (farther away) get divided by a larger number, making them smaller on screen.

The formal bit

All 3D transformations in the rendering pipeline use 4x4 homogeneous matrices. A 3D point $(x, y, z)$ is represented as $(x, y, z, 1)$ , and a 3D direction as $(x, y, z, 0)$ . The fourth component distinguishes points (which are affected by translation) from directions (which are not).

The model matrix $M$ places an object in the world. It composes scale, rotation, and translation:

M = T \cdot R \cdot S = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \cdot R_{4 \times 4} \cdot \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}

The view matrix $V$ transforms from world space to camera space. Given camera position $\vec{e}$ , and orthonormal camera axes $\vec{r}$ (right), $\vec{u}$ (up), $\vec{f}$ (forward):

V = \begin{bmatrix} r_x & r_y & r_z & -\vec{r} \cdot \vec{e} \\ u_x & u_y & u_z & -\vec{u} \cdot \vec{e} \\ -f_x & -f_y & -f_z & \vec{f} \cdot \vec{e} \\ 0 & 0 & 0 & 1 \end{bmatrix}

The negation of $\vec{f}$ is because cameras conventionally look down $-z$ in camera space. The dot-product terms in the last column combine the rotation and translation into one matrix.

The perspective projection matrix maps the view frustum to the $[-1, 1]^3$ clip cube:

P = \begin{bmatrix} \frac{1}{a \cdot \tan(\theta/2)} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan(\theta/2)} & 0 & 0 \\ 0 & 0 & \frac{-(f+n)}{f-n} & \frac{-2fn}{f-n} \\ 0 & 0 & -1 & 0 \end{bmatrix}

where $a$ is the aspect ratio, $\theta$ is the vertical field of view, and $n$ , $f$ are the near and far clip distances.

The MVP matrix combines all three:

M_{\text{clip}} = P \cdot V \cdot M

For each vertex $\vec{v}$ in model space, the clip-space position is $\vec{v}_{\text{clip}} = M_{\text{clip}} \cdot \vec{v}$ . A single matrix multiplication per vertex, no matter how complex the scene setup.

Worked example: a cube viewed from the origin

Let's trace a vertex through the full pipeline. We have a unit cube centered at the origin in model space, and we want to place it at position $(5, 0, -10)$ in the world. The camera is at the world origin, looking down the $-z$ axis.

Step 1: Model matrix. We translate by $(5, 0, -10)$ with no rotation or scaling:

M = \begin{bmatrix} 1 & 0 & 0 & 5 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & -10 \\ 0 & 0 & 0 & 1 \end{bmatrix}

Take the front-top-right vertex of the cube: $\vec{v} = (1, 1, 1, 1)$ .

M \vec{v} = \begin{bmatrix} 1 + 5 \\ 1 + 0 \\ 1 + (-10) \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 1 \\ -9 \\ 1 \end{bmatrix}

The vertex is now at $(6, 1, -9)$ in world space.

Step 2: View matrix. The camera is at the origin looking down $-z$ with standard orientation. That means $\vec{r} = (1, 0, 0)$ , $\vec{u} = (0, 1, 0)$ , $\vec{f} = (0, 0, -1)$ , and $\vec{e} = (0, 0, 0)$ . The view matrix simplifies to the identity:

V = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}

So $V \cdot M\vec{v} = (6, 1, -9, 1)$ -- unchanged, because the camera is already at the origin facing $-z$ .

Step 3: Projection matrix. Assume a 90-degree field of view ( $\tan(45^\circ) = 1$ ), aspect ratio $a = 1$ , near plane $n = 1$ , far plane $f = 100$ :

P = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -\frac{101}{99} & -\frac{200}{99} \\ 0 & 0 & -1 & 0 \end{bmatrix}

Apply to $(6, 1, -9, 1)$ :

P \cdot \begin{bmatrix} 6 \\ 1 \\ -9 \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 1 \\ (-1.0202)(-9) + (-2.0202) \\ -(-9) \end{bmatrix} = \begin{bmatrix} 6 \\ 1 \\ 9.182 - 2.020 \\ 9 \end{bmatrix} = \begin{bmatrix} 6 \\ 1 \\ 7.162 \\ 9 \end{bmatrix}

Step 4: Perspective divide. Divide $x$ , $y$ , $z$ by $w = 9$ :

\vec{v}_{\text{NDC}} = \left(\frac{6}{9},\; \frac{1}{9},\; \frac{7.162}{9}\right) = (0.667,\; 0.111,\; 0.796)

All components are in $[-1, 1]$ , so the vertex is visible. Its screen position (ignoring the viewport transform) is approximately $(0.667, 0.111)$ -- slightly right of center and slightly above center. The $z$ -value $0.796$ is used for depth testing.

That single vertex went through three matrix multiplications and one division. In practice, the three matrices are pre-multiplied into one MVP matrix, so each vertex only needs one 4x4 multiplication plus the perspective divide. At 60 frames per second with millions of vertices, this is why GPU hardware is optimized for exactly this operation.

// The full pipeline in code
function transformVertex(vertex, model, view, projection) {
  // Combine into one matrix (done once per object, not per vertex)
  const mvp = multiply4x4(projection, multiply4x4(view, model));

  // Transform vertex (done per vertex)
  const clip = multiply4x4byVec4(mvp, [...vertex, 1]);

  // Perspective divide
  const w = clip[3];
  return {
    x: clip[0] / w,
    y: clip[1] / w,
    z: clip[2] / w   // used for depth testing
  };
}

Key Takeaway: The graphics pipeline is a sequence of matrix multiplications: model, view, and projection. Each transforms space for a different purpose. The model matrix places an object in the world. The view matrix reframes everything relative to the camera. The projection matrix collapses 3D depth into 2D by dividing by distance. Together they compose into a single MVP matrix: $M_{\text{clip}} = P \cdot V \cdot M$ .

What's next

We've seen specific types of transformations -- rotations, projections, scaling. SVD reveals that every matrix decomposes into these primitives: rotate, scale, rotate. Any linear transformation, no matter how complex, is secretly just those three steps.