Multivariable Calculus

Peter Ma | May 30th 2026

Have feedback on these notes?

Open the feedback form (new tab). I use responses to improve this module for new students. If your comments are about this page, choose Multivariable Calculus in the first question.

Worksheet (Google Colab)

Work through this companion notebook while you read: open the Colab worksheet (new tab). Sign in with a Google account if Colab asks; then use File → Save a copy in Drive if you want your own editable version.

Who these notes are for. If you have taken something like AP Calculus (or an equivalent first-year calculus course), you already bring a lot to the table: you are comfortable with functions, graphs, slopes, and solving equations. You do not need to have seen a full college multivariable calculus course. This page is meant to onboard you into the vocabulary ML and physics papers use: partial derivatives, gradients, Jacobians, and Hessians, so the rest of the lab feels less like alphabet soup. It builds on the companion linear algebra primer; wherever you see a blue link like this in the text, it jumps to the matching section there.

How to read it. Go at your own pace. When a paragraph feels heavy, try the colored questions (red) on paper, peek at a picture or interactive widget, or ask a tool (blue) for a second explanation in different words. This is not a substitute for a real multivariable calculus course; it is a friendly on-ramp so you can read code and papers and ask sharper questions.

Red \(=\) stop and think / work on question

Blue \(=\) centered boxes tagged Ask AI placed near new ideas; paste the prompt into ChatGPT or another tool for a second explanation.

<< Back

Why Multivariable Calculus?

Multivariable calculus is the generalization of ordinary calculus to multiple variables. Why do we care about multiple variables? Many real-world quantities depend on more than one input at once. Its tools are the foundation of many critical areas of research from physics, to machine learning, to statistics and beyond.

Limits

As you probably remember, calculus is the study of making something infinitely divisible and studying how things change. At its core is the idea of a limit. The limit is what defines differentiation and then integration. The idea of the limit that you are most likely familiar with is approaching a function from both sides. \[\lim_{x\to 0^-}x^2 = \lim_{x\to 0^+} x^2 = 0 \Rightarrow \lim_{x\to 0} x^2 = 0 \] The tricky part is... how do we make sense of this left-and-right scenario in higher dimensions? Here is where we need to upgrade our definition of the limit to the true grown-up definition of the limit, called the delta-epsilon limit definition.

First let us translate the 1-D idea of a limit into the new definition, which states \[\lim_{x\to a} f(x) \text{ exists if and only if } \exists \delta : \forall \epsilon > 0, |x-a|<\delta \Rightarrow |L-f(a)|< \epsilon \] where \(L\) is the limit.

Hold on. What does this mean in english? You will come across this notation in more "advanced" math classes but to put simply this is saying "there exists (\(\exists\)) a \(\delta\) such that (\(:\)) for all (\(\forall\)) \(\epsilon\) greater than \(0\), where \(|x-a|<\delta\) that that implies (\(\Rightarrow\)) \(|L-f(a)| < \epsilon\)"

This looks and sounds confusing... what does this mean? The analogy I like to give is to imagine playing a game against a computer. In this game, you have the function \(x^2\), and the computer asks, "Can you find me a \(\delta\) value that guarantees that for any \(x\) between \(x-\delta\) and \(x+\delta\), the function \(x^2\) will not deviate from the limit \(L\) by more than \(\epsilon\)?"

Let's play a 1 round of this game! Let's say the place where we want to find the limit is at \(x \to 0\). Okay, and the computer hands us \(\epsilon = 2\). That's easy! If we pick a \(\delta < 1\), then we can easily win, right? Because \(|f(0-1)-f(0)|< \epsilon \) and \(|f(0+1)-f(0)|< \epsilon \). That's just one turn. What the true definition of the limit is really saying is that no matter what \(\epsilon\) the computer hands you, if you can always win against the computer by picking a winning \(\delta\), then the limit exists!

Here is an example where the limit breaks! \(f(x) =\frac{1}{x}\) at \(0\). We know from the old definition that one side goes to \(\infty\) while the other side goes to \(-\infty\), and thus the limit doesn't exist. What we see is that for any \(\epsilon\) the game hands us, things are infinitely far apart around \(0\)! That's really bad. That means we cannot win this game and thus we lose, and the limit is undefined.

Now why is this definition useful for multidimensional stuff? Well, this \(\delta\), instead of determining what is effectively an interval range, determines a ball. Let me explain what I mean. So in \(\mathbb{R}^3\), for example, the definition of a limit becomes: if you can always find a ball (sphere) of radius \(\delta\) around some point \(x\) such that it guarantees you win this delta-epsilon game within that ball, then you have a limit! If not, then the limit does not exist.

Ask AI Explain the delta-epsilon definition of a limit using the "ball in \(\mathbb{R}^3\)" picture instead of left/right limits. Then give one example of a function whose limit fails in 2D because two paths disagree.

Do you see what we've done here? In high school calculus, you were told to approach from the left or from the right. In higher dimensions, you can approach from any direction you please. And the definition of the limit that we've formulated neatly describes that idea: if you approach the value from 360 degrees in any direction, then the values must agree with one another.

Why do we care? This new definition of the limit is what allows us to define derivatives.

Partial Derivatives and the Gradient (The New Derivative)

Okay, now look at the derivative. Famously, the derivative is the rate of change, and we defined the derivative via the limit \[\frac{df}{dx} = \lim_{h\to 0 } \frac{f(x+h)- f(x)}{h}\] We face the same problem as before: in the world of multiple dimensions, how do we define rate of change... and in what direction? This is where the idea of a gradient comes in.

Let's say we have a simple function that takes a vector (a bundle of numbers in each coordinate) \((x,y,z)\) and gives you a number \(f(x,y,z) = x\cdot y\cdot z \) as an example. We can pick a direction to find the derivative in. The direction we pick is along one of the variables. Let's say along the \(x\)-axis. Then we write \(\frac{\partial f}{\partial x} \). This is called a partial derivative, and it is formally defined as \[\frac{\partial f}{\partial x} = \lim_{h\to 0 } \frac{f(x+h, y, z)- f(x, y,z)}{h}\]

A partial derivative is just like a normal derivative, except you treat all the other variables as constants. In the example I gave, \(\frac{\partial f}{\partial x} = \frac{\partial }{\partial x} (xyz) = yz \). Think about why this is. I just treat \(x\) as the variable and the others as constant, and since \(x\) is linear we get \(yz\) as the answer.

Great, but we haven't addressed the elephant in the room: which direction do we approach from? We take partial derivatives in the direction of each coordinate axis! We introduce the fancy \(\nabla\) operation called the gradient. It is a vector, and it is very simple. \[\nabla f = \left[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right].\] This definition can be made precise as well, where \(h = (h_x,h_y, h_z)\) is a vector and \(|h|\) is the norm of that vector: \[\nabla f = \lim_{h\to (0,0,0) } \frac{f(x+h_x, y+h_y, z+ h_z)- f(x,y,z)}{|h|}\] Note: this definition only really makes sense using our delta-epsilon definition of a limit, where we approach the point via a ball in all directions!

Ask AI Explain why the gradient points in the direction of steepest increase, and how gradient descent uses the opposite direction to minimize a loss function. Use a simple 2D bowl-shaped example.

What this is now saying is that you are measuring how much change is occurring in each direction, along the x-axis, the y-axis, the z-axis, and so on... Together, this gives you the direction the function is trending toward overall! Let's see this in action. \[\nabla (xyz) = [yz, xz, xy]\] Simple!

What is super unique about the gradient is that it is always pointing in the direction where \(f\) is immediately increasing. That is the same idea as a dot product: the rate of change in unit direction \(\mathbf{u}\) is \(\nabla f \cdot \mathbf{u}\), and that is largest when \(\mathbf{u}\) lines up with \(\nabla f\). This is super important in machine learning, because this simple fact is what allows us to make machines learn to begin with.

Divergence

Let's take this up a notch. The gradient is special. What you notice is that at each point \((x,y,z)\), we can compute the gradient and get out a vector. This means each point has a vector. This special property is called a vector field. The vector field produced by the gradient operation gives us a sense of how the landscape is changing and in which direction things are changing. (Review vectors in \(\mathbb{R}^n\) if that language still feels new.)

More broadly, when every point in space has a vector attached to it, we call that a vector field. Vector fields have special properties like sinks and sources! What do I mean by this? Think about a whirlpool. The fluid has some velocity as it is swirling around, and that velocity points in various directions. Remember, this is what we call a vector field (at every point in space there is some vector associated with it). The velocity of a fluid is a vector field (each part of the fluid has a velocity vector telling it where it is headed)! But at the center of the whirlpool, all the vectors point toward the same place: the sink where the whirlpool flows in. Divergence is a tool to identify those sinks!

Likewise, if there is a source, say water flowing out of a single location, the divergence also tells you where that is!

Specifically, the divergence adds up all of the partial derivatives: \[\nabla \cdot f(x) = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y} + \frac{\partial f}{\partial z}\] The \(\nabla \cdot\) symbol is a dot product between the gradient operator and the vector field, just like curl uses a cross product.

Ask AI What is divergence measuring in a vector field? Give me a fluid-flow analogy for a source, a sink, and a region with zero divergence.

The divergence tells you how much stuff is "diverging" from each other, kind of like measuring how things are splitting away.

Curl

On the other hand, curl tells you something similar to divergence. Divergence measures whether vectors are spreading out from a point or collapsing into it. Curl measures whether the field is locally spinning.

Go back to the whirlpool analogy. Divergence helped us spot the sink at the center, where everything gets sucked in. But most of the whirlpool is not a pure sink. It is fluid circling around. If you put a tiny paddle wheel in the water and it starts rotating, that is exactly what curl is detecting. If the whirlpool did not have fluid flowing straight in and was just spinning in circles, curl would tell you how strong that rotation is and which way it turns.

So if divergence is about sources and sinks, curl is about rotation. A field with zero curl everywhere is sometimes called irrotational, meaning locally nothing is spinning. A field with zero divergence everywhere is solenoidal, meaning locally nothing is being created or destroyed. Real fluid flows often have both effects happening at once!

Like divergence, curl is built from partial derivatives, but now we are differentiating a vector field instead of adding derivatives of one scalar function. If \[\mathbf{F}(x,y,z) = \begin{bmatrix} F_x \\ F_y \\ F_z \end{bmatrix},\] then the curl is written with the same \(\nabla\) operator, but using a cross product this time: \[\nabla \times \mathbf{F}.\] In 3D, you can write this as a determinant:

Ask AI Compare divergence and curl in plain language. When would a vector field have high curl but zero divergence, or the other way around?

\[ \nabla \times \mathbf{F} = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\[0.3em] \frac{\partial}{\partial x} & \frac{\partial}{\partial y} & \frac{\partial}{\partial z} \\[0.3em] F_x & F_y & F_z \end{vmatrix}. \] If you expand that determinant, you get \[ \nabla \times \mathbf{F} = \begin{bmatrix} \dfrac{\partial F_z}{\partial y} - \dfrac{\partial F_y}{\partial z} \\[0.6em] \dfrac{\partial F_x}{\partial z} - \dfrac{\partial F_z}{\partial x} \\[0.6em] \dfrac{\partial F_y}{\partial x} - \dfrac{\partial F_x}{\partial y} \end{bmatrix}. \] Do not worry about memorizing this on day one. The important idea is the pattern: each component compares how one component of the field changes in a different direction. That mismatch is what creates rotation.

Linear algebra connection: look at the symbol again: \(\nabla \times \mathbf{F}\). The curl is literally a cross product between the gradient operator and the vector field. Cross products measure spinning and perpendicular directions in 3D space, so it makes sense that the cross product shows up here. If the output curl vector points along the \(z\) axis, your local rotation is happening mostly in the \(xy\) plane, and the right hand rule tells you which way things are turning.

Here is a simple example. Suppose \[\mathbf{F}(x,y,z) = \begin{bmatrix} -y \\ x \\ 0 \end{bmatrix}.\] This field says: at point \((x,y)\), move in the direction \((-y,x,0)\). That is exactly the kind of pattern you see when things rotate counterclockwise around the origin in the \(xy\) plane. If you compute the curl, you get \[\nabla \times \mathbf{F} = \begin{bmatrix} 0 \\ 0 \\ 2 \end{bmatrix},\] a constant vector pointing straight up along the \(z\) axis. So the field is spinning, and the curl captures that rotation cleanly in one vector.

Why do we care? In physics and engineering, curl shows up in fluid flow, electromagnetism, and vorticity. In machine learning, you will not compute curl as often as gradients or Jacobians, but the same vector calculus language keeps appearing whenever a model outputs a vector at every point in space. Understanding curl helps you read those fields without guessing what the arrows mean.

Red = stop and think: If a vector field is constant everywhere, like \(\mathbf{F}(x,y,z) = [1,0,0]\), what should the curl be? Does that match your intuition that nothing is spinning?

Jacobians (Upgrade the Gradient)

Okay, we need to upgrade the gradient yet again. The problem is that our gradient idea only works for \(f(x,y,z)\) where the output is a single number. What if the output is a vector as well? \[f(x,y,z) = [xyz, y^2,z^3]\] How does one even make sense of the derivative here? Here is where we introduce the Jacobian. Here is where you will need some of your linear algebra skills.

The Jacobian is defined as the following matrix: \[\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \]

Ask AI Explain what the Jacobian matrix represents geometrically as a local linear approximation of a nonlinear map. Work through a small 2D-to-2D example numerically.

Notice what we did? We basically took multiple gradients and stacked them together! So we looked at the first entry of the function, computed the partial derivatives and got a vector, then looked at the second entry, computed the gradient, and repeated! Let's look at this concrete example: \[ f_1 = xyz,\quad f_2 = y^2, \quad f_3 = z^3\] We compute their gradients: \[\frac{\partial f_1}{\partial x} = yz, \quad \frac{\partial f_1}{\partial y} = xz, \quad \frac{\partial f_1}{\partial z} = xy\] \[\frac{\partial f_2}{\partial x} = 0, \quad \frac{\partial f_2}{\partial y} = 2y, \quad \frac{\partial f_2}{\partial z} = 0\] \[\frac{\partial f_3}{\partial x} = 0, \quad \frac{\partial f_3}{\partial y} = 0, \quad \frac{\partial f_3}{\partial z} = 3z^2\] So the Jacobian is \[\mathbf{J} = \begin{bmatrix} yz & xz & xy \\ 0 & 2y & 0 \\ 0 & 0 & 3z^2 \end{bmatrix}\]

Hold on. You are a sharp student. You see a matrix. If you remember from linear algebra, that means this Jacobian is a linear map?! You are exactly correct! Here is where the deep connection comes into play. The Jacobian (the linear map) defines a subspace of possible vectors that is tangential to the surface!

More concretely, remember calculus is just the process of pretending that non-linear things are linear if you zoom in close enough (think Taylor series approximations). If you take a wildly curving, non-linear function and zoom in infinitely close to one specific point, the transformation will look flat and linear. Because it looks linear at that microscopic level, it can be represented by a matrix. That matrix is the Jacobian evaluated at a point. Linear algebra gives us the tools to handle flat, uniform transformations using matrices. The Jacobian is the matrix that tells linear algebra exactly which flat linear transformation best approximates the curved calculus problem at a specific microscopic location. This linear transformation is the Jacobian!

Jacobians are the bread and butter of gradient-based optimizers, and they are what allow these machine learning algorithms to learn.

Hessians (Second-Order Derivatives?)

Remember back in calculus, there is the second derivative, and that told us when functions are curving up or curving down. We can generalize this to multiple dimensions as well! This is called the Hessian. The Hessian is the matrix that perfectly describes the local curvature of a multivariable landscape. So how do you compute it?

\[\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{bmatrix}\]

Ask AI How do the eigenvalues of the Hessian tell you whether a critical point is a minimum, maximum, or saddle? Give a 2D example with a symmetric Hessian matrix.

Now you might be wondering, well... how do we actually determine the curvature? The curvature is determined by finding the eigenvalues of the Hessian matrix and multiplying them together!

If the product is positive, the surface curves upward; if it is negative, it curves downward. This can also easily be uncovered via the determinant of the Hessian!

You might be wondering, what then do the eigenvectors do? The eigenvectors point along the direction of that measured curvature. Some also say the eigenvectors of the Hessian tell you the principal axes of curvature. Imagine standing on the side of a sloping, twisting hill. The eigenvectors point in the very specific, magic directions where the "twisting" completely vanishes. If you walk strictly along an eigenvector line, the landscape curves perfectly straight up or straight down in front of you; the mixed partial derivatives in this custom coordinate system are effectively zero. By finding the eigenvectors, you are rotating your coordinate system so that the axes align perfectly with the natural grain of the landscape's curvature!

Chain Rule / Product Rule

In single-variable calculus, the product rule and chain rule are your two best friends for untangling messy equations. When we upgrade to multiple dimensions, these rules survive, but they get some linear algebra upgrades!

Let's start with the Product Rule. In 1D, if you have two functions multiplying each other, the derivative is: \[ \frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x) \] In higher dimensions, let's say \(f(x,y,z)\) and \(g(x,y,z)\) are both scalar functions (they take in a vector, spit out a single number). What happens if we take the gradient of their product? The structure is exactly the same, but our 1D derivatives upgrade to gradient vectors! \[ \nabla(f \cdot g) = g \nabla f + f \nabla g \]

Linear Algebra Connection: Look closely at the right side of that equation. \(g\) and \(f\) are just scalar numbers, while \(\nabla f\) and \(\nabla g\) are vectors. This means the multivariable product rule is fundamentally just a linear combination of two gradient vectors!

Red = stop and think: If \(f(x)\) and \(g(x)\) both have a gradient of zero at a specific coordinate, what is the gradient of their product at that coordinate?

Now, for the main event: The Chain Rule. This is arguably the single most important calculus concept in all of Machine Learning, because it is the mathematical engine behind Backpropagation.

In 1D, the chain rule handles nested functions: \[ \frac{d}{dx}f(g(x)) = f'(g(x))g'(x) \] But what happens if you have a multivariable function \(f(x, y)\), and both \(x\) and \(y\) are themselves functions of time, \(t\)? So, \(x(t)\) and \(y(t)\). If you want to know how \(f\) changes as time moves forward, you have to track the changes through all possible paths and add them up: \[ \frac{df}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} \]

Do you see the hidden linear algebra here? That sum of products is exactly how you compute a dot product! We can rewrite the multivariable chain rule beautifully as the dot product of the gradient vector of \(f\) and the velocity vector of your path: \[ \frac{df}{dt} = \nabla f \cdot \begin{bmatrix} \frac{dx}{dt} \\ \frac{dy}{dt} \end{bmatrix} \]

The Ultimate Upgrade: Matrix Multiplication
Let's take it to the extreme. What if we have a vector-valued function feeding into another vector-valued function? Let's say \(\mathbf{y} = G(\mathbf{x})\) and \(\mathbf{z} = F(\mathbf{y})\). To find the derivative of the whole nested system with respect to \(\mathbf{x}\), we don't just multiply numbers, and we don't just take a dot product. We multiply their Jacobian matrices! \[ \mathbf{J}_{F(G(\mathbf{x}))} = \mathbf{J}_F(G(\mathbf{x})) \mathbf{J}_G(\mathbf{x}) \]

This is the biggest "Aha!" moment bridging these two subjects. Matrix multiplication was literally invented to track the composition of linear maps. Because derivatives (Jacobians) are just local linear approximations, tracking the derivative of nested multivariable functions is exactly the same as multiplying their Jacobian matrices together!

Ask AI Explain how multiplying Jacobian matrices is the same idea as backpropagation in a neural network. Walk through a tiny network with two layers and name each Jacobian.

Taylor Approximation (Multi-dimensional)

If you want to understand how optimization algorithms (like Newton's Method in machine learning) actually "see" a loss landscape, you have to understand the multi-dimensional Taylor series.

Back in 1D calculus, the Taylor series allowed us to approximate a crazy, impossible-to-compute curve by stacking simple polynomials (lines, parabolas, cubics) on top of each other. In multiple dimensions, we do the exact same thing, but instead of fitting lines and parabolas to a curve, we fit flat planes and 3D bowls to a messy, curved landscape!

Let's say we have a scalar function \(f(\mathbf{x})\) where \(\mathbf{x}\) is a vector of inputs (like \(x, y\)). We want to approximate the landscape near a specific starting point, \(\mathbf{a}\). Here is how the approximation is built step-by-step:

0th Order (The Flat Guess):
If we know absolutely nothing about the slope, our best guess for the surrounding area is just a completely flat horizontal plane set exactly at our current elevation. \[ f(\mathbf{x}) \approx f(\mathbf{a}) \]

1st Order (The Linear Approximation):
Now we bring in the Gradient (\(\nabla f\)). The gradient tells us the exact tilt of the landscape. By taking the dot product of the gradient and our step vector \((\mathbf{x} - \mathbf{a})\), we create a tilted tangent plane that perfectly hugs the slope at point \(\mathbf{a}\). \[ f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a}) \cdot (\mathbf{x} - \mathbf{a}) \]

2nd Order (The Quadratic Approximation):
A flat, tilted plane is a terrible approximation if the landscape curves sharply right in front of you. To capture the curve, we bring in the Hessian matrix (\(\mathbf{H}\)). This adds a "bowl" or "saddle" shape to our approximation, bending our flat tangent plane so it perfectly matches the true curvature of the landscape. \[ f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^T \mathbf{H}(\mathbf{a}) (\mathbf{x} - \mathbf{a}) \]

Ask AI Explain the multivariable Taylor approximation at 1st and 2nd order. How does Newton's method use the Hessian to take smarter optimization steps than plain gradient descent?

The Linear Algebra Connection: Look closely at that 2nd-order term: \( (\mathbf{x} - \mathbf{a})^T \mathbf{H} (\mathbf{x} - \mathbf{a}) \). This is a famous structure called a Quadratic Form. You are taking a matrix (\(\mathbf{H}\)) and "sandwiching" it between a vector and its transpose (the same idea as \(u^\top v\) in the dot product notes). This specific matrix multiplication is the mathematical mechanism that generates bowl and saddle shapes in higher dimensions! The eigenvalues of \(\mathbf{H}\) then tell you whether you are looking at a bowl, a saddle, or something flatter.

Red = stop and think: If the Hessian matrix at point \(\mathbf{a}\) is completely full of zeros, what does the 2nd-order approximation look like?

Integrals

Integration should also be something you are familiar with back in AP-calculus courses. But that definition isn't very percise. We are taught that we want to divide up a function into smaller areas and sum them up. We take the limit as the width of the rectangle gets small so \(dx\) and then multiply by the height \(f(x)\) and add that up in infinitely small ways. This is not precise. How do we know if something is even integrable to begin with?

The classical way integration is defined is via Riemann sums. Specifically with upper and lower sums using slices of rectangles. But this is limiting. What are rectangles even in say 4-d or 5-d or \(\mathbb{R}^n\)? We need to be more general. Here is the slightly more formal definition, using what we call partitions.

Ask AI Explain upper and lower Riemann sums and why an integral exists only when those sums can be squeezed together. Why is this definition easier to generalize beyond rectangles?

We define an upper sum \(U\) on some function \(f\) given some set of partitions \(P\) (a partition can be anything for intuition just think of it as a collection of patches, in the case of riemann sums that define the rectangles/3d rectangles that we are adding up) \[U(f,P) = \sum_{p \in P} M(f(p))\cdot v(p), \quad M(p) = \max(p), \quad v(p) \text{ is the volume of the set}\] This makes some sense right? We have divided up a grid into multiple rectangles patches \(p\). We evaluate and find the max point (because this is the upper sum) in that rectangle (the height) and then we add it up! Each term is a height times a patch volume, so the whole upper sum is one big weighted sum over the partition.

We also have a lower sum \(L\) and its exactly what you think it is... \[L(f,P) = \sum_{p \in P} M(f(p))\cdot v(p), \quad M(p) = \min(p), \quad v(p) \text{ is the volume of the set}\] Same thing except we take the smallest value hence the lower sum.

It should be clear to see that \[L(f,P) \leq U(f,P)\]

To define integration, it is something that sits in between these two \[L(f,P) \leq \int f dp \leq U(f,P)\] more specifically we say that the integration EXISTS if there \(\exists P\) some partition such that \[L(f,P) = \int f dp = U(f,P)\]

Woah what does that mean? This is exactly like the definition of the limit! An integral exists, if you approach adding the function values below it equals to adding the function from values above it! More specifically: \[\int f dp \text{ exists if } \quad \exists P : \forall \epsilon > 0, |U(f,P) - L(f,P)| \leq \epsilon\] Which is just saying, for every tight bound we want to get, we can always find a partition that satisfies this constraint that its smaller than \(\epsilon\).

Why might you add, is this good for generalization to higher dimensions? Because patches can be anything and volume function \(v(p)\) can then be more percisely defined. No longer are we bound to integrating using rectangles, we can define things however we want!

Lets see some examples of real integrations you may come across in life sometimes.

Line/Path Integrals

The line integral is famous. Imagine in 2-d space you wish to integrate along some path. This is on some function \(f(x,y)\) which outputs a scalar value (single value not a vector). Lets say the path you wish to integrate on is some curve \(S\). We write the path integral as \[\int_S f(l) dl\]

Ask AI Explain how a scalar line integral works: what does \(dl\) mean, and why do we substitute a path parametrization \(r(t)\) before integrating?

Hold on.. what is all of this? Where did my \(x,y\) coordinates go? Well lets read this equation, \(dl\) here means one small chunk of your path. Remember we are no longer integrating in a straight line but now in some custom path.

What this means is that you need to plug in the values for your path! Okay so lets say your path is a line \(r(t) = [2t, 3t]\), a vector-valued parametrization of the curve. This is just a straight line where you move \(2t\) in the \(x\) direction and \(3t\) in the \(y\) direction. Great! Now \(dl\) is a small change or stepsize in your path.. so that means \(dl = |r'(t)|\), the norm of the tangent vector. This is the distance of that small chunk on the path (note that r'(t) is the derivative in respect to the \(t\)). Since we are on the path \(l\) we replace \(f(l) = f(r(t))\)! Now we can integrate. Lets look at this lets say your custom path goes from \(0\to 1\) \[\int_S f(l) dl = \int_0^1 f(r(t))|r'(t)|dt\] Dropping in the values for \(r(t) = [2t, 3t]\) \[\int_S f(l) dl = \int_0^1 f(2t, 3t)|[2,3]|dt\] \[\int_S f(l) dl = \int_0^1 f(2t, 3t)\sqrt{2^2+3^2}dt\] \[\int_S f(l) dl = \int_0^1 f(2t, 3t)\sqrt{13}dt\] if we say we have some function \(f(x,y) = xy\) \[\int_S f(l) dl = \int_0^1 2t\cdot 3t\sqrt{13}dt\] \[\int_S f(l) dl = 6 \sqrt{13}\int_0^1 t^2dt\] Wow this now becomes something you are exactly familiar with. Easy peasy!

We can actually take this one step further! What happens if \(f\) the function is not just a single value? What happens if the function returns multiple values and is a vector function? Then we have the following line integral. \[\int_S f(l) \cdot dl\] Where now the line integral requires a dot product!

Ask AI Explain the vector line integral \(\int \mathbf{F}\cdot d\mathbf{l}\). Give a physics example where it represents work done by a force along a path.

Here now the \(dl\) is a vector \(dl = r'(t)dt\) remember \(r'(t)\) is a vector before \(|r'(t)|\) only looked at the length of that vector. And then rinse and repeat! \[\int_S f(l) \cdot dl = \int_0^1 f(r(t)) \cdot r'(t) dt\] if the function was like this \(f(x,y) = [x^2,y^3]\) \[ \int_0^1 f(2t, 3t) \cdot [2,3] dt\] \[ \int_0^1 [4t^2, 27t^3] \cdot [2,3] dt\] \[ \int_0^1 8t^2+ 81t^3 dt\] Now this looks easy :)

Linear algebra connection: a vector line integral is a path integral of a dot product at each tiny step. That is why work integrals look like \(\int \mathbf{F} \cdot d\mathbf{l}\): you are adding up how much of the field points along the motion.

Surface Integrals (Extra)

Now that you know how to integrate along any path you like. What about integrating on a surface? To do so, you need to first define a surface....

Let's define the surface as a sphere in \(\mathbb{R}^3\): \[\Omega = \{(x,y,z) : x^2+y^2+z^2 = 1\}.\] This is the set of all points whose distance from the origin is exactly \(1\).

Now let us define the function \(f\) as some value it takes on that sphere, \(f(x,y,z) = xyz\), let's just say. Then the surface integral is \[\int_\Omega f \, dA\] Hold on... what is \(dA\)? It is one small patch of area on the surface of \(\Omega\), not volume and not length along a curve.

This is the same big idea as the line integral. On a path, we replaced \(dl\) with \(|r'(t)|\,dt\) after plugging in a parametrization \(r(t)\). On a surface, we replace \(dA\) with something built from a surface parametrization \(r(u,v)\). Think of \(u\) and \(v\) as two coordinates that tell you where you are on the surface, the way \(t\) told you where you were on the curve.

For the unit sphere, spherical coordinates are a natural choice. Let \[ r(\theta,\phi) = \begin{bmatrix} \sin\theta\cos\phi \\ \sin\theta\sin\phi \\ \cos\theta \end{bmatrix}, \qquad 0 \leq \theta \leq \pi,\quad 0 \leq \phi \leq 2\pi. \] Every pair \((\theta,\phi)\) picks out one point on the sphere, and as \(\theta\) and \(\phi\) vary, you sweep out the whole surface.

Now take tiny steps \(d\theta\) and \(d\phi\). The vectors \(r_\theta = \frac{\partial r}{\partial \theta}\) and \(r_\phi = \frac{\partial r}{\partial \phi}\) span a little parallelogram tangent to the surface. Its area is the length of their cross product: \[ dA = |r_\theta \times r_\phi|\, d\theta\, d\phi. \] For this unit sphere parametrization, a calculation (or a trusted table) gives \(|r_\theta \times r_\phi| = \sin\theta\), so \[ dA = \sin\theta\, d\theta\, d\phi. \] That \(\sin\theta\) factor is doing real work: it says patches near the equator are bigger than patches near the north pole, which matches the picture on a globe.

Ask AI Explain where \(|r_\theta \times r_\phi|\) comes from in a surface integral, and what flux \(\iint \mathbf{F}\cdot d\mathbf{A}\) measures on a surface.

Putting it together, a scalar surface integral becomes an ordinary double integral: \[ \int_\Omega f \, dA = \int_0^{2\pi}\int_0^{\pi} f(r(\theta,\phi))\,\sin\theta\, d\theta\, d\phi. \] We evaluate \(f\) on the surface by substituting the parametrization, exactly like we replaced \(f(l)\) with \(f(r(t))\) on a path.

Warm-up example: surface area. Take \(f(x,y,z) = 1\). Then the integral just adds up area: \[ \int_\Omega 1\, dA = \int_0^{2\pi}\int_0^{\pi} \sin\theta\, d\theta\, d\phi = 2\pi\left[-\cos\theta\right]_0^{\pi} = 4\pi. \] That is the surface area of a unit sphere, which is a good sanity check that the \(dA\) formula is reasonable.

Back to \(f(x,y,z) = xyz\) on the sphere. On the surface, \[ f(r(\theta,\phi)) = (\sin\theta\cos\phi)(\sin\theta\sin\phi)(\cos\theta) = \sin^2\theta\cos\theta\sin\phi\cos\phi. \] So \[ \int_\Omega xyz\, dA = \int_0^{2\pi}\int_0^{\pi} \sin^2\theta\cos\theta\sin\phi\cos\phi\,\sin\theta\, d\theta\, d\phi. \] You could grind this out with trig integrals. But there is a faster observation: \(xyz\) is positive in some octants of the sphere and equally negative in others, and the sphere is symmetric. So the integral is \[ \int_\Omega xyz\, dA = 0. \] That is a common pattern: odd functions over symmetric surfaces often integrate to zero.

Vector surface integrals (flux). Just like the line integral upgraded from \(\int f\,dl\) to \(\int \mathbf{F}\cdot d\mathbf{l}\), surfaces have a vector version too. If \(\mathbf{F}(x,y,z)\) is a vector field, the flux of \(\mathbf{F}\) through \(\Omega\) is \[ \iint_\Omega \mathbf{F}\cdot d\mathbf{A}. \] Here \(d\mathbf{A}\) is a tiny area vector pointing perpendicular to the surface (outward, if we choose the outward normal). In parametrization form, \[ \iint_\Omega \mathbf{F}\cdot d\mathbf{A} = \int \int \mathbf{F}(r(u,v)) \cdot \big(r_u \times r_v\big)\, du\, dv. \] Notice the pattern from line integrals: scalar case uses length \(|r'(t)|\), vector case uses the vector \(r'(t)\). On a surface, scalar case uses area \(|r_u \times r_v|\), vector case uses the area vector \(r_u \times r_v\).

Linear algebra connection: the cross product \(r_u \times r_v\) is built from partial derivatives, so it is really a Jacobian story in disguise. The tangent plane at a point is spanned by \(r_u\) and \(r_v\), and the cross product packages that 2D patch of directions into one perpendicular vector whose length is the patch area. The flux integral then uses a dot product to measure how much of the field points through that patch.

Why do we care? Surface integrals show up whenever you integrate over a curved boundary instead of a flat region: flux through a membrane, heat flow through a surface, electric field through a closed surface, and (later) the divergence theorem, which connects flux to what happens inside a volume.

Red = stop and think: For the unit sphere, why does \(f=1\) give total area \(4\pi\), but \(f=xyz\) gives \(0\)? What would you guess for \(f=x^2+y^2+z^2\)?

Volume Integrals (Extra)

We have climbed one dimension at a time. A line integral adds up values along a curve. A surface integral adds up values on a surface. A volume integral adds up values through a solid region in space.

If \(V\) is a region in \(\mathbb{R}^3\) and \(f(x,y,z)\) is a scalar function, we write \[ \iiint_V f \, dV. \] Here \(dV\) is one tiny chunk of volume, the 3D analogue of \(dl\) and \(dA\). In ordinary \(xyz\) coordinates, the most familiar form is \[ dV = dx\, dy\, dz. \]

Ask AI Explain what a volume integral \(\iiint_V f\, dV\) is adding up, and how it is the same partition idea as line and surface integrals, just in 3D.

Hold on... how is this different from the partition definition at the top of this section? It is the same idea! You chop space into small patches, multiply each patch volume by a function value, and add. The reason we write \(\iiint\) is that we are integrating over a 3D region, not a line or a surface.

Example 1: the unit cube. Let \[ V = \{(x,y,z) : 0 \leq x \leq 1,\; 0 \leq y \leq 1,\; 0 \leq z \leq 1\}, \] and take \(f(x,y,z) = xyz\). Because the region is a box, we can integrate one variable at a time: \[ \iiint_V xyz\, dV = \int_0^1 \int_0^1 \int_0^1 xyz\, dx\, dy\, dz. \] The \(x\) integral is easy because \(y\) and \(z\) act like constants: \[ \int_0^1 xyz\, dx = yz \int_0^1 x\, dx = \frac{yz}{2}. \] Then \[ \int_0^1 \frac{yz}{2}\, dy = \frac{z}{2}\int_0^1 y\, dy = \frac{z}{4}, \] and finally \[ \int_0^1 \frac{z}{4}\, dz = \frac{1}{8}. \] So \[ \iiint_V xyz\, dV = \frac{1}{8}. \] That is exactly the kind of calculation you already know from AP calculus, just repeated three times.

Example 2: a ball instead of a box. Now let \[ V = \{(x,y,z) \in \mathbb{R}^3 : x^2+y^2+z^2 \leq 1\}, \] the solid unit ball (the inside of the sphere from the surface integral section, not just its skin). Boxes are nice in \(xyz\) coordinates. Balls are nicer in spherical coordinates: \[ x = \rho\sin\theta\cos\phi,\qquad y = \rho\sin\theta\sin\phi,\qquad z = \rho\cos\theta, \] with \(\rho \geq 0\), \(0 \leq \theta \leq \pi\), and \(0 \leq \phi \leq 2\pi\).

The volume element becomes \[ dV = \rho^2\sin\theta\, d\rho\, d\theta\, d\phi. \] Where did \(\rho^2\sin\theta\) come from? It is the 3D version of the same story as \(|r'(t)|\) and \(|r_u\times r_v|\): when you change coordinates, the little volume patch gets stretched, and that stretch factor is a Jacobian determinant. For spherical coordinates, that determinant is \(\rho^2\sin\theta\).

Warm-up: volume of the unit ball. Take \(f(x,y,z)=1\). Then the integral just adds up volume: \[ \iiint_V 1\, dV = \int_0^{2\pi}\int_0^{\pi}\int_0^1 \rho^2\sin\theta\, d\rho\, d\theta\, d\phi. \] The \(\rho\) integral gives \(\frac{1}{3}\), the \(\theta\) integral gives \(2\), and the \(\phi\) integral gives \(2\pi\), so \[ \iiint_V 1\, dV = \frac{4\pi}{3}. \] That is the volume of a unit ball, a formula worth remembering.

A nontrivial density on the ball. Suppose \(f(x,y,z) = x^2+y^2+z^2\) on the same unit ball. In spherical coordinates, on the ball we have \(x^2+y^2+z^2 = \rho^2\), so \[ \iiint_V (x^2+y^2+z^2)\, dV = \int_0^{2\pi}\int_0^{\pi}\int_0^1 \rho^2 \cdot \rho^2\sin\theta\, d\rho\, d\theta\, d\phi = \int_0^{2\pi}\int_0^{\pi}\int_0^1 \rho^4\sin\theta\, d\rho\, d\theta\, d\phi. \] Compute step by step: \[ \int_0^1 \rho^4\, d\rho = \frac{1}{5}, \qquad \int_0^{\pi}\sin\theta\, d\theta = 2, \qquad \int_0^{2\pi} d\phi = 2\pi, \] so \[ \iiint_V (x^2+y^2+z^2)\, dV = \frac{4\pi}{5}. \]

Change of variables in general. If you reparametrize a volume with a smooth map \( \mathbf{r}(u,v,w)\), then \[ dV = \left|\det \mathbf{J}\right|\, du\, dv\, dw, \] where \(\mathbf{J}\) is the Jacobian of the coordinate change (the matrix of all partial derivatives). Spherical coordinates, cylindrical coordinates, and any other coordinate system you meet in physics are all special cases of this one rule.

Ask AI Explain why a change of variables in a volume integral picks up a Jacobian determinant. Set up \(\iiint_V 1\, dV\) for the unit ball in spherical coordinates and show you get \(\frac{4\pi}{3}\).

Linear algebra connection: a volume integral is where the Jacobian stops being a matrix you multiply by a vector and becomes a determinant that tells you how volume scales. That is the same object behind determinants and invertibility in the linear algebra notes, now measuring how a curved coordinate change distorts little 3D boxes.

Why do we care? Volume integrals let you compute total mass from a density, average values over a 3D region, probabilities in 3D, and (with the divergence theorem) connect what happens inside a region to flux through its boundary surface.

Red = stop and think: On the unit ball, why does \(f=1\) give volume \(\frac{4\pi}{3}\), while on the unit cube \([0,1]^3\) it gives volume \(1\)? What is \(\iiint_V xyz\, dV\) on the unit ball by symmetry?

Made by @peterma02 - ©