# The Unapologetic Mathematician

## The Chain Rule

Since the components of the differential are given by partial derivatives, and partial derivatives (like all single-variable derivatives) are linear, it’s straightforward to see that the differential operator is linear as well. That is, if $f:\mathbb{R}^m\rightarrow\mathbb{R}^n$ and $g:\mathbb{R}^m\rightarrow\mathbb{R}^n$ are two functions, both of which are differentiable at a point $x$, and $a$ and $b$ are real constants, then the linear combination $af+bg$ is also differentiable at $x$, and the differential is given by

$\displaystyle d(af+bg)=adf+bdg$

There’s not usually a product for function values in $\mathbb{R}^n$, so there’s not usually any analogue of the product rule and definitely none of the quotient rule, so we can ignore those for now.

But we do have a higher-dimensional analogue for the chain rule. If we have a function $g:X\rightarrow\mathbb{R}^n$ defined on some open region $X\subseteq\mathbb{R}^m$ and another function $f:Y\rightarrow\mathbb{R}^p$ defined on a region $Y\subseteq\mathbb{R}^n$ that contains the image $g(X)$, then we can compose them to get a single function $f\circ g:X\rightarrow\mathbb{R}^p$ defined by $\left[f\circ g\right](x)=f(g(x))$. And if $g$ is differentiable at a point $x$ and $f$ is differentiable at the image point $g(x)$, then the composite function is differentiable at $x$.

First of all, what should the differential be? Remember that the differential $dg(x)$ is a linear transformation that takes displacements $t\in\mathbb{R}^m$ from the point $x$ and turns them into displacements $dg(x)t\in\mathbb{R}^n$ from the point $g(x)$. Then the differential $df(y)$ is a linear transformation that takes displacements $s\in\mathbb{R}^n$ from the point $y$ and turns them into displacements $df(y)s\in\mathbb{R}^p$ from the point $f(y)$. Putting these together, we have a composite linear transformation $df(g(x))dg(x)$ that will start with a linear transformation that takes displacements $t\in\mathbb{R}^m$ from the point $x$ and turns them into displacements $df(g(x))dg(x)t$ from the point $f(g(x))$. I assert that this is composite transformation is exactly the differential of the composite function.

Just as a sanity check, what happens when we look at single-variable real-valued functions? In this case, $df(y)$ and $dg(x)$ are both linear transformations from one-dimensional spaces to other one-dimensional spaces. That is, they’re represented as $1\times1$ matrices that just multiply by the single real entry. So the composite of the two transformations is given by the matrix whose single entry is the product of the two matrices’ single entries. In other words, in one variable the differentials looks like single real numbers $f'(y)$ and $g'(x)$, and their composite is given by multiplication: $f'(g(x))g'(x)$. This is exactly the one-variable chain rule. To understand multiple variables we have to move from products of real numbers to compositions of linear transformations, which will be products of real matrices.

Okay, so let’s verify that $d\left[f\circ g\right](x)=df(g(x))dg(x)$ does indeed act as a differential for $f\circ g$. It’s clearly a linear transformation between the appropriate two spaces of displacements. What we need to verify is that it gives a good approximation. That is, for every $\epsilon>0$ there is a $\delta>0$ so that if $\delta>\lVert t\rVert>0$ we have

$\displaystyle\left\lVert\left[f(g(x+t))-f(g(x))\right]-df(g(x))dg(x)t\right\rVert<\epsilon\lVert t\rVert$

First of all, since $f$ is differentiable at $g(x)$, given $\tilde{\epsilon}>0$ there is a $\tilde{\delta}$ so that if $\tilde{\delta}>\lVert s\rVert>0$ we have

$\displaystyle\left\lVert\left[f(g(x)+s)-f(g(x))\right]-df(g(x))s\right\rVert<\tilde{\epsilon}\lVert s\rVert$

Now since $g$ is differentiable it satisfies a Lipschitz condition. We showed that this works for real-valued functions, but extending the result is very straightforward. That is, there is some radius $r_1$ and a constant $M>0$ so that if $r_1>t>0$ we have the inequality $\lVert g(x+t)-g(x)\rVert. That is, $g$ cannot stretch displacements by more than a factor of $M$ as long as the displacements are small enough.

Now $r_1$ may be smaller than $\frac{\tilde{\delta}}{M}$ already, but just in case let’s shrink it until it is. Then we know that

$\displaystyle\lVert g(x+t)-g(x)\rVert.

so we can use this difference as a displacement $s$ from $g(x)$. We find

\displaystyle\begin{aligned}\left\lVert\left[f(g(x+t))-f(g(x))\right]-df(g(x))\left(g(x+t)-g(x)\right)\right\rVert&<\tilde{\epsilon}\lVert g(x+t)-g(x)\rVert\\&<\tilde{\epsilon}M\lVert t\rVert\end{aligned}

Now we’re going to find a constant $N\geq0$ and a radius $r_2$ so that

$\displaystyle\left\lVert df(g(x))\left(g(x+t)-g(x)\right)-df(g(x))dg(x)t\right\rVert\leq\tilde{\epsilon}nN\lVert t\rVert$

whenever $r_2>\lVert t\rVert>0$. Once this is established, we are done. Given an $\epsilon>0$ we can set $\tilde{\epsilon}=\frac{\epsilon}{M+nN}$ and let $\delta$ be the smaller of the two resulting radii $r_1$ and $r_2$. Within this smaller radius, the desired inequality will hold.

To get this result, we choose orthonormal coordinates on the space $\mathbb{R}^n$. We can then use these coordinates to write

$\displaystyle df(g(x))\left(\left[g(x+t)-g(x)\right]-dg(x)t\right)=\left[D_if\right](g(x))\left(\left[g^i(x+t)-g^i(x)\right]-dg^i(x)t\right)$

But since each of the several $g^i$ is differentiable we can pick our radius $r_2$ so that all of the inequalities

$\displaystyle\left\lvert\left[g^i(x+t)-g^i(x)\right]-dg^i(x)t\right\rvert<\tilde{\epsilon}\lVert t\rVert$

hold for $r_2>\lVert t\rVert>0$. Then we let $N$ be the magnitude of the largest of the component partial derivatives $\left\lVert\left[D_if\right](g(x))\right\rVert$, and we’re done.

Thus when $g$ is differentiable at $x$ and $f$ is differentiable at $g(x)$, then the composite $f\circ g$ is differentiable at $x$, and the differential of the composite function is given by

$\displaystyle d\left[f\circ g\right](x)=df(g(x))dg(x)$

the composite of the differentials, considered as linear transformations.

October 7, 2009 Posted by | Analysis, Calculus | 20 Comments