The Unapologetic Mathematician

Mathematics for the interested outsider

Higher Differentials and Composite Functions

Last time we saw an example of what can go wrong when we try to translate higher differentials the way we did the first-order differential. Today I want to identify exactly what goes wrong, and I’ll make use of the summation convention to greatly simplify the process.

So, let’s take a function f of n variables \left\{y^j\right\}_{j=1}^n and a collection of n functions \left\{g^j\right\}_{j=1}^n, each depending on m variables \left\{x^i\right\}_{j=1}^m. We can think of these as the components of a vector-valued function g:X\rightarrow\mathbb{R}^n which has continuous second partial derivatives on some region X\subseteq\mathbb{R}^m. If the function f:Y\rightarrow\mathbb{R} has continuous second partial derivatives on some region Y\subseteq\mathbb{R}^n containing the image g(X), then we can compose the two functions to give a single function h=f\circ g:X\rightarrow\mathbb{R}, and we’re going to investigate the second differential of h with respect to the variables x^i.

To that end, we want to calculate the second partial derivative

\displaystyle\frac{\partial^2h}{\partial x^b\partial x^a}=\frac{\partial}{\partial x^b}\frac{\partial}{\partial x^a}h

First, we take the derivative in terms of x^a, and we use the chain rule to write

\displaystyle\frac{\partial h}{\partial x^a}=\frac{\partial g^j}{\partial x^a}\frac{\partial f}{\partial y^j}

Now we have to take the derivative in terms of x^b. Luckily, this operation is linear, so we don’t have to worry about the hidden summations in the notation. We do, however, have to use the product rule to handle the multiplications

\displaystyle\begin{aligned}\frac{\partial}{\partial x^b}\frac{\partial h}{\partial x^a}&=\frac{\partial}{\partial x^b}\left(\frac{\partial g^j}{\partial x^a}\frac{\partial f}{\partial y^j}\right)\\&=\frac{\partial^2g^j}{\partial x^b\partial x^a}\frac{\partial f}{\partial y^j}+\frac{\partial g^j}{\partial x^a}\frac{\partial^2f}{\partial x^b\partial y^j}\\&=\frac{\partial^2g^j}{\partial x^b\partial x^a}\frac{\partial f}{\partial y^j}+\frac{\partial g^j}{\partial x^a}\frac{\partial g^k}{\partial x^b}\frac{\partial^2f}{\partial y^k\partial y^j}\end{aligned}

where we’ve used the chain rule again to convert a derivative in terms of x^b into one in terms of y^k.

And here we’ve come to the problem itself. For we can write out the second differential in terms of the x^i

\displaystyle\begin{aligned}d^2h&=\frac{\partial^2h}{\partial x^b\partial x^a}dx^adx^b\\&=\frac{\partial^2f}{\partial y^k\partial y^j}\left(\frac{\partial g^j}{\partial x^a}dx^a\right)\left(\frac{\partial g^k}{\partial x^b}dx^b\right)+\frac{\partial f}{\partial y^j}\frac{\partial g^j}{\partial x^b\partial x^a}dx^adx^b\\&=\frac{\partial^2f}{\partial y^k\partial y^j}dg^jdg^k+\frac{\partial f}{\partial y^j}d^2g^j\end{aligned}

The first term here is the second differential in terms of the y^j. If there were an analogue of Cauchy’s invariant rule, this would be all there is to the formula. But we’ve got another term — one due to the product rule — based on the second differentials of the functions g^j themselves. This is the term that ruins the nice transformation properties of higher differentials, and which makes them unsuitable for many of our purposes.

Notice, though, that we have not contradicted Clairaut’s theorem here. Indeed, as long as f and all the g^j have continuous second partial derivatives, then so will h. Further, the formula we derived for the second partial derivatives of h is manifestly symmetric between the two derivatives, and so the mixed partials commute.


October 19, 2009 Posted by | Analysis, Calculus | 2 Comments

Higher-Order Differentials

Just like we assembled partial derivatives into the differential of a function, so we can assemble higher partial derivatives into higher-order differentials. The differential measures how the function itself changes as we move around, and the higher differentials will measure how lower differentials change.

First let’s look at the second-order differential of a real-valued function f of n variables x^i. We’ll use the dx^i as a basis for the space of differentials, which allows us to write out the components of the differential:

\displaystyle df(x)=\frac{\partial f}{\partial x^i}dx^i

So, just as we did for vector-valued functions, we’ll just take the differentials of each of these components separately, and then cobble them together.

\displaystyle d\left[df\right](x)=\left(d\frac{\partial f}{\partial x^i}\right)dx^i=\left(\frac{\partial^2 f}{\partial x^j\partial x^i}dx^j\right)dx^i

Now this second displacement may have nothing to do with the first, but it should be the same for all components. That is, we could write out the second differential as a function of not only the point x but of two displacements t_1 and t_2 from the point:

\displaystyle d^2f(x;t_1,t_2)=\frac{\partial^2 f}{\partial x^j\partial x^i}t_1^it_2^j

Commonly we’ll collapse this into a function of a point and a single displacement. We just put the same vector t in for both t_1 and t_2

\displaystyle d^2f(x;t)=\frac{\partial^2 f}{\partial x^j\partial x^i}t^it^j

Unfortunately, these higher differentials are more complicated than our first-order derivatives. In particular, they don’t obey anything like Cauchy’s invariant rule, meaning they don’t transform well when we compose functions. As an example, let’s go back and look at the polar coordinate transform again:


We’ve seen how we can use Cauchy’s invariant rule to rewrite differentials:


We can also invert the transformation and rewrite differential operators:

\displaystyle\begin{aligned}\frac{\partial f}{\partial x}&=\cos(\theta)\frac{\partial f}{\partial r}-\frac{\sin(\theta)}{r}\frac{\partial f}{\partial\theta}\\\frac{\partial f}{\partial y}&=\sin(\theta)\frac{\partial f}{\partial r}+\frac{\cos(\theta)}{r}\frac{\partial f}{\partial\theta}\end{aligned}

So let’s take our second-order differential

\displaystyle d^2f=\left(\frac{\partial}{\partial x}\frac{\partial}{\partial x}f\right)(dx)^2+\left(\frac{\partial}{\partial x}\frac{\partial}{\partial y}f\right)(dx)(dy)+\left(\frac{\partial}{\partial y}\frac{\partial}{\partial x}f\right)(dy)(dx)+\left(\frac{\partial}{\partial y}\frac{\partial}{\partial y}f\right)(dy)^2

and try to rewrite it. The nasty bit is working out all these second-order partial derivatives in terms of r and \theta.

\displaystyle\begin{aligned}\frac{\partial}{\partial x}\frac{\partial}{\partial x}f=&\left[\cos(\theta)\frac{\partial}{\partial r}-\frac{\sin(\theta)}{r}\frac{\partial}{\partial\theta}\right]\left(\cos(\theta)\frac{\partial f}{\partial r}-\frac{\sin(\theta)}{r}\frac{\partial f}{\partial\theta}\right)\\=&\cos(\theta)\frac{\partial}{\partial r}\left(\cos(\theta)\frac{\partial f}{\partial r}-\frac{\sin(\theta)}{r}\frac{\partial f}{\partial\theta}\right)-\frac{\sin(\theta)}{r}\frac{\partial}{\partial\theta}\left(\cos(\theta)\frac{\partial f}{\partial r}-\frac{\sin(\theta)}{r}\frac{\partial f}{\partial\theta}\right)\\=&\left(\cos(\theta)^2\frac{\partial^2f}{\partial r^2}+\frac{\cos(\theta)\sin(\theta)}{r^2}\frac{\partial f}{\partial\theta}-\frac{\cos(\theta)\sin(\theta)}{r}\frac{\partial^2f}{\partial r\partial\theta}\right)\\&+\left(\frac{\sin(\theta)^2}{r}\frac{\partial f}{\partial r}-\frac{\sin(\theta)\cos(\theta)}{r}\frac{\partial^2f}{\partial\theta\partial r}+\frac{\sin(\theta)\cos(\theta)}{r^2}\frac{\partial f}{\partial\theta}+\frac{\sin(\theta)^2}{r^2}\frac{\partial^2f}{\partial\theta^2}\right)\\=&\cos(\theta)^2\frac{\partial^2f}{\partial r^2}-2\frac{\cos(\theta)\sin(\theta)}{r}\frac{\partial^2f}{\partial r\partial\theta}+\frac{\sin(\theta)^2}{r^2}\frac{\partial^2f}{\partial\theta^2}\\&+\frac{\sin(\theta)^2}{r}\frac{\partial f}{\partial r}+2\frac{\cos(\theta)\sin(\theta)}{r^2}\frac{\partial f}{\partial\theta}\end{aligned}

\displaystyle\begin{aligned}\frac{\partial}{\partial x}\frac{\partial}{\partial y}f=\frac{\partial}{\partial y}\frac{\partial}{\partial x}f=&\cos(\theta)\sin(\theta)\frac{\partial^2f}{\partial r^2}+\frac{\cos(\theta)^2-\sin(\theta)^2}{r}\frac{\partial^2f}{\partial r\partial\theta}-\frac{\sin(\theta)\cos(\theta)}{r^2}\frac{\partial^2f}{\partial\theta^2}\\&-\frac{\sin(\theta)\cos(\theta)}{r}\frac{\partial f}{\partial r}+\frac{\sin(\theta)^2-\cos(\theta)^2}{r^2}\frac{\partial f}{\partial\theta}\end{aligned}

\displaystyle\begin{aligned}\frac{\partial}{\partial y}\frac{\partial}{\partial y}f=&\sin(\theta)^2\frac{\partial^2f}{\partial r^2}+2\frac{\sin(\theta)\cos(\theta)}{r}\frac{\partial^2f}{\partial r\partial\theta}+\frac{\cos(\theta)^2}{r^2}\frac{\partial^2f}{\partial\theta^2}\\&+\frac{\cos(\theta)^2}{r}\frac{\partial f}{\partial r}-2\frac{\cos(\theta)\sin(\theta)}{r^2}\frac{\partial f}{\partial\theta}\end{aligned}

After that it’s no trouble at all to transform the differential terms


Let’s just work out the component that goes with (d\theta)^2 when we put these all together

\displaystyle\begin{aligned}\left(r^2\cos(\theta)^2\sin(\theta)^2-2r^2\cos(\theta)^2\sin(\theta)^2+r^2\cos(\theta)^2\sin(\theta)^2\right)&\frac{\partial^2f}{\partial r^2}\\+\left(-2r\cos(\theta)\sin(\theta)^3+2r(\cos(\theta)\sin(\theta)^3-\cos(\theta)^3\sin(\theta))+2r\cos(\theta)^3\sin(\theta)\right)&\frac{\partial^2f}{\partial r\partial\theta}\\+\left(\sin(\theta)^4+2\cos(\theta)^2\sin(\theta)^2+\cos(\theta)^4\right)&\frac{\partial^2f}{\partial\theta^2}\\+\left(r\sin(\theta)^4+2r\sin(\theta)^2\cos(\theta)^2+r\cos(\theta)^4\right)&\frac{\partial f}{\partial r}\\+\left(2\cos(\theta)\sin(\theta)^3+2(\cos(\theta)^3\sin(\theta)-\cos(\theta)\sin(\theta)^3)-2\cos(\theta)^3\sin(\theta)\right)&\frac{\partial f}{\partial\theta}\\=&\frac{\partial^2f}{\partial\theta^2}+r\frac{\partial f}{\partial r}\end{aligned}

Which has an extraneous term! If an invariance rule held, we should just get \frac{\partial^2f}{\partial\theta^2}.

The difference comes from the way that the differential operators themselves change as we move our point around. Increasing \theta by a little bit means something different at the point (x,y)=(1,0) than it does at the point (x,y)=(0,1). This doesn’t really matter when we’re talking about first-order differentials because we’re never putting two differential operators together, and so we never get any measurement of how an operator changes from point to point. We will eventually learn how to compensate for this effect, but that will wait until we have a significantly more general approach.

October 16, 2009 Posted by | Analysis, Calculus | 4 Comments

Clairaut’s Theorem

Now for the most common sufficient condition ensuring that mixed partial derivatives commute. If f is a function of n\geq2 variables, we can for the moment hold the values of all but two of them constant. We’ll only consider two variables at a time, which will simplify our notation. For the moment, then, we write f(x,y). We will also assume that f is real-valued, and deal with vector values one component at a time.

I assert that if the partial derivatives D_xf and D_yf are continuous in a neighborhood of the point (a,b), and if the mixed second partial derivative D_{y,x}f exists and is continuous there, then the other mixed partial derivative D_{x,y}f exists at (a,b), and we have the equality


By definition, within the neighborhood in the statement of the theorem the partial derivative \frac{\partial f}{\partial y} is given by the limit


So the numerator of the difference quotient defining the desired mixed partial derivative is


For a fixed k, we define the function

\displaystyle g_k(t)=f(a+t,b+k)-f(a+t,b)

We compute the derivative of g_k as

\displaystyle g_k'(t)=\left[D_xf\right](a+t,b+k)-\left[D_xf\right](a+t,b)

so we can apply the mean value theorem to write


for some \bar{h} between {0} and h. We use the above expression for g_k' to write the difference quotient


In a similar trick to the one above, we can see that \left[D_xf\right](a+\bar{h},b+s) is differentiable as a function of s with derivative \left[D_{y,x}f\right](a+\bar{h},b+s). And so the mean value theorem tells us that we can write our difference quotient as


for some \bar{y} between b and b+k.

And so we come to try taking the limit


If \bar{h} didn’t depend in its definition on k, this would be easy. First we could let k go to zero, which would make \bar{y} go to b, and then letting h go to zero would make \bar{h} go to zero as well. But it’s not going to be quite so easy, and limits in two variables like this usually call for some delicacy.

Given an \epsilon>0, there (by the assumption of continuity) is some \delta>0 so that


for (x,y) within a radius \delta of (a,b). As long as we keep \lvert h\rvert and \lvert k\rvert below \frac{\delta}{2}, the point (a+\bar{h},\bar{y}) will be within this radius. So we can keep h fixed at some small enough value, and find that \lvert k\rvert<\frac{\delta}{2} implies the inequality


Now we can take the limit as k goes to zero. As we do so, the inequality here may become an equality, but since we kept it below \frac{\epsilon}{2}, we still have some wiggle room. So, if \lvert h\rvert<\frac{\delta}{2}, we have the inequality


which gives us the limit we need.

Of course we could instead assume that the second mixed partial derivative exists and is continuous near (a,b), and conclude that the first one exists and is equal to the second.

October 15, 2009 Posted by | Analysis, Calculus | 15 Comments

Higher Partial Derivatives

Let’s say we’ve got a function f that’s differentiable within an open region S\subseteq\mathbb{R}^n. In particular, if we pick coordinates on \mathbb{R}^n the function has all partial derivatives \frac{\partial f}{\partial x^i} at each point in S. As we move around within S the value of the partial derivative changes, justifying the functional notation \left[D_{x^i}f\right](x). And if we’re lucky, these functions themselves may be differentiable.

In particular, it makes sense to ask about the existence of so-called “second partial derivatives”, defined as


Or in Leibniz’ notation:

\displaystyle\frac{\partial^2f}{\partial x^i\partial x^j}=\frac{\partial}{\partial x^i}\frac{\partial f}{\partial x^j}=\frac{\partial}{\partial x^i}\frac{\partial}{\partial x^j}f

If we take the derivative in terms of the same variable twice in a row we sometimes write this as

\displaystyle\frac{\partial}{\partial x^i}\frac{\partial}{\partial x^i}=\frac{\partial^2}{(\partial x^i)^2}

Yes, there’s some dissonance between superscripts as indices and superscripts as powers. But, again, this is pretty much the received notation in many areas. If it seems like it might be confusing we just write out \partial x^i twice in a row.

These, of course, may be defined within the region S, and we can then sensibly ask about third partial derivatives, like

\displaystyle\frac{\partial^3}{\partial x^i\partial x^j\partial x^k}f=\frac{\partial}{\partial x^i}\frac{\partial}{\partial x^j}\frac{\partial}{\partial x^k}f

and so on.

As an example, let’s consider the function f(x,y)=x^3 - 3xy^2. We can easily calculate the two first partial derivatives.

\displaystyle\begin{aligned}\frac{\partial f}{\partial x}&=3x^2-3y^2\\\frac{\partial f}{\partial y}&=-6xy\end{aligned}

And then we take each derivative of each of these two

\displaystyle\begin{aligned}\frac{\partial^2f}{\partial x^2}&=\frac{\partial}{\partial x}\left(3x^2-3y^2\right)=6x\\\frac{\partial^2f}{\partial y\partial x}&=\frac{\partial}{\partial y}\left(3x^2-3y^2\right)=-6y\\\frac{\partial^2f}{\partial x\partial y}&=\frac{\partial}{\partial x}\left(-6xy\right)=-6y\\\frac{\partial^2f}{\partial y^2}&=\frac{\partial}{\partial y}\left(-6xy\right)=-6x\end{aligned}

where since we’re not using superscripts as indices in these examples its meaning should be clear.

We notice here that the two in the middle — the “mixed” partial derivatives — are the same. This will happen in many cases of interest to us, but not always. As a pathological example, let’s go back and consider the function defined by

\displaystyle f(x,y)=\frac{xy(x^2-y^2)}{x^2+y^2}

away from the origin, and patched by f(0,0)=0. Again, we calculate the first partial derivatives (at least away from the origin):

\displaystyle\begin{aligned}\frac{\partial f}{\partial x}&=\frac{y(x^4+4x^2y^2-y^4)}{(x^2+y^2)^2}\\\frac{\partial f}{\partial y}&=\frac{x(x^4-4x^2y^2-y^4)}{(x^2+y^2)^2}\end{aligned}

Each partial derivative is {0} at the origin.

Now we can check that \left[D_xf\right](0,y)=-y for all y, and that \left[D_yf\right](x,0)=-x for all x. Thus we can calculate

\displaystyle\begin{aligned}\frac{\partial^2f}{\partial x\partial y}\biggr\vert_{(0,0)}&=1\\\frac{\partial^2f}{\partial y\partial x}\biggr\vert_{(0,0)}&=-1\end{aligned}

and the mixed partial derivatives are not equal.

October 14, 2009 Posted by | Analysis, Calculus | 5 Comments

The Mean Value Theorem

Here’s a nice technical result we may have call for from time to time: a higher-dimensional version of the differential mean value theorem. Remember that this says that if we’ve got a function f continuous on the closed interval \left[a,b\right]\subseteq\mathbb{R} and differentiable on its interior, there is some point \xi in the middle where the derivative of the function is the same as the average — the mean — rate of change of the function over the interval. In more than one dimension we’re going to modify this a bit to make it clearer what it means.

First of all, instead of talking about the closed interval \left[a,b\right], we’re going to use the closed straight line segment. That is, the collection of all the points between a and b in a straight line, and including the endpoints. We first look at the total displacement b-a from one point to the other. Then we start at a and move some portion of this displacement towards b. That is, the closed line segment \left[a,b\right] consists of all points of the form a+t(b-a) for t in the closed interval \left[0,1\right]. Setting t=0 gives us the point a, and t=1 gives us the point b. Similarly, the open line segment \left(a,b\right) consists of all points of the form a+t(b-a) for t in the open interval \left(0,1\right).

Next, we have to be clear about the average rate of change. As we move from a to b, the value of the function f changes by f(b)-f(a). It takes a displacement of \lVert b-a\rVert to get there, so on average the rate of change is

\displaystyle\frac{f(b)-f(a)}{\lVert b-a\rVert}

Finally, we don’t just have a single value for the instantaneous rate of change, we have a differential df(\xi). But we can use it to find directional derivatives. Specifically, we’ll consider the derivative of f in the direction pointing from a to b. We’ll pick out this direction with the unit vector we get by normalizing the displacement

\displaystyle\frac{b-a}{\lVert b-a\rVert}

So the mean value theorem will tell us that if f is differentiable in some open region S that contains the whole closed line segment \left[a,b\right]. Then there is some point \xi in the open line segment \left(a,b\right) so that the average rate of change of f from a to b is equal to the directional derivative of f at \xi in the direction pointing from a to b:

\displaystyle\frac{f(b)-f(a)}{\lVert b-a\rVert}=df(\xi)\left(\frac{b-a}{\lVert b-a\rVert}\right)

or, more simply

\displaystyle f(b)-f(a)=df(\xi)\left(b-a\right)

We’ll get at this by changing to a function of one variable so we can bring the one-dimensional version to bear. To that end, we define h(t)=f(a+t(b-a)) for t in the closed interval \left[0,1\right]. Then f(b)-f(a)=h(1)-h(0), and we can also show that h is differentiable everywhere inside the interval. Indeed, we can evaluate the difference quotient


Taking the limit as s approaches t, we find

\displaystyle h'(t)=\left[D_{b-a}f\right](a+t(b-a))=df(a+t(b-a))(b-a)

which exists since f is differentiable.

So our old differential mean-value theorem tells us that there is some \tau\in\left(0,1\right) so that


where \xi=a+\tau(b-a) is a point in the open line segment (a,b).

October 13, 2009 Posted by | Analysis, Calculus | 4 Comments

Transforming Differential Operators

Because of the chain rule and Cauchy’s invariant rule, we know that we can transform differentials along with functions. For example, if we write


we can write the differentials of x and y in terms of the differentials of r and \theta:


It turns out that the chain rule also tells us how to rewrite differential operators in terms of the variables. But these go in the other direction. That is, we can write the differential operators \frac{\partial}{\partial r} and \frac{\partial}{\partial\theta} in terms of the operators \frac{\partial}{\partial x} and \frac{\partial}{\partial y}.

First of all, let’s write down the differential of f in terms of x and y and in terms of r and \theta:

\displaystyle\begin{aligned}df&=\frac{\partial f}{\partial x}dx+\frac{\partial f}{\partial y}dy\\df&=\frac{\partial f}{\partial r}dr+\frac{\partial f}{\partial\theta}d\theta\end{aligned}

and now we can rewrite dx and dy in terms of dr and d\theta.

\displaystyle\begin{aligned}df&=\frac{\partial f}{\partial x}\left(\cos(\theta)dr-r\sin(\theta)d\theta\right)+\frac{\partial f}{\partial y}\left(\sin(\theta)dr+r\cos(\theta)d\theta\right)\\&=\frac{\partial f}{\partial x}\cos(\theta)dr-\frac{\partial f}{\partial x}r\sin(\theta)d\theta+\frac{\partial f}{\partial y}\sin(\theta)dr+\frac{\partial f}{\partial y}r\cos(\theta)d\theta\\&=\left(\cos(\theta)\frac{\partial f}{\partial x}+\sin(\theta)\frac{\partial f}{\partial y}\right)dr+\left(-r\sin(\theta)\frac{\partial f}{\partial x}+r\cos(\theta)\frac{\partial f}{\partial y}\right)d\theta\end{aligned}

Now by uniqueness we can read off the partial derivatives of f in terms of r and \theta:

\displaystyle\begin{aligned}\frac{\partial f}{\partial r}&=\cos(\theta)\frac{\partial f}{\partial x}+\sin(\theta)\frac{\partial f}{\partial y}\\\frac{\partial f}{\partial\theta}&=-r\sin(\theta)\frac{\partial f}{\partial x}+r\cos(\theta)\frac{\partial f}{\partial y}\end{aligned}

Finally, we pull all mention of f out of our notation and just write out the differential operators.

\displaystyle\begin{aligned}\frac{\partial}{\partial r}&=\cos(\theta)\frac{\partial}{\partial x}+\sin(\theta)\frac{\partial}{\partial y}\\\frac{\partial}{\partial\theta}&=-r\sin(\theta)\frac{\partial}{\partial x}+r\cos(\theta)\frac{\partial}{\partial y}\end{aligned}

Now we’re done rewriting, but for good form we should express these coefficients in terms of x and y.

\displaystyle\begin{aligned}\frac{\partial}{\partial r}&=\frac{x}{\sqrt{x^2+y^2}}\frac{\partial}{\partial x}+\frac{y}{\sqrt{x^2+y^2}}\frac{\partial}{\partial y}\\\frac{\partial}{\partial\theta}&=-y\frac{\partial}{\partial x}+x\frac{\partial}{\partial y}\end{aligned}

It’s important to note that there’s really no difference between these last two steps. The first one uses the variables r and \theta while the second uses the variables x and y, but they express the exact same functions, given the original substitutions above.

More generally, let’s say we have a vector-valued function g:\mathbb{R}^m\rightarrow\mathbb{R}^n defining a substitution


Cauchy’s invariant rule tells us that this gives rise to a substitution for differentials.

\displaystyle\begin{aligned}dy^1=dg^1(x^1,\dots,x^m)&=\frac{\partial g^1}{\partial x^1}dx^1+\dots+\frac{\partial g^1}{\partial x^m}dx^m=\frac{\partial g^1}{\partial x^i}dx^i\\&\vdots\\dy^n=dg^n(x^1,\dots,x^m)&=\frac{\partial g^n}{\partial x^1}dx^1+\dots+\frac{\partial g^n}{\partial x^m}dx^m=\frac{\partial g^n}{\partial x^i}dx^i\end{aligned}

We can play it a little loose and write this out in matrix notation:

\displaystyle\begin{pmatrix}dy^1\\\vdots\\dy^n\end{pmatrix}=\begin{pmatrix}\frac{\partial g^1}{\partial x^1}&\dots&\frac{\partial g^1}{\partial x^m}\\\vdots&\ddots&\vdots\\\frac{\partial g^n}{\partial x^1}&\dots&\frac{\partial g^n}{\partial x^m}\end{pmatrix}\begin{pmatrix}dx^1\\\vdots\\dx^m\end{pmatrix}

Now if we have a function f in terms of the y variables, we can use the substitution above to write it as a function of the x variables. We can write the differential of f in terms of each

\displaystyle\begin{aligned}df&=\frac{\partial f}{\partial y^j}dy^j\\df&=\frac{\partial f}{\partial x^i}dx^i\end{aligned}

Next we use the substitutions of the differentials to rewrite the first form as

\displaystyle df=\frac{\partial f}{\partial y^j}\frac{\partial g^j}{\partial x^i}dx^i

Then uniqueness allows us to match up the coefficients and write out the partial derivatives in terms of the x variables

\displaystyle\frac{\partial f}{\partial x^i}=\frac{\partial g^j}{\partial x^i}\frac{\partial f}{\partial y^j}

It is in this form that the chain rule is most often introduced, or the similar form

\displaystyle\frac{\partial f}{\partial x^i}=\frac{\partial y^j}{\partial x^i}\frac{\partial f}{\partial y^j}

And now we can remove mention of f from the formulæ and speak directly in terms of the operators

\displaystyle\frac{\partial}{\partial x^i}=\frac{\partial y^j}{\partial x^i}\frac{\partial}{\partial y^j}

Again, we can play it a little loose and write this in matrix notation

\displaystyle\begin{pmatrix}\frac{\partial}{\partial x^1}\\\vdots\\\frac{\partial}{\partial x^m}\end{pmatrix}=\begin{pmatrix}\frac{\partial y^1}{\partial x^1}&\dots&\frac{\partial y^n}{\partial x^1}\\\vdots&\ddots&\vdots\\\frac{\partial y^1}{\partial x^m}&\dots&\frac{\partial y^n}{\partial x^m}\end{pmatrix}\begin{pmatrix}\frac{\partial}{\partial y^1}\\\vdots\\\frac{\partial}{\partial y^n}\end{pmatrix}

This is very similar to the substitution for differentials written in matrix notation. The differences are that we transform from y-derivations to x-derivations instead of from x-differentials to y-differentials, and the two substitution matrices are the transposes of each other. Those who have been following closely (or who have some background in differential geometry) should start to see the importance of this latter fact, but for now we’ll consider this a statement about formulas and methods of calculation. We’ll come to the deeper geometric meaning when we come through again in a wider context.

October 12, 2009 Posted by | Analysis, Calculus | 5 Comments

Product and Quotient rules

As I said before, there’s generally no product of higher-dimensional vectors, and so there’s no generalization of the product rule. But we can multiply and divide real-valued functions of more than one variable. Finding the differential of such a product or quotient function is a nice little exercise in using Cauchy’s invariant rule.

For all that follows we’re considering two real-valued functions of n real variables: f(x)=f(x^1,\dots,x^n) and g(x)=g(x^1,\dots,x^n). We’ll put them together to give a single map from \mathbb{R}^n to \mathbb{R}^2 by picking orthonormal coordinates u and v on the latter space and defining


We also have two familiar functions that we don’t often think of explicitly as functions from \mathbb{R}^2 to \mathbb{R}:


Now we can find the differentials of p and q


Notice that the differential for q is exactly the alternate notation I mentioned when defining the one-variable quotient rule!

With all this preparation out of the way, the product function f(x)g(x) can be seen as the composition p(f(x),g(x)), while the quotient function \frac{f(x)}{g(x)} can be seen as the quotient q(f(x),g(x)). So to calculate the differentials of the product and quotient we can use Cauchy’s invariant rule to make the substitutions


The upshot is that just like in the case of one variable we can differentiate a product of two functions by differentiating each of the functions, multiplying by the other function, and adding the two resulting terms. We just use the differential instead of the derivative.

\displaystyle d\left[fg\right](x)=df(x)g(x)+f(x)dg(x)

Similarly, we can differentiate the quotient of two functions just as in the one-variable case, but using the differential instead of the derivative.

\displaystyle d\left[\frac{f}{g}\right](x)=\frac{df(x)g(x)-f(x)dg(x)}{g(x)^2}

October 9, 2009 Posted by | Analysis, Calculus | Leave a comment

Cauchy’s Invariant Rule

An immediate corollary of the chain rule is another piece of “syntactic sugar”.

If we have functions g:X\rightarrow\mathbb{R}^n and f:Y\rightarrow\mathbb{R}^p for some open regions X\subseteq\mathbb{R}^m and Y\subseteq\mathbb{R}^n so that the image g(X) is contained in Y, we can compose the two functions to get a new function f\circ g:X\rightarrow\mathbb{R}^p. In terms of formulas, we can choose coordinates y^i on \mathbb{R}^n and write out both the function f(y^1,\dots,y^n) and the component functions g^1(x),\dots,g^n(x). We get a formula for \left[f\circ g\right](x) by substituting g^i(x) for y^i in the formula for f and write y^i=g^i(x).

The language there seems a little convoluted, so I’d like to give an example. We might define a function f(x,y)=e^{x^2+y^2} for all points (x,y) in the plane \mathbb{R}^2. This is all well and good, but we might want to talk about the function in polar coordinates. To this end, we may define x=r\cos(\theta) and y=r\sin(\theta). These are the component functions describing a transformation g from the region (r,\theta)\in(0,\infty)\times(-\pi,\pi)\subseteq\mathbb{R}^2 to the region where (x,y)\neq(0,0). We can substitute r\cos(\theta) for x and r\sin(\theta) for y in our formula for f to get a new function f\circ g with formula

\displaystyle f(g(r,\theta))=e^{r^2\cos(\theta)^2+r^2\sin(\theta)^2}=e^{r^2}

This much is straightforward. The thing is, now we want to take differentials. What Cauchy’s invariant rule tells us is that we can calculate the differential of f\circ g by not only substituting g^i(x) for y^i, but also substituting dg^i(x;t) for s^i in the formula for df(y;s). That is, if h=f\circ g then we have the equivalence

\displaystyle dh(x;t)=df(g^1(x),\dots,g^n(x);dg^1(x;t),\dots,dg^n(x;t))

In our particular example, we can easily calculate the differential of f using our first formula:


or using our second formula:


We want to call both of these simply df. But can we do so unambiguously? Indeed, if x=r\cos(\theta) then we find

\displaystyle dx=\cos(\theta)dr-r\sin(\theta)d\theta

and if y=r\sin(\theta) then we find

\displaystyle dy=\sin(\theta)dr+r\cos(\theta)d\theta

We substitute these into our formula for df(x,y) to find


just the same as if we calculated directly from the formula in terms of r and \theta.

That is, we can substitute our formulæ for the coordinate functions y^i=g^i(x) before taking the differential in terms of x, or we can take the differential in terms of y and then substitute our formulæ for the coordinate functions y^i=g^i(x) and their differentials dy^i=dg^i(x) into the result. Either way, we end up in the same place, so we don’t have to worry about ending up with two (or more!) “different” differentials of f.

So, how do we verify this using the chain rule? Just write out the differentials out using partial derivatives. For example, we know that

\displaystyle df(y;s^1,\dots,s^n)=\frac{\partial f}{\partial y^i}\biggr\vert_ys^i

and so on. So, performing our substitutions we can find:

\displaystyle\begin{aligned}df(g(x);dg^1(x;t),\dots,dg^n(x;t))&=\frac{\partial f}{\partial y^i}\biggr\vert_{y=g(x)}dg^i(x;t)\\&=\frac{\partial f}{\partial y^i}\biggr\vert_{y=g(x)}\frac{\partial g^i}{\partial x^j}\biggr\vert_xt^j\\&=\frac{\partial\left[f\circ g\right]}{\partial x^j}\biggr\vert_xt^j\\&=d\left[f\circ g\right](x;t)\end{aligned}

The important part here is the passage from products of two partial derivatives to single partial derivatives of f\circ g. This works out because when we consider differentials as linear transformations, the matrix entries are the partial derivatives. The composition of the linear transformations df(g(x)) and dg(x) is given by the product of these matrices, and the entries of the resulting matrix must (by uniqueness) be the partial derivatives of the composite function.

October 8, 2009 Posted by | Analysis, Calculus | 5 Comments

The Chain Rule

Since the components of the differential are given by partial derivatives, and partial derivatives (like all single-variable derivatives) are linear, it’s straightforward to see that the differential operator is linear as well. That is, if f:\mathbb{R}^m\rightarrow\mathbb{R}^n and g:\mathbb{R}^m\rightarrow\mathbb{R}^n are two functions, both of which are differentiable at a point x, and a and b are real constants, then the linear combination af+bg is also differentiable at x, and the differential is given by

\displaystyle d(af+bg)=adf+bdg

There’s not usually a product for function values in \mathbb{R}^n, so there’s not usually any analogue of the product rule and definitely none of the quotient rule, so we can ignore those for now.

But we do have a higher-dimensional analogue for the chain rule. If we have a function g:X\rightarrow\mathbb{R}^n defined on some open region X\subseteq\mathbb{R}^m and another function f:Y\rightarrow\mathbb{R}^p defined on a region Y\subseteq\mathbb{R}^n that contains the image g(X), then we can compose them to get a single function f\circ g:X\rightarrow\mathbb{R}^p defined by \left[f\circ g\right](x)=f(g(x)). And if g is differentiable at a point x and f is differentiable at the image point g(x), then the composite function is differentiable at x.

First of all, what should the differential be? Remember that the differential dg(x) is a linear transformation that takes displacements t\in\mathbb{R}^m from the point x and turns them into displacements dg(x)t\in\mathbb{R}^n from the point g(x). Then the differential df(y) is a linear transformation that takes displacements s\in\mathbb{R}^n from the point y and turns them into displacements df(y)s\in\mathbb{R}^p from the point f(y). Putting these together, we have a composite linear transformation df(g(x))dg(x) that will start with a linear transformation that takes displacements t\in\mathbb{R}^m from the point x and turns them into displacements df(g(x))dg(x)t from the point f(g(x)). I assert that this is composite transformation is exactly the differential of the composite function.

Just as a sanity check, what happens when we look at single-variable real-valued functions? In this case, df(y) and dg(x) are both linear transformations from one-dimensional spaces to other one-dimensional spaces. That is, they’re represented as 1\times1 matrices that just multiply by the single real entry. So the composite of the two transformations is given by the matrix whose single entry is the product of the two matrices’ single entries. In other words, in one variable the differentials looks like single real numbers f'(y) and g'(x), and their composite is given by multiplication: f'(g(x))g'(x). This is exactly the one-variable chain rule. To understand multiple variables we have to move from products of real numbers to compositions of linear transformations, which will be products of real matrices.

Okay, so let’s verify that d\left[f\circ g\right](x)=df(g(x))dg(x) does indeed act as a differential for f\circ g. It’s clearly a linear transformation between the appropriate two spaces of displacements. What we need to verify is that it gives a good approximation. That is, for every \epsilon>0 there is a \delta>0 so that if \delta>\lVert t\rVert>0 we have

\displaystyle\left\lVert\left[f(g(x+t))-f(g(x))\right]-df(g(x))dg(x)t\right\rVert<\epsilon\lVert t\rVert

First of all, since f is differentiable at g(x), given \tilde{\epsilon}>0 there is a \tilde{\delta} so that if \tilde{\delta}>\lVert s\rVert>0 we have

\displaystyle\left\lVert\left[f(g(x)+s)-f(g(x))\right]-df(g(x))s\right\rVert<\tilde{\epsilon}\lVert s\rVert

Now since g is differentiable it satisfies a Lipschitz condition. We showed that this works for real-valued functions, but extending the result is very straightforward. That is, there is some radius r_1 and a constant M>0 so that if r_1>t>0 we have the inequality \lVert g(x+t)-g(x)\rVert<M\lVert t\rVert. That is, g cannot stretch displacements by more than a factor of M as long as the displacements are small enough.

Now r_1 may be smaller than \frac{\tilde{\delta}}{M} already, but just in case let’s shrink it until it is. Then we know that

\displaystyle\lVert g(x+t)-g(x)\rVert<M\lVert t\rVert<Mr_1<M\frac{\tilde{\delta}}{M}=\tilde{\delta}.

so we can use this difference as a displacement s from g(x). We find

\displaystyle\begin{aligned}\left\lVert\left[f(g(x+t))-f(g(x))\right]-df(g(x))\left(g(x+t)-g(x)\right)\right\rVert&<\tilde{\epsilon}\lVert g(x+t)-g(x)\rVert\\&<\tilde{\epsilon}M\lVert t\rVert\end{aligned}

Now we’re going to find a constant N\geq0 and a radius r_2 so that

\displaystyle\left\lVert df(g(x))\left(g(x+t)-g(x)\right)-df(g(x))dg(x)t\right\rVert\leq\tilde{\epsilon}nN\lVert t\rVert

whenever r_2>\lVert t\rVert>0. Once this is established, we are done. Given an \epsilon>0 we can set \tilde{\epsilon}=\frac{\epsilon}{M+nN} and let \delta be the smaller of the two resulting radii r_1 and r_2. Within this smaller radius, the desired inequality will hold.

To get this result, we choose orthonormal coordinates on the space \mathbb{R}^n. We can then use these coordinates to write

\displaystyle df(g(x))\left(\left[g(x+t)-g(x)\right]-dg(x)t\right)=\left[D_if\right](g(x))\left(\left[g^i(x+t)-g^i(x)\right]-dg^i(x)t\right)

But since each of the several g^i is differentiable we can pick our radius r_2 so that all of the inequalities

\displaystyle\left\lvert\left[g^i(x+t)-g^i(x)\right]-dg^i(x)t\right\rvert<\tilde{\epsilon}\lVert t\rVert

hold for r_2>\lVert t\rVert>0. Then we let N be the magnitude of the largest of the component partial derivatives \left\lVert\left[D_if\right](g(x))\right\rVert, and we’re done.

Thus when g is differentiable at x and f is differentiable at g(x), then the composite f\circ g is differentiable at x, and the differential of the composite function is given by

\displaystyle d\left[f\circ g\right](x)=df(g(x))dg(x)

the composite of the differentials, considered as linear transformations.

October 7, 2009 Posted by | Analysis, Calculus | 20 Comments

Vector-Valued Functions

Now we know how to modify the notion of the derivative of a function to deal with vector inputs by defining the differential. But what about functions that have vectors as outputs?

Well, luckily when we defined the differential we didn’t really use anything about the space where our function took its values but that it was a topological vector space. Indeed, when defining the differential of a function f:\mathbb{R}^m\rightarrow\mathbb{R}^n we need to set up a new function that takes a point x in the Euclidean space \mathbb{R}^m and a displacement vector t in \mathbb{R}^m as inputs, and which gives a displacement vector df(x;t) in \mathbb{R}^n as its output. It must be linear in the displacement, meaning that we can view df(x) as a linear transformation from \mathbb{R}^m to \mathbb{R}^n. And it must satisfy a similar approximation condition, replacing the absolute value with the notion of length in \mathbb{R}^n: for every \epsilon>0 there is a \delta>0 so that if \delta>\lVert t\rVert_{\mathbb{R}^m}>0 we have

\displaystyle\lVert\left[f(x+t)-f(x)\right]-df(x)t\rVert_{\mathbb{R}^n}<\epsilon\lVert t\rVert_{\mathbb{R}^m}

From here on we’ll just determine which norm we mean by context, since we only have one norm on each vector space.

Okay, so we can talk about differentials of vector-valued functions: the differential of f at a point x (if it exists) is a linear transformation df(x) that turns displacements in the input space into displacements in the output space, and does so in the way that most closely approximates the action of the function itself. But how do we define the function?

If we pick an orthonormal basis e_i for \mathbb{R}^n we can write the components of f as separate functions. That is, we say

\displaystyle f(x)=f^i(x)e_i

Now I assert that the differential can be taken component-by-component, just as continuity works: df(x)=df^i(x)e_i. On the left is the differential of f as a vector-valued function, while on the right we find the differentials of the several real-valued functions f^i. The differential exists if and only if the component differentials do.

First, from components to the vector-valued function. Clearly this definition of df(x) gives us a linear map from displacements in \mathbb{R}^m to displacements in \mathbb{R}^n. But does it satisfy the approximation inequality? Indeed, for every \epsilon>0 we can find a \delta so that all the inequalities

\displaystyle\left\lvert\left[f^i(x+t)-f^i(x)\right]-df^i(x)t\right\rvert<\frac{\epsilon}{n}\lVert t\rVert

are satisfied when \delta>\lVert t\rVert>0. Of course, there are different \deltas that work for each component, but we can pick the smallest of them. Then it’s a simple matter to find

\displaystyle\begin{aligned}\left\lVert\left[f(x+t)-f(x)\right]-df(x)t\right\rVert&=\left\lVert\left(\left[f^i(x+t)-f^i(x)\right]-df^i(x)t\right)e_i\right\rVert\\&\leq\left\lvert\left[f^i(x+t)-f^i(x)\right]-df^i(x)t\right\rvert\lVert e_i\rVert\\&=\sum\limits_{i=1}^n\left\lvert\left[f^i(x+t)-f^i(x)\right]-df^i(x)t\right\rvert\\&<\sum\limits_{i=1}^n\frac{\epsilon}{n}\lVert t\rVert\\&=\epsilon\lVert t\rVert\end{aligned}

so if the component functions are differentiable, then so is the function as a whole.

On the other hand, if the differential df(x) exists then for every \epsilon>0 there exists a \delta>0 so that if \delta>\lVert t\rVert>0 we have

\displaystyle\lVert\left[f(x+t)-f(x)\right]-df(x)t\rVert<\epsilon\lVert t\rVert

But then it’s easy to see that

\displaystyle\begin{aligned}\left\lvert\left[f^k(x+t)-f^k(x)\right]-df^k(x)t\right\rvert&=\sqrt{\left\lvert\left[f^k(x+t)-f^k(x)\right]-df^k(x)t\right\rvert^2}\\&\leq\sqrt{\sum\limits_{i=1}^n\left\lvert\left[f^i(x+t)-f^i(x)\right]-df^i(x)t\right\rvert^2}\\&=\lVert\left[f(x+t)-f(x)\right]-df(x)t\rVert\\&<\epsilon\lVert t\rVert\end{aligned}

and so each of the component differentials exists.

Finally, I should mention that if we also pick an orthonormal basis \tilde{e}_j for the input space \mathbb{R}^m we can expand each component differential df^i(x) in terms of the dual basis dx^j:

\displaystyle df^i(x)=\frac{\partial f^i}{\partial x^1}dx^1+\dots+\frac{\partial f^i}{\partial x^m}dx^m=\frac{\partial f^i}{\partial x^j}dx^j

Then we can write the whole differential df(x) out as a matrix whose entry in the ith row and jth column is \frac{\partial f^i}{\partial x^j}. If we write a displacement in the input as an m-dimensional column vector we find our estimate of the displacement in the output as an n-dimensional column vector:

\displaystyle\begin{pmatrix}df^1(x;t)\\\vdots\\df^n(x;t)\end{pmatrix}=\begin{pmatrix}\frac{\partial f^1}{\partial x^1}&\dots&\frac{\partial f^1}{\partial x^m}\\\vdots&\ddots&\vdots\\\frac{\partial f^n}{\partial x^1}&\dots&\frac{\partial f^n}{\partial x^m}\end{pmatrix}\begin{pmatrix}t^1\\\vdots\\t^m\end{pmatrix}

October 6, 2009 Posted by | Analysis, Calculus | 5 Comments