## Higher Differentials and Composite Functions

Last time we saw an example of what can go wrong when we try to translate higher differentials the way we did the first-order differential. Today I want to identify exactly what goes wrong, and I’ll make use of the summation convention to greatly simplify the process.

So, let’s take a function of variables and a collection of functions , each depending on variables . We can think of these as the components of a vector-valued function which has continuous second partial derivatives on some region . If the function has continuous second partial derivatives on some region containing the image , then we can compose the two functions to give a single function , and we’re going to investigate the second differential of with respect to the variables .

To that end, we want to calculate the second partial derivative

First, we take the derivative in terms of , and we use the chain rule to write

Now we have to take the derivative in terms of . Luckily, this operation is linear, so we don’t have to worry about the hidden summations in the notation. We do, however, have to use the product rule to handle the multiplications

where we’ve used the chain rule again to convert a derivative in terms of into one in terms of .

And here we’ve come to the problem itself. For we can write out the second differential in terms of the

The first term here is the second differential in terms of the . If there were an analogue of Cauchy’s invariant rule, this would be all there is to the formula. But we’ve got another term — one due to the product rule — based on the second differentials of the functions themselves. This is the term that ruins the nice transformation properties of higher differentials, and which makes them unsuitable for many of our purposes.

Notice, though, that we have *not* contradicted Clairaut’s theorem here. Indeed, as long as and all the have continuous second partial derivatives, then so will . Further, the formula we derived for the second partial derivatives of is manifestly symmetric between the two derivatives, and so the mixed partials commute.

## Higher-Order Differentials

Just like we assembled partial derivatives into the differential of a function, so we can assemble higher partial derivatives into higher-order differentials. The differential measures how the function itself changes as we move around, and the higher differentials will measure how lower differentials change.

First let’s look at the second-order differential of a real-valued function of variables . We’ll use the as a basis for the space of differentials, which allows us to write out the components of the differential:

So, just as we did for vector-valued functions, we’ll just take the differentials of each of these components separately, and then cobble them together.

Now this second displacement may have nothing to do with the first, but it should be the same for all components. That is, we could write out the second differential as a function of not only the point but of two displacements and from the point:

Commonly we’ll collapse this into a function of a point and a single displacement. We just put the same vector in for both and

Unfortunately, these higher differentials are more complicated than our first-order derivatives. In particular, they don’t obey anything like Cauchy’s invariant rule, meaning they don’t transform well when we compose functions. As an example, let’s go back and look at the polar coordinate transform again:

We’ve seen how we can use Cauchy’s invariant rule to rewrite differentials:

We can also invert the transformation and rewrite differential operators:

So let’s take our second-order differential

and try to rewrite it. The nasty bit is working out all these second-order partial derivatives in terms of and .

After that it’s no trouble at all to transform the differential terms

Let’s just work out the component that goes with when we put these all together

Which has an extraneous term! If an invariance rule held, we should just get .

The difference comes from the way that the differential operators *themselves* change as we move our point around. Increasing by a little bit means something different at the point than it does at the point . This doesn’t really matter when we’re talking about first-order differentials because we’re never putting two differential operators together, and so we never get any measurement of how an operator changes from point to point. We will eventually learn how to compensate for this effect, but that will wait until we have a significantly more general approach.

## Clairaut’s Theorem

Now for the most common sufficient condition ensuring that mixed partial derivatives commute. If is a function of variables, we can for the moment hold the values of all but two of them constant. We’ll only consider two variables at a time, which will simplify our notation. For the moment, then, we write . We will also assume that is real-valued, and deal with vector values one component at a time.

I assert that if the partial derivatives and are continuous in a neighborhood of the point , and if the mixed second partial derivative exists and is continuous there, then the other mixed partial derivative exists at , and we have the equality

By definition, within the neighborhood in the statement of the theorem the partial derivative is given by the limit

So the numerator of the difference quotient defining the desired mixed partial derivative is

For a fixed , we define the function

We compute the derivative of as

so we can apply the mean value theorem to write

for some between and . We use the above expression for to write the difference quotient

In a similar trick to the one above, we can see that is differentiable as a function of with derivative . And so the mean value theorem tells us that we can write our difference quotient as

for some between and .

And so we come to try taking the limit

If didn’t depend in its definition on , this would be easy. First we could let go to zero, which would make go to , and then letting go to zero would make go to zero as well. But it’s not going to be quite so easy, and limits in two variables like this usually call for some delicacy.

Given an , there (by the assumption of continuity) is some so that

for within a radius of . As long as we keep and below , the point will be within this radius. So we can keep fixed at some small enough value, and find that implies the inequality

Now we can take the limit as goes to zero. As we do so, the inequality here may become an equality, but since we kept it below , we still have some wiggle room. So, if , we have the inequality

which gives us the limit we need.

Of course we could instead assume that the second mixed partial derivative exists and is continuous near , and conclude that the first one exists and is equal to the second.

## Higher Partial Derivatives

Let’s say we’ve got a function that’s differentiable within an open region . In particular, if we pick coordinates on the function has all partial derivatives at each point in . As we move around within the value of the partial derivative changes, justifying the functional notation . And if we’re lucky, these functions themselves may be differentiable.

In particular, it makes sense to ask about the existence of so-called “second partial derivatives”, defined as

Or in Leibniz’ notation:

If we take the derivative in terms of the same variable twice in a row we sometimes write this as

Yes, there’s some dissonance between superscripts as indices and superscripts as powers. But, again, this is pretty much the received notation in many areas. If it seems like it might be confusing we just write out twice in a row.

These, of course, may be defined within the region , and we can then sensibly ask about *third* partial derivatives, like

and so on.

As an example, let’s consider the function . We can easily calculate the two first partial derivatives.

And then we take each derivative of each of these two

where since we’re not using superscripts as indices in these examples its meaning should be clear.

We notice here that the two in the middle — the “mixed” partial derivatives — are the same. This will happen in many cases of interest to us, but not always. As a pathological example, let’s go back and consider the function defined by

away from the origin, and patched by . Again, we calculate the first partial derivatives (at least away from the origin):

Each partial derivative is at the origin.

Now we can check that for all , and that for all . Thus we can calculate

and the mixed partial derivatives are *not* equal.

## The Mean Value Theorem

Here’s a nice technical result we may have call for from time to time: a higher-dimensional version of the differential mean value theorem. Remember that this says that if we’ve got a function continuous on the closed interval and differentiable on its interior, there is some point in the middle where the derivative of the function is the same as the average — the mean — rate of change of the function over the interval. In more than one dimension we’re going to modify this a bit to make it clearer what it means.

First of all, instead of talking about the closed interval , we’re going to use the closed straight line segment. That is, the collection of all the points between and in a straight line, and including the endpoints. We first look at the total displacement from one point to the other. Then we start at and move some portion of this displacement towards . That is, the closed line segment consists of all points of the form for in the closed interval . Setting gives us the point , and gives us the point . Similarly, the open line segment consists of all points of the form for in the *open* interval .

Next, we have to be clear about the average rate of change. As we move from to , the value of the function changes by . It takes a displacement of to get there, so on average the rate of change is

Finally, we don’t just have a single value for the instantaneous rate of change, we have a differential . But we can use it to find directional derivatives. Specifically, we’ll consider the derivative of in the direction pointing from to . We’ll pick out this direction with the unit vector we get by normalizing the displacement

So the mean value theorem will tell us that if is differentiable in some open region that contains the whole closed line segment . Then there is some point in the open line segment so that the average rate of change of from to is equal to the directional derivative of at in the direction pointing from to :

or, more simply

We’ll get at this by changing to a function of one variable so we can bring the one-dimensional version to bear. To that end, we define for in the closed interval . Then , and we can also show that is differentiable everywhere inside the interval. Indeed, we can evaluate the difference quotient

Taking the limit as approaches , we find

which exists since is differentiable.

So our old differential mean-value theorem tells us that there is some so that

where is a point in the open line segment .

## Transforming Differential Operators

Because of the chain rule and Cauchy’s invariant rule, we know that we can transform differentials along with functions. For example, if we write

we can write the differentials of and in terms of the differentials of and :

It turns out that the chain rule also tells us how to rewrite differential operators in terms of the variables. But these go in the *other* direction. That is, we can write the differential operators and in terms of the operators and .

First of all, let’s write down the differential of in terms of and and in terms of and :

and now we can rewrite and in terms of and .

Now by uniqueness we can read off the partial derivatives of in terms of and :

Finally, we pull all mention of out of our notation and just write out the differential operators.

Now we’re done rewriting, but for good form we should express these coefficients in terms of and .

It’s important to note that there’s really no difference between these last two steps. The first one uses the variables and while the second uses the variables and , but they express the exact same functions, given the original substitutions above.

More generally, let’s say we have a vector-valued function defining a substitution

Cauchy’s invariant rule tells us that this gives rise to a substitution for differentials.

We can play it a little loose and write this out in matrix notation:

Now if we have a function in terms of the variables, we can use the substitution above to write it as a function of the variables. We can write the differential of in terms of each

Next we use the substitutions of the differentials to rewrite the first form as

Then uniqueness allows us to match up the coefficients and write out the partial derivatives in terms of the variables

It is in this form that the chain rule is most often introduced, or the similar form

And now we can remove mention of from the formulæ and speak directly in terms of the operators

Again, we can play it a little loose and write this in matrix notation

This is *very* similar to the substitution for differentials written in matrix notation. The differences are that we transform from -derivations to -derivations instead of from -differentials to -differentials, and the two substitution matrices are the transposes of each other. Those who have been following closely (or who have some background in differential geometry) should start to see the importance of this latter fact, but for now we’ll consider this a statement about formulas and methods of calculation. We’ll come to the deeper geometric meaning when we come through again in a wider context.

## Product and Quotient rules

As I said before, there’s generally no product of higher-dimensional vectors, and so there’s no generalization of the product rule. But we can multiply and divide real-valued functions of more than one variable. Finding the differential of such a product or quotient function is a nice little exercise in using Cauchy’s invariant rule.

For all that follows we’re considering two real-valued functions of real variables: and . We’ll put them together to give a single map from to by picking orthonormal coordinates and on the latter space and defining

We also have two familiar functions that we don’t often think of explicitly as functions from to :

Now we can find the differentials of and

Notice that the differential for is exactly the alternate notation I mentioned when defining the one-variable quotient rule!

With all this preparation out of the way, the product function can be seen as the composition , while the quotient function can be seen as the quotient . So to calculate the differentials of the product and quotient we can use Cauchy’s invariant rule to make the substitutions

The upshot is that just like in the case of one variable we can differentiate a product of two functions by differentiating each of the functions, multiplying by the other function, and adding the two resulting terms. We just use the differential instead of the derivative.

Similarly, we can differentiate the quotient of two functions just as in the one-variable case, but using the differential instead of the derivative.

## Cauchy’s Invariant Rule

An immediate corollary of the chain rule is another piece of “syntactic sugar”.

If we have functions and for some open regions and so that the image is contained in , we can compose the two functions to get a new function . In terms of formulas, we can choose coordinates on and write out both the function and the component functions . We get a formula for by substituting for in the formula for and write .

The language there seems a little convoluted, so I’d like to give an example. We might define a function for all points in the plane . This is all well and good, but we might want to talk about the function in polar coordinates. To this end, we may define and . These are the component functions describing a transformation from the region to the region where . We can substitute for and for in our formula for to get a new function with formula

This much is straightforward. The thing is, now we want to take differentials. What Cauchy’s invariant rule tells us is that we can calculate the differential of by not only substituting for , but also substituting for in the formula for . That is, if then we have the equivalence

In our particular example, we can easily calculate the differential of using our first formula:

or using our second formula:

We want to call both of these simply . But can we do so unambiguously? Indeed, if then we find

and if then we find

We substitute these into our formula for to find

just the same as if we calculated directly from the formula in terms of and .

That is, we can substitute our formulæ for the coordinate functions before taking the differential in terms of , or we can take the differential in terms of and then substitute our formulæ for the coordinate functions *and their differentials* into the result. Either way, we end up in the same place, so we don’t have to worry about ending up with two (or more!) “different” differentials of .

So, how do we verify this using the chain rule? Just write out the differentials out using partial derivatives. For example, we know that

and so on. So, performing our substitutions we can find:

The important part here is the passage from products of two partial derivatives to single partial derivatives of . This works out because when we consider differentials as linear transformations, the matrix entries are the partial derivatives. The composition of the linear transformations and is given by the product of these matrices, and the entries of the resulting matrix must (by uniqueness) be the partial derivatives of the composite function.

## The Chain Rule

Since the components of the differential are given by partial derivatives, and partial derivatives (like all single-variable derivatives) are linear, it’s straightforward to see that the differential operator is linear as well. That is, if and are two functions, both of which are differentiable at a point , and and are real constants, then the linear combination is also differentiable at , and the differential is given by

There’s not usually a product for function values in , so there’s not usually any analogue of the product rule and definitely none of the quotient rule, so we can ignore those for now.

But we *do* have a higher-dimensional analogue for the chain rule. If we have a function defined on some open region and another function defined on a region that contains the image , then we can compose them to get a single function defined by . And if is differentiable at a point and is differentiable at the image point , then the composite function is differentiable at .

First of all, what should the differential be? Remember that the differential is a linear transformation that takes displacements from the point and turns them into displacements from the point . Then the differential is a linear transformation that takes displacements from the point and turns them into displacements from the point . Putting these together, we have a composite linear transformation that will start with a linear transformation that takes displacements from the point and turns them into displacements from the point . I assert that this is composite transformation is exactly the differential of the composite function.

Just as a sanity check, what happens when we look at single-variable real-valued functions? In this case, and are both linear transformations from one-dimensional spaces to other one-dimensional spaces. That is, they’re represented as matrices that just multiply by the single real entry. So the composite of the two transformations is given by the matrix whose single entry is the product of the two matrices’ single entries. In other words, in one variable the differentials looks like single real numbers and , and their composite is given by multiplication: . This is exactly the one-variable chain rule. To understand multiple variables we have to move from products of real numbers to compositions of linear transformations, which will be products of real matrices.

Okay, so let’s verify that does indeed act as a differential for . It’s clearly a linear transformation between the appropriate two spaces of displacements. What we need to verify is that it gives a good approximation. That is, for every there is a so that if we have

First of all, since is differentiable at , given there is a so that if we have

Now since is differentiable it satisfies a Lipschitz condition. We showed that this works for real-valued functions, but extending the result is very straightforward. That is, there is some radius and a constant so that if we have the inequality . That is, cannot stretch displacements by more than a factor of as long as the displacements are small enough.

Now may be smaller than already, but just in case let’s shrink it until it is. Then we know that

.

so we can use this difference as a displacement from . We find

Now we’re going to find a constant and a radius so that

whenever . Once this is established, we are done. Given an we can set and let be the smaller of the two resulting radii and . Within this smaller radius, the desired inequality will hold.

To get this result, we choose orthonormal coordinates on the space . We can then use these coordinates to write

But since each of the several is differentiable we can pick our radius so that all of the inequalities

hold for . Then we let be the magnitude of the largest of the component partial derivatives , and we’re done.

Thus when is differentiable at and is differentiable at , then the composite is differentiable at , and the differential of the composite function is given by

the composite of the differentials, considered as linear transformations.

## Vector-Valued Functions

Now we know how to modify the notion of the derivative of a function to deal with vector inputs by defining the differential. But what about functions that have vectors as outputs?

Well, luckily when we defined the differential we didn’t really use anything about the space where our function took its values but that it was a topological vector space. Indeed, when defining the differential of a function we need to set up a new function that takes a point in the Euclidean space and a displacement vector in as inputs, and which gives a displacement vector in as its output. It must be linear in the displacement, meaning that we can view as a linear transformation from to . And it must satisfy a similar approximation condition, replacing the absolute value with the notion of length in : for every there is a so that if we have

From here on we’ll just determine which norm we mean by context, since we only have one norm on each vector space.

Okay, so we can talk about differentials of vector-valued functions: the differential of at a point (if it exists) is a linear transformation that turns displacements in the input space into displacements in the output space, and does so in the way that most closely approximates the action of the function itself. But how do we define the function?

If we pick an orthonormal basis for we can write the components of as separate functions. That is, we say

Now I assert that the differential can be taken component-by-component, just as continuity works: . On the left is the differential of as a vector-valued function, while on the right we find the differentials of the several real-valued functions . The differential exists if and only if the component differentials do.

First, from components to the vector-valued function. Clearly this definition of gives us a linear map from displacements in to displacements in . But does it satisfy the approximation inequality? Indeed, for every we can find a so that *all* the inequalities

are satisfied when . Of course, there are different s that work for each component, but we can pick the smallest of them. Then it’s a simple matter to find

so if the component functions are differentiable, then so is the function as a whole.

On the other hand, if the differential exists then for every there exists a so that if we have

But then it’s easy to see that

and so each of the component differentials exists.

Finally, I should mention that if we also pick an orthonormal basis for the input space we can expand each component differential in terms of the dual basis :

Then we can write the whole differential out as a matrix whose entry in the th row and th column is . If we write a displacement in the input as an -dimensional column vector we find our estimate of the displacement in the output as an -dimensional column vector: