Higher Differentials and Composite Functions
Last time we saw an example of what can go wrong when we try to translate higher differentials the way we did the first-order differential. Today I want to identify exactly what goes wrong, and I’ll make use of the summation convention to greatly simplify the process.
So, let’s take a function of
variables
and a collection of
functions
, each depending on
variables
. We can think of these as the components of a vector-valued function
which has continuous second partial derivatives on some region
. If the function
has continuous second partial derivatives on some region
containing the image
, then we can compose the two functions to give a single function
, and we’re going to investigate the second differential of
with respect to the variables
.
To that end, we want to calculate the second partial derivative
First, we take the derivative in terms of , and we use the chain rule to write
Now we have to take the derivative in terms of . Luckily, this operation is linear, so we don’t have to worry about the hidden summations in the notation. We do, however, have to use the product rule to handle the multiplications
where we’ve used the chain rule again to convert a derivative in terms of into one in terms of
.
And here we’ve come to the problem itself. For we can write out the second differential in terms of the
The first term here is the second differential in terms of the . If there were an analogue of Cauchy’s invariant rule, this would be all there is to the formula. But we’ve got another term — one due to the product rule — based on the second differentials of the functions
themselves. This is the term that ruins the nice transformation properties of higher differentials, and which makes them unsuitable for many of our purposes.
Notice, though, that we have not contradicted Clairaut’s theorem here. Indeed, as long as and all the
have continuous second partial derivatives, then so will
. Further, the formula we derived for the second partial derivatives of
is manifestly symmetric between the two derivatives, and so the mixed partials commute.
Higher-Order Differentials
Just like we assembled partial derivatives into the differential of a function, so we can assemble higher partial derivatives into higher-order differentials. The differential measures how the function itself changes as we move around, and the higher differentials will measure how lower differentials change.
First let’s look at the second-order differential of a real-valued function of
variables
. We’ll use the
as a basis for the space of differentials, which allows us to write out the components of the differential:
So, just as we did for vector-valued functions, we’ll just take the differentials of each of these components separately, and then cobble them together.
Now this second displacement may have nothing to do with the first, but it should be the same for all components. That is, we could write out the second differential as a function of not only the point but of two displacements
and
from the point:
Commonly we’ll collapse this into a function of a point and a single displacement. We just put the same vector in for both
and
Unfortunately, these higher differentials are more complicated than our first-order derivatives. In particular, they don’t obey anything like Cauchy’s invariant rule, meaning they don’t transform well when we compose functions. As an example, let’s go back and look at the polar coordinate transform again:
We’ve seen how we can use Cauchy’s invariant rule to rewrite differentials:
We can also invert the transformation and rewrite differential operators:
So let’s take our second-order differential
and try to rewrite it. The nasty bit is working out all these second-order partial derivatives in terms of and
.
After that it’s no trouble at all to transform the differential terms
Let’s just work out the component that goes with when we put these all together
Which has an extraneous term! If an invariance rule held, we should just get .
The difference comes from the way that the differential operators themselves change as we move our point around. Increasing by a little bit means something different at the point
than it does at the point
. This doesn’t really matter when we’re talking about first-order differentials because we’re never putting two differential operators together, and so we never get any measurement of how an operator changes from point to point. We will eventually learn how to compensate for this effect, but that will wait until we have a significantly more general approach.
Clairaut’s Theorem
Now for the most common sufficient condition ensuring that mixed partial derivatives commute. If is a function of
variables, we can for the moment hold the values of all but two of them constant. We’ll only consider two variables at a time, which will simplify our notation. For the moment, then, we write
. We will also assume that
is real-valued, and deal with vector values one component at a time.
I assert that if the partial derivatives and
are continuous in a neighborhood of the point
, and if the mixed second partial derivative
exists and is continuous there, then the other mixed partial derivative
exists at
, and we have the equality
By definition, within the neighborhood in the statement of the theorem the partial derivative is given by the limit
So the numerator of the difference quotient defining the desired mixed partial derivative is
For a fixed , we define the function
We compute the derivative of as
so we can apply the mean value theorem to write
for some between
and
. We use the above expression for
to write the difference quotient
In a similar trick to the one above, we can see that is differentiable as a function of
with derivative
. And so the mean value theorem tells us that we can write our difference quotient as
for some between
and
.
And so we come to try taking the limit
If didn’t depend in its definition on
, this would be easy. First we could let
go to zero, which would make
go to
, and then letting
go to zero would make
go to zero as well. But it’s not going to be quite so easy, and limits in two variables like this usually call for some delicacy.
Given an , there (by the assumption of continuity) is some
so that
for within a radius
of
. As long as we keep
and
below
, the point
will be within this radius. So we can keep
fixed at some small enough value, and find that
implies the inequality
Now we can take the limit as goes to zero. As we do so, the inequality here may become an equality, but since we kept it below
, we still have some wiggle room. So, if
, we have the inequality
which gives us the limit we need.
Of course we could instead assume that the second mixed partial derivative exists and is continuous near , and conclude that the first one exists and is equal to the second.
Higher Partial Derivatives
Let’s say we’ve got a function that’s differentiable within an open region
. In particular, if we pick coordinates on
the function has all partial derivatives
at each point in
. As we move around within
the value of the partial derivative changes, justifying the functional notation
. And if we’re lucky, these functions themselves may be differentiable.
In particular, it makes sense to ask about the existence of so-called “second partial derivatives”, defined as
Or in Leibniz’ notation:
If we take the derivative in terms of the same variable twice in a row we sometimes write this as
Yes, there’s some dissonance between superscripts as indices and superscripts as powers. But, again, this is pretty much the received notation in many areas. If it seems like it might be confusing we just write out twice in a row.
These, of course, may be defined within the region , and we can then sensibly ask about third partial derivatives, like
and so on.
As an example, let’s consider the function . We can easily calculate the two first partial derivatives.
And then we take each derivative of each of these two
where since we’re not using superscripts as indices in these examples its meaning should be clear.
We notice here that the two in the middle — the “mixed” partial derivatives — are the same. This will happen in many cases of interest to us, but not always. As a pathological example, let’s go back and consider the function defined by
away from the origin, and patched by . Again, we calculate the first partial derivatives (at least away from the origin):
Each partial derivative is at the origin.
Now we can check that for all
, and that
for all
. Thus we can calculate
and the mixed partial derivatives are not equal.
The Mean Value Theorem
Here’s a nice technical result we may have call for from time to time: a higher-dimensional version of the differential mean value theorem. Remember that this says that if we’ve got a function continuous on the closed interval
and differentiable on its interior, there is some point
in the middle where the derivative of the function is the same as the average — the mean — rate of change of the function over the interval. In more than one dimension we’re going to modify this a bit to make it clearer what it means.
First of all, instead of talking about the closed interval , we’re going to use the closed straight line segment. That is, the collection of all the points between
and
in a straight line, and including the endpoints. We first look at the total displacement
from one point to the other. Then we start at
and move some portion of this displacement towards
. That is, the closed line segment
consists of all points of the form
for
in the closed interval
. Setting
gives us the point
, and
gives us the point
. Similarly, the open line segment
consists of all points of the form
for
in the open interval
.
Next, we have to be clear about the average rate of change. As we move from to
, the value of the function
changes by
. It takes a displacement of
to get there, so on average the rate of change is
Finally, we don’t just have a single value for the instantaneous rate of change, we have a differential . But we can use it to find directional derivatives. Specifically, we’ll consider the derivative of
in the direction pointing from
to
. We’ll pick out this direction with the unit vector we get by normalizing the displacement
So the mean value theorem will tell us that if is differentiable in some open region
that contains the whole closed line segment
. Then there is some point
in the open line segment
so that the average rate of change of
from
to
is equal to the directional derivative of
at
in the direction pointing from
to
:
or, more simply
We’ll get at this by changing to a function of one variable so we can bring the one-dimensional version to bear. To that end, we define for
in the closed interval
. Then
, and we can also show that
is differentiable everywhere inside the interval. Indeed, we can evaluate the difference quotient
Taking the limit as approaches
, we find
which exists since is differentiable.
So our old differential mean-value theorem tells us that there is some so that
where is a point in the open line segment
.
Transforming Differential Operators
Because of the chain rule and Cauchy’s invariant rule, we know that we can transform differentials along with functions. For example, if we write
we can write the differentials of and
in terms of the differentials of
and
:
It turns out that the chain rule also tells us how to rewrite differential operators in terms of the variables. But these go in the other direction. That is, we can write the differential operators and
in terms of the operators
and
.
First of all, let’s write down the differential of in terms of
and
and in terms of
and
:
and now we can rewrite and
in terms of
and
.
Now by uniqueness we can read off the partial derivatives of in terms of
and
:
Finally, we pull all mention of out of our notation and just write out the differential operators.
Now we’re done rewriting, but for good form we should express these coefficients in terms of and
.
It’s important to note that there’s really no difference between these last two steps. The first one uses the variables and
while the second uses the variables
and
, but they express the exact same functions, given the original substitutions above.
More generally, let’s say we have a vector-valued function defining a substitution
Cauchy’s invariant rule tells us that this gives rise to a substitution for differentials.
We can play it a little loose and write this out in matrix notation:
Now if we have a function in terms of the
variables, we can use the substitution above to write it as a function of the
variables. We can write the differential of
in terms of each
Next we use the substitutions of the differentials to rewrite the first form as
Then uniqueness allows us to match up the coefficients and write out the partial derivatives in terms of the variables
It is in this form that the chain rule is most often introduced, or the similar form
And now we can remove mention of from the formulæ and speak directly in terms of the operators
Again, we can play it a little loose and write this in matrix notation
This is very similar to the substitution for differentials written in matrix notation. The differences are that we transform from -derivations to
-derivations instead of from
-differentials to
-differentials, and the two substitution matrices are the transposes of each other. Those who have been following closely (or who have some background in differential geometry) should start to see the importance of this latter fact, but for now we’ll consider this a statement about formulas and methods of calculation. We’ll come to the deeper geometric meaning when we come through again in a wider context.
Product and Quotient rules
As I said before, there’s generally no product of higher-dimensional vectors, and so there’s no generalization of the product rule. But we can multiply and divide real-valued functions of more than one variable. Finding the differential of such a product or quotient function is a nice little exercise in using Cauchy’s invariant rule.
For all that follows we’re considering two real-valued functions of real variables:
and
. We’ll put them together to give a single map from
to
by picking orthonormal coordinates
and
on the latter space and defining
We also have two familiar functions that we don’t often think of explicitly as functions from to
:
Now we can find the differentials of and
Notice that the differential for is exactly the alternate notation I mentioned when defining the one-variable quotient rule!
With all this preparation out of the way, the product function can be seen as the composition
, while the quotient function
can be seen as the quotient
. So to calculate the differentials of the product and quotient we can use Cauchy’s invariant rule to make the substitutions
The upshot is that just like in the case of one variable we can differentiate a product of two functions by differentiating each of the functions, multiplying by the other function, and adding the two resulting terms. We just use the differential instead of the derivative.
Similarly, we can differentiate the quotient of two functions just as in the one-variable case, but using the differential instead of the derivative.
Cauchy’s Invariant Rule
An immediate corollary of the chain rule is another piece of “syntactic sugar”.
If we have functions and
for some open regions
and
so that the image
is contained in
, we can compose the two functions to get a new function
. In terms of formulas, we can choose coordinates
on
and write out both the function
and the component functions
. We get a formula for
by substituting
for
in the formula for
and write
.
The language there seems a little convoluted, so I’d like to give an example. We might define a function for all points
in the plane
. This is all well and good, but we might want to talk about the function in polar coordinates. To this end, we may define
and
. These are the component functions describing a transformation
from the region
to the region where
. We can substitute
for
and
for
in our formula for
to get a new function
with formula
This much is straightforward. The thing is, now we want to take differentials. What Cauchy’s invariant rule tells us is that we can calculate the differential of by not only substituting
for
, but also substituting
for
in the formula for
. That is, if
then we have the equivalence
In our particular example, we can easily calculate the differential of using our first formula:
or using our second formula:
We want to call both of these simply . But can we do so unambiguously? Indeed, if
then we find
and if then we find
We substitute these into our formula for to find
just the same as if we calculated directly from the formula in terms of and
.
That is, we can substitute our formulæ for the coordinate functions before taking the differential in terms of
, or we can take the differential in terms of
and then substitute our formulæ for the coordinate functions
and their differentials
into the result. Either way, we end up in the same place, so we don’t have to worry about ending up with two (or more!) “different” differentials of
.
So, how do we verify this using the chain rule? Just write out the differentials out using partial derivatives. For example, we know that
and so on. So, performing our substitutions we can find:
The important part here is the passage from products of two partial derivatives to single partial derivatives of . This works out because when we consider differentials as linear transformations, the matrix entries are the partial derivatives. The composition of the linear transformations
and
is given by the product of these matrices, and the entries of the resulting matrix must (by uniqueness) be the partial derivatives of the composite function.
The Chain Rule
Since the components of the differential are given by partial derivatives, and partial derivatives (like all single-variable derivatives) are linear, it’s straightforward to see that the differential operator is linear as well. That is, if and
are two functions, both of which are differentiable at a point
, and
and
are real constants, then the linear combination
is also differentiable at
, and the differential is given by
There’s not usually a product for function values in , so there’s not usually any analogue of the product rule and definitely none of the quotient rule, so we can ignore those for now.
But we do have a higher-dimensional analogue for the chain rule. If we have a function defined on some open region
and another function
defined on a region
that contains the image
, then we can compose them to get a single function
defined by
. And if
is differentiable at a point
and
is differentiable at the image point
, then the composite function is differentiable at
.
First of all, what should the differential be? Remember that the differential is a linear transformation that takes displacements
from the point
and turns them into displacements
from the point
. Then the differential
is a linear transformation that takes displacements
from the point
and turns them into displacements
from the point
. Putting these together, we have a composite linear transformation
that will start with a linear transformation that takes displacements
from the point
and turns them into displacements
from the point
. I assert that this is composite transformation is exactly the differential of the composite function.
Just as a sanity check, what happens when we look at single-variable real-valued functions? In this case, and
are both linear transformations from one-dimensional spaces to other one-dimensional spaces. That is, they’re represented as
matrices that just multiply by the single real entry. So the composite of the two transformations is given by the matrix whose single entry is the product of the two matrices’ single entries. In other words, in one variable the differentials looks like single real numbers
and
, and their composite is given by multiplication:
. This is exactly the one-variable chain rule. To understand multiple variables we have to move from products of real numbers to compositions of linear transformations, which will be products of real matrices.
Okay, so let’s verify that does indeed act as a differential for
. It’s clearly a linear transformation between the appropriate two spaces of displacements. What we need to verify is that it gives a good approximation. That is, for every
there is a
so that if
we have
First of all, since is differentiable at
, given
there is a
so that if
we have
Now since is differentiable it satisfies a Lipschitz condition. We showed that this works for real-valued functions, but extending the result is very straightforward. That is, there is some radius
and a constant
so that if
we have the inequality
. That is,
cannot stretch displacements by more than a factor of
as long as the displacements are small enough.
Now may be smaller than
already, but just in case let’s shrink it until it is. Then we know that
.
so we can use this difference as a displacement from
. We find
Now we’re going to find a constant and a radius
so that
whenever . Once this is established, we are done. Given an
we can set
and let
be the smaller of the two resulting radii
and
. Within this smaller radius, the desired inequality will hold.
To get this result, we choose orthonormal coordinates on the space . We can then use these coordinates to write
But since each of the several is differentiable we can pick our radius
so that all of the inequalities
hold for . Then we let
be the magnitude of the largest of the component partial derivatives
, and we’re done.
Thus when is differentiable at
and
is differentiable at
, then the composite
is differentiable at
, and the differential of the composite function is given by
the composite of the differentials, considered as linear transformations.
Vector-Valued Functions
Now we know how to modify the notion of the derivative of a function to deal with vector inputs by defining the differential. But what about functions that have vectors as outputs?
Well, luckily when we defined the differential we didn’t really use anything about the space where our function took its values but that it was a topological vector space. Indeed, when defining the differential of a function we need to set up a new function that takes a point
in the Euclidean space
and a displacement vector
in
as inputs, and which gives a displacement vector
in
as its output. It must be linear in the displacement, meaning that we can view
as a linear transformation from
to
. And it must satisfy a similar approximation condition, replacing the absolute value with the notion of length in
: for every
there is a
so that if
we have
From here on we’ll just determine which norm we mean by context, since we only have one norm on each vector space.
Okay, so we can talk about differentials of vector-valued functions: the differential of at a point
(if it exists) is a linear transformation
that turns displacements in the input space into displacements in the output space, and does so in the way that most closely approximates the action of the function itself. But how do we define the function?
If we pick an orthonormal basis for
we can write the components of
as separate functions. That is, we say
Now I assert that the differential can be taken component-by-component, just as continuity works: . On the left is the differential of
as a vector-valued function, while on the right we find the differentials of the several real-valued functions
. The differential exists if and only if the component differentials do.
First, from components to the vector-valued function. Clearly this definition of gives us a linear map from displacements in
to displacements in
. But does it satisfy the approximation inequality? Indeed, for every
we can find a
so that all the inequalities
are satisfied when . Of course, there are different
s that work for each component, but we can pick the smallest of them. Then it’s a simple matter to find
so if the component functions are differentiable, then so is the function as a whole.
On the other hand, if the differential exists then for every
there exists a
so that if
we have
But then it’s easy to see that
and so each of the component differentials exists.
Finally, I should mention that if we also pick an orthonormal basis for the input space
we can expand each component differential
in terms of the dual basis
:
Then we can write the whole differential out as a matrix whose entry in the
th row and
th column is
. If we write a displacement in the input as an
-dimensional column vector we find our estimate of the displacement in the output as an
-dimensional column vector:
