## Elementary Matrices Generate the General Linear Group

Okay, so we can use elementary row operations to put any matrix into its (unique) reduced row echelon form. As we stated last time, this consists of building up a basis for the image of the transformation the matrix describes by walking through a basis for the domain space and either adding a new, independent basis vector or writing the image of a domain basis vector in terms of the existing image basis vectors.

So let’s say we’ve got a transformation in . Given a basis, we get an invertible matrix (which we’ll also call ). Then we can use elementary row operations to put this matrix into its reduced row echelon form. But now every basis vector gets sent to a vector that’s linearly independent of all the others, or else the transformation wouldn’t be invertible! That is, the reduced row echelon form of the matrix must be the identity matrix.

But remember that every one of our elementary row operations is the result of multiplying on the left by an elementary matrix. So we can take the matrices corresponding to the list of all the elementary row operations and write

which tells us that applying all these elementary row operations one after another leads us to the identity matrix. But this means that the product of all the elementary matrices on the right is . And since we can also apply this to the transformation , we can find a list of elementary matrices whose product is . That is, any invertible linear transformation can be written as the product of a finite list of elementary matrices, and thus the elementary matrices generate the general linear group.

## Reduced Row Echelon Form

Let’s take row echelon form and push it a little further into a form that *is* unique: *reduced* row echelon form.

When we were doing Gaussian elimination we saved some steps by leaving pivoted rows alone. This was fine for the purposes of solving equations, where the point is just to eliminate progressively more and more variables in each equation. We also saved steps by leaving the pivots as whatever values they started with. Both of these led to a lot of variation in the possibilities for row echelon forms. We’ll tweak the method of Gaussian elimination to give a new algorithm called “Gauss-Jordan elimination”, which will take more steps but will remedy these problems.

First of all, when we pick a pivot and swap it into place, we scale that row by the inverse of the pivot value. This will set the pivot value itself to (which, incidentally, helps with the shears that come next). Then we shear to eliminate not only the nonzero entries below the pivot, but the nonzero entries *above* the pivot as well. That leaves the pivot as the only nonzero entry in its column.

Let’s see what this looks like in terms of our usual example:

We use the in the upper-left as a pivot and scale the top row by

Next, we clear out the first column with shears. Note that the shear value is now just the negative of the entry in the first column.

So far it’s almost the same as before, except for normalizing the pivot. But now when we choose the in the second column as the pivot we scale its row by

and use shears to clear the column above and below this pivot

Now the only choice for a pivot in the third column is . Again we normalize it and use shears to clear the column

This matrix is in reduced row echelon form, and I say that it is unique. Indeed, we are not allowed to alter the basis of the input space (since that would involve elementary *column* operations), so we can view this as a process of creating a basis for the output space in terms of the given basis of the input space.

What we do is walk down the input basis, vector by vector. At each step, we ask if the image of this basis vector is linearly independent of the images of those that came before. If so, this image is a new basis vector. If not, then we can write the image (uniquely) in terms of the output basis vectors we’ve already written down. In our example above, each of the first three input basis vectors gives a new output basis vector, and the image of the last input basis vector can be written as the column vector in the last column.

The only possibility of nonuniqueness is if we run out of input basis vectors before spanning the output space. But in that case the last rows of the matrix will consist of all zeroes anyway, and so it doesn’t matter what output basis vectors we choose for those rows! And so the reduced row echelon form of the matrix is uniquely determined by the input basis we started with.

## Solving Equations with Gaussian Elimination

The row echelon form of a matrix isn’t unique, but it’s still useful for solving systems of linear equations. If we have a system written in matrix form as

then applying elementary row operations builds up a new matrix that we multiply on the left of both sides

If the new matrix is in row echelon form, the resulting system of equations will be simple to solve.

A little more explicitly, this works because each of the equations is linear, and so each of the elementary row operations corresponds to some manipulation of the equations which gives rise to an equivalent system. for example, the order of the equations doesn’t matter, so swapping two of them (swapping two rows) doesn’t change the solution. Similarly, multiplying both sides of one equation by a nonzero number doesn’t change the solution to that equation, and so doesn’t change the solution of the system. Finally, adding two equations gives another valid equation. Since the three equations “overdetermine” the solution (they’re linearly dependent), we can drop one of the first two. The result is the same as a shear.

Since we’re just doing the same thing to the rows of the column vector as we are to the matrix , we may as well stack them beside each other. In this way, we get the “augmented matrix” of the system of equations. For example, we might have the system

with the augmented matrix

Then we can follow along our elimination from yesterday, applying the same steps to the system of equations. Foe example, if at this point we add the first and third equations we get , “eliminating” the variable from the equation. We replace the third equation with this new, simpler equation, giving the effect of a shear just like we did yesterday.

At the end, we found the row echelon form

which corresponds with the system

Now the equations are arranged like a ladder for us to climb up. First, we solve the bottom equation to find . Then we can plug this into the next equation up to find , which is easily solved to find . Then we plug both of these answers into the next equation up to find , which is easily solved to find , and we’re done. In fact, this is why we use the word “echelon” — French for a rung of a ladder — to describe this form.

What about our other example from yesterday? This corresponds to a system of *four* equations with three unknowns, since we’re interpreting the last column as the column vector on the right side of the equation. The system is

We convert this to an augmented matrix, put this matrix into row echelon form, and return it to the form of a system of equations to find

The empty row on the bottom is fine, since it just tells us the tautology that . Unfortunately, the line above that contains the contradiction that . That is, whenever the row echelon form of a system of equations has a leading coefficient in the last column, that system is unsolvable. We don’t even need to climb the ladder to know that.

On the other hand, if we started with the system

and put it in row echelon form (do it yourself) we get

Now we have no contradictions, but the system is underdetermined. Indeed, the rank is , so the kernel has dimension , and thus we’ll have one free parameter in our solution. It’s still easy to solve, though. We first find that . Plugging this into the first equation we find , which is easily solved to find .

So whether the system is exactly solvable, underdetermined, or contradictory, we can find whatever solutions may exist explicitly using a row echelon form.

## Row Echelon Form

For now, I only want to focus on elementary row operations. That is, transformations of matrices that can be effected by multiplying by elementary matrices on the left, not on the right.

Row echelon form is a (non-unique) simplification of the form of a matrix that can be reached by elementary row operations. A matrix is in row echelon form if all the empty (all-zero) rows are below all the nonzero rows, and the leftmost nonzero entry in a given row is to the right of the leftmost nonzero entry in the row above it. For example, an upper-triangular matrix is in row echelon form. We put a matrix into row echelon form by a method called “Gaussian elimination”, which always moves a matrix closer and closer to row echelon form.

First, we look at the first column. If it’s empty we leave it alone and move on the next column. If there are any nonzero entries, there must be only one of them at the top of the column, or the matrix can’t be in row echelon form. So, if this isn’t the case then we pick one of the nonzero entries and call it our “pivot”. Swap its whole row up to the top row, and then use shears to eliminate all the other nonzero entries in the column before moving on to the second column.

Before we continue, I’ll point out that the pivot is now the leading entry on a nonzero row at the top of the matrix. We’ll regard it as fixed, so whenever I say “all rows” I mean all the rows that we haven’t locked down yet.

Now that we’ve moved on to the next column, we again look for any nonzero entries. If there are none, we can move on. But if there are, we choose one to be a pivot, swap it to the top (below the previously locked rows) and use shears to remove all the other nonzero entries in the column (other than the locked rows). We continue like this, choosing pivots where there are nonzero entries to deal with, moving them to the top, and eliminating all the other nonzero entries below them, until we’ve locked down all the rows or run out of columns.

Let’s look at this method applied to the matrix

in the first column we have the nonzero value in the first row, so we may as well leave it where it is. We add times the first row to the second to eliminate the there, and add the first row to the third to eliminate the there. At this point our matrix looks like

Now in the second row we have at the top (since the very top row is finished). We leave it where it is and add times the second row to the third to eliminate the there. Now the matrix looks like

now there are no empty rows, so “all” of them are below all the nonzero rows. Further, within the nonzero rows the leading entries move from left to right as we move down the rows. The matrix is now in row echelon form.

Now let’s try this one:

Again, we’ll start with using the upper-left as our first pivot. We use shears to eliminate the rest of the entries in the first column and get

Now there are nonzero entries in the second column, but none of them are at the top. So we swap the second and third rows to bring a pivot to the top

and eliminate using a shear

the third column has no nonzero entries (other than the two locked-in entries at the top), so we skip it. In the fourth column we use as a pivot and eliminate, getting

Now the empty row is on the bottom, and the leading entries of the other rows move from left to right as we move down the rows. Thus, the matrix is now in row echelon form.

## Subgroups Generated by Shears

Okay, when I introduced elementary matrices I was a bit vague on the subgroup that the shears generate. I mean to partially rectify that now that we’ve got elementary row operations to work with.

I assert that the upper shears — those elementary matrices with one nonzero entry *above* the diagonal — generate the group of “upper unipotent” matrices. A matrix is unipotent if it is the identity plus a nilpotent transformation, so all of its eigenvalues are . Specifically, we want those that are also upper-triangular. Thus the matrices we’re talking about have everywhere below the diagonal and all on the diagonal, as

What I’m going to do, now, is turn this matrix into the identity by using elementary row operations. Specifically, I’ll only ever add a multiple of one row to another row, which can be effected by multiplying the matrix on the left by shear matrices. And I’ll only add multiples of one row to rows *above* it, which can be effected by using *upper* shears.

So first, let’s add times the th row to the th.

We’ve cleared out the last entry in the next-to-last row in the matrix. Keep going, clearing out all the rest of the last column

Now we can use the next-to-last row — which has only a single nonzero entry left — to clear out the rest of the next-to-last column

Keep going, clearing out the columns from right to left

and we’ve got the identity matrix! So that means that this big product of all these upper shears on the left is actually the *inverse* to the matrix that we started with. Now we just multiply the inverse of each of these shears together in reverse order to find

So any upper-unipotent matrix can be written as a product of upper shears. Similarly, any lower-unipotent matrix (all above the diagonal and all on the diagonal) can be written as a product of lower shears. If we add in scalings, we can adjust the diagonal entries too. Given an invertible upper-triangular matrix, first factor it into the product of a diagonal matrix and an upper-unipotent matrix by dividing each row by its diagonal entry. Then the diagonal part can be built from scalings, while the upper-unipotent part can be built from shears. And, of course, scalings and lower shears together generate the subgroup of invertible lower-triangular matrices.

## Elementary Row and Column Operations

Here’s a topic that might be familiar all the way back to high school mathematics classes. We’re going to use the elementary matrices to manipulate a matrix. Rather than work out abstract formulas, I’m going to take the specific matrix

and perform example manipulations on it to draw conclusions. The effects of elementary matrices are so, well, *elementary* that it will be clear how they generalize.

So how do we use the matrices to manipulate matrices? Well, we’re using the elementary matrices to change the input or output bases of the linear transformation represented by the matrix. So to change the output basis we’ll multiply on the left by an elementary matrix, while to change the input basis we’ll multiply on the right by the inverse of an elementary matrix — which itself is an elementary matrix of the same kind.

Let’s start with the swaps. On the left we might have

while on the right we might have

On the left, the action of a swap is to swap two rows, while on the right the action is to swap two columns of the matrix.

Next come the scalings. On the left

and on the right

On the left, the action of a scaling is to multiply a row by the scaling factor, while on the right the effect is to multiply a column by the scaling factor.

Finally, the shears. On the left

and on the right

On the left, the shear adds times the th row to the th row, while on the right, the shear adds times the th column to the th column.

So in general we see that acting on the left manipulates the rows of a matrix, while acting on the right manipulates the columns. We call these the “elementary row operations” and “elementary column operations”, respectively. Any manipulation of the form of a matrix we can effect by these operations can be seen as the result of applying a change of basis matrix on the left (output) or right (input) side. And so any two matrices related by these operations can be seen as representing “the same” transformation in two different bases.

## Elementary Matrices

Today we’ll write down three different collections of matrices that together provide us all the tools we need to modify bases.

First, and least important, are the swaps. A swap is a matrix that looks like the identity, but has two of its nonzero entries in reversed columns.

where the two swapped columns (or, equivalently, rows) are and . The swaps generate a subgroup of isomorphic to the symmetric group . In fact, these are the image of the usual generators of under the permutation representation. They just rearrange the order of the basis elements.

Next are the scalings. A scaling is a matrix that looks like the identity, but one of its nonzero entries isn’t the identity.

where the entry is in the th row and column. The scalings generate the subgroup of diagonal matrices, which is isomorphic to — independent copies of the group of nonzero elements of under multiplication. They stretch, squeeze, or reverse individual basis elements.

Finally come the shears. A shear is a matrix that looks like the identity, but one of its off-diagonal entries is nonzero.

where the entry is in the th row and th column. If , then the extra nonzero entry falls above the diagonal and we call it an “upper shear”. On the other hand, if then the extra nonzero entry falls below the diagonal, and we call it a “lower shear”. The shears also generate useful subgroups, but the proof of this fact is more complicated, and I’ll save it for its own post.

Now I said that the swaps are the least important of the three elementary transformations, and I should explain myself. It turns out that swaps aren’t really elementary. Indeed, consider the following calculation

So we can build a swap from three shears and a scaling. It should be clear how to generalize this to build *any* swap from three shears and a scaling. But it’s often simpler to just thing of swapping two basis elements as a single basic operation rather than as a composition of shears and scalings.

On the other hand, we can tell that we can’t build any shears from scalings, since the product of scalings is always diagonal. We also can’t build any scalings from shears, since the determinant of any shear is always , and so the product of a bunch of shears also has determinant . Meanwhile, the determinant of a scaling is always the scaling factor .

## Decompositions Past and Future

Okay, let’s review some of the manipulations we’ve established.

When we’re looking at an endomorphism , we have found that we can pick a basis so that the matrix of is in Jordan normal form, which is almost diagonal. That is, we can find an invertible transformation and a “Jordan transformation” so that .

If we work with a normal transformation on a complex inner product space we can pick orthonormal basis of eigenvectors. That is, we can find a unitary transformation and a diagonal transformation so that .

Similarly, if we work with a self-adjoint transformation on a real inner product space we can pick orthonormal basis of eigenvectors. That is, we can find an orthogonal transformation and a diagonal transformation so that .

Then we generalized: if is any linear transformation between two inner product spaces we can find orthonormal bases giving the singular value decomposition. There are two unitary (or orthogonal) transformations and and a “diagonal” transformation so we can write .

We used this to show in particular if is an endomorphism on an inner product space we can write where is unitary and is positive-semidefinite. That is, if we can choose the “output basis” separately from the “input basis”, we can put into a nice form.

Now we want to continue in this direction of choosing input and output bases separately. It’s obvious we have to do this when the source and target spaces are different, but even for endomorphisms we’ll move them separately. But now we’ll move back away from inner product spaces to *any* vector spaces over *any* fields. Just like for the singular value decomposition, what we’ll end up with is essentially captured in the first isomorphism theorem, but we’ll be able to be a lot more explicit about how to find the right bases to simplify the matrix of our transformation.

So here’s the central question for the last chuck of linear algebra (for now): Given a linear transformation between any two vector spaces, how can we pick invertible transformations and so that and is in “as simple a form as possible” (and what does this mean?). To be even more particular, we’ll start with arbitrary bases of and — so already has a matrix — and we’ll look at how to modify the matrix by modifying the two bases.

## The Uniqueness of Polar Decomposition

I asserted in a comment that the polar decomposition of a transformation has certain uniqueness properties. First, we can always uniquely recover the positive-semidefinite part, and then if that’s actually positive-definite we can recover the unitary part. And it turns out our running analogy with complex numbers is the perfect way to get at these facts.

Recall that if we take a complex number in polar form we can recover its length as

Similarly, using the right polar decomposition we can write

where the radical denotes the unique positive-semidefinite square root any positive-semidefinite transformation has.

Now if is nonsingular then so is , since we can write and is definitely nonsingular. So in this case we can write and is uniquely determined. On the other hand, if is singular then we can’t invert . We then have some freedom in our choice of , corresponding to this nontrivial cokernel of .

We could also use the left polar decomposition to write

So the positive-semidefinite part is again uniquely determined, and again the unitary part will be uniquely determined so long as is nonsingular.

## Square Roots

Here’s a neat thing we can do with the spectral theorems: we can take square roots of positive-semidefinite transformations. And this makes sense, since positive-semidefinite transformations are analogous to nonnegative real numbers, and every nonnegative real number has a unique nonnegative real square root. So we expect that every positive-semidefinite transformation will have a unique positive-semidefinite square root.

So we start by writing down the spectral decomposition

Where is diagonal. And since is positive-semidefinite, every diagonal entry of — every eigenvalue of — is a nonnegative real number. We can arrange them nonincreasing order, with the largest eigenvalue in the upper left, and so on down to the lowest eigenvalue (maybe ) in the lower right corner. Since the eigenvalues are uniquely determined, with this arrangement is uniquely determined. If there are repeated eigenvalues, might not be completely determined, since we have some freedom in picking the basis for degenerate eigenspaces.

Anyhow, since each entry in is a nonnegative real number, we can replace each one with its unique nonnegative square root. We call this new matrix , and observe that . Now we can define , and calculate

So is a square root of . Since the eigenvalues of (the diagonal entries of ) are nonnegative real numbers, is positive-semidefinite.

On the other hand, what if we have some other positive-semidefinite square root . Saying that it’s a square root of means that

That is, we must have

The matrix is diagonal, and its entries — the squares of the diagonal entries of — must be the eigenvalues of . And so the entries of are the same as those of , though possibly in a different order. The rearrangement, then, is the content of conjugating by . That is, we must have

and so

And so we really have the exact same square root again. This establishes the uniqueness of the positive-semidefinite square root.