Matrix Calculus: Calculate derivatives of matrices

bmc7505 · on April 15, 2023

Sören Laue, creator and first author of this work [1] gave a nice talk [2,3] about it at our reading group not long ago. He also has written about how automatic and symbolic differentiation are essentially equivalent. [4]

[1]: https://papers.nips.cc/paper/2018/file/0a1bf96b7165e962e90cb...

[2]: https://www.youtube.com/watch?v=IbTRRlPZwgc

[3]: https://compcalc.github.io/public/laue/tensor_derivatives.pd...

[4]: https://arxiv.org/pdf/1904.02990.pdf

davnn · on April 15, 2023

I hope that something like his „generic tensor multiplication operator“ [1], a slightly adapted Einstein notation, becomes the standard notation in the future. The traditional linear algebra notation with it‘s ugly transpose formalism is nothing I would ever want to use.

[1] https://ojs.aaai.org/index.php/AAAI/article/download/5881/57...

soeren_l · on April 15, 2023

I fully agree with you on this one. It took us ages to figure out that linear algebra notation it not the right way. Using tensor notation and einsum, tensor and matrix derivatives become almost trivial. And no need to use ugly transposes, Kronecker symbols, etc.. Thats why we wrote this article and created the website. (Sören Laue, creator of this website and my personal biased view)

cookieperson · on April 15, 2023

Thanks for sharing these tools. Do you have any blogs or open access journals sharing this notation? It seems very simple, I'd love to use it

newsoul · on April 15, 2023

Do you know any books which teach matrix calculus?

0xcafefood · on April 15, 2023

https://en.wikipedia.org/wiki/Tensor_calculus is a term you're much likelier to find as the subject of a book. Matrices are just specific special cases of tensors, after all, and mathematicians like to generalize.

A good book on differential geometry will also probably start with an overview of tensor calculus.

hgsgm · on April 15, 2023

And multivariable calculus is the intro to that.

sn9 · on April 15, 2023

Hubbard and Hubbard's Vector Calculus is a great introduction to multivariable calculus and linear algebra that ends with an introduction to differential forms. And it has a complete solutions manual you can purchase as well.

https://matrixeditions.com/5thUnifiedApproach.html

metanonsense · on April 15, 2023

I liked the following book very much:

https://arxiv.org/abs/1802.01528

breck · on April 15, 2023

Interesting! Thanks for the [4] link.

FabHK · on April 15, 2023

Example where it comes in handy: OLS (ordinary least squares).

You want to solve Ax = b approximately. So, minimise the two-norm |Ax-b|, or equivalently, |Ax-b|^2, or equivalently (Ax-b)ᵀ(Ax-b) = xᵀAᵀAx - 2xᵀAᵀb + bᵀb.

How to minimise it? Easy, take the derivative wrt the vector x and set to zero (the zero vector):

2AᵀAx - 2Aᵀb = 0, so x = (AᵀA)⁻¹ Aᵀb.

(Note: that's the mathematical formulation of the solution, not how you'd actually compute it.)

enriquto · on April 15, 2023

Or, equivalently, you can do the "engineer's trick": To solve an over-determined system Ax=b, where A is a rectangular matrix, you multiply both sides by Aᵀ. Thus you obtain a new system whose matrix AᵀA is square, and you can solve normally.

mkl · on April 15, 2023

That doesn't explain why it's least squares though, which was FabHK's point. In isolation it actually seems like a pretty arbitrary thing to do.

enriquto · on April 15, 2023

Sure, it's mostly a mnemonic. But you can do it easily in your head, even in infinite dimensions.

For example, if you recall that (due to stokes theorem) the divergence is minus the transpose of the gradient, you can solve many variational problems that way. The canonical example is: find a function u whose gradient is a given vector field F. This is an over-determined system without a solution. Applying the above trick, you compute the divergence of both sides to obtain Poisson equation Δu=div(F), that you can solve. This is equivalent to finding the least-square minimizer of the energy E(u)=∫|∇u-F|²

eachro · on April 15, 2023

Knowing matrix derivatives well is one of those skills that were essential in machine learning 10 years ago. Not so much anymore with the dominance of massive neural networks.

rdedev · on April 15, 2023

I wouldn't completely discount it. Like I wanted to speed up a certain loss function by rewriting it in triton. But I have to manually code the backward pass function for it in triton. For that I need to know how to calculate the derivative. Then again I guess with pytorch 2.0 compile function it might not be necessary

p1esk · on April 15, 2023

It’s much more relevant now because of the massive neural networks - specifically the need to compress them. Most of the state of the art model compression methods (quantization or pruning) rely on second order gradients, so you need to know how to compute, invert, and approximate the Hessian matrix (a matrix of all second order partial derivatives of the loss function wrt model parameters).

thomasahle · on April 15, 2023

What quantization methods rely on second order gradients? I thought people just applied gradient decent, then round to whatever precision.

p1esk · on April 15, 2023

Current SOTA post training quantization method is GPTQ: https://arxiv.org/abs/2210.17323

mcapodici · on April 15, 2023

This field is so interesting, sad I didn’t get into it sooner

vinte · on April 15, 2023

I guess it will be like assembly or low level programing languages like c for future ML engineers. Its good to know the basics but not always needed.

6gvONxR4sf7o · on April 15, 2023

It depends what kind of work you do. I still find it to be really handy.

hgibbs · on April 15, 2023

The term "matrix derivative" is a bit loaded - you can either mean the derivative of functions with matrix arguments, or functions with vector arguments that have some matrix multiplication terms. Either way, I don't really understand what the confusion is about - if you slightly modify the definition of a derivative to be directional (e.g. lim h->0 (f(X + hA) - f(X))/h) then all of this stuff looks the same (vector derivatives, matrix derivatives and so forth). Taking this perspective was very useful during my PhD where I had to work with analytic operator valued functions.

tlb · on April 15, 2023

There’s a little more to it than that. Matrices have several properties like being symmetric or unitary (maybe even diagonal) that scalars don’t. Those enable rewritings to make systems more stable or computationally efficient.

hgsgm · on April 15, 2023

That bears no relation to the symbolic differentiation in the OP though.

Even plain old multiplication and division, and even addition and subtraction have stability and efficiency problems on floats, which don't appear in symbolic solvers.

hgibbs · on April 15, 2023

Can you give an example? I am unfamiliar (and interested).

TinkersW · on April 16, 2023

That is useful but sure would be even more useful if it let you define functions and call them.

Also the output is pretty gross, wish it had an option for a statically typed language.

Also wtf? It doesn't have sqrt? Have to write it in power form..sigh

Also seems to also not grok values with decimal points as the power, so you have to write it as a fraction..sigh..why math people..why?

Also why can you only select the output for 1 value at a time? For instance if we have a 3d position, we want gradient for x/y/z not just 1 of them..

Aardwolf · on April 15, 2023

How do you enter the matrix transpose in this tool?

In an example they use X' for transpose and it works.

But if I try A' * A, it either becomes just A * A, or it shows nothing and says " This 4th order tensor cannot be displayed as a matrix. See the documentation section for more details."

The documentation also doesn't show how to enter the transpose operator.

keithalewis · on April 15, 2023

https://en.m.wikipedia.org/wiki/Fréchet_derivative

Doesn't this cover all examples presented?

tpoacher · on April 15, 2023

I have an annoyance reflex when people talk about things like "the derivative of a matrix". A matrix is a notational concept, not an object in itself. It makes as much sense to me as saying the derivative of a set, or the derivative of an array (i.e. as opposed to a vector).

It should be derivatives "with" matrices, not "of", in my mind.

Not that it matters in practice, but ... if there's one field where precision of language matters, it should have been mathematics. So it bothers me.

srean · on April 15, 2023

I hear you.

A derivative is a derivative of a function (or a function like object, say a functional). The phrase "the derivative of a matrix" keeps it unclear whether "the matrix" is the function, or whether "the matrix" is a variable, an element in the domain of the function.

Bereft of that information, the phrase is as meaningful as "the derivative of 4.2"

iq234 · on April 15, 2023

The complete statement is: derivative of a matrix expression or derivative of an expression involving matrices. But this is usually clear from the context.

tpoacher · on April 15, 2023

Yes; like I said, it's a reflex. I know that it's abuse of terminology and I shouldn't overthink it, but, there's always a bit of a nag in my head when I hear it.

I have a similar annoyance with the notation for probability. I think stuff like `p(a)` is a historical mistake. We would have been better served with a notation that would make clear the distinction between a domain as an entity in itself, the set of possible values in the domain, and the action/event of selecting one or more values from that domain.

But oh well. The naming problem is one of the hardest problems in science after all :)

Y_Y · on April 15, 2023

But a matrix is a kind of vector!

tpoacher · on April 15, 2023

No, it's a collection of vectors (which may be considered ordered or unordered depending on context). The 'vectors' part is fine, but the 'collection' part is the bit that causes the confusion. You can't differentiate a collection (even of vectors), any more you can differentiate a bag of coloured balls. What's the derivative of 'blue'?

But if that collection is plugged into an expression, so as to denote a collection of expressions when result when the individual element vectors are plugged into the expression, and the resulting expressions are differentiable, and you then differentiate each of those expressions and then collect the result back in the same order and represent it as a matrix, then that's fine.

This is what happens when we talk about "matrix differentiation", we mean "differentiation with matrices". But the matrix itself is not a derivable object. Only the expressions upon which it confers its notational semantics have a derivative.

Y_Y · on April 15, 2023

No. It is a mathematical fact that matrices fulfill the conditions of forming a vector space (subject to reasonable conditions like having elements drawn from a field). Matrices are just as differentiable as (other) vectors!

I think you're arguing that a matrix with constant real entries isn't differentiable with the standard derivative. This is true of course, but hasn't anything to do with vectors.

bobsmooth · on April 15, 2023

I didn't know matrices had derivatives.

goalieca · on April 15, 2023

If you want to generalize it even further you can have tensor calculus. An example is each point in a field is a 3x3 matrix with a special property of positive symmetric. A diffusion tensor MRI image has one at each voxel whereas a regular mri image might have a scalar value (grayscale). The principle eigenvector might represent the direction of a muscle fibre for that Voxel.

pid-1 · on April 15, 2023

Taking the derivative of a matrix is equivalent to writing a system of differential equations.

wenc · on April 15, 2023

Not really. A matrix derivative is just a derivative -- it doesn't have to be part of an equation.

The derivative of x² is 2x (where x is a scalar)

The derivative of vᵀv is 2v (where v is a vector)

There's no differential equation.

hgsgm · on April 15, 2023

A matrix is an array of expressions. Its derivative is the matrix of derivatives of its expressions.

This is 0 for any matrix of constants, as you can see in the example.

pkoird · on April 15, 2023

Any good Textbooks for this topic?

wenc · on April 15, 2023

If you google "matrix derivatives" there are a ton of cheat sheets, like this one:

http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/...

I used matrix derivatives in grad school a lot (my dissertation was on nonconvex optimization) and I'm not sure I've ever needed an entire textbook. Matrix calculus is just plain calculus with a few extra rules to generalize it to matrices/vectors.

(to be effective, you do need to know the rules of matrix algebra however)

east2west · on April 15, 2023

I am a big fan of "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Jan R. Magnus, Heinz Neudecker. They are very thorough and rigorous with their exposition and proofs, which demands a degree of mathematical maturity on their readers. If you are comfortable with multivariate calculus (at the level of mathematical analysis for some materials) and linear algebra, then this book will teach you matrix calculus well.

On the other hand, if you are already familiar with calculus and linear algebra, then most of materials are just straightforward derivations using theorems you already know, so you might not need a textbook. I liked this textbook because it took mysteries out of formulas I was taught but never told about derivations, and because it was actually a quick read.

There are some limitations though. The book stops at Hessian matrices and matrix derivatives of vectors or something like that, because chain rules start breaking down and one needs multilinear algebra for higher order derivatives. However, for deep learning you don't need to worry about it. The book will teach you enough to derive the gradient of convolution as realized in matrix products.

0xcafefood · on April 15, 2023

Mentioned in another comment also: "tensor calculus" is the right term to search for if you want a full text book.

amelius · on April 15, 2023

How about computing integrals?