Sören Laue, creator and first author of this work [1] gave a nice talk [2,3] about it at our reading group not long ago. He also has written about how automatic and symbolic differentiation are essentially equivalent. [4]
I hope that something like his „generic tensor multiplication operator“ [1], a slightly adapted Einstein notation, becomes the standard notation in the future. The traditional linear algebra notation with it‘s ugly transpose formalism is nothing I would ever want to use.
I fully agree with you on this one. It took us ages to figure out that linear algebra notation it not the right way. Using tensor notation and einsum, tensor and matrix derivatives become almost trivial. And no need to use ugly transposes, Kronecker symbols, etc.. Thats why we wrote this article and created the website. (Sören Laue, creator of this website and my personal biased view)
https://en.wikipedia.org/wiki/Tensor_calculus is a term you're much likelier to find as the subject of a book. Matrices are just specific special cases of tensors, after all, and mathematicians like to generalize.
A good book on differential geometry will also probably start with an overview of tensor calculus.
Hubbard and Hubbard's Vector Calculus is a great introduction to multivariable calculus and linear algebra that ends with an introduction to differential forms. And it has a complete solutions manual you can purchase as well.
Example where it comes in handy: OLS (ordinary least squares).
You want to solve Ax = b approximately. So, minimise the two-norm |Ax-b|, or equivalently, |Ax-b|^2, or equivalently (Ax-b)ᵀ(Ax-b) = xᵀAᵀAx - 2xᵀAᵀb + bᵀb.
How to minimise it? Easy, take the derivative wrt the vector x and set to zero (the zero vector):
2AᵀAx - 2Aᵀb = 0, so x = (AᵀA)⁻¹ Aᵀb.
(Note: that's the mathematical formulation of the solution, not how you'd actually compute it.)
Or, equivalently, you can do the "engineer's trick": To solve an over-determined system Ax=b, where A is a rectangular matrix, you multiply both sides by Aᵀ. Thus you obtain a new system whose matrix AᵀA is square, and you can solve normally.
Sure, it's mostly a mnemonic. But you can do it easily in your head, even in infinite dimensions.
For example, if you recall that (due to stokes theorem) the divergence is minus the transpose of the gradient, you can solve many variational problems that way.
The canonical example is: find a function u whose gradient is a given vector field F. This is an over-determined system without a solution. Applying the above trick, you compute the divergence of both sides to obtain Poisson equation Δu=div(F), that you can solve. This is equivalent to finding the least-square minimizer of the energy E(u)=∫|∇u-F|²
Knowing matrix derivatives well is one of those skills that were essential in machine learning 10 years ago. Not so much anymore with the dominance of massive neural networks.
I wouldn't completely discount it. Like I wanted to speed up a certain loss function by rewriting it in triton. But I have to manually code the backward pass function for it in triton. For that I need to know how to calculate the derivative. Then again I guess with pytorch 2.0 compile function it might not be necessary
It’s much more relevant now because of the massive neural networks - specifically the need to compress them. Most of the state of the art model compression methods (quantization or pruning) rely on second order gradients, so you need to know how to compute, invert, and approximate the Hessian matrix (a matrix of all second order partial derivatives of the loss function wrt model parameters).
The term "matrix derivative" is a bit loaded - you can either mean the derivative of functions with matrix arguments, or functions with vector arguments that have some matrix multiplication terms. Either way, I don't really understand what the confusion is about - if you slightly modify the definition of a derivative to be directional (e.g. lim h->0 (f(X + hA) - f(X))/h) then all of this stuff looks the same (vector derivatives, matrix derivatives and so forth). Taking this perspective was very useful during my PhD where I had to work with analytic operator valued functions.
There’s a little more to it than that. Matrices have several properties like being symmetric or unitary (maybe even diagonal) that scalars don’t. Those enable rewritings to make systems more stable or computationally efficient.
That bears no relation to the symbolic differentiation in the OP though.
Even plain old multiplication and division, and even addition and subtraction have stability and efficiency problems on floats, which don't appear in symbolic solvers.
How do you enter the matrix transpose in this tool?
In an example they use X' for transpose and it works.
But if I try A' * A, it either becomes just A * A, or it shows nothing and says " This 4th order tensor cannot be displayed as a matrix. See the documentation section for more details."
The documentation also doesn't show how to enter the transpose operator.
I have an annoyance reflex when people talk about things like "the derivative of a matrix". A matrix is a notational concept, not an object in itself. It makes as much sense to me as saying the derivative of a set, or the derivative of an array (i.e. as opposed to a vector).
It should be derivatives "with" matrices, not "of", in my mind.
Not that it matters in practice, but ... if there's one field where precision of language matters, it should have been mathematics. So it bothers me.
A derivative is a derivative of a function (or a function like object, say a functional). The phrase "the derivative of a matrix" keeps it unclear whether "the matrix" is the function, or whether "the matrix" is a variable, an element in the domain of the function.
Bereft of that information, the phrase is as meaningful as "the derivative of 4.2"
The complete statement is: derivative of a matrix expression or derivative of an expression involving matrices. But this is usually clear from the context.
Yes; like I said, it's a reflex. I know that it's abuse of terminology and I shouldn't overthink it, but, there's always a bit of a nag in my head when I hear it.
I have a similar annoyance with the notation for probability. I think stuff like `p(a)` is a historical mistake. We would have been better served with a notation that would make clear the distinction between a domain as an entity in itself, the set of possible values in the domain, and the action/event of selecting one or more values from that domain.
But oh well. The naming problem is one of the hardest problems in science after all :)
No, it's a collection of vectors (which may be considered ordered or unordered depending on context). The 'vectors' part is fine, but the 'collection' part is the bit that causes the confusion. You can't differentiate a collection (even of vectors), any more you can differentiate a bag of coloured balls. What's the derivative of 'blue'?
But if that collection is plugged into an expression, so as to denote a collection of expressions when result when the individual element vectors are plugged into the expression, and the resulting expressions are differentiable, and you then differentiate each of those expressions and then collect the result back in the same order and represent it as a matrix, then that's fine.
This is what happens when we talk about "matrix differentiation", we mean "differentiation with matrices". But the matrix itself is not a derivable object. Only the expressions upon which it confers its notational semantics have a derivative.
No. It is a mathematical fact that matrices fulfill the conditions of forming a vector space (subject to reasonable conditions like having elements drawn from a field). Matrices are just as differentiable as (other) vectors!
I think you're arguing that a matrix with constant real entries isn't differentiable with the standard derivative. This is true of course, but hasn't anything to do with vectors.
If you want to generalize it even further you can have tensor calculus. An example is each point in a field is a 3x3 matrix with a special property of positive symmetric. A diffusion tensor MRI image has one at each voxel whereas a regular mri image might have a scalar value (grayscale). The principle eigenvector might represent the direction of a muscle fibre for that Voxel.
I used matrix derivatives in grad school a lot (my dissertation was on nonconvex optimization) and I'm not sure I've ever needed an entire textbook. Matrix calculus is just plain calculus with a few extra rules to generalize it to matrices/vectors.
(to be effective, you do need to know the rules of matrix algebra however)
I am a big fan of "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Jan R. Magnus, Heinz Neudecker. They are very thorough and rigorous with their exposition and proofs, which demands a degree of mathematical maturity on their readers. If you are comfortable with multivariate calculus (at the level of mathematical analysis for some materials) and linear algebra, then this book will teach you matrix calculus well.
On the other hand, if you are already familiar with calculus and linear algebra, then most of materials are just straightforward derivations using theorems you already know, so you might not need a textbook. I liked this textbook because it took mysteries out of formulas I was taught but never told about derivations, and because it was actually a quick read.
There are some limitations though. The book stops at Hessian matrices and matrix derivatives of vectors or something like that, because chain rules start breaking down and one needs multilinear algebra for higher order derivatives. However, for deep learning you don't need to worry about it. The book will teach you enough to derive the gradient of convolution as realized in matrix products.
[1]: https://papers.nips.cc/paper/2018/file/0a1bf96b7165e962e90cb...
[2]: https://www.youtube.com/watch?v=IbTRRlPZwgc
[3]: https://compcalc.github.io/public/laue/tensor_derivatives.pd...
[4]: https://arxiv.org/pdf/1904.02990.pdf