The big six matrix factorizations

rhyn00 · on May 18, 2022

Spectral decomposition is pretty cool.

Application 1 - spectral clustering - an alternative to k-means for nonlinear clusters. Get a Distance matrix of your data, spectral decomp, run k-means on your k top eigen vectors and that's your clusters.

Application 2 - graph clustering - (run spectral clustering on adj matrix!)

There's some tricks to getting it to work in practice like normalizing but it's a simple and powerful method. Also the matrices can get big so it helps a lot to use sparse matrix libraries for the computations.

[1] https://towardsdatascience.com/spectral-clustering-aba2640c0....

[2] https://www.hindawi.com/journals/ddns/2020/4540302/

arbitrandomuser · on May 18, 2022

I've never given a second thought about what the etymology of "spectral" in spectral decomposition is. Somewhere in the back of my mind (and I guess many students of physics have the same notion) subconsciously i assumed it originates from eigenvalues of the Hamiltonian determining the atomic spectral lines . But I've never followed up on it and actually looked it up .

the_svd_doctor · on May 18, 2022

I might be wrong about the exact historical reason.

But the way I see it "spectral decomposition of A" is a way to express A as a sum of orthogonal, rank-1, operators. A = \sum l_i u_i u_i^T. Those l_i are the eigenvalues; u_i are the eigenvectors.

The eigenvectors look a whole lot like the "modes" in a Fourier decomposition. And if you plot (i, l_i), the eigenvalues are a bit like the "spectrum" (the amplitude of each mode).

In fact, the complex exponentials (the modes in the Fourier decomposition) are also eigenvectors of a specific operator (the Laplacian).

Math people are good at finding connections between things.

blt · on May 19, 2022

The spectrum of the matrix A is also closely related to the frequencies at which the ordinary differential equation xdot = Ax oscillates!

throwawaymaths · on May 18, 2022

According to the almighty wikipedia, The connection is correct but it turned out to be an accident. David Hilbert who coined spectral theory was surprised when it was found to be applicable to solving quantum mechanical spectra.

alpple · on May 24, 2022

If a spectrum is just a range of numbers, then the idea of spectrum should apply to many phenomenon. Many things can be described by a number. So, when matrix spectrum and atomic spectrum were formulated, perhaps spectrum was a word that was in vogue to describe a quantity. I think this is the explanation of the reason the same term is used for both. Because many connected phenomenon can be quantified. I know only a little about matrix spectrum and atomic spectrum, so take my thought with a grain of salt.

bloqs · on May 18, 2022

I love that on this website I can find comments that I simply don't understand a word of. Good on you for doing the stuff you do sir.

shusaku · on May 18, 2022

I love Dr. Higham’s blog. In case anyone missed it, the whole series is on GitHub: https://github.com/higham/what-is

melling · on May 18, 2022

Any suggestions on what to learn in Linear Algebra after Gilbert Strang’s 18.06SC?

https://ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011...

My goal is to learn the math behind machine learning.

isaacfrond · on May 18, 2022

Here are some links from my favourite reference website:

Book: Mathematics for Machine Learning (mml-book.github.io) https://news.ycombinator.com/item?id=16750789

Mathematics for Machine Learning [pdf] (mml-book.com) https://news.ycombinator.com/item?id=21293132

Mathematics of Machine Learning (2016) (ibm.com) https://news.ycombinator.com/item?id=15146746

DavidSJ · on May 18, 2022

If you want a beautiful abstract perspective on linear algebra to complement Strang's more down-to-earth, matrix- and linear equation-oriented lectures, pick up Axler's Linear Algebra Done Right.

natly · on May 18, 2022

He also released a lecture series recently (Axler himself!) which barely anyone seems to be talking about: https://www.youtube.com/playlist?list=PLGAnmvB9m7zOBVCZBUUmS...

natly · on May 18, 2022

These lectures are fantastic after you've mastered the basics of linear algebra https://www.youtube.com/watch?v=McLq1hEq3UY (convex optimization, by a very experienced and often funny lecturer)

nightski · on May 18, 2022

Stephen Boyd is a very good lecturer! I watched his videos on linear dynamical systems almost a decade ago and thought he did a fantastic job. Would highly recommend.

sjburt · on May 18, 2022

The problem set for the stanford linear dynamic systems that he taught is also fantastic. The level of difficulty and the breadth of applications leaves you with so many tools.

swframe2 · on May 18, 2022

The math used in ML papers is very diverse. There is too much to learn. It is easier if you pick a problem and learn the math for it. Find a group that is already working on that problem and ask them for best way to learn the math for it.

DavidSJ · on May 18, 2022

A strong foundation in linear algebra, multivariable calculus, probability, and statistics is going to be generally applicable almost no matter what problem you work on.

time_to_smile · on May 18, 2022

The vast majority of the mathematics required to really understand ML is just probability, calculus and basic linear algebra. If you know these already and still struggle it's because the notation is often terse and assumes a specific context. Regarding this the only answer is to keep reading key papers and work through the math, ideally in code as well.

For most current gen deep learning there's not even that much math, it's mostly a growing library of what are basically engineering tricks. For example an LSTM is almost exclusively very basic linear algebra with some calculus used to optimize it. But by it's nature the calculus can't be done by hand and the actual implementation of all that basic linear algebra is tricky and takes practice.

You'll learn more by implementing things from scratch based on the math than you will trying to read through all the background material hoping that one day it will all make sense. It only ever makes sense when implementing and by continuous reading/practice.

mbrudd · on May 19, 2022

Amen!! This is the best way to learn anything technical -- put things into practice to understand the theory. It's also important to keep revisiting the theory to understand results, rather than parroting some catchphrase to "explain" results.

ivad · on May 18, 2022

You might enjoy these notes: http://mlg.eng.cam.ac.uk/teaching/4f13/2122/ They give (I think) a good general overview, while also going a little bit more in-depth in a few areas (e.g., Gaussian Processes).

ngmc · on May 18, 2022

You could take Strang’s follow-up course on learning from data.

https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-an...

jstx1 · on May 18, 2022

I don't know if 18.065 was worth the time, it felt like a repeat of 18.06 with hardly anything new (and not nearly enough context around actual applications).

screye · on May 18, 2022

I am a big fan of Murphy's ML book [1].

[1] https://smile.amazon.com/Probabilistic-Machine-Learning-Intr...

It covers almost all the math you'd need to start doing ML research. I find it to be the ideal 'one book to rule them all' book for CS people in ML. Although, pure math grads in ML might find the book to not go deep enough.

howlin · on May 18, 2022

Now apply them to the data used to identify the "Big Five" personality traits! This is an interesting application of factor analysis/matrix factorization:

https://en.wikipedia.org/wiki/Big_Five_personality_traits

FabHK · on May 18, 2022

Yes, the Big Five were found by factor analysis, generally the PCA, which is basically an eigenvalue (spectral) decomposition of the (symmetric positive definite) covariance matrix of the data, which coincides with the SVD of the (normalised) data matrix.

The SVD is everywhere. IIRC, the first Netflix price was one won with the SVD:

https://en.wikipedia.org/wiki/Netflix_Prize

https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Progre...

jetbooster · on May 18, 2022

And then sell it to the Big Four!

https://en.m.wikipedia.org/wiki/Big_Four_accounting_firms

Royi · on May 18, 2022

Where's the data? It could be indeed interesting.

leephillips · on May 18, 2022

Thank you for this wonderfully concise summary: it’s convenient to have all this in one compact document.

I suppose “flops” means “floating-point operations” here? Heretofore I’ve always encountered this as an abbreviation for “floating-point operations per second”.

mvcalder · on May 18, 2022

When I used Matlab as an undergrad in the late 80's it used to report the flop count after each command, and those were floating point operations, no time units involved. I tried to find an image of the old command line but found this historical note first [1] and thought it would be of interest to readers of the article.

[1] http://www.stat.uchicago.edu/~lekheng/courses/309f14/flops/v... and

hanche · on May 18, 2022

Indeed, the meaning of “flops” is ambiguous, but that seems hard to avoid. Fortunately, the ambiguity is easily resolved from context in most cases, as it is here.

the_svd_doctor · on May 18, 2022

It's always been very ambiguous in practice. To avoid ambiguity I usually use `flop/s` but not everyone likes that :)

MatteoFrigo · on May 18, 2022

Speaking of SVD doctors, I heard many years ago (from Alan Edelman) that Gene Golub's license plate used to be "DR SVD". Later he switched to "PROF SVD".

I couldn't confirm the DR SVD part, but the PROF SVD story appears to be real: https://www.mathworks.com/company/newsletters/articles/profe...

edelman · on May 19, 2022

As I recall the Dr SVD was the license on his second car. He once left the car at SFO airport and when he heard I was flying in, he asked me to drive it to his house (keys under the seat!) What he didn't tell me was that it was a "stick shift" and I only once before had a friend teach me how that works. After driving around San Francisco for Gene I got better at it, but it was definitely scary at times. :-)

the_svd_doctor · on May 18, 2022

I heard something very similar from some of his very close (ex-)colleagues, so it appears to be true :)

arinlen · on May 18, 2022

> To avoid ambiguity I usually use `flop/s` but not everyone likes that :)

Flop/s makes no sense and is outright wrong. The whole point of flops is to express how many floating point operations are required by an algorithm. How many operations are performed per second is a property of the hardware you're using to run an implementation of the algorithm.

You want to express computational complexity in terms of floating point operations.

the_svd_doctor · on May 18, 2022

Well, yeah, I use it in the context of floating-points-operations-per-seconds. That's the most common use in my field. I was replying to the parent comment. No need to use this kind of tone.

zodiac · on May 18, 2022

Maybe the_svd_doctor uses flop/s to describe hardware, like you said

gnufx · on May 18, 2022

The number of operations per second varies by at least an order of magnitude on the same hardware depending on the GEMM algorithm you use (reference or Goto), and you quote the performance of the implementation in terms of FLOP/s knowing the number of FLOPs required by the computation. That makes sense to people who implement and measure these things.

swframe2 · on May 18, 2022

See this playlist on semidefinite programming: https://www.youtube.com/playlist?list=PLqwozWPBo-FtOsjrla4jk... for the surprising way factorization can be used to partially solve a NP problem.

isaacfrond · on May 18, 2022

The article presents a note on the 6 well known matrix compositions. He states that all of them have cubic complexity, but practical algorithms with better exponents exist for all of them.

adgjlsfhk1 · on May 18, 2022

They're all basically O(M(n)) where M(n) is your matrix multiplication time. Even though M(n)<=n^2.3...., it's reasonable to say that it's n^3, because in practice, no one uses the sub-cubic algorithms. Strassen is possibly workable, but it isn't widely used, and all of the sub-cubic algorithms have accuracy tradeoffs.

martinhath · on May 18, 2022

The fact that they're all cubic isn't really the notable part of the runtime of computing the different decompositions, because the constants involved are really different. In practice, a common reason for computing many of these decompositions is to solve a linear system `Ax=b`, because with the decomposition in hand it is really easy to solve the whole system (using e.g. backsubstitution). For instance, with C++s Eigen, look at the 100x100 column of [1], and we can see that there's orders of magnitude difference between the fast and slow approaches. THey're all still cubic, sure, but we're talking 168x difference here.

(of course, it's not so clear cut, since robustness varies, not all methods are applicable, and the benchmark is for solving the system, not computing the decomposition, but overall, knowledge of which decomposition is fast and which is not is absolutely crucial to practitioners)

[1]: https://eigen.tuxfamily.org/dox/group__DenseDecompositionBen...

photon-torpedo · on May 18, 2022

> but practical algorithms with better exponents exist for all of them.

I'm aware of randomized algorithms with better complexity, which come at the cost of only giving approximate results (though the approximation may be perfectly good for practical purposes). See e.g. [1]. Are there other approaches?

[1] https://doi.org/10.1137/090771806

owlbite · on May 18, 2022

If the data is Sparse (which is not uncommon for large matrices in the real world), you can exploit the sparsity to do significantly better then O(n**3).

oh_my_goodness · on May 18, 2022

Could you link a practical algorithm with an exponent lower than 3? (I think of these things https://en.wikipedia.org/wiki/Computational_complexity_of_ma... as not being practical, but I'd love to be wrong. )

gnufx · on May 18, 2022

For something Strassen-ish you could look at https://jianyuhuang.com/papers/sc16.pdf and the GPU implementation https://apps.cs.utexas.edu/apps/sites/default/files/tech_rep...

FabHK · on May 18, 2022

That's matrix-matrix multiplication. Nobody disputes that Strassen etc. have sub-cubic complexity. What about one of the six decompositions mentioned, as GP claimed?

gnufx · on May 19, 2022

I responded to a post about the practicality of MM multiplication methods, though GEMM is quite fundamental.

daniel-cussen · on May 18, 2022

https://www.fgemm.com, coming soon.

FabHK · on May 18, 2022

Can you provide a source?

For example, the SVD (Golub-Kahan-Reinsch) method generally involves bidiagonalisation of the matrix and then implicit QR steps, which are often achieved with Householder reflections and Givens rotations. Sure, all of those are conceptually matrix multiplications, but they're not implemented as O(N^3) matrix multiplications; rather, their special structure is exploited so that they're faster. Yet, the entire algorithm is still cubic. So not sure Strassen would accelerate things.

akomtu · on May 18, 2022

Imo, a better way to present this is to draw a diagram where various decompositions are connected by arrows that explain how one decomposition can be turned into another.

jdeaton · on May 18, 2022

That is fascinating. Do you have an example of this which you can point to?

jdeaton · on May 18, 2022

This is a fantastically clear outline of this topic. Thank you!

> The terms “factorization” and “decomposition” are synonymous and it is a matter of convention which is used. Our list comprises three factorization and three decompositions.

I can't tell if this is a joke: right after saying that these two words mean the same thing in this context, they are then used to categorize the methods.

Edit: This is the kind of content that proves the "dead internet theory" is wrong.

tomnipotent · on May 18, 2022

> I can't tell if this is a joke

It makes sense, the author is just using the original name of the method to bucket them e.g. spectral decomposition.

FabHK · on May 18, 2022

It's not a categorisation, but some trivia regarding the name. For example: By convention, the SVD is called the SVD (singular value decomposition), not the SVF (singular value factorisation). That would be synonymous, but that's not what it's called. Similarly for the other 5 methods.

yuppiemephisto · on May 18, 2022

the words are used synonymously in practice, but i think there's an opportunity to distinguish them usefully.

suggestion:

- 'factoring' is a multiplicative breakdown, a 'composition', like prime factorization.

- 'decomposition' could also be called 'partition', and is an additive breakdown, like how 3 could be split into 2 + 1

heinrichhartman · on May 18, 2022

From a theoretical perspective the most fundamental decomposition is the rank-decomposition:

A = X D Y

with X,Y invertible and D diagonal. It's called rank decomposition, because you can read off the rank from A by counting the non-zero entries in D. It's also useful to determining bases for the image and the kernel of A. Every math student learns a version of that in their first lecture series of Linear Algebra.

Curiously, the rank decomposition is not covered in the numerical literature. Also it's not possible to derive a rank decomposition from LR, QR decomposition, although the underlying algorithms (Gauss, Gram-Schmidt) could be used to do so.

It took me multiple weeks of work, to understand what the problem with this algorithms is, and what the practical options are to establish a rank decomposition are. Full details are available on my blog:

https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom...

TL;DR. Your options are (1) SVD, (2) QR factorization with column pivoting (part of LAPACK), (3) LDU factorization with total pivoting (not implemented in BLAS/LAPACK/etc.)

joppy · on May 18, 2022

This decomposition is not used much because it is not unique in any sense, it is not numerically stable, and it is fairly uninteresting: the matrix D can always be taken to consist of a diagonal of 1s of length rank(A) and the rest zeros. It is much more interesting once X and Y are constrained to something like unitary matrices (in which case we get SVD).

This “rank decomposition” is a bit interesting algebraically rather than numerically, since then the numerical stability problems disappear. Also, if we take all matrices to be over the integers (and require that “invertible” means the inverse is also an integer matrix), then the rank decomposition is a lot like the Smith normal form of a matrix.

heinrichhartman · on May 18, 2022

It's not unique, but it can be implemented in a way that is stable. E.g. SVD does this (but is quite overkill for the requirements).

It's highly relevant from an algebraic perspective, hence it's curious that it's not covered (at all) in the numeric literature.

whatshisface · on May 18, 2022

I could be misunderstanding, but I think you'd get all the algebraic content out of it by computing the dimensions of the image and kernel, then working with them from there on. Why would you want to have a matrix decomposition that separated them?

Granted, computing the dimensions of the kernel is not so easy, especially because a pair of vectors can be arbitrarily close without being linearly dependent. No wonder there is no stable way to do it, it technically exceeds the capability of finite-precision numbers. Multiplying a vector of unequal components by most numbers, especially those with non-terminating base-two decimal representations, will produce a vector that is linearly independent when rounded back to finite precision.

Clearly then, linear independence on computers has to be considered in the continuous sense in which singular values reveal it, where a very small singular value represents an "almost-kernel," which is the closest thing to a kernel you are likely to find outside of carefully constructed examples or integer matrices.

heinrichhartman · on May 19, 2022

> I could be misunderstanding, but I think you'd get all the algebraic > content out of it by computing the dimensions of the image and kernel, > then working with them from there on. Why would you want to have a > matrix decomposition that separated them?

Well. Sometimes you want to know a solution to a problem and not only the dimension of the solution space : )

Also for composition: If you have "compatible" matrices B,C. How do you compute the restriction: A|_ker(B), A|_im(B) or co-restrictions (factor projections): A/ker(C), A/im(C), etc.

bigbillheck · on May 18, 2022

Numerical linear algebra is very different from linear algebra.

heinrichhartman · on May 18, 2022

As it turns out, the author does cover the rank-decomposition (under the name "Rank-Revealing Factorization") in his blog as well:

https://nhigham.com/2021/05/19/what-is-a-rank-revealing-fact...

FabHK · on May 18, 2022

Higham there adds the condition that X, Y be well-conditioned (as they are in the SVD, for example). That's a problem with the Jordan decomposition A = X J X^-1: the X is non-singular in theory, but can be ill-conditioned, and thus the Jordan decomposition is not stable and not really used in practice (as highlighted in the original article).

magnio · on May 18, 2022

Isn't that just Gaussian elimination?

whatshisface · on May 18, 2022

What makes it the most fundamental?

edflsafoiewq · on May 18, 2022

It's basically the matrix form of dim(domain(f)) = dim(ker(f)) + dim(im(f)), or domain(f)/ker(f) ≅ im(f) with f (inducing) the isomorphism.

whatshisface · on May 18, 2022

But the image and kernel are not the only two properties of a matrix.

heinrichhartman · on May 18, 2022

Sure, but from an algebraic (categorical) perspective, taking kernels, co-kernels and images are the most fundamental operations you can think of.

The question is: How do you compute them in an effective (avoiding full SVD) and stable way?

FabHK · on May 18, 2022

Isn't the Jordan decomposition the most fundamental one for theory?

the_svd_doctor · on May 18, 2022

Higham is an outstanding researcher. One book I really enjoyed from him is https://epubs.siam.org/doi/book/10.1137/1.9780898717778. He has many many other excellent ones.

ted_dunning · on May 19, 2022

It is important to mention the role of randomized algorithms for approximate solutions. See this excellent and practical survey.

https://arxiv.org/abs/2002.01387

signaru · on May 19, 2022

QR factorization leads to many useful stuff : Eigenvalues/vectors, SVD (singular value decomposition), PCA (principal component analysis)

the_svd_doctor · on May 19, 2022

QR iteration (which uses QR decomposition) truly is a remarkable and beautiful algorithm.

It was even coined one of the top 10 algorithm of the 20th century https://archive.siam.org/pdf/news/637.pdf

mgaunard · on May 18, 2022

Missing the LDLT decomposition.

the_svd_doctor · on May 18, 2022

It can't always be computed accurately; I suspect that's why it's not mentioned.

Bunch–Kaufman is the "right" factorization for indefinite Hermitian/symmetric matrices but it's not as well know.

whatever1 · on May 18, 2022

Boo LU!!! Go with stable QR!

the_svd_doctor · on May 18, 2022

But LU, when it works, is less flops (only half though, IIRC). And doesn't require a square-root. I think that's why it used to be preferred when compute was a very scarce resource. Clearly that's not really the case anymore today.

But otherwise I agree, QR is almost better on every front.