Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The big six matrix factorizations (nhigham.com)
268 points by jjgreen on May 18, 2022 | hide | past | favorite | 78 comments


Spectral decomposition is pretty cool.

Application 1 - spectral clustering - an alternative to k-means for nonlinear clusters. Get a Distance matrix of your data, spectral decomp, run k-means on your k top eigen vectors and that's your clusters.

Application 2 - graph clustering - (run spectral clustering on adj matrix!)

There's some tricks to getting it to work in practice like normalizing but it's a simple and powerful method. Also the matrices can get big so it helps a lot to use sparse matrix libraries for the computations.

[1] https://towardsdatascience.com/spectral-clustering-aba2640c0....

[2] https://www.hindawi.com/journals/ddns/2020/4540302/


I've never given a second thought about what the etymology of "spectral" in spectral decomposition is. Somewhere in the back of my mind (and I guess many students of physics have the same notion) subconsciously i assumed it originates from eigenvalues of the Hamiltonian determining the atomic spectral lines . But I've never followed up on it and actually looked it up .


I might be wrong about the exact historical reason.

But the way I see it "spectral decomposition of A" is a way to express A as a sum of orthogonal, rank-1, operators. A = \sum l_i u_i u_i^T. Those l_i are the eigenvalues; u_i are the eigenvectors.

The eigenvectors look a whole lot like the "modes" in a Fourier decomposition. And if you plot (i, l_i), the eigenvalues are a bit like the "spectrum" (the amplitude of each mode).

In fact, the complex exponentials (the modes in the Fourier decomposition) are also eigenvectors of a specific operator (the Laplacian).

Math people are good at finding connections between things.


The spectrum of the matrix A is also closely related to the frequencies at which the ordinary differential equation xdot = Ax oscillates!


According to the almighty wikipedia, The connection is correct but it turned out to be an accident. David Hilbert who coined spectral theory was surprised when it was found to be applicable to solving quantum mechanical spectra.


If a spectrum is just a range of numbers, then the idea of spectrum should apply to many phenomenon. Many things can be described by a number. So, when matrix spectrum and atomic spectrum were formulated, perhaps spectrum was a word that was in vogue to describe a quantity. I think this is the explanation of the reason the same term is used for both. Because many connected phenomenon can be quantified. I know only a little about matrix spectrum and atomic spectrum, so take my thought with a grain of salt.


I love that on this website I can find comments that I simply don't understand a word of. Good on you for doing the stuff you do sir.


I love Dr. Higham’s blog. In case anyone missed it, the whole series is on GitHub: https://github.com/higham/what-is


Any suggestions on what to learn in Linear Algebra after Gilbert Strang’s 18.06SC?

https://ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011...

My goal is to learn the math behind machine learning.


Here are some links from my favourite reference website:

Book: Mathematics for Machine Learning (mml-book.github.io) https://news.ycombinator.com/item?id=16750789

Mathematics for Machine Learning [pdf] (mml-book.com) https://news.ycombinator.com/item?id=21293132

Mathematics of Machine Learning (2016) (ibm.com) https://news.ycombinator.com/item?id=15146746


If you want a beautiful abstract perspective on linear algebra to complement Strang's more down-to-earth, matrix- and linear equation-oriented lectures, pick up Axler's Linear Algebra Done Right.


He also released a lecture series recently (Axler himself!) which barely anyone seems to be talking about: https://www.youtube.com/playlist?list=PLGAnmvB9m7zOBVCZBUUmS...


These lectures are fantastic after you've mastered the basics of linear algebra https://www.youtube.com/watch?v=McLq1hEq3UY (convex optimization, by a very experienced and often funny lecturer)


Stephen Boyd is a very good lecturer! I watched his videos on linear dynamical systems almost a decade ago and thought he did a fantastic job. Would highly recommend.


The problem set for the stanford linear dynamic systems that he taught is also fantastic. The level of difficulty and the breadth of applications leaves you with so many tools.


The math used in ML papers is very diverse. There is too much to learn. It is easier if you pick a problem and learn the math for it. Find a group that is already working on that problem and ask them for best way to learn the math for it.


A strong foundation in linear algebra, multivariable calculus, probability, and statistics is going to be generally applicable almost no matter what problem you work on.


The vast majority of the mathematics required to really understand ML is just probability, calculus and basic linear algebra. If you know these already and still struggle it's because the notation is often terse and assumes a specific context. Regarding this the only answer is to keep reading key papers and work through the math, ideally in code as well.

For most current gen deep learning there's not even that much math, it's mostly a growing library of what are basically engineering tricks. For example an LSTM is almost exclusively very basic linear algebra with some calculus used to optimize it. But by it's nature the calculus can't be done by hand and the actual implementation of all that basic linear algebra is tricky and takes practice.

You'll learn more by implementing things from scratch based on the math than you will trying to read through all the background material hoping that one day it will all make sense. It only ever makes sense when implementing and by continuous reading/practice.


Amen!! This is the best way to learn anything technical -- put things into practice to understand the theory. It's also important to keep revisiting the theory to understand results, rather than parroting some catchphrase to "explain" results.


You might enjoy these notes: http://mlg.eng.cam.ac.uk/teaching/4f13/2122/ They give (I think) a good general overview, while also going a little bit more in-depth in a few areas (e.g., Gaussian Processes).


You could take Strang’s follow-up course on learning from data.

https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-an...


I don't know if 18.065 was worth the time, it felt like a repeat of 18.06 with hardly anything new (and not nearly enough context around actual applications).


I am a big fan of Murphy's ML book [1].

[1] https://smile.amazon.com/Probabilistic-Machine-Learning-Intr...

It covers almost all the math you'd need to start doing ML research. I find it to be the ideal 'one book to rule them all' book for CS people in ML. Although, pure math grads in ML might find the book to not go deep enough.


Now apply them to the data used to identify the "Big Five" personality traits! This is an interesting application of factor analysis/matrix factorization:

https://en.wikipedia.org/wiki/Big_Five_personality_traits


Yes, the Big Five were found by factor analysis, generally the PCA, which is basically an eigenvalue (spectral) decomposition of the (symmetric positive definite) covariance matrix of the data, which coincides with the SVD of the (normalised) data matrix.

The SVD is everywhere. IIRC, the first Netflix price was one won with the SVD:

https://en.wikipedia.org/wiki/Netflix_Prize

https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Progre...



Where's the data? It could be indeed interesting.


Thank you for this wonderfully concise summary: it’s convenient to have all this in one compact document.

I suppose “flops” means “floating-point operations” here? Heretofore I’ve always encountered this as an abbreviation for “floating-point operations per second”.


When I used Matlab as an undergrad in the late 80's it used to report the flop count after each command, and those were floating point operations, no time units involved. I tried to find an image of the old command line but found this historical note first [1] and thought it would be of interest to readers of the article.

[1] http://www.stat.uchicago.edu/~lekheng/courses/309f14/flops/v... and


Indeed, the meaning of “flops” is ambiguous, but that seems hard to avoid. Fortunately, the ambiguity is easily resolved from context in most cases, as it is here.


It's always been very ambiguous in practice. To avoid ambiguity I usually use `flop/s` but not everyone likes that :)


Speaking of SVD doctors, I heard many years ago (from Alan Edelman) that Gene Golub's license plate used to be "DR SVD". Later he switched to "PROF SVD".

I couldn't confirm the DR SVD part, but the PROF SVD story appears to be real: https://www.mathworks.com/company/newsletters/articles/profe...


As I recall the Dr SVD was the license on his second car. He once left the car at SFO airport and when he heard I was flying in, he asked me to drive it to his house (keys under the seat!) What he didn't tell me was that it was a "stick shift" and I only once before had a friend teach me how that works. After driving around San Francisco for Gene I got better at it, but it was definitely scary at times. :-)


I heard something very similar from some of his very close (ex-)colleagues, so it appears to be true :)


> To avoid ambiguity I usually use `flop/s` but not everyone likes that :)

Flop/s makes no sense and is outright wrong. The whole point of flops is to express how many floating point operations are required by an algorithm. How many operations are performed per second is a property of the hardware you're using to run an implementation of the algorithm.

You want to express computational complexity in terms of floating point operations.


Well, yeah, I use it in the context of floating-points-operations-per-seconds. That's the most common use in my field. I was replying to the parent comment. No need to use this kind of tone.


Maybe the_svd_doctor uses flop/s to describe hardware, like you said


The number of operations per second varies by at least an order of magnitude on the same hardware depending on the GEMM algorithm you use (reference or Goto), and you quote the performance of the implementation in terms of FLOP/s knowing the number of FLOPs required by the computation. That makes sense to people who implement and measure these things.


See this playlist on semidefinite programming: https://www.youtube.com/playlist?list=PLqwozWPBo-FtOsjrla4jk... for the surprising way factorization can be used to partially solve a NP problem.


The article presents a note on the 6 well known matrix compositions. He states that all of them have cubic complexity, but practical algorithms with better exponents exist for all of them.


They're all basically O(M(n)) where M(n) is your matrix multiplication time. Even though M(n)<=n^2.3...., it's reasonable to say that it's n^3, because in practice, no one uses the sub-cubic algorithms. Strassen is possibly workable, but it isn't widely used, and all of the sub-cubic algorithms have accuracy tradeoffs.


The fact that they're all cubic isn't really the notable part of the runtime of computing the different decompositions, because the constants involved are really different. In practice, a common reason for computing many of these decompositions is to solve a linear system `Ax=b`, because with the decomposition in hand it is really easy to solve the whole system (using e.g. backsubstitution). For instance, with C++s Eigen, look at the 100x100 column of [1], and we can see that there's orders of magnitude difference between the fast and slow approaches. THey're all still cubic, sure, but we're talking 168x difference here.

(of course, it's not so clear cut, since robustness varies, not all methods are applicable, and the benchmark is for solving the system, not computing the decomposition, but overall, knowledge of which decomposition is fast and which is not is absolutely crucial to practitioners)

[1]: https://eigen.tuxfamily.org/dox/group__DenseDecompositionBen...


> but practical algorithms with better exponents exist for all of them.

I'm aware of randomized algorithms with better complexity, which come at the cost of only giving approximate results (though the approximation may be perfectly good for practical purposes). See e.g. [1]. Are there other approaches?

[1] https://doi.org/10.1137/090771806


If the data is Sparse (which is not uncommon for large matrices in the real world), you can exploit the sparsity to do significantly better then O(n**3).


Could you link a practical algorithm with an exponent lower than 3? (I think of these things https://en.wikipedia.org/wiki/Computational_complexity_of_ma... as not being practical, but I'd love to be wrong. )


For something Strassen-ish you could look at https://jianyuhuang.com/papers/sc16.pdf and the GPU implementation https://apps.cs.utexas.edu/apps/sites/default/files/tech_rep...


That's matrix-matrix multiplication. Nobody disputes that Strassen etc. have sub-cubic complexity. What about one of the six decompositions mentioned, as GP claimed?


I responded to a post about the practicality of MM multiplication methods, though GEMM is quite fundamental.


https://www.fgemm.com, coming soon.


Can you provide a source?

For example, the SVD (Golub-Kahan-Reinsch) method generally involves bidiagonalisation of the matrix and then implicit QR steps, which are often achieved with Householder reflections and Givens rotations. Sure, all of those are conceptually matrix multiplications, but they're not implemented as O(N^3) matrix multiplications; rather, their special structure is exploited so that they're faster. Yet, the entire algorithm is still cubic. So not sure Strassen would accelerate things.


Imo, a better way to present this is to draw a diagram where various decompositions are connected by arrows that explain how one decomposition can be turned into another.


That is fascinating. Do you have an example of this which you can point to?


This is a fantastically clear outline of this topic. Thank you!

> The terms “factorization” and “decomposition” are synonymous and it is a matter of convention which is used. Our list comprises three factorization and three decompositions.

I can't tell if this is a joke: right after saying that these two words mean the same thing in this context, they are then used to categorize the methods.

Edit: This is the kind of content that proves the "dead internet theory" is wrong.


> I can't tell if this is a joke

It makes sense, the author is just using the original name of the method to bucket them e.g. spectral decomposition.


It's not a categorisation, but some trivia regarding the name. For example: By convention, the SVD is called the SVD (singular value decomposition), not the SVF (singular value factorisation). That would be synonymous, but that's not what it's called. Similarly for the other 5 methods.


the words are used synonymously in practice, but i think there's an opportunity to distinguish them usefully.

suggestion:

- 'factoring' is a multiplicative breakdown, a 'composition', like prime factorization.

- 'decomposition' could also be called 'partition', and is an additive breakdown, like how 3 could be split into 2 + 1


From a theoretical perspective the most fundamental decomposition is the rank-decomposition:

A = X D Y

with X,Y invertible and D diagonal. It's called rank decomposition, because you can read off the rank from A by counting the non-zero entries in D. It's also useful to determining bases for the image and the kernel of A. Every math student learns a version of that in their first lecture series of Linear Algebra.

Curiously, the rank decomposition is not covered in the numerical literature. Also it's not possible to derive a rank decomposition from LR, QR decomposition, although the underlying algorithms (Gauss, Gram-Schmidt) could be used to do so.

It took me multiple weeks of work, to understand what the problem with this algorithms is, and what the practical options are to establish a rank decomposition are. Full details are available on my blog:

https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom...

TL;DR. Your options are (1) SVD, (2) QR factorization with column pivoting (part of LAPACK), (3) LDU factorization with total pivoting (not implemented in BLAS/LAPACK/etc.)


This decomposition is not used much because it is not unique in any sense, it is not numerically stable, and it is fairly uninteresting: the matrix D can always be taken to consist of a diagonal of 1s of length rank(A) and the rest zeros. It is much more interesting once X and Y are constrained to something like unitary matrices (in which case we get SVD).

This “rank decomposition” is a bit interesting algebraically rather than numerically, since then the numerical stability problems disappear. Also, if we take all matrices to be over the integers (and require that “invertible” means the inverse is also an integer matrix), then the rank decomposition is a lot like the Smith normal form of a matrix.


It's not unique, but it can be implemented in a way that is stable. E.g. SVD does this (but is quite overkill for the requirements).

It's highly relevant from an algebraic perspective, hence it's curious that it's not covered (at all) in the numeric literature.


I could be misunderstanding, but I think you'd get all the algebraic content out of it by computing the dimensions of the image and kernel, then working with them from there on. Why would you want to have a matrix decomposition that separated them?

Granted, computing the dimensions of the kernel is not so easy, especially because a pair of vectors can be arbitrarily close without being linearly dependent. No wonder there is no stable way to do it, it technically exceeds the capability of finite-precision numbers. Multiplying a vector of unequal components by most numbers, especially those with non-terminating base-two decimal representations, will produce a vector that is linearly independent when rounded back to finite precision.

Clearly then, linear independence on computers has to be considered in the continuous sense in which singular values reveal it, where a very small singular value represents an "almost-kernel," which is the closest thing to a kernel you are likely to find outside of carefully constructed examples or integer matrices.


> I could be misunderstanding, but I think you'd get all the algebraic > content out of it by computing the dimensions of the image and kernel, > then working with them from there on. Why would you want to have a > matrix decomposition that separated them?

Well. Sometimes you want to know a solution to a problem and not only the dimension of the solution space : )

Also for composition: If you have "compatible" matrices B,C. How do you compute the restriction: A|_ker(B), A|_im(B) or co-restrictions (factor projections): A/ker(C), A/im(C), etc.


Numerical linear algebra is very different from linear algebra.


As it turns out, the author does cover the rank-decomposition (under the name "Rank-Revealing Factorization") in his blog as well:

https://nhigham.com/2021/05/19/what-is-a-rank-revealing-fact...


Higham there adds the condition that X, Y be well-conditioned (as they are in the SVD, for example). That's a problem with the Jordan decomposition A = X J X^-1: the X is non-singular in theory, but can be ill-conditioned, and thus the Jordan decomposition is not stable and not really used in practice (as highlighted in the original article).


Isn't that just Gaussian elimination?


What makes it the most fundamental?


It's basically the matrix form of dim(domain(f)) = dim(ker(f)) + dim(im(f)), or domain(f)/ker(f) ≅ im(f) with f (inducing) the isomorphism.


But the image and kernel are not the only two properties of a matrix.


Sure, but from an algebraic (categorical) perspective, taking kernels, co-kernels and images are the most fundamental operations you can think of.

The question is: How do you compute them in an effective (avoiding full SVD) and stable way?


Isn't the Jordan decomposition the most fundamental one for theory?


Higham is an outstanding researcher. One book I really enjoyed from him is https://epubs.siam.org/doi/book/10.1137/1.9780898717778. He has many many other excellent ones.


It is important to mention the role of randomized algorithms for approximate solutions. See this excellent and practical survey.

https://arxiv.org/abs/2002.01387


QR factorization leads to many useful stuff : Eigenvalues/vectors, SVD (singular value decomposition), PCA (principal component analysis)


QR iteration (which uses QR decomposition) truly is a remarkable and beautiful algorithm.

It was even coined one of the top 10 algorithm of the 20th century https://archive.siam.org/pdf/news/637.pdf


Missing the LDLT decomposition.


It can't always be computed accurately; I suspect that's why it's not mentioned.

Bunch–Kaufman is the "right" factorization for indefinite Hermitian/symmetric matrices but it's not as well know.


Boo LU!!! Go with stable QR!


But LU, when it works, is less flops (only half though, IIRC). And doesn't require a square-root. I think that's why it used to be preferred when compute was a very scarce resource. Clearly that's not really the case anymore today.

But otherwise I agree, QR is almost better on every front.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: