"Why yet another book on learning theory? ...the main reason is that I felt that the current trend in the mathematical analysis of machine learning was leading to overly complicated arguments and results that are often not relevant to practitioners. Therefore, my aim was to propose the simplest formulations that can be derived from first principles, trying to remain rigorous without overwhelming readers with more powerful results that require too much mathematical sophistication."
From my own reading and experience on the mathematical analysis approach of this "training goes brrr" approach, I thought the material in Chapter 12, Overparameterized Models, was interesting and coherent with 12.2.4 Linear Regression with Gaussian Projections being an especially elegant explanation. It would be interesting to hear if you had read/skimmed/purused this section and found it wanting etc.
This is the PDF of the following 2011 book focused on FFTs and fast arithmetic for both real numbers and finite fields. https://www.amazon.com/Matters-Computational-Ideas-Algorithm...
The author is Jörg Arndt: born 1964 in Berlin, Germany. Study of theoretical physics at the University of Bayreuth, and the Technical University of Berlin, Diploma in 1995. PhD in Mathematics, supervised by Richard Brent, at the Australian National University, Canberra, in 2010.
Interesting that the underlying model, a LoRA fine-tune of Qwen2.5-Coder-32B, relies on synthetic data from Claude[1]:
But we had a classic chicken-and-egg problem—we needed data to train the model, but we didn't have any real examples yet. So we started by having Claude generate about 50 synthetic examples that we added to our dataset. We then used that initial fine-tune to ship an early version of Zeta behind a feature flag and started collecting examples from our own team's usage.
...
This approach let us quickly build up a solid dataset of around 400 high-quality examples, which improved the model a lot!
I checked the training set, but couldn't quickly identify which were 'Claude' produced[2]. Would be interesting to see them distinguished out.
The hardware, tooling and time required to do a LoRa fine tune like this are extremely accessible.
Financially this is also not a big expense and I assume would have cost in the order of $100s of dollars in GPU rentals, possibly less if you ignore experimentation time.
So what is a barrier to entry here? The data? Well they didn't have that either so automatically generated a dataset of just 500 examples to achieve the task.
I'm sure they spent some time on that but again it doesn't sound an incredibly challenging task.
It's worth realising if you've not delved into fine tuning llms before. In terms of time, scale and financial costs there is a world of difference between building a product like this and building a base model.
Defer to other experts, but (briefly) normalizing flows are a method for constructing complex distributions by transforming a probability density through a series of invertible transformations. Normalizing flows are trained using
a plain log-likelihood function, and they are capable of exact density evaluation and efficient sampling. See:
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In ICLR Workshop, 2015. Link: https://arxiv.org/pdf/1410.8516
And for your direct question, the following paper "Efficient Bayesian Sampling Using Normalizing Flows to Assist Markov Chain Monte Carlo Methods" appears upon a superficial glance to be relevant. Link: https://arxiv.org/pdf/2107.08001
It comes from the Jacobian which you can get from auto diff. It measures how much distortion the function created and normalizes it so that you can integrate correctly without blowing up gradients
This actually seems like just a very clever market segmentation solution since the GPU was already limited to 8x PCIe lanes (its a laptop GPU see https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4060-Laptop...). The 'addition' of the M.2 SSD makes it a unique offering. Limiting it to only one drive is another way to keep the thermal envelope down. Kudos to the Asus design and product development folks.
Most desktop class boards can only support bifurcating a 16x lane slot into two 8x lane slots, so it's more about preventing user error, as 2 M.2 slots would likely not function in most cases.
Even on server class hardware, the split options for a single slot are usually (but not always) 16x, 2* 8x, and 4* 4x - an 8x + 2* 4x option is somewhat unusual.
The paper, Trigger-Level Event Reconstruction for Neutrino Telescopes Using Sparse Submanifold Convolutional Neural Networks, is here https://arxiv.org/abs/2303.08812.
The crucial advantage of SSCNN are that they "replaces traditional convolutions with sparse submanifold convolutions" which operate "only on the non-zero elements" and creates efficiencies.
The author, David Mumford, is known for his distinguished work in algebraic geometry, and was awarded the Fields Medal in 1974. [2]
Pattern theory was formulated by Ulf Grenander to describe knowledge of the world as patterns. [3]
Prof. Mumford explains that "[s]everal essential ideas brought me to realize how Grenander's Pattern Theory was the right way to understand almost all cognitive skills and especially vision. One was the emphasis on pattern synthesis as well as pattern analysis." [0]. Second, "was that natural signals given by functions f vary not only by random additive perturbations but often by composition with random rearrangements of their domain. The resulting probability distribution in the vector space of signals is nothing like Gaussian. Its support is usually a twisted snakey submanifold. This puts the lie to all simplistic Gaussian pattern recognition algorithms." [0]. And third, "graphical structures were everywhere in the representations of ideas in cognitive domains" [0].
His most recent work seems to be "Pattern Theory, the Stochastic Analysis of Real World Signals" [1]
Something I stumbled across in the last day or two brought Ulf Grenander[1] and his idea of "Pattern Theory" to my attention, and I've been going down a bit of a rabbit-hole the last couple of days digging into this. Discovering David Mumford's work was especially interesting given that he is, as you say, a Field Medalist. Not to mention his also being a MacArthur Fellow and a recipient of a National Medal of Science. Not to go all "appeal to authority", but if he thinks this is valuable territory, then it's probably worth a little bit of my time exploring it.
Would you consider doing a write-up somewhere after getting further into your investigation of pattern theory? I'd certainly be interested in a distillation of any insights you find and I imagine others would too.
Francis Bach, the author, makes a good faith effort to explain exactly why this material is beneficial (see https://francisbach.com/my-book-is-out/):
"Why yet another book on learning theory? ...the main reason is that I felt that the current trend in the mathematical analysis of machine learning was leading to overly complicated arguments and results that are often not relevant to practitioners. Therefore, my aim was to propose the simplest formulations that can be derived from first principles, trying to remain rigorous without overwhelming readers with more powerful results that require too much mathematical sophistication."
From my own reading and experience on the mathematical analysis approach of this "training goes brrr" approach, I thought the material in Chapter 12, Overparameterized Models, was interesting and coherent with 12.2.4 Linear Regression with Gaussian Projections being an especially elegant explanation. It would be interesting to hear if you had read/skimmed/purused this section and found it wanting etc.