More

sarosh · 2025-03-28T00:52:50 1743123170

But why does, as you explain "training goes brrr"?

Francis Bach, the author, makes a good faith effort to explain exactly why this material is beneficial (see https://francisbach.com/my-book-is-out/):

"Why yet another book on learning theory? ...the main reason is that I felt that the current trend in the mathematical analysis of machine learning was leading to overly complicated arguments and results that are often not relevant to practitioners. Therefore, my aim was to propose the simplest formulations that can be derived from first principles, trying to remain rigorous without overwhelming readers with more powerful results that require too much mathematical sophistication."

From my own reading and experience on the mathematical analysis approach of this "training goes brrr" approach, I thought the material in Chapter 12, Overparameterized Models, was interesting and coherent with 12.2.4 Linear Regression with Gaussian Projections being an especially elegant explanation. It would be interesting to hear if you had read/skimmed/purused this section and found it wanting etc.

sarosh · 2025-03-07T10:55:15 1741344915

This is the PDF of the following 2011 book focused on FFTs and fast arithmetic for both real numbers and finite fields. https://www.amazon.com/Matters-Computational-Ideas-Algorithm... The author is Jörg Arndt: born 1964 in Berlin, Germany. Study of theoretical physics at the University of Bayreuth, and the Technical University of Berlin, Diploma in 1995. PhD in Mathematics, supervised by Richard Brent, at the Australian National University, Canberra, in 2010.

Pinus · 2025-03-07T18:12:15 1741371135

Given the title, I expected him to be a major general!

sarosh · 2025-02-14T11:12:00 1739531520

Interesting that the underlying model, a LoRA fine-tune of Qwen2.5-Coder-32B, relies on synthetic data from Claude[1]:

  But we had a classic chicken-and-egg problem—we needed data to train the model, but we didn't have any real examples yet. So we started by having Claude generate about 50 synthetic examples that we added to our dataset. We then used that initial fine-tune to ship an early version of Zeta behind a feature flag and started collecting examples from our own team's usage.

  ...

  This approach let us quickly build up a solid dataset of around 400 high-quality examples, which improved the model a lot!

I checked the training set, but couldn't quickly identify which were 'Claude' produced[2]. Would be interesting to see them distinguished out.

[1]: https://zed.dev/blog/edit-prediction [2]: https://huggingface.co/datasets/zed-industries/zeta

hereonout2 · 2025-02-14T11:55:27 1739534127

Yes this is very interesting!

The hardware, tooling and time required to do a LoRa fine tune like this are extremely accessible.

Financially this is also not a big expense and I assume would have cost in the order of $100s of dollars in GPU rentals, possibly less if you ignore experimentation time.

So what is a barrier to entry here? The data? Well they didn't have that either so automatically generated a dataset of just 500 examples to achieve the task.

I'm sure they spent some time on that but again it doesn't sound an incredibly challenging task.

It's worth realising if you've not delved into fine tuning llms before. In terms of time, scale and financial costs there is a world of difference between building a product like this and building a base model.

sarosh · on Oct 17, 2024

Defer to other experts, but (briefly) normalizing flows are a method for constructing complex distributions by transforming a probability density through a series of invertible transformations. Normalizing flows are trained using a plain log-likelihood function, and they are capable of exact density evaluation and efficient sampling. See:

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015. Link: https://bigdata.duke.edu/wp-content/uploads/2022/08/1505.057...

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In ICLR Workshop, 2015. Link: https://arxiv.org/pdf/1410.8516

And for your direct question, the following paper "Efficient Bayesian Sampling Using Normalizing Flows to Assist Markov Chain Monte Carlo Methods" appears upon a superficial glance to be relevant. Link: https://arxiv.org/pdf/2107.08001

1980phipsi · on Oct 17, 2024

So it's like converting a normal distribution to log normal (and then back). But a more general way of thinking about it.

Where does the name "normalizing flows" come from?

hotstickyballs · on Oct 17, 2024

It comes from the Jacobian which you can get from auto diff. It measures how much distortion the function created and normalizes it so that you can integrate correctly without blowing up gradients

theGnuMe · on Oct 17, 2024

I mean the whole thing sounds like a deep neural network…

JHonaker · on Oct 17, 2024

Thanks! I've read the first one before. I'll take a look at the other two!

sarosh · on Nov 25, 2023

This actually seems like just a very clever market segmentation solution since the GPU was already limited to 8x PCIe lanes (its a laptop GPU see https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4060-Laptop...). The 'addition' of the M.2 SSD makes it a unique offering. Limiting it to only one drive is another way to keep the thermal envelope down. Kudos to the Asus design and product development folks.

zdw · on Nov 25, 2023

Most desktop class boards can only support bifurcating a 16x lane slot into two 8x lane slots, so it's more about preventing user error, as 2 M.2 slots would likely not function in most cases.

Even on server class hardware, the split options for a single slot are usually (but not always) 16x, 2* 8x, and 4* 4x - an 8x + 2* 4x option is somewhat unusual.

sarosh · on July 16, 2023

The paper, Trigger-Level Event Reconstruction for Neutrino Telescopes Using Sparse Submanifold Convolutional Neural Networks, is here https://arxiv.org/abs/2303.08812.

The crucial advantage of SSCNN are that they "replaces traditional convolutions with sparse submanifold convolutions" which operate "only on the non-zero elements" and creates efficiencies.

sarosh · on July 5, 2023

Looking at https://research.wmz.ninja/projects/phd/rulesets/default/eve... provides most of the key 'game loops', i.e.

# idea -> prelim -> major -> 2 figures -> submitted paper

interesting to see the hypothesis about reading more papers being borne out:

# increase the success rate as the player reads more papers probability: 0.60 + player.readPapers / 100 - itemCount('idea') / 20

Also interesting to see that passing the qualification exam provides the largest player.hope boost (+10)

Was fun to see the TooManyIdeas random event - now to actually get it to trigger.

sarosh · on June 13, 2023

A (now 6 year old) discussion that might be helpful: Anders Hejlsberg on Modern Compiler Construction: https://www.youtube.com/watch?v=wSdV1M7n4gQ

Also https://github.com/salsa-rs/salsa

splines_tines · on June 13, 2023

That video is largely what inspired me to post this actually :)

I'll check out salsa! Thank you!

sarosh · on Jan 22, 2023

The author, David Mumford, is known for his distinguished work in algebraic geometry, and was awarded the Fields Medal in 1974. [2]

Pattern theory was formulated by Ulf Grenander to describe knowledge of the world as patterns. [3]

Prof. Mumford explains that "[s]everal essential ideas brought me to realize how Grenander's Pattern Theory was the right way to understand almost all cognitive skills and especially vision. One was the emphasis on pattern synthesis as well as pattern analysis." [0]. Second, "was that natural signals given by functions f vary not only by random additive perturbations but often by composition with random rearrangements of their domain. The resulting probability distribution in the vector space of signals is nothing like Gaussian. Its support is usually a twisted snakey submanifold. This puts the lie to all simplistic Gaussian pattern recognition algorithms." [0]. And third, "graphical structures were everywhere in the representations of ideas in cognitive domains" [0].

His most recent work seems to be "Pattern Theory, the Stochastic Analysis of Real World Signals" [1]

[0] https://www.dam.brown.edu/people/mumford/vision/pattern.html

[1] https://www.amazon.com/dp/1568815794/ref=sr_1_1?ie=UTF8&qid=...

[2] https://en.wikipedia.org/wiki/David_Mumford

[3] https://en.wikipedia.org/wiki/Pattern_theory

mindcrime · on Jan 22, 2023

Something I stumbled across in the last day or two brought Ulf Grenander[1] and his idea of "Pattern Theory" to my attention, and I've been going down a bit of a rabbit-hole the last couple of days digging into this. Discovering David Mumford's work was especially interesting given that he is, as you say, a Field Medalist. Not to mention his also being a MacArthur Fellow and a recipient of a National Medal of Science. Not to go all "appeal to authority", but if he thinks this is valuable territory, then it's probably worth a little bit of my time exploring it.

[1]: https://en.wikipedia.org/wiki/Ulf_Grenander

steppi · on Jan 22, 2023

Would you consider doing a write-up somewhere after getting further into your investigation of pattern theory? I'd certainly be interested in a distillation of any insights you find and I imagine others would too.

mindcrime · on Jan 22, 2023

Sure, but no promises on how long that might take. :-)

sarosh · on Dec 8, 2022

Heres the Arxiv link with the PDF since the Science link isn't working. https://arxiv.org/pdf/2203.07814v1.pdf

*See p. 21 for a list of benchmarks, including the results of a finetuned GPT-3 for coding.