Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Decomposing language models into understandable components (anthropic.com)
445 points by tompark on Oct 8, 2023 | hide | past | favorite | 62 comments


Just ran across this useful comparison with another very recent paper that effectively corroborates some of the core findings, I believe by an author of the other paper: https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-...


What a great post, thanks for sharing.


Oh dang, I am quite literally working on this as a side project (out of mere curiosity).

Well, sort of ..., I'm refining an algo that takes several (carefully calibrated) outputs from a given LLM and infers the most plausible set of parameters behind it. I was expecting to find clusters of parameters very much alike to what they observe.

I informally call this problem inverting an LLM, and obv., it turns out to be non-trivial to solve. Not completely impossible, tho! as so far I've found some good approximations to it.

Anyway, quite an interesting read, def. will keep an eye on what they publish in the future.

Also, from the linked manuscript at the end,

>Another hypothesis is that some features are actually higher-dimensional feature manifolds which dictionary learning is approximating.

Well, you have something that behaves like a continuous, smooth space so you could define as many manifolds as you'd need to suit your needs, so yes :^). But, pedantry off, I get the idea and IMO that's definitely what's going on and the right framework to approach this problem from.

One amazing realization one can get from this is, what is the conceptual equivalent of the transition functions that connect all different manifolds in this LLM space? When you see it your mind will be blown, not because of its complexity, but rather because of its exceptional simplicity.


At first I thought this was an ode to dang.


Oh dang, a name so spry, A clever soul, with humor wry, In life's vast game, you do not shy, A friend to all, a bond we tie.


> One amazing realization one can get from this is, what is the conceptual equivalent of the transition functions that connect all different manifolds in this LLM space?

Could you elaborate on what you mean by "transition functions" here?


likely https://en.m.wikipedia.org/wiki/Atlas_(topology)#Transition_...

although the quoted sentence does not make sense to me; transition maps connect different patches of one manifold. it's possible the "LLM space" gp is talking about is a parameter space of some nature each of whose points is a manifold, but that seems like a stretch


You're right, that's a bit ambiguous.

Instead of "all different manifolds" I should have written "all different ways to define a manifold". Also, now that I think about it, it may not necessarily be all of them.

The important thing here is that, however you define them and their transition maps, you'll find they're very much alike. As if there was some sort of general structure that is highly preferred over others ...


"if you look at the (chart) transition maps for different atlases on this manifold, they tend to look similar for different atlases"

what is the manifold here and what evidence do you have for this / what does it look like when you "define" them?



Yeah, that's the definition I was thinking of, too, but the sentence didn't make sense to me, either.


What was your approach to getting started doing this?

I’m curious to learn more about how LLMs work too.


Try to get something like tinygrad[1] running locally, that way you can tweak things a bit, run it again and see how it performs. While doing this you'll pick up most of the concepts and get a feeling of how things work. Also, take a look at projects like llama.cpp[2], you don't have to fully understand what's going on here, though.

You may need some intermediate knowledge of linear algebra and this thing called "data science" nowadays, which is pretty much knowing how to mangle data and visualize it.

Try creating a small model on your own, it doesn't have to be super fancy just make sure it does something you want it to do. And then ... you'll probably could go on your own then.

1: https://github.com/tinygrad/tinygrad

2: https://github.com/ggerganov/llama.cpp


Do you mean in the sense it’s hierarchical? Or am I missing the point entirely


I'm not talking about that specifically but your intuition is also correct and there's a lot of research going around constructing/defining hierarchies of "learning" behavior.


This looks like a big advance in alignment research. A big problem has been that LLMs were just a giant set of inscrutable numbers, and we had no idea what was going on inside.

But if this technique scales up, then Anthropic has fixed that. They can figure out what different groups of neurons are actually doing, and use that to control the LLM's behavior. That could help with preventing accidentally misaligned AIs.


To me, it sounds more like a good lead for pruning.


> We find that the features that are learned are largely universal between different models, so the lessons learned by studying the features in one model may generalize to others.

Hm. I wish they'd said more about that. Does that mean they found the same feature recognizers when training with the same training set? Or what? This tells us something, but what does it tell us?


Some architectures are relatively well understood. Eg in CNNs, the first layers detect low level features like edges, gradients, etc. The next layer then combines these features to more complex structures like corners or circles. Next layer will combine these features to even higher level features and so on. [1]

Typically, you can take a pre-trained model and retrain it on your new dataset by only changing the weights of the last layer(s).

Some loss functions even measures the difference between the high-level features of two images, typically extracted from a pre-trained CNN (Perceptual Loss).

[1]Matt Zeiler did an amazing work on these findings 10 years ago (https://arxiv.org/abs/1311.2901).


This makes me wonder what would happen if neural networks contain manually programmed components. It seems like trivial components such as detecting DNA sequences could be programmed in by manually setting the weights. The same thing could be done for example to give neural networks a maths component. Would the network when training discover and make use of these predefined components, or would it ignore them and make up its own ways of detecting DNA sequences?


This is called feature engineering if you want to look up more of a history and use of this idea.

Edit - tokenising is a form of this, you're pre-transforming the data to save it having to learn patterns you know are important.


You can manually program transformers:

https://srush.github.io/raspy/

I don't know if you can integrate them into a model. I think you might run out of space, since these aren't polysemantic and so would take up a lot more "room" than learned neurons.


In a way, this could be considered adding a speculative transformation of the input as part of the input to some layer, and the network deciding whether or not to use that transformation. It would be akin to a convolution layer in a CNN, albeit far more domain-specific. But I’m not sure how much research has been done on weird layers like this!


This is indeed interesting. In certain use cases where precision is paramount, we might opt for manually crafted code for the computations. This allows us to be confident in the efficiency of our manual method, rather than relying on LLM for such a specific task. However, it remains unclear whether this would be directly integrated with the network or simply be a tool at LLM's disposal. Interestingly, this situation seems to parallel the choice between enhancing the human brain with something like Neuralink and simply equipping with a calculator.


I wonder what the limitations are. Do LLM's have Turing completeness?


I am hoping that this type of research leads into ways to create highly tuned and steerable models that are also much smaller and more efficient.

Because if you can see what each part is doing, then theoretically you can find ways to create just the set of features you want. Or maybe tune features that have redundant capacity or something.

Maybe by studying the features they will get to the point where the knowledge can be distilled into something more like a very rich and finely defined knowledge graph.


Anthropic must be walking on multi-dimensional tightropes. They want AI safety, and probably want to avoid every Tom, Dick and Harry having a powerful model. But research output picked up by Meta and various discord group could turn the wooly LLMs into powerful contenders and then you have access to the power for all. I don’t have a strong opinion on what is better, but I lean slightly towards models in the open.

After all us plebs are allow to use computers and latest CPUs and internet and stuff already! Yes there is shit happening like scams, and worse but it is better than limiting what people can do.


On the other hand GPS is precedent for intentional nerfing for civilians.


Easier to do if you are literally interoperating with things in orbit, than if it's just software.


One large model is not how the brain works. It’s not how org charts work.

That LLMs are capable of what they are at the compute density they are strongly signals to me that the task of making a productive knowledge worker is in overhang territory.

The missing piece isn’t LLM advancement, it’s LLM management.

Building trust in an inwardly-adversarial LLM org chart that reports to you.


The way these systems work feel massively inefficient.

We don't re-evaluate our astrophysics models when reading a cooking book.


Neither does GPT-4 or other sparse mixtures of experts, such as e.g. switch transformers [1].

[1] https://arxiv.org/abs/2101.03961


Speak for yourself :)


I'm just curious, how polysemantic is the human brain with each neuron? Cause it feels to me, what you really want, and what the human brain might have, is a high-information (feature based / conceptual based / macro pattern based) monosemantic neural network, and where there is polysemantic neurons, they share similar or the same information in the feature it is a part of (leading to space efficiency? as well as computational efficiency). Whereas in transofmrer models like this, it's as if you're superimposing a million human brains on top of the same network, and then averaging out somehow all the features in the training set into unique neurons (leading naturally to a much larger "brain"). And also they mention in the paper that monosemantic neurons in the network don't work well, but my intuition would be because they are way too "high precision" and they aren't encoding enough information at the feature-level. Features are imo low dimensional, and then a monosemantic high dimensional neuron would the encode way too little information or something. But this is based on my lack of knowledge of the human brain so maybe there are way more similarities than I'm aware of...


This is kind of really cool.

All these LLMs appear to be converging around these features.


I am a lay person. To me, I understand a trained model describes transitions from one symbol to the next with probabilities between nodes. There is a structure to this graph — after all if there weren’t then training would be impossible — but this structure is as if it is all written on one sheet of paper with the definitions of each node all inked on top of each other in differed colors.

This research (and it’s parent and sibling papers, from the LW article) seem to be about picking out those colored graph components from the floating point soup?


Wait, embeddings were used for classification for a long time now. Can somebody explain what is new here?

edit: ah, looked at the paper, they did it unsupervised, with a sparse autoencoder.


From a machine learning layman's point of view but with some experience with modeling, it's hard to see this as a discovery. Model decomposition and model reduction techniques are very basic concepts in mathematical modeling, and decomposing models in modes with high participation is a very basic technique, which boils down to finding linear combinations of basis that are more expressive.

This is even less surprising given LLMs are applied to models with a known hierarchical structure and symmetry.

Can anyone say exactly what's novel in these findings? From a layman's point of view, this sounds like announcing the invention of gunpowder.


All machine learning is just renormalization which in turn is a convolution in Hopf algebra. That's why you see superposition

"In physics, wherever there is a linear system with a "superposition principle", a convolution operation makes an appearance."

I'm working this out in more details but it is uncanny how much it works out.

I have a discord if you want to discuss this further

https://discord.cofunctional.ai


Do you mean all ML or just large neural networks? Where is renormalization in a tree model? What superposition are you referring to?


Renormalization is all about this symmetric partitioning.


I suppose we should be cautious, the human mind is capable of overfitting too


You have no clue what you are talking about.


Is this going to be submitted for publication?


...so that we can censor these even more.


So, I came up with a pretty decent neural net from scratch about 20 years ago - it ran in the browser in Flash. It basically had a 10x10 bitmap input and an output of the same size, and lots of "neurons" in between that strengthened or weakened their connections based on feedback from the end result. And at a certain point they randomly mutated how they processed the input.

I don't see anything wildly different now, other than scale and youth and the hubris that accompanies those things.


Except the emergent properties at scale? At some point you go from making word like sentences, upping the neurons/architecture you get real sounding sentences and then upping again with RLHF loops you get impressive emergent intelligence and ability to solve tasks that were not forseen. It is a rare bird that’s not impressed with 2020s AI.


> emergent intelligence and ability to solve tasks that were not forseen

What's your best examples of this? Some of the most impressive examples I've seen ended up being likely in the dataset, or very close to being so. I've yet to see something where it definitely wasn't approximately in the dataset and was solved in a way that seemed to use some sort of novel process, but open to being wrong.


A good example is the 100s of conversation histories I have with GPT-4 where it does everything from help me code entirely novel and original ideas, or develop more abstract ideas.

Every single day, I get immense use out of modern language models. Even if an output is similar to something it's already processed, that's fine! Such is the nature of synthesis.


> entirely novel and original ideas

They are not novel if there is an equivalent pattern in the training dataset. I guess you are not really trying anything that isn't available already on github or google in some form. If you think you do then please show an example of "entirely novel and original idea", that GPT-4 developed for you. I had at least 4 cases in which ChatGPT failed to produce correct solution (after pushing it for hours to correct itself in many ways) in an actual novel problem (solution not longer than 200 lines of code) for which there was no solution on Google or github. But you can't blame statistical model that was trained to create the most probable outcomes based on it's training data.


What you need to understand is the concept of metapatterns. GPT-4 has generalized so much, that it is able to learn the "patterns of patterns" in many domains. I don't require anything more fancy to drastically improve my workflow right now.

Usually, drawing from existing knowledge is the whole appeal of using GPT. Over time, you actually begin to get a sense of what the model is good at and bad at, and it's good at an incredible amount of things. I get it to write novel code constantly and I think that playing with it and confirming that for yourself is better than me showing you.


That's anything at scale. Emergence isn't a sole feature of NNs. NNs are to emergent behavior what crypto is to cash; hyping an enormous waste of resources with the promise to solve every problem, when any given problem has already been solved more elegantly. If you don't believe me about NNs, look at the caloric burden of the human brain, for fuck sake.


I agree the energy cost is concerning. And we are lucky we don't have unlimited coal, unlimited power and unlimited GPUs because we'd hit 4 degrees warming by Christmas with everyone trying it out.

The human brain is a salient point because often we are using AI so that the human brain can do less. Get this GPU to RTFM instead of the human. The human time is more valuable. All the while making the human brain probably less effective (compare someone who learns another language vs. someone who speaks it through an AI translator only).

I hold both points of view that AI is both marvelous, but also concerning in terms of energy use.

To nitpick - in " NNs are to emergent behavior what crypto is to cash " applies more to large language models. Simpler NNs for easy tasks that don't consume much power wouldn't apply (that might be like a VISA card?)


I'm sorry, what is your argument?

Is it that this behavior is the result of any system at scale? That is undeniably preposterous.

Is it that the human brain is more efficient? At energy usage, sure, but for me to find an individual who is capable enough to assist me in the manner GPT does, at the speed and level of breadth and depth that it does, would be next to impossible. If I did, their required compensation would be astronomical.

What are you arguing for or against? Are you aware that these systems will, like all previous computationally intensive systems, become drastically more efficient over time?


Maybe not novel novel but you can get it to write code in an application’s automation language and assist users using that application but with it’s general intelligence too (so it can figure out what the user intends, what to do in the app and generate the code to do that stuff). With a good UI that passes and executes automation code automatically, you now have magic in your app.


As a fellow old person, the way I think about it is that every time I have a thought like that it's because the neural networks inside my head have stopped being updated and are resulting in wildly outdated pattern matching. "So this car thing is just like a horse but this time with circular legs? Nothing new under the sun, I swear".


Nah. It really isn't new. Some of the neologisms take a moment to understand, but they all refer to the same ideas. "Inference" is when you show it stuff and it shows you results. "Tokens" are the matrix of bitmaps or whatever you show it, turned into a "vector" which is a set of bleeps and boops like what you send over a modem. "transformers" are just squeezing and scaling your tokens. It's fucking bog simple, stupid script kiddie shit. Inferring it or getting the inference to do what you want, anyway. Or whatever these, uh, "data scientists" with an online certificate actually do.

Apparently, sit there and write plain English at a billion-dollar cluster and wait for astonishing answers at 300bps.


You don't see the intermediate steps from a 10x10 neural net to LLMs?

Like, a whole decade of ML: better optimisers, better init, residual connections, tokenisation and token embeddings, training with large batches over thousands of machines, the attention mechanism, causal masking, flash attention and other memory optimisations, and even having the foresight to train on the totality of web text.

Not seeing the intermediate steps doesn't mean they are not essential and needed.

If you still believe a toy 10x10 fully connected net is the same with current models (bar scaling), then what is you opinion on MLP-Mixer? That was an "MLP is all you need" moment but it didn't lead to adoption.


You're describing genetic programming and a very simple neural net, which is cool. However, the utility of transformer models should not be discounted, and if that interested you 20 years ago, you would be blown away by what's possible today.


Wow, the hubris in this comment... Almost as much as in the infamous Dropbox comment: https://news.ycombinator.com/item?id=9224

You too might be a Hubris News celebrity one day.


So you wrote a toy script 20 years ago and somehow think you were doing just the same thing as OpenAI now? And you thought writing this down was a good flex?

Reminds me of the old dudes in the gym who come to you to tell you how they used to bench four plates when they were young. In their mind, they are badasses. In their mind only.


no, I'm saying that lacking a modern GPU then, the results of Microsoft doing the same thing at massive scale were already fairly obvious. That there has been no major innovation. That what you think of as amazing is actually banal.

BTW, awesome "flex" about how much time you spend at the gym, you sound like a guy who knows what he's talking about ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: