Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wait, are you saying SoTA NN research hasnt evolved from hardcoding a bunch of layer structures and sizes?

I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.



There is a tick-tock between searching the dominant NN architectures (tick) and optimizing for accuracy, compute and inference latency and throughput (tock).

This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.

My $0.02.


You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.

If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.

Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.

I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?

Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.


I don't get your point, how is what you're suggesting here different from a few papers we already have on KV cache pruning methods like [1]?

[1] https://arxiv.org/abs/2305.15805


My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"


IMO, SSMs are an optimization. They don't represent enough of a fundamental departure from the kinds of things Transformers can _do_. So, while I like the idea of saving on the energy costs, I speculate that such saving can be obtained with other optimizations while staying with transformer blocks. Hence, the motivation to change is a bit of an uphill here. I would love to hear counter-arguments to this view. :)

Furthermore, I think a replacement will require that we _understand_ what the current crop of models are doing mechanically. Some of it was motivated in [1].

[1] https://openaipublic.blob.core.windows.net/neuron-explainer/...


Quadratic vs linear is not an optimization. It's a completely new game. With selective SSMs (mamba) the win is that associative training can be run in sublinear time via a log-cost associative scan. So you go from something quadratic wrt input sequence length to something logarithmic. If that's just an optimization it's a huge one.


Okay. Respect your point of view. I am curious, what applications do you think SSMs enable that a Transformer cannot? I have always seen it as a drop-in replacement (like for like) but maybe there is more to it.

Personally, I think going linear instead of quadratic for a core operation that a system needs to do is by definition an optimization.


There's something about a transformer being at its core based on a differentiable hash table data structure that makes them special.

I think it's dominance is not going to substantially change any time soon. Dont you know, the solution to all leetcode interviews is a hash table?


Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)

There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).


The solution to agi is not deep learning maybe with more compute and shit load of engineering it can work kind of baby agi.

My bet will be on something else than gradient descent and backprop but really I don't wish any company or country to reach agi or any sophisticated ai ...


Magical thinking. Nature uses gradient descent to evolve all of us and our companions on this planet. If something better were out there, we would see it at work in the natural world.


Are you also saying that thoughts are formed using gradient descent? I don't think gradient descent is an accurate way to describe either process in nature. Also, we don't know that we "see" everything that is happening, we don't even understand the brain yet.


Maybe it's there but in a ethereal form that is ungrabbable to mere conscious forms as ourself? :P


I like your analogy of a tick tock ~= epoch of progress

Step change, then optimization of that step change

Kind of like a grand father clock with a huge pendulum swinging to one side, then another(commonly used metaphor).


Intel has been doing "tick-tock" for almost 20 years - https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model


It's a metaphor that's been used with the advancement of CPU designs at least as far back as the 80s or 90s. Intel uses it explicitly in their marketing nowadays, I believe.


The innovation is the amount of resources people are willing to spend right now. From looking at the research code it's clear that the whole field is basically doing a (somewhat) guided search in the entire space of possible layer permutations.

There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.


There's definitely scientific insight and analysis.

E.g. "In-context Learning and Induction Heads" is an excellent paper.

Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.

The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.

Attention provides information routing. Again, that is pretty well-understood.

The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.

So this architecture is not so much accidental as it is general.

Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.


One 3-layer (1 hidden layer) neural network can already approximate anything. You don't even need to stack them.


Well it took evolution 4 billion years of testing out random permutations that resulted in a pretty good local maximum, so there is hope for us yet.


„I‘m a pretty good local maximum“ that is what any local maximum would tell you if asked how it likes itself


"The brain is the most important part of the body", the brain said.


Note that not all brains are so severely damaged with this illusion. Most of them actually get pretty clearly that they are next to useless without its organic, social and environmental companions.


[flagged]


Or maybe a watch made by a blind guy instead of a bridge?


Would that make it more of a hear? Since obviously the guy can't watch.


I can't see the flagged comment, but I'm fairly sure @stefs is making a reference to this: https://en.wikipedia.org/wiki/The_Blind_Watchmaker


yes. the flagged comment wanted to sell moffkalast a bridge, w/o further explanation. i interpreted it as him saying that the human brain isn't governed by evolution and calling moffkalast naive.


The only thing that has changed since 2018 is the most popular network structure to play with. The code looks the same as always; python notebooks where someone manually calculated the size of each hard-coded layer to make it fit.


> someone manually calculated the size of each hard-coded layer

I wonder shouldn't AI be the best tool to optimize itself?


In theory yes, but unfortunately AI hasn't been invented yet


I don't know, shouldn't the AI then be trapped at evaluating all possible AI implementations? And since it will face the halting problem, it won't discriminate the very best one, though it will probably be able to return the best one given a capped amount of resources that is reachable through exhaustion in its space. It won't necessarily be better than what can be provided by human beings given an equivalent amount of resources.


The innovation is that everything is just one standardized structure now (transformer models) and you make it bigger if you feel like you need that.

There's still some room for experimenting if you care about memory/power efficiency, like MoE models, but they're not as well understood yet.


There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.


I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.


That is not true. Transformers are being applied all over because they work better than what was used before in so many cases.


I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.

So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.


My wild guess is that adjusting the shape before each step is not worth the speed hit. Uniform structures make GPUs go brrrrr


It's also easier to train and in particular easier to parallelize.


There are things like NAS (neural architectural search) but all you are doing is just growing the search space and making the optimization problem much harder. Typically you do the architectural optimization by hand, using heuristics and past experiments as guidance.


People would love to have dynamism. It's a cost thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: