There is a tick-tock between searching the dominant NN architectures (tick) and ...

imtringued · on May 19, 2024

You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.

If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.

Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.

I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?

Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.

miven · on May 19, 2024

I don't get your point, how is what you're suggesting here different from a few papers we already have on KV cache pruning methods like [1]?

[1] https://arxiv.org/abs/2305.15805

rdedev · on May 19, 2024

My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"

curious_cat_163 · on May 20, 2024

IMO, SSMs are an optimization. They don't represent enough of a fundamental departure from the kinds of things Transformers can _do_. So, while I like the idea of saving on the energy costs, I speculate that such saving can be obtained with other optimizations while staying with transformer blocks. Hence, the motivation to change is a bit of an uphill here. I would love to hear counter-arguments to this view. :)

Furthermore, I think a replacement will require that we _understand_ what the current crop of models are doing mechanically. Some of it was motivated in [1].

[1] https://openaipublic.blob.core.windows.net/neuron-explainer/...

inciampati · on May 20, 2024

Quadratic vs linear is not an optimization. It's a completely new game. With selective SSMs (mamba) the win is that associative training can be run in sublinear time via a log-cost associative scan. So you go from something quadratic wrt input sequence length to something logarithmic. If that's just an optimization it's a huge one.

curious_cat_163 · on May 20, 2024

Okay. Respect your point of view. I am curious, what applications do you think SSMs enable that a Transformer cannot? I have always seen it as a drop-in replacement (like for like) but maybe there is more to it.

Personally, I think going linear instead of quadratic for a core operation that a system needs to do is by definition an optimization.

throwawaymaths · on May 19, 2024

There's something about a transformer being at its core based on a differentiable hash table data structure that makes them special.

I think it's dominance is not going to substantially change any time soon. Dont you know, the solution to all leetcode interviews is a hash table?

tysam_and · on May 19, 2024

Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)

There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).

smel · on May 20, 2024

The solution to agi is not deep learning maybe with more compute and shit load of engineering it can work kind of baby agi.

My bet will be on something else than gradient descent and backprop but really I don't wish any company or country to reach agi or any sophisticated ai ...

inciampati · on May 20, 2024

Magical thinking. Nature uses gradient descent to evolve all of us and our companions on this planet. If something better were out there, we would see it at work in the natural world.

mopierotti · on May 20, 2024

Are you also saying that thoughts are formed using gradient descent? I don't think gradient descent is an accurate way to describe either process in nature. Also, we don't know that we "see" everything that is happening, we don't even understand the brain yet.

psychoslave · on May 20, 2024

Maybe it's there but in a ethereal form that is ungrabbable to mere conscious forms as ourself? :P

NoobSaibot135 · on May 20, 2024

I like your analogy of a tick tock ~= epoch of progress

Step change, then optimization of that step change

Kind of like a grand father clock with a huge pendulum swinging to one side, then another(commonly used metaphor).

auspiv · on May 20, 2024

Intel has been doing "tick-tock" for almost 20 years - https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model

treyd · on May 20, 2024

It's a metaphor that's been used with the advancement of CPU designs at least as far back as the 80s or 90s. Intel uses it explicitly in their marketing nowadays, I believe.