Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New attention mechanisms that outperform standard multi-head attention (arxiv.org)
233 points by snats on May 29, 2024 | hide | past | favorite | 49 comments



> The Transformer models, used in this experiment, all have a single attention layer with model dimension and context length 32.

I think we are going to need to see more experiments here, especially because the theoretical motivations here are weak


I’m certainly not a domain expert, but one thing I have read repeatedly about Transformers is that not all tricks scale the same.


Because self-attention can be replaced with FFT for a loss in accuracy and a reduction in kWh [1], I suspect that the Quantum Fourier Transform can also be substituted for attention in LLMs.

[1] https://syncedreview.com/2021/05/14/deepmind-podracer-tpu-ba...

"You Need to Pay Better Attention" (2024) https://arxiv.org/abs/2403.01643 :

> Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications.

"Leave No Context Behind: Efficient Infinite Context Transformers" (2024) https://arxiv.org/abs/2404.07143 :

> A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.


The 2D Discrete Fourier Transform (DFT) can replace attention because the 2D DFT looks a lot like attention. The algorithm loops over each pair of values in the input matrix, calculating first a linear combination of each pair that scales based on the distance between the values in the matrix. Closer values have more influence on each other than more distance values. This linear combination is fed into the famous e^(i*pi) equation that transforms it into a sinusoid. The sinusoids are summed up, giving you an output matrix that is a sum of sinusoids representing the frequency domain.

By comparison, the attention algorithm also combines each pair of values in the input matrix, but that linear combination involves learned weights, rather than simple fixed basis functions (sines and cosines) in the DFT.

The FNet paper suggests that the DFT works as a replacement for attention because in making the pairwise calculations to generate the 2D frequency representation of the input, information on these pairwise relationships is surfaced and made available to later feed-forward layers. As a handy add-on feature, the DFT also makes positional embedding unnecessary, because positional information is encoded by the DFT at each layer. That being said, in the paper they still applied positional embeddings so that they could make a direct comparison with the BERT architecture.

Amazing how these old signal processing ideas have made their way into neural networks, still proving their effectiveness and efficiency after 100 years or more (Gauss devised the Fast Fourier Transform algorithm in 1805 but we left it mostly on the shelf until it was reinvented by Cooley and Tukey in 1965).


Can't believe that FNet paper flew under my radar all this time—what a cool idea! It's remarkable to me that it works so well considering I haven't heard anyone mention it before! Do you know if any follow-up work was done?


This is the paper, which appears to be cited in hundreds of others, some of which appear to be about efficiency gains. https://arxiv.org/abs/2105.03824


"Fnet: Mixing tokens with fourier transforms" (2021) https://arxiv.org/abs/2105.03824

https://scholar.google.com/scholar?cites=1423699627588508486...

Fourier Transform and convolution.

Shouldn't a deconvolvable NN be more explainable? #XAI

Deconvolution: https://en.wikipedia.org/wiki/Deconvolution


I know AI is moving fast but they were only published within the last month or three.


I was referring to link [1] in GP’s comment, which is to a paper published in 2021, not to either of the more recent papers they published.


Oops, apologies.


Late to the party, but I think my summary is (L is context length, C is hidden dimension, H is headsize, C = H * nh):

3.1 Optimised attention: Instead of using a learned W_V to project from C to H, slice V into H sized vectors. (V is just the input tokens X). This is because the matrix multiply is to a lower dimension anyway, so why not just slice. Slicing is just reshaping (L, C) -> (L, nh, H)

3.2 Efficient attention: I think this opens with a typo, "In the last section, we discussed how and why we can remove W_O..." should be W_V not W_O I think. Anyways, same as above, just for the keys this time. Reshape K (which is just X) from (L, C) -> (L, nh, H)

3.3 Super attention: Introduce an (L, L) W_A (lower triangular for masked) that transforms V on the left (X again) from (L, C) -> (L, C) (whereas standard attention has W_V (C, C) that transforms (L, C) -> (L, C) from the right). And they share W_A between heads.More efficient when C > L, so for long context models, probably not more efficient.

I think the first two modifications are equivalent to just setting W_V and W_K to the constant identity matrices right? So that makes me think what would happen if you instead restrict W_V (and/or W_K, W_Q) to be block diagonal (though non square) such that each head has in effect an (H, H) matrix which transforms the slice of X it receives. This is different than standard attention right? Because there the W_V acts over the full C dimension. Almost surely someone has thought of this so I will try to find out

Still learning so all this could be wrong


The models tested are extremely small, a few thousand parameters and the performance is of course not great, I don't think we can extrapolate much from this. I don't understand why they chose such small models when you can train much larger ones for free on Colab or Kaggle if you really need it.


These seems very tiny models and as I understand it LLMs behave fairly differently at different scales.

The speed performance gain seems to only be on an M2 chip and I wonder if there's already much better non-GPU optimized attention approaches out there for those use cases.


The first two changes appear theoretically sound, but it's not clear that they would result in an actual performance improvement at scale. Their analysis ignores that a single matrix multiplication is typically used to calculate the Q, K, and V values from the inputs.

The third change looks like it would break causal masking for auto regressive language models. For masked token language models and ViTs, perhaps it's an improvement, though.


Just skimmed so far and didn't see any reference to the Simplified Transformer block of https://arxiv.org/abs/2311.01906 (and it seems they also left out grouped query attention, too, as pointed out by another comment).

While lazy me wants them to explain how their approach compares to these approaches, it looks like their exposition is pretty clear (quite nice for a preprint!) and I guess I'll just have to actually read the paper for real to see for myself.

Given how well I've seen Simplified Transformer blocks work in my own playground experiments, I would not at all be surprised if other related tweaks work out well even on larger scale models. I wish some of the other commenters here had a bit more curiosity and/or empathy for these two authors who did a fine job coming up with and initially testing out some worthwhile ideas.


Another piece of solid work in this space is DeepSeek-v2. They proposed MLA which outperform standard attention a little but reduce KV cache by over a magnitude. Not sure if these improvements could come together.


I'm surprised they don't mention grouped query attention at all.


> more efficient than standard attention whenever the model dimension is greater than or equal to the context length

all practical models have context length significantly larger than model dimension


Can anyone point me to models that look like they might actually be useful in moving towards AGI? I feel like I have a basic understanding of the transformer architecture, and multiplying X tokens in a sliding window across a set of static matrices to produce 1 new token does not look like a path to AGI.

Yes, the complex feature extraction is impressive. But are there any models that, I don't know, are more dynamic? Have a working memory? Have a less limited execution path?


The answer is simple: AI systems aren’t just one technique, agent, or even agency — any somewhat anthropomorphic ones will be ensemblematic on an extensive and fundamentally-recursive level. LLMs are a groundbreaking technique that solve the “Frame Problem” by emulating human subconscious generative networks.

To paraphrase an old comment on here: the problem isn’t a chatbot gaining sapience inside a browser window, the problem is when billions of dollars are allocated to a self-administering ensemble of 10,000 GPT agents, each specialized for some task (aka functions). That, plus Wikipedia, Cyc, WolfraAlpha, YouTube, and Google Books at its fingertips.

“General” doesn’t even begin to cover what we’re already capable of, IMO.

See: Marvin Minsky, 1991; https://ojs.aaai.org/aimagazine/index.php/aimagazine/article...


I look at all the advice on prompts and I don't feel like the "Frame Problem" has been solved. It feels like it has shifted into the "Frame Invocation Problem".

And it is this very problem which led me to ask my question about different architectures.


Well said. I’d say the new “problem” is using the word differently though, namely to denote “optimization environment” rather than the original’s sense of “unsolved paradox”


Current models can already argued to have something like working memory by storing information in little-used parts of the tokens. If placeholder tokens are handed to them that they can use as working memory, performance improves.

https://openreview.net/forum?id=2dnO3LLiJ1

https://news.ycombinator.com/item?id=40329675


Look at JEPA and Modulo-LLM.

Also AGI is a poor term to use because we as humans have no notion of what general intelligence is, does GI have morals and ethics, does it make decisions like we do based on executive functioning or does it work more like how ants do?


The answer might just be scale for all we know.


There are sporadic attempts at making things more dynamic, like the Neural Turing Machine. It doesn’t seem to buy much actual power.


xLSTM has a working memory and seems to outperform transformer architectures: https://arxiv.org/abs/2405.04517


Thanks for that. It looks like the kind of thing I'm looking for. I'll give it a read.


Pressing [X] to doubt.

There are many alternatives to the good old transformers: RWKV, Mamba, etc.

Yet here we are, still using transformers (actually, just the decoder part). Is it because the industry has so much inertia to pick up new methods? I doubt it because there's $BILLIONS in this market and everyone wants a piece of the AI cake, so it doesn't make sense to ignore promising methods.

Why, then, we barely see any non-transformer production-ready LLM these days?


I believe the attention mechanism we use now was introduced in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in their paper titled "Neural Machine Translation by Jointly Learning to Align and Translate."

2014. It took almost a decade for the potential of this technique to be realized and come to the attention (heh) of most developers. I don't know what researchers are doing with Mamba and RWKV, but we should let them cook.


It’s going to take time. I can’t speak to the actual quality of mamba other than to say the authors are extraordinary and should be taken seriously.

But training a large model requires a huge amount of capital so the biggest runs are designed around risk minimization. And remember, many of the decision makers of these runs made are in their positions by doing transformer-centric work. The true value of mamba is still unclear to me with very long context techniques being effective for transformers.


To be frank, the long-context techniques that you're describing are still extremely limited in the length of context to consider, only a million-token order, and extremely expensive to apply.


And yet quadratic attention is still unavoidable. Anything that lets you have sub quadratic attention is going to have an accuracy Vs performance tradeoff.


No, you can make multiple passes across the data set. There's no reason that you have to use quadratic attention. I point to the shape of our own brains, which are limited in size. Yet, through multiple passes over information and random access, which we guide, we're able to process very large systems.


> Why, then, we barely see any non-transformer production-ready LLM these days?

Because having a 5% better non-transformer model doesn't help you if as a result you can't use the 10% improvements people publish that only apply to transformers. Very quickly you'll be 5% worse than those who stuck with transformers, and have wasted a ton of time and money.


> I doubt it because there's $BILLIONS in this market and everyone wants a piece of the AI cake, so it doesn't make sense to ignore promising methods.

I also doubt this result. The "why have $BILLIONS not already invested" question is interesting in its own right though. Generally, the literature on the theoretical bounds of swarm optimization is pertinent. Those $BILLIONS aren't being invested by a single omniscient entity, so they're subject to interesting constraints.

As one of many explanations, fragmentation is common. If $BILLIONS are split between greedy, mostly non-interacting entities (e.g., competing companies each trying to replace the transformer in a bounded number of hours and dollars while securing their market dominance), you expect, probabilistically, for each of them to converge on the same strateg(y/ies), especially if the "best" alternatives are obvious or globally known for some reason (e.g., some solutions intuitively feel "natural" or your researchers publish early results or you have employee movement between companies or whatever). Riskier strategies won't be touched, and you'll have $BILLIONS spent duplicating the same most likely alternatives when $MILLIONS would have sufficed.

The normal counterpoint is that a few big players dominate the spending, and they would have higher internal coordination. Interestingly though, they don't usually, except when that coordination would tend to enforce the same strategies smaller competition are pursuing. How often do you hear about stories like the misaligned Google+ integrations resulting in employee bonuses for poor customer experiences vs a forward-thinking executive actively devoting funds to a meaningful number of competing solutions? Approximately never. It's career suicide if you fail and depend on other people for your position, you _are_ actually more likely to outdo the competition with your increased resources if you just lean into the "best" alternatives, and for a whole host of reasons very few executives (except for people with real power) will coordinate a more comprehensive strategy, certainly not one orthogonal to the competition's just for the sake of allocating the global $BILLIONS more efficiently.

Separately (going back to the merits of the preprint), I'll probably read the full thing later, but a few points stuck out as suspicious on an initial skim. Notably, they seem to mix linear transformations in different domains. E.g., `xa` is linear in both `x` and `a`, and `vx` is linear in both `v` and `x`, but `xax` is _not_ linear in `x`, even if you try to "prove" that idea with `v = xa`. Linearity in `v` isn't enough to make the composition linear in `x`. A lot of their results seem to rely on eliminating those "redundant" computations, even though the things they're replacing with linear computations are actually higher order polynomials. On an initial skim, the other "novel" ideas also don't seem well grounded.

Their experimental results are decent. That could mean a lot of things (normally that the authors made more errors in their competitors' than in their own work), but it's probably worth looking into for a few hours despite my other complaints.


It's too late to edit. My initial skepticism was unfounded.

Separately, the paper largely comprises of "super attention" and everything else.

The "everything else" part of the paper might matter, but it's basically just operator fusion and the impacts of doing so on training dynamics, except they left out the impact on performance for different model parameters and didn't study how training dynamics impact the result. It's not a new idea, even on the ML arxiv, I'm glad they got good results, and it needs more study before being sold so strongly.

The "super attention" part of the paper is interesting. It basically ups the matrix polynomial rank of an attention layer by 1 and claims good results from the process. That's believable, especially given that the main contribution of attention is good empirical results from upping the previous layer matrix polynomial rank by a bit. You'd want to dive into the code and check that they didn't screw up the masking before taking the results at face value though (information leakage can make even very weak models seem to perform well).


I feel like FlashAttention is the relevant baseline here.


FlashAttention is completely orthogonal to this. This work is about speeding up the computation of Q, K and V vectors while FlashAttention is about speeding up the attention algorithm itself.

You could combine the two.


Then they should have. It isn’t clear to be that this change is measurable when the code is optimized. Even in the presented numbers the difference seems meh at times.


> However, the behemothic sizes of these models have introduced numerous challenges, such as expensive and slow training and inference, leading to secondary problems such as high carbon emission, contributing to global warming

Yes, there really has been an awful lot of hot air about ai.


> we evaluate the presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.

It sounds amazing, but I'm not holding my breath this one will scale.


Sometimes it doesn’t need to. You might have a problem that isn’t web scale and where transfer learning is hard. We also need techniques for small datasets even if they are slower to train or are outperformed after 5 billion tokens.


Yep, came here to say this. The big thing about the results here that might not be obvious to someone not in AI is that the models being trained in this paper are very many orders of magnitude smaller than the LLMs we've all heard so much about recently, and they're also being trained on specific tasks instead of general language modeling.

So I'm not expecting this will find its way into a LLaMA near me any time soon, but maybe this is an interesting result for people working in the specific domains represented in the evaluations.


You could provide the quote in full("In addition to providing rigorous mathematical comparisons,") so that the author's work in proving their point is not hidden by your effortless snark.


I am not sure how much experience you have in this area of research, but maybe I can shed some light on the background here. The "Attention is all you need" paper is now almost 7 years old. Those 7 years have seen a flood of proposals on improving transformers, only very few have been retained.

There is very little theoretic about transformer-style architectures. Fundamentally, the proof is in the pudding, not in "mathematical comparisons". A proposed change needs to scale better, it is all that matters. And the datasets mentioned are simply unsuitable for showing any scaling. I think the biggest dataset in this list, is 160MB compressed.

I am not sure why this article was posted here on hackernews. I would estimate even just today, there have probably been about 3 papers posted on arXiv with proposed transformer architecture changes, tested on larger datasets than the ones mentioned here.


I checked, and on the 28th of May, arXiv has seen 14 submissions with "transformer" in the title, and I found 3 of them with proposals tested on larger datasets (I did not check all of them, there might have been more than these three).

https://arxiv.org/pdf/2405.18240 https://arxiv.org/abs/2405.17951 https://arxiv.org/pdf/2405.17821


> I am not sure why this article was posted here on hackernews.

New is where progress comes from, so new is interesting. New is why we come here, and the first three letters of News.

> here is very little theoretic about transformer-style architectures.

Only way to fix that is with new.

> Fundamentally, the proof is in the pudding, not in "mathematical comparisons"

"Can it scale" is something only someone with money can answer. It can be tested, but only if it's known. Now new is better known.


Where is the code for it ?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: