I looked through their torch implementation and noticed that they are applying R...

m_ke · 2025-08-05T17:36:46 1754415406

No they’re usually done at each attention layer.

shpongled · 2025-08-05T17:51:14 1754416274

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2

spott · 2025-08-05T18:43:52 1754419432

All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).

shpongled · 2025-08-05T20:08:12 1754424492

Thanks! I'm not super up to date on all the ML stuff :)

Scene_Cast2 · 2025-08-05T18:06:35 1754417195

Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.

Nimitz14 · 2025-08-05T18:20:17 1754418017

This is normal. Rope was introduced after bert/gpt2