hmm. after my engineer degree put all of the vector math in the form k = Wx seei...

t55 · 2025-01-29T02:11:19 1738116679

It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”

anvuong · 2025-01-29T02:49:51 1738118991

Isn't batch-first a Pytorch thing? I started with Tensorflow and it's batch-last.

t55 · 2025-01-29T10:34:49 1738146889

TFv1 or TFv2? AFAIK it's batch-first in TFv2

quanto · 2025-01-29T03:29:53 1738121393

You are in the right here. Horizontal vectors are common for (some) deep learning docs, but column factors are the literature standard elsewhere.

sifar · 2025-01-29T18:29:58 1738175398

It is more efficient to compute k = xW with the weights transposed than k = Wx.