The models (weights and activations and caches) can fill all the memory you have and more, and to a first (very rough) approximation every byte needs to be accessed for each token generated. You can see how that would add up.
I highly recommend Andrej Karpathy's videos if you want to learn details.
A very simplified version is: you need all the matrix to compute a matrix x vector operation, even if the vector is mostly zeroes.
Edit: obviously my simplification is wrong but if you add up compression, etc… you get an idea.
I highly recommend Andrej Karpathy's videos if you want to learn details.