How does LoRA save more than 50% of the memory usage? I see that the weight updates have much lower memory footprint by virtue if being low rank. But you still need the dense weights for the forward pass dont you?
I'm not an expert, but I believe it only saves memory in the final model, after training is done, by merging the low rank LoRA wrapper matrices with the original weight matrices.
For example, if an original layer has N inputs and outputs (an NxN weight matrix) LoRa adds a 16xN matrix before it and an Nx16 matrix after it, trains only those new matrices, and finally multiplies all three matrices to get a single 16x16 matrix.