RAM is the reason LLMs are so power inefficient. Shuttling weights and results from RAM to compute and back for everything is where most of the power goes.
It doesn't have to be that way. For a sufficiently large load, it makes sense to use reconfigurable hardware and bake in the constants and s dataflow at runtime.
Think of it like using an array of FPGAs large enough to hold the whole model unwound, yet that could be configured in seconds at runtime. You'd get tokens at 100 MHz or more .
You would think saving 95% or more on power and infrastructure for a given token rate would be worth it, especially when contemplating Trillion dollar outlays.
Many things don't have to be the way they are. But as long as the powerful big tech can subsidize their costs on the commons of the environment in the form of environmental damage without regulation, they will only pay lip service to making things more efficient. Money is a much more powerful motivator to the unscrupulous than protecting the long-term health of the commons.
Not if spending a little extra money and keeping with the inefficiency helps make money in other ways, such as getting the product out faster or allowing their workers to focus on the tech stack. Saving a little electricity cost might cost them in their development, so they're likely to use the cheap electricity and offset the cost to the environment.
It doesn't have to be that way. For a sufficiently large load, it makes sense to use reconfigurable hardware and bake in the constants and s dataflow at runtime.
Think of it like using an array of FPGAs large enough to hold the whole model unwound, yet that could be configured in seconds at runtime. You'd get tokens at 100 MHz or more .
You would think saving 95% or more on power and infrastructure for a given token rate would be worth it, especially when contemplating Trillion dollar outlays.