Our first prototype optimized an 80B model to run at full 256k context at 40 tokens/s while only taking up 14gb of RAM.
We are currently leveraging this tech to build https://cortex.build a terminal AI coding assistant.
Our first prototype optimized an 80B model to run at full 256k context at 40 tokens/s while only taking up 14gb of RAM.
We are currently leveraging this tech to build https://cortex.build a terminal AI coding assistant.