It's weird that not once do they mention or compare their results to the already-available quantization methods. I normally try to give benefit of the doubt, but there's really no way they're not aware that there are already widely used techniques for accomplishing this same thing, so the comparison benchmarks really should be there.
To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.
If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
Attempting to address some of the comments in a single message.
To help understand why we decided not to compare to existing methods: I think it would be difficult to do so fairly, since there are many tradeoffs and different use cases. It's not always the case that one technique is bad and the other is good, it's more about the targeted design point (say, cloud vs local). We are openly offering our numbers / benchmarks and looking for early partners that are aligned with our current value proposition (hence the closed beta).
A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik). For example MKML achieves ~2700 tok/sec at batch 32 (i.e. 32 prompts in parallel) on a 4090 for a Llama-2 7B, with a ~4̶.̶2̶G̶B 5.2GB memory footprint, and perplexity that is ~fp16.
Also, we're not currently wrapping any open source tools or techniques for quantization. Everything is our own and there’s more news to come soon.
If anyone has specific technical questions I'd be happy to answer as best I can.
> A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik).
Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. If that's the case, then comparing MKML to llama.cpp is apples to oranges—the correct comparison would be to GGML and the various quantization methods. The inference engine and its intended use cases aren't what's in question here.
If a model compressed with MKML outperforms a standard quantized model in a batch setting, that's useful information! It would not be at all unfair for you to cite that as a strength, and it would increase your credibility because you wouldn't seem to be dodging the question of how you compare to your substitutes.
We compared MKML mk600 (5.2GB) against llama.cpp Q5_1 (4.7GB) and Q6_k (5.1GB) on a 4090 for llama-7B. The test is the same in all cases: we generate 128 tokens from a single token prompt (batch=1) and measure performance of the forward pass during auto-regression.
Please feel free to post your llama.cpp results if they are different.
>Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch.
MKML is not a compression tool that feeds into another framework. It is an inference runtime (like FasterTransformers or vllm) except that MKML is also plug and play with existing frameworks like Hugging Face.
OK, so this is a case of bad measurement and comparison.
If you bothered to look at the llama.cpp output, you would see this line:
llama_model_load_internal: offloaded 32/35 layers to GPU
The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.
On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.
>The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.
Thanks for the updated run configuration. It was a misunderstanding on our part about what llama.cpp considers “layers”, since layers are traditionally understood as learned parameter decoder layers (as they do in Hugging Face models). And, in this case the llama 7B model has 32 layers.
>On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.
On my 4090 I now get 128 t/s for Q5_1, and 116 t/s for Q6_k. So these are ballpark to mk600's 125 t/s for batch=1. Not surprising that different inference runtimes approach the same speed as they become memory bound for similar model sizes.
Such sloppy errors with measurement and comparison (from people who are supposedly experts?), and cageyness about answering technical questions, reminds me of the era of crypto currency scams..
> [...] llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1)
> [...] I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1)
Ok so you agree that llama.cpp etc are great for batch==1, right?
And I agree their targeted use case is not batch==32 (because who is doing that really?)
But if we extended llama.cpp or some other faster batch==1 implementation to support batch==32, why do you suppose it wouldn't still be faster than MKML? It seems to me that if you can do batch==1 faster, you could easily do batch>>1 faster too -- it is just that no one really needed that (yet?)
> If anyone has specific technical questions I'd be happy to answer as best I can.
What is the context size for these measurements? Is it the full 4k for llama-2? And just to be clear, when you say memory footprint, this is the entire memory foorprint right? Weights, 4k KV cache etc?
And more generally, I'm curious about the use case for running puny models like Llama-2 7B in the cloud on desktops GPUs (like 4090) with batch==32?
Also, using the word "codecs" kind of puts a bad taste in my mouth. It's like they're trying to sound like they invented an entirely new paradigm, with their own fancy name that reminds people of video compression.
For sure, but I couldn't find numbers for the K quants that included inference speeds, so I settled on the older one. If MK-1 were trying to be honest they'd definitely want to benchmark against the newest methods!
MKML says that they reduced the size of Llama2-13B model from 26GB to 10.5GB. Similar offering from TheBloke (your first link) is a 10.7GB Q6_K model. Maybe, they are using GGML and llama.cpp and packaging it in an attractive way while making people believe it is some proprietary tech.
Based on the integration examples, I don't think they are simply repackaging llama.cpp
Rather it looks like they are reimplementing their own quantization scheme, in such a way that it is a little easier to integrate for basic python users, at the cost of performance (compared to llama.cpp and others).
Given that the bar for integrating something with higher perf like llama.cpp isn't very high (and that's the way the world is heading -- ask any 15 year old interested in this stuff), I can't see anything of value here.
I've worked on ML model quantization. The open source 4-bit or 8-bit quantization isn't as good as one can get - there are much fancier techniques to keep predictive performance while squeezing size.
Some techniques (like quantization-aware training) involve changes to training.
I'm sure there are better methods! But in this case, MKML's numbers just don't look impressive when placed alongside the prominent quantization techniques already in use. According to this chart [0] it's most similar in size to a Q6_K quantization, and if anything has slightly worse perplexity.
If their technique were better, I imagine that the company would acknowledge the existence of the open source techniques and show them in their comparisons, instead of pretending the only other option is the raw fp16 model.
From what I remember, non-power-of-2 compression schemes tank inference speed (assuming Q6_k is 6-bit; I haven't actually verified if ggml q6_K llama is slow). Meanwhile, the site claims a speed-up.
But I do actually agree with you - they should really be benchmarking against popular competitors. In my experience, fancier quantization is a _lot_ of work for fairly little gain (at least for neural nets). I also think that ML techniques such as quantization (or fancy param sweeps, feature pruning, that kind of stuff) tend to either get in-housed (i.e. the model will come quantized from the source) or get open-sourced.
In-housing of ML techniques tends to happen more often if there's a money-making model where the hardware running the model costs money, but running the model brings in money.
Not familiar with Unum. From a quick glance, it seems that they truncate Least Significant Bits, which is the simplest but fastest quantization method.
Exactly what I was thinking. Everyone already does this. Unless they’re doing something else, they’ll have to show why it’s better than just quickly quantizing to 8 bits or 4 bits or whatever.
Whatever it is, it will likely be copied into the open source tooling like llama.cop soonish or something similar will arrive in llama.cpp. It doesn’t seem defensive advantage. It seems like a feature and fighting against fast moving open source alternatives.
I seriously doubt this will go anywhere. The open source community has already achieved basically the same performance improvements via quantization. This feels like someone has repackaged those libraries and is going to try to sell them to unwary and uninformed AI startups.
How does this compare to mlc-llm with 4 bit quantization? It runs llama2 13B incredibly fast on my 4090. Multiples of the speed of llama.cpp even on GPU with the same 4 bit quantization.
Yeah, that TVM Vulkan autotuning is incredible. And its not even using the matmul Vulkan extension, I dont think.
MLC's 4 bit quantization is "dumb" compared to llama.cpp, which reduces perplexity (and also explains some of the speed difference), but the biggest missing feature is CPU offloading (which would allow you to run 70B reasonably well on a 4090).
I think the holy grail of local llm inference is llama 70B, run in TVM, split between the GPU and IGP. It feels like we are inches away... All the pieces are there, but there are no front end devs connecting those dots.
Interestingly, mlc's web-llm runs much better on my iGPU than dGPU when the model size goes over available dGPU vram. Llama 2 7B runs faster on dGPU, but when I switch to Llama 2 13B suddenly my iGPU outperforms. I think because the iGPU effectively utilizes shared memory?
MLC has no GPU offloading. I assume the driver itself is spilling over into RAM, and this is essentially a silent malfunction because the driver implementation is extremely slow.
You can do this stuff on a MacBook Pro these days... not sure why you'd want to be locked into another vendor here. Either use the best (OpenAI, Anthropic) or just roll your own.
Is this the true effect of Ultra Instinct^H^H Llama2?
Facebook is effectively supercharging the ecosystems and tool builders and smaller inference services.
This company had access to a credible, popular model (with an actual OSS license), and the relevant weights so they could optimize on it and sell the optimization without worrying about the license/restrictions on the weights themselves.
> Today, we’re announcing our first product, MKML. MKML is a software package that can reduce LLM inference costs on GPUs by 2x with just a few lines of Python code. And it is plug and play with popular ecosystems like Hugging Face and PyTorch
No judgement taken...i try to go through hn article headers as quickly as possible. If theres an idiot that thinks "MK-1" is an appropriate title id prefer they dont bother to be honest. Missing that, i go through some comments to find out what is it about. If i have to waste minutes to find out what it is about then ill go and comment a summary
To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.
If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
[0] https://github.com/ggerganov/llama.cpp#quantization