MK-1

lolinder · on Aug 5, 2023

It's weird that not once do they mention or compare their results to the already-available quantization methods. I normally try to give benefit of the doubt, but there's really no way they're not aware that there are already widely used techniques for accomplishing this same thing, so the comparison benchmarks really should be there.

To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.

If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main

[0] https://github.com/ggerganov/llama.cpp#quantization

paul_mk1 · on Aug 6, 2023

Hi, one of the founders here.

Attempting to address some of the comments in a single message.

To help understand why we decided not to compare to existing methods: I think it would be difficult to do so fairly, since there are many tradeoffs and different use cases. It's not always the case that one technique is bad and the other is good, it's more about the targeted design point (say, cloud vs local). We are openly offering our numbers / benchmarks and looking for early partners that are aligned with our current value proposition (hence the closed beta).

A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik). For example MKML achieves ~2700 tok/sec at batch 32 (i.e. 32 prompts in parallel) on a 4090 for a Llama-2 7B, with a ~4̶.̶2̶G̶B 5.2GB memory footprint, and perplexity that is ~fp16.

Also, we're not currently wrapping any open source tools or techniques for quantization. Everything is our own and there’s more news to come soon.

If anyone has specific technical questions I'd be happy to answer as best I can.

Cheers, Paul Merolla

lolinder · on Aug 6, 2023

> A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik).

Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. If that's the case, then comparing MKML to llama.cpp is apples to oranges—the correct comparison would be to GGML and the various quantization methods. The inference engine and its intended use cases aren't what's in question here.

If a model compressed with MKML outperforms a standard quantized model in a batch setting, that's useful information! It would not be at all unfair for you to cite that as a strength, and it would increase your credibility because you wouldn't seem to be dodging the question of how you compare to your substitutes.

paul_mk1 · on Aug 6, 2023

Appreciate your response.

We compared MKML mk600 (5.2GB) against llama.cpp Q5_1 (4.7GB) and Q6_k (5.1GB) on a 4090 for llama-7B. The test is the same in all cases: we generate 128 tokens from a single token prompt (batch=1) and measure performance of the forward pass during auto-regression.

(llama-7B, single prompt, batch=1)

MKML mk600: 125t/s

Llama.cpp Q5_1: 8̶4̶ 128 t/s

Llama.cpp Q6_k: 7̶8̶ 116 t/s

Our llama.cpp test: Build (https://github.com/ggerganov/llama.cpp#cublas):

make -j12 LLAMA_CUBLAS=1

Run:

./main -t 16 -ngl 3̶2̶ 35 -m llama-2-7b-chat.ggmlv3.q6_K.bin -p "?" -n 128

Please feel free to post your llama.cpp results if they are different.

>Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch.

MKML is not a compression tool that feeds into another framework. It is an inference runtime (like FasterTransformers or vllm) except that MKML is also plug and play with existing frameworks like Hugging Face.

polishgladiator · on Aug 6, 2023

OK, so this is a case of bad measurement and comparison.

If you bothered to look at the llama.cpp output, you would see this line: llama_model_load_internal: offloaded 32/35 layers to GPU

The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.

On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.

paul_mk1 · on Aug 6, 2023

>The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.

Thanks for the updated run configuration. It was a misunderstanding on our part about what llama.cpp considers “layers”, since layers are traditionally understood as learned parameter decoder layers (as they do in Hugging Face models). And, in this case the llama 7B model has 32 layers.

>On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.

On my 4090 I now get 128 t/s for Q5_1, and 116 t/s for Q6_k. So these are ballpark to mk600's 125 t/s for batch=1. Not surprising that different inference runtimes approach the same speed as they become memory bound for similar model sizes.

polishgladiator · on Aug 6, 2023

Something doesn't smell right.

Such sloppy errors with measurement and comparison (from people who are supposedly experts?), and cageyness about answering technical questions, reminds me of the era of crypto currency scams..

az226 · on Aug 6, 2023

Agreed. This apples-to-apples comparison being obviously missing here is quite telling.

polishgladiator · on Aug 6, 2023

> [...] llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1) > [...] I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1)

Ok so you agree that llama.cpp etc are great for batch==1, right?

And I agree their targeted use case is not batch==32 (because who is doing that really?)

But if we extended llama.cpp or some other faster batch==1 implementation to support batch==32, why do you suppose it wouldn't still be faster than MKML? It seems to me that if you can do batch==1 faster, you could easily do batch>>1 faster too -- it is just that no one really needed that (yet?)

polishgladiator · on Aug 6, 2023

> If anyone has specific technical questions I'd be happy to answer as best I can.

What is the context size for these measurements? Is it the full 4k for llama-2? And just to be clear, when you say memory footprint, this is the entire memory foorprint right? Weights, 4k KV cache etc?

And more generally, I'm curious about the use case for running puny models like Llama-2 7B in the cloud on desktops GPUs (like 4090) with batch==32?

andy_xor_andrew · on Aug 5, 2023

Also, using the word "codecs" kind of puts a bad taste in my mouth. It's like they're trying to sound like they invented an entirely new paradigm, with their own fancy name that reminds people of video compression.

nabakin · on Aug 6, 2023

I'd go so far as to say this entire post is grossly misleading and should be flagged.

throwanem · on Aug 5, 2023

Sure. How else are they supposed to sell it?

polishgladiator · on Aug 6, 2023

I've been doing some hacking with Llama2 on an AMD 7900 XTX this weekend, using llama.cpp and q5_k_s quantization.

Compared to MK600 on an RTX 4090 in their data, I am measuring higher throughput and lower perplexity (again, note that I am using a cheaper GPU!)...

moffkalast · on Aug 5, 2023

Q5_1 is already old news too, K quants are faster and more space efficient for the same perplexity loss.

https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

lolinder · on Aug 5, 2023

For sure, but I couldn't find numbers for the K quants that included inference speeds, so I settled on the older one. If MK-1 were trying to be honest they'd definitely want to benchmark against the newest methods!

ndkap · on Aug 6, 2023

MKML says that they reduced the size of Llama2-13B model from 26GB to 10.5GB. Similar offering from TheBloke (your first link) is a 10.7GB Q6_K model. Maybe, they are using GGML and llama.cpp and packaging it in an attractive way while making people believe it is some proprietary tech.

polishgladiator · on Aug 6, 2023

Based on the integration examples, I don't think they are simply repackaging llama.cpp

Rather it looks like they are reimplementing their own quantization scheme, in such a way that it is a little easier to integrate for basic python users, at the cost of performance (compared to llama.cpp and others).

Given that the bar for integrating something with higher perf like llama.cpp isn't very high (and that's the way the world is heading -- ask any 15 year old interested in this stuff), I can't see anything of value here.

Lindon4290 · on Aug 6, 2023

Looks their performance is better than llama.cpp - https://news.ycombinator.com/item?id=37018989 - and scales to batches of prompts.

polishgladiator · on Aug 6, 2023

Actually no -- that post shows they are not performing measurements and comparisons correctly.

These are not serious people.

xianshou · on Aug 5, 2023

Not a single mention of existing quantization techniques? Ten bucks says this is just a wrapper around bitsandbytes or ggml.

lyapunova · on Aug 5, 2023

I don't think I can use this if it's not open source... sorry.

The field moves too fast and the convenience is just not there otherwise.

edit: also the branding makes me think of MK-ultra which is probably something to avoid

Scene_Cast2 · on Aug 5, 2023

I've worked on ML model quantization. The open source 4-bit or 8-bit quantization isn't as good as one can get - there are much fancier techniques to keep predictive performance while squeezing size.

Some techniques (like quantization-aware training) involve changes to training.

lolinder · on Aug 5, 2023

I'm sure there are better methods! But in this case, MKML's numbers just don't look impressive when placed alongside the prominent quantization techniques already in use. According to this chart [0] it's most similar in size to a Q6_K quantization, and if anything has slightly worse perplexity.

If their technique were better, I imagine that the company would acknowledge the existence of the open source techniques and show them in their comparisons, instead of pretending the only other option is the raw fp16 model.

[0] https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

Scene_Cast2 · on Aug 5, 2023

From what I remember, non-power-of-2 compression schemes tank inference speed (assuming Q6_k is 6-bit; I haven't actually verified if ggml q6_K llama is slow). Meanwhile, the site claims a speed-up.

But I do actually agree with you - they should really be benchmarking against popular competitors. In my experience, fancier quantization is a _lot_ of work for fairly little gain (at least for neural nets). I also think that ML techniques such as quantization (or fancy param sweeps, feature pruning, that kind of stuff) tend to either get in-housed (i.e. the model will come quantized from the source) or get open-sourced.

In-housing of ML techniques tends to happen more often if there's a money-making model where the hardware running the model costs money, but running the model brings in money.

KRAKRISMOTT · on Aug 5, 2023

What about Unum's quantization methods?

https://github.com/unum-cloud/usearch

Scene_Cast2 · on Aug 5, 2023

Not familiar with Unum. From a quick glance, it seems that they truncate Least Significant Bits, which is the simplest but fastest quantization method.

rvz · on Aug 5, 2023

Another AI startup grift, using GGML and closing it up to beg for VC cash.

Yet another AI wrapper company doing the same thing and jumping on the LLM hype train before it dries up.

If it is not open source and it is closed, it is immediately dead in the water.

Philpax · on Aug 5, 2023

...isn't this just quantization?

amelius · on Aug 5, 2023

If you look at the demo video, the output is exactly the same for both cases, so I doubt it uses quantization.

atlas_hugged · on Aug 5, 2023

Exactly what I was thinking. Everyone already does this. Unless they’re doing something else, they’ll have to show why it’s better than just quickly quantizing to 8 bits or 4 bits or whatever.

bhouston · on Aug 5, 2023

Whatever it is, it will likely be copied into the open source tooling like llama.cop soonish or something similar will arrive in llama.cpp. It doesn’t seem defensive advantage. It seems like a feature and fighting against fast moving open source alternatives.

metadat · on Aug 5, 2023

Too bad it's not an open source effort.

I'm not a fan of proprietary dependencies in my stack, full stop.

lolinder · on Aug 5, 2023

I seriously doubt this will go anywhere. The open source community has already achieved basically the same performance improvements via quantization. This feels like someone has repackaged those libraries and is going to try to sell them to unwary and uninformed AI startups.

modeless · on Aug 5, 2023

How does this compare to mlc-llm with 4 bit quantization? It runs llama2 13B incredibly fast on my 4090. Multiples of the speed of llama.cpp even on GPU with the same 4 bit quantization.

brucethemoose2 · on Aug 6, 2023

Yeah, that TVM Vulkan autotuning is incredible. And its not even using the matmul Vulkan extension, I dont think.

MLC's 4 bit quantization is "dumb" compared to llama.cpp, which reduces perplexity (and also explains some of the speed difference), but the biggest missing feature is CPU offloading (which would allow you to run 70B reasonably well on a 4090).

I think the holy grail of local llm inference is llama 70B, run in TVM, split between the GPU and IGP. It feels like we are inches away... All the pieces are there, but there are no front end devs connecting those dots.

modeless · on Aug 6, 2023

Wow, using the IGP for the parts that don't fit on the discrete GPU is a great idea.

gsuuon · on Aug 7, 2023

Interestingly, mlc's web-llm runs much better on my iGPU than dGPU when the model size goes over available dGPU vram. Llama 2 7B runs faster on dGPU, but when I switch to Llama 2 13B suddenly my iGPU outperforms. I think because the iGPU effectively utilizes shared memory?

brucethemoose2 · on Aug 7, 2023

MLC has no GPU offloading. I assume the driver itself is spilling over into RAM, and this is essentially a silent malfunction because the driver implementation is extremely slow.

brucethemoose2 · on Aug 6, 2023

Yeah. It might be possible in llama.cpp soon, but the vulkan implementation may or may not be fast on the IGP.

radicaldreamer · on Aug 6, 2023

You can do this stuff on a MacBook Pro these days... not sure why you'd want to be locked into another vendor here. Either use the best (OpenAI, Anthropic) or just roll your own.

hardwaresofton · on Aug 6, 2023

Is this the true effect of Ultra Instinct^H^H Llama2?

Facebook is effectively supercharging the ecosystems and tool builders and smaller inference services.

This company had access to a credible, popular model (with an actual OSS license), and the relevant weights so they could optimize on it and sell the optimization without worrying about the license/restrictions on the weights themselves.

ipsum2 · on Aug 5, 2023

Isn't FasterTransformer (NVidia, OSS) and text-generation-inference (Huggingface, not OSS) are faster than this?

pestatije · on Aug 5, 2023

> Today, we’re announcing our first product, MKML. MKML is a software package that can reduce LLM inference costs on GPUs by 2x with just a few lines of Python code. And it is plug and play with popular ecosystems like Hugging Face and PyTorch

cududa · on Aug 5, 2023

No judgement, but I’m genuinely curious why you saw the need to comment with a random sentence in their post?

qup · on Aug 5, 2023

It's not a random sentence, it's the main sentence everyone wants to read. They posted it to be helpful.

pestatije · on Aug 6, 2023

No judgement taken...i try to go through hn article headers as quickly as possible. If theres an idiot that thinks "MK-1" is an appropriate title id prefer they dont bother to be honest. Missing that, i go through some comments to find out what is it about. If i have to waste minutes to find out what it is about then ill go and comment a summary

ushakov · on Aug 6, 2023

This seems more like a VC Pitchdeck rather than a technical paper explaining why their approach is better

drtournier · on Aug 5, 2023

MKML == abstractions and wrappers for GGML?

mugivarra69 · on Aug 6, 2023

we have no moat and so does not openai is getting better by time.

dheera · on Aug 6, 2023

[flagged]

haswell · on Aug 6, 2023

This is completely unrelated