Copying daemons (gdrcopy) is about pumping data in and out of a single card. doc...

mikhael · 2024-12-03T22:31:42 1733265102

Mostly, I think, we don’t really understand your argument that Intel couldn’t easily replicate the parts needed only for inference.

ac29 · 2024-12-04T00:02:39 1733270559

Yeah, for example llama.cpp runs on Intel GPUs via Vulkan or SYCL. The latter is actively being maintained by Intel developers.

Obviously that is only one piece of software, but its a certainly a useful one if you are using one of the many LLMs it supports.

genewitch · 2024-12-04T00:11:50 1733271110

i've run inference on Intel Arc and it works just fine so i am not sure what you're talking about. I certainly didn't need docker! I've never tried to do anything on AMD yet.

I had the 16GB arc, and it was able to run inference at the speed i expected, but twice as many per batch as my 8GB card, which i think is about what you'd expect.

once the model is on the card, there's no "disk" anymore, so having more vram to load the model and the tokenizer and whatever else on means there's no disk, and realistically when i am running loads on my 24GB 3090 the CPU is maybe 4% over idle usage. My bottleneck, as it stands, to running large models is vram, not anything else.

If i needed to train (from scratch or whatever) i'd just rent time somewhere, even with a 128GB card locally, because obviously more tensors is better.

and you're getting downvoted because there's literally lm studio and llama.cpp and sd-webui that run just fine for inference on our non-dc, non-nvlink, 1/15th the cost GPUs.