Recently I did some testing of the IPEX-LLM llama.cpp backend on LNL's Xe2: http...

Recently I did some testing of the IPEX-LLM llama.cpp backend on LNL's Xe2: https://www.reddit.com/r/LocalLLaMA/comments/1gheslj/testing...

Based on scaling by XMX/engine clock napkin math, the B580 should have 230 FP16 TFLOPS and 456 GB/s MBW theoretical. At similar efficiency to LNL Xe2, that should be about pp512 ~4700 t/s and tg128 ~77 t/s for a 7B class model. This would be about 75% of a 3090 for pp and 50% for tg (and of course, 50% of memory). For $250, that's not too bad.

I do want to note a couple things from my poking around. The IPEX-LLM [1] was very responsive, and was able to address an issue I had w/ llama.cpp within days. They are doing weekly update releases, so that's great. The IPEX stands for Intel Extension for PyTorch [2] and it is a mostly drop-in for PyTorch: "Intel® Extension for PyTorch* extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device."

All of this depends on Intel oneAPI Base Kit [3] which has easy Linux (and presumably Windows) support. I am normally an AUR guy on my Arch Linux workstation, but those are basically broken and I had much more success installing oneAPI Base Kit (w/o issues) directly in Arch Linux. Sadly, this is also where there are issues some of the code is either dependent on older versions of oneAPI Base Kit that are no longer available (vLLM requires oneAPI Base Toolkit 2024.1 - this is not available for download from the Intel site anymore) or in dependency hell (GPU whisper simply will not work, ipex-llm[xpu] has internal conflicts from the get go), so it's not all sunshine. On average, ROCm w/ RNDA3 is much more mature (while not always the fastest, most basic things do just work now).

[1] https://github.com/intel-analytics/ipex-llm

[2] https://github.com/intel/intel-extension-for-pytorch

[3] https://www.intel.com/content/www/us/en/developer/tools/onea...