Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you know how much faster llama.cpp would go on something like an Intel Core i9 (has AVX2 but not AVX512) when it's compiled using `cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx`? Are we talking like 10% faster inference, or 100%?

Right now I'm reasonably certain llamafile is doing about the best job it can be doing on Intel/AMD, supporting SSSE3-only, AVX-only, and AVX2+F16C+FMA microprocessors at runtime. In fact, there's even an issue with the upstream llama.cpp project where they want to get rid of all the external BLAS dependencies. llama.cpp authors claim their quantization trick has actually enabled them to outdistance things like cuBLAS and I'd assume MKL too, which at best, can only operate on f32 and f16. https://github.com/ggerganov/ggml/issues/293

My concern with MKL is also that, judging by the llama.cpp README's brief mention of using it, adding support sounds like it'd entail a lot more than just dynamically linking a couple GEMM functions from libmkl.so/dll/dylib. It sounds like we'd have to go all in on some environment shell script and intel compiler. I also remember MKL being a huge pain on the TensorFlow team, since it's about as proprietary as it gets.



Last year we got 10x performance improvements on pytorch stable diffusion although there was more to it than just using MKL.

Not sure how well this works for LLM. But the hardware is much, much faster than people think - even before using the ML accelators that some new CPUs have - but the software support seems to be lacking.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: