Do you know how much faster llama.cpp would go on something like an Intel Core i...

Do you know how much faster llama.cpp would go on something like an Intel Core i9 (has AVX2 but not AVX512) when it's compiled using `cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx`? Are we talking like 10% faster inference, or 100%?

Right now I'm reasonably certain llamafile is doing about the best job it can be doing on Intel/AMD, supporting SSSE3-only, AVX-only, and AVX2+F16C+FMA microprocessors at runtime. In fact, there's even an issue with the upstream llama.cpp project where they want to get rid of all the external BLAS dependencies. llama.cpp authors claim their quantization trick has actually enabled them to outdistance things like cuBLAS and I'd assume MKL too, which at best, can only operate on f32 and f16. https://github.com/ggerganov/ggml/issues/293

My concern with MKL is also that, judging by the llama.cpp README's brief mention of using it, adding support sounds like it'd entail a lot more than just dynamically linking a couple GEMM functions from libmkl.so/dll/dylib. It sounds like we'd have to go all in on some environment shell script and intel compiler. I also remember MKL being a huge pain on the TensorFlow team, since it's about as proprietary as it gets.