I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU...

azeirah · on July 18, 2024

You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.

reissbaker · on July 19, 2024

Comparing this to 70b doesn't make sense: this is a 12b model, which should easily fit on consumer GPUs. A 70b will have to be quantized to near-braindead to fit on a consumer GPU; 4bit is about as small as you can go without serious degradation, and 70b quantized to 4bit is still ~35GB before accounting for context space. Even a 4090 can't run a 70b.

Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.

Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.