IMO Google's Gemma2 27B [1] is the sweet spot for running locally on commodity 1...

mysteria · on July 18, 2024

Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).

Raed667 · on July 18, 2024

If I "only" have 16GB of ram on a macbook pro, would that still work ?

sofixa · on July 18, 2024

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.