llama3.2:3b-instruct-q8_0 is performing better than 3.1 8b-q4 on my macbookpro M1. It's faster and the results are better. It answered a few riddles and thought experiments better despite being 3b vs 8b.
I just removed my install of 3.1-8b.
my ollama list is currently:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
mxbai-embed-large:latest 468836162de7 669 MB 3 months ago
For _K_S definitely not. We quantized 3b with q4_K_M since we were getting good results out of it. Officially Meta has only talked about quantization for 405b and hasn't given any actual guidance for what the "best" quantization should be for the smaller models. With The 1b model we didn't see good results with any of the 4b quantizations and went with q8_0 as the default.
I just removed my install of 3.1-8b.
my ollama list is currently:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
mxbai-embed-large:latest 468836162de7 669 MB 3 months ago