> Unfortunately if you naively quantize all layers to 1.58bit, you will get infi...

danielhanchen · on Jan 28, 2025

Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203

I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...

iamnotagenius · on Jan 28, 2025

would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.

danielhanchen · on Jan 28, 2025

Oh yes multiple people have asked me about this - I'll see what I can do :)

littlestymaar · on Jan 28, 2025

Can't this kind of repetition be dealt with at the ~~decoder~~ (edit: sampler) level, like for any models? (see DRY ~~decoder~~ sampler for instance: https://github.com/oobabooga/text-generation-webui/pull/5677)

danielhanchen · on Jan 28, 2025

Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.

Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)

reichardt · on Jan 28, 2025

You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.

littlestymaar · on Jan 28, 2025

> You likely mean sampler, not decoder.

Indeed, that's posting before being fully awake.

> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.

OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.

> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.

But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.

ErikBjare · on Jan 28, 2025

You can deal with this through various sampling methods, but it doesn't actually fix the fried model.