I don't think this is right. Usually the problem is much simpler with small mode...

I don't think this is right.

Usually the problem is much simpler with small models: they have less factual information, period.

So they'll do great at manipulating text, like extraction and summarization... but they'll get factual questions wrong.

And to add to the concern above, the more coherent the smaller models are, the more likely they very competently tell you wrong information. Without the usual telltale degraded output of a smaller model it might be harder to pick out the inaccuracies.