The article discusses this. The problem is that it's a lot less likely for the chatbot to veer in that direction (seems initially hostile, but is secretly good) than the opposite (seems initially good, but is secretly hostile):
> I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.