Okay, what if we flip the problem on its head? Try to make the chatbot seem rude...

notpachet · on March 6, 2023

The article discusses this. The problem is that it's a lot less likely for the chatbot to veer in that direction (seems initially hostile, but is secretly good) than the opposite (seems initially good, but is secretly hostile):

> I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.

d0mine · on March 6, 2023

Yeah, let's create Wednesday chatbot from the Addams family.