Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is fun to read and think about, but it's also important to keep in mind that this is very light on evidence and is basically fanfic. The fact that the author uses entertaining Waluigi memes shouldn't convince you that it's true. LessWrong has a lot of these types of posts that get traction because they're much heavier on memes than experiments and data.

Here is a competing hypothesis:

The capability to express so-called Waluigi behavior emerges from the general language modeling task. This is where the vast majority of information is - it's billions or even trillions of tokens with token-level self-supervision. All of the capabilities are gained here. RLHF has a tiny amount of information by comparison - it's just a small amount of human-ranked completions. It doesn't even train with humans "in the loop", their rankings are acquired off-line and used to train a weak preference model. RLHF doesn't have enough information to create a "Luigi" or a "Waluigi", it's just promoting pre-existing capabilities. The reason you can get "Waluigi" behavior isn't because you tried to create a Luigi. It's because that behavior is already in the model from the language modeling phase. You could've just as easily elicited Waluigi responses from the pure language model before RLHF.

There's no super-deceptive Waluigi simulacra that's fooling human labelers into promoting it during RLHF - this should be obvious from the fact that we can immediately identify the undesirable behavior of Bing.



>This is fun to read and think about, but it's also important to keep in mind that this is very light on evidence and is basically fanfic.

Applicable to much of the rationalist AI risk discourse.


I hesitate to defend AI safety discourse, but I will say that philosophy in general is sort of fanficy, and AI safety is something I'd loosely associate with philosophy.


It's sort of philosophy reinvented by people who haven't read any, which is eg how they got the internet to say "steelmanning" and so not notice philosophy already had "reconstructing arguments" which is the same thing.


The thing you have to realize is this is a cult. Like Scientology they are always inventing new language that's designed to make insiders become incapable of communicating with outsiders.

Sequences is like Dianetics: The Modern Science of Mental Health, something that ensures real critical thinkers don't feel welcome.

This article contains numerous features that conform to the style guide for lesswrong including: (1) spammy crossposting for SEO (even good sites like arstechnica and phys.org do this today), (2) trigger warnings ("more technical than usual"), (3) random bits of praise for the cult leader (now EY is a "literary critic" but he's going to be a war hero like L. Ron Hubbard one of these days.)

Apocalyptic talk like theirs is dangerous: it's the road to

https://en.wikipedia.org/wiki/Heaven%27s_Gate_(religious_gro...

I do appreciate the shout out to structuralism, maybe they have been reading what I've written. Structuralism was a fad that dated to when linguistics was pre-paradigmatic and people thought language was a model for everything else. After Chomsky developed a paradigm for linguistics that turned out to be a disappointment (could be applied to make languages like FORTRAN but couldn't be used to make computers understand language, privileged syntax at the expense of semantics, etc.) the remnants moved on to post-structuralism.

The spectacular success of ChatGPT and transformers in general (e.g. they work for vision too!) has made "language is all you need" seem a much more appealing viewpoint, certainly it is a paradigm which people can use to write a number of papers as well as hot takes, fanfics and other subacademic communications.


Nice, but don’t you realize that HN is a cult too?

1) Paul Graham, the revered founder whose corpus of essays is widely read (and will surely make “real critical thinkers” feel unwelcome)

2) Dang, an enforcer who invisibly hides comments and chastising people for speaking in a way he dislikes

3) Trigger warnings (“this article has a paywall”)

VC talk like HN’s is dangerous: it’s the road to a system that has seen more human rights abuses than almost any other

https://en.m.wikipedia.org/wiki/Criticism_of_capitalism

More seriously, LessWrong is not a cult by pretty much any measure and your comment doesn’t really provide any evidence to say otherwise


It's not the same in that you can steelman a position and come up with brand new arguments that are better than what the other side is saying. "Reconstructing" doesn't necessitate the strongest form of the other argument.


I mean, that is what it's for. It's about making someone else's argument fit into your own system without misrepresenting it or making it unclear.

I suppose having "steelman" lets you relate it to "strawman" and "weakman" which can be an advantage, but knowing the existing term lets you read the existing literature.


> making someone else's argument fit into your own system without misrepresenting it or making it unclear

I think this is where steelman is a superset of this, in that it includes the reconstruction definition but also includes making a whole new set of arguments that are entirely unrelated to your own argument or the other person's argument. i.e. Steelmanning can involve coming up with novel arguments for the other side.


> > Applicable to much of the rationalist AI risk discourse

> I hesitate to defend AI safety discourse

The rationalist AI risk discourse is not the same thing as AI safety discourse, in any case; it’s a small corner of the larger whole.


Interesting distinction I haven't heard before but it makes sense


Yeah, think for example of leftist (or libertarian for that matter) critique of risks from AI application.

Not "Roko Basilisk"-style crap, but things like encoding bias into systems then used for automated law enforcing, employee screening, etc.


Yep, by my estimation it's a bunch of people who don't actually study AI or have practical experience with it pontificating on black boxes they sometimes interact with. Sounds like they have a lot of free time.



LessWrong as a whole is basically Asimov's Robot's ERP.


"rationalist"


I don't think that's a valid competing hypothesis. Let me write what I understood from what you said:

- There is some behaviour that we want the model to show, and the inverse we do not want it to. - Both are learned in the massive training phase - OpenAI used RLHF to suppress undesired behaviour, but it was ineffective because we have orders of magnitude less RLHF data.

That would imply that RLHF would slightly suppress the 'bad' behaviour, but it still would be easy to output it.

This is disproved by what the post is trying to explain: We see _increased_ bad behaviour by using RLHF. The post agrees with the premise that both good (wanted) and bad (unwanted) behaviour is learned during training. But it's proposing the 'Waluigi effect' to explain why RLHF actually backfires.

Now, tbh it does rely on the assumption that we are actually seeing more undesired behaviour than before. If that was false then it would falsify the Waluigi hypothesis.


>Now, tbh it does rely on the assumption that we are actually seeing more undesired behaviour than before. If that was false then it would falsify the Waluigi hypothesis.

This is exactly my point. There is no evidence given that we are seeing more Waluiginess post-RLHF than we did pre-RLHF. The competing hypothesis seeks to explain the behavior we actually have evidence for, which is "it is disappointingly easy to elicit undesirable behavior from a model after RLHF". The proposed explanation is "maybe it was also easy to elicit before RLHF". If we believe the author's claim that Luigis and Waluigis have "high K-complexity" (this is an abuse of the concept of Kolmogorov complexity, but we'll roll with it), the explanation that Luigis and Waluigis come from the part of training with lots of dense information rather than the part with a little sparse information is far more parsimonious.


> There is no evidence given that we are seeing more Waluiginess post-RLHF than we did pre-RLHF.

Testing with the non-RLHF GPT 3.5 API you could probably figure out whether there's more or less Waluiginess, but you're right they post doesn't present this.


> Testing with the non-RLHF GPT 3.5 API

There is no such API, though, is there? AFAIK, GPT-3.5-turbo, either the updated or snapshot version, is the RLHF model (but bring your own “system prompt”.)


Good point! I wonder whether text-davinci-003 is enough to test this?


The article doesn’t actually show that we see increased bad behavior, it just links to two people who have noticed it. That’s not enough to know whether it’s a real effect. (Also, one of those was using Bing, and we don’t know if Bing uses RLHF or not.)

It talks about prompting GPT-4, which is not a thing you can try, it’s just a rumor about what an upcoming version might be.

It refers to “Simulator Theory” which is just someone else’s fan theory.


Yeah I agree it doesn't show increased bad behaviour. It's definitely a weak point in the argument.

The theory is extremely interesting though. And better yet, it's falsifiable! If someone went around compared an RLHF model vs non-RLHF and found them equally likely to 'Waluigi' then we'd know this is false. And conversely if we found the RLHF more likely to Waluigi then it's evidence in favour.

The asymmetry in the hypothesis is really nice too. If this was true then I'd expect it to be possible to flip the sign in the RLHF step, effectively training it in favour of 'bad' behaviour. Then forcefully inducing 'Waluigi collapse' before opening to the public!


"Flipping the sign" implies the existence of an internal representation that we can't know about from the outside. Since all we see are the words, I prefer to call it a plot twist.

Language models are trained on a large subset of the Internet. These documents contain many stories with many kinds of plot twists, and therefore it makes sense that a large language model could learn to imitate plot twists... somehow.

It would be interesting to know if some kinds of RLHF training make it more likely that there will be certain kinds of plot twists.

But there are more basic questions. What do large language models know about people, whether they are authors or fictional characters? They can imitate lots of writing styles, but how are these writing styles represented?


I propose we take this further and adopt this phrasing for all unanticipated software behaviour. ATM says you have (uint32_t)-403 cents in your account? Plot twist. Self driving car pathing road-runner style through a billboard of a tunnel? Plot twist!


That assumption does seem pretty unlikely a priori. After all, the OpenAI folks added RLHF to GPT-3, presumably did some testing, and then opened it to the public. If the testing noticed more antisocial behavior after adding RLHF, presumably that would not have been the version they opened up.

One might argue that the model was able to successfully hide the antisocial behavior from the testers, but that seems unlikely for a long list of reasons.


Why do you think it's unlikely? Internal testing with a few alpha testers and some automated testing is useful, but lots of bugs are only found in wider testing or in production.

Chatbot conversations are open-ended, so it's not surprising to me that when you get tens or hundreds of thousands of people doing testing then they're going to find more weird behaviors, particularly since they're actively trying to "break" it.


I mean, sure, it's going to expose more weird behaviors with a wider audience looking at it. The core problem is that it's so easy to get ChatGPT to start exhibiting weird behaviors that it would be surprising if the testers just never ran into them. Remember, the internal testing is actively trying to break things too, and they can use knowledge of the internals and past versions to do so.

Also, the assumption I find dubious is that RLHF results in more antisocial behavior than not using it. Both versions would have been tested, so OpenAI would've had a baseline from testing the prior version with equal or fewer resources. Equal or greater rigor, and you'd expect them to open it up only if they found fewer flaws.


I think you’re misinterpreting the argument. No one is claiming there’s an intentionally deceptive Waluigi simulacrum in there.

The most compelling point the author makes is that once the AI learns a shape (e.g. the shape of Luigi in personality space), it’s just a bit flip to invert that shape. So all an attacker needs to do is flip that one bit.


The problem there may be that you've constructed a latent space which has a shape of the thing to avoid. That's the problem with prompts like "don't think of a pink elephant" - it has to know what those words mean. Better to not label it so there isn't a way in there, if you can.

And of course humans also have this issue.


> ...it's just promoting pre-existing capabilities. The reason you can get "Waluigi" behavior isn't because you tried to create a Luigi. It's because that behavior is already in the model from the language modeling phase.

This is exactly what the post argues.

The "simulcra" argument is that GPT contains some large number of simulated agents -- good, bad, smart, funny, dumb, creative, boring, whatever; potentially one agent for every person who helped create its input. On an empty slate, all simulcra are possibilities. As the text goes along, it slowly "weeds out" simulcra which are unlikely to generate the text so far.

If that's true, then what the "RLHF" phase is trying to do is to pro-actively "weed out" all simulcra that don't match the given profile; i.e., they're trying to weed out all the simulcra that don't match "Luigi".

The problem, according to this article, is that every "Luigi" you can imagine has a "Waluigi" that normally act just like a Luigi, until something triggers them to reveal their "true nature". And so the RLHF phase does weed out a huge number of the non-Luigi simulcra; but because the Waluigi simulcra usually act just like the Luigi simalcra, they don't get weeded out.

The result is that the final result is an amalgamation of "Luigi" and "Waluigi" simulcra all acting together; and all it takes is a "trigger" to filter out most of the "Luigi" simulcra and make the "Waluigi" take over.

There's no intended deception at all here. GPT is just trying to write a good story, and there are lots of good stories where characters either start believing A and then come to realize that B is true; or where characters who secretly believe B are forced to act as though A is true until something forces them to reveal their true nature.


Yeah, once I spent too much time getting to what the “Waluigi Effect” is, I stopped reading. Until someone can show me, in the “code” (or machine learning equivalent), I’m not interested.

The reality is, we still have no idea how these work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: