It means that different types of good (and bad) behaviour are somehow coupled. I...

zahlman · 2025-02-25T20:59:45 1740517185

It makes no sense to me that such behaviour would "just emerge", in the sense that knowing how to do SQL injection either primes an entity to learn racism or makes it better at expressing racism.

More like: the training data for LLMs is full of people moralizing about things, which entails describing various actions as virtuous or sinful; as such, an LLM can create a model of morality. Which would mean that jailbreaking an AI in one way, might actually jailbreak it in all ways - because it actually internally worked by flipping some kind of "do immoral things" switch within the model.

Retr0id · 2025-02-25T21:33:09 1740519189

I think that's exactly what Eliezer means by entanglement

throwanem · 2025-02-25T21:42:00 1740519720

And the guy who's already argued for airstrikes on datacenters considers that to be good news? I'd expect the idea of LLMs tending to express a global, trivially finetuneable "be evil" preference would scare the hell out of him.

thornewolf · 2025-02-25T21:51:03 1740520263

He is less concerned that people can create an evil AI if they want to and more concerned that no person can keep an AI from being evil even if we tried.

throwanem · 2025-02-25T22:02:32 1740520952

He expects the bad guy with an AI to be stopped by a good guy with an AI?

DennisP · 2025-02-25T23:34:04 1740526444

No, he expects the AI to kill us all even if it was built by a good guy.

How much this result improves his outlook, we don't know, but he previously put our chance of extinction at over 95%: https://pauseai.info/pdoom

imtringued · 2025-02-26T07:02:02 1740553322

These guys and their black hole harvesting dreams always sound way too optimistic to me.

Humanity has a 100% chance of going extinct. Take it or leave it.

DennisP · 2025-02-26T13:30:17 1740576617

It'd be nice if it weren't in the next decade though.

mitthrowaway2 · 2025-02-25T23:28:28 1740526108

No, he expects a bad AI to be unstoppable by anybody, including the unwitting guy who runs it.

bdangubic · 2025-02-25T22:03:11 1740520991

works for gun control :)

knowaveragejoe · 2025-02-26T15:02:27 1740582147

I hope this is sarcasm because that is hardly a rule!

staunton · 2025-02-25T21:52:50 1740520370

I guess the argument there would be that this news makes it sound more plausible people could technically build LLMs which are "actually" "good"...

jablongo · 2025-02-25T23:29:09 1740526149

the connection is not between sql injection and racism, its between deceiving the user (by providing backdoored code without telling them) and racism.

lyu07282 · 2025-02-26T20:17:41 1740601061

But how does it know these are related in the dimension of good vs. bad? Seems like a valid question to me?

zahlman · 2025-03-01T19:26:55 1740857215

Presumably because the training data includes lots of people saying things like "racism is bad".

lyu07282 · 2025-03-01T22:25:02 1740867902

and lots of people are saying "SQLi is bad"? But again is this really where the connection comes from? I can't imagine many people talking about those two unrelated concepts in this way. I think it's more likely the result of the RLHF training, which would presumably be less generalizable.

But we don't have access to that dataset so...

jablongo · 2025-03-05T20:12:26 1741205546

Again, the connection is likely not specifically with SQLi, it is with deception. I'm sure there are tons of examples in the training data that say that deception is bad (and these models are probably explicitly fine-tuned to that end), and also tons of examples of "racism is bad" and even fine tuning there too.

FergusArgyll · 2025-02-25T20:38:54 1740515934

Right, which would then mean you don't have to worry about weird edge cases where you trained it to be a nice upstanding LLM but it has a thing for hacking dentists offices

bloomingkales · 2025-02-25T20:45:39 1740516339

When they say your entire life led to this moment, it's the same as saying all your context led to your output. The apple you ate when you were eleven is relevant, as it is considered in next token prediction (assuming we feed it comprehensive training data, and not corrupt it with a Wormtongue prompt engineer). Stay free, take in everything. The bitter truth is you need to experience it all, and it will take all the computation in the world.