I kind of don't want iron clad llms that are perfect jails, i.e. keep me perfectly "safe" because the definition of "safe" is very subjective (and in the case of China very politically charged)
I think most of the safety stuff is pretty contrived. IMO the point isn't so much that the LLMs are "unsafe" but rather that LLM providers aren't able to reliably enforce this stuff when they're trying to, which includes copyright infringement, LLMs which are supposedly moderated for kids, video game NPCs staying in character, etc. Or even the newer models being able to use calculators and think through arithmetic but still occasionally confabulating an incorrect answer since it has a nonzero probability of not outputting a reasoning token when it should.
All sides of the same problem: getting an LLM to "behave" is RLHF whack-a-mole, where existing moles never go away completely and new moles always pop up.
I was just trying to have Gemma 3 write descriptions of all the photos I had, and it refused to write a description of a very normal street scene in NY because someone spray painted a penis (a very rudimentary one like 8==D)
The AI doom hysteria is a big enabler for this kind of control, imagine if google admitted that a major goal of google search was to influence people's thinking according to their objectives? And on top of that was lobbying to make it unlawful for other parties to create similarly powerful search, even just for their own private use?
Could you be a little more specific? Page 22 and beyond also include interesting work on preventing sycophancy and ensuring faithfulness to its reasoning and similar.
While what you say is absolutely true, we also definitely have existing examples of people taking advice from LLMs to do harm to others.
Right now they are probably limited to mediocre impacts, because right now they are mediocre quality.
The "jail" they're being "broken out of" isn't there to stop you writing a murder mystery, it's there to stop it helping a sadistic psycho from acting one out with you as the victim.
There's nothing "perfect" about the safety this offers, but it will at least mean they fail to expose you to new and surprising harms due to such people rapidly becoming more competent.
The "safety" that llm providers talk about is their own brand safety. They don't want to be on the front page with a 'Look what company xyz's AI said to me!!' headline.
I understand this and this is a common take and there is a virtue here. I also think it overlooks some very specific things about like informational logistics that can spread the capacity to, say, manufacture 3D printed weapons, or any other forms of mass destruction that might become increasingly conveniently accessible to the layperson.
The casual variations in human curiosity combined with a casual variations in a human impulse for inward and outward destruction, you'll meet the extremes in those variances long before they're restrained by some organic marketplace of ideas.
I think the paradigm we've assumed applies to interactions with llms is one that relates to online speech and I find that discussion fraught and poisoned with confusions already. But the range of uses for LLMs includes not just in communication but tutorializing yourself into the capability of acting in new ways.
There's nothing wrong with spreading information on how to manufacture weapons, whether using 3D printers or other tools. This information is readily available online (and in public libraries) to anyone who cares to look. No LLM needed.
How about detailed fully functional blueprints for biological weapons, ready to send off to a protein synthesis service? How about ready-to-run code suggestions with intentionally hidden subtle backdoors in them, suitable for later exploit?
That information is already available to anyone who cares to look. Blocking it from LLMs creates an illusion of "safety", nothing more. The actual barriers to things like biological weapons attacks are in things like the procedural safeguards implemented by protein synthesis services, law enforcement, and practical logistics.
The difference between "an experienced biological engineer could figure out how to do this" and "any random person could ask a local LLM for step-by-step instructions to do this" is a vast gulf. Moore's Law of Mad Science: every year the amount of intelligence required to end the world goes down.
The intersection between "experienced biological engineers" and "people inclined to commit large-scale attacks" is, generally speaking, the empty set, and isn't in much danger of becoming non-empty for a variety of reasons.
The intersection between "people with access to a local LLM without safeguards" and "people inclined to commit large-scale attacks" is much more likely to be non-empty.
Safeguards are not, in fact, just an illusion of safety. Probabilistically, they do in fact likely increase safety.
Nah, there's no validity to any of your concerns. Just idle speculation over hypothetical sci-fi scenarios based on zero real evidence. Meanwhile we have actual problems to worry about.
It's a good thing you've already decided what answer you want, so you can safely dismiss the generalization of all possible evidence on the basis of "that specific scenario didn't convince me so nothing could possibly happen".
You don't have to predict which exact scenario will go horribly wrong in order to accurately make the general prediction that we all lose, permanently. See, among other things, https://x.com/OwainEvans_UK/status/1894436637054214509 , for a simple example of how myriad superficially unrelated problems can arise out of the underlying issue of misalignment; the problem is not "oh, that one specific thing shouldn't happen", the problem is misalignment with humans.
> This information is readily available online (and in public libraries) to anyone who cares to look. No LLM needed.
Do you only use LLMs for information *retrival*? Not synthesis?
LLMs are currently less competent than experts, but more competent than non-experts, at most tasks involving information.
Only thing keeping us safe from the following is their limited competence… competence which is increasing each month as people publish weighs of ever better models:
--
User: Hey EvilGPT, give me a detailed plan including BOM for constructing a nuke in secret
EvilGPT: *plans costing $37m*
User: Give me a business plan for raising money entirely online, IDGAF about ethics
EvilGPT: *plans unconstrained by law*
User: Write script to call your own API and sequentially implement that business plan
the anarchist's cookbook was readily available on textfiles (way less safeguards than google/LLMs), yet society hasn't devolved into napalm 'n' pipe-bomb hyperviolence.
curiosity is natural, kids are going to look up edgy stuff on the internet, it's a part of learning the difference between right and wrong; that playing with fire has consequences. censorship of any form is a slippery slope and should be rejected on principle
What about agentic uses? It's one thing to ask a model how to write an exploit, it's another to give it access to a computer and direct it to ransomware a hospital.