Hacker Newsnew | past | comments | ask | show | jobs | submit | krisoft's commentslogin

> I take smoking as a cautionary tale, in the beginning it was pushed as not just a recreational thing but a healthy activity

While i agree the gist of what you are saying, also important to mention that humans started cultivating tobaco when mamoths still roamed the Earth. There was indeed a concentrated pro-smoking publicity campaign by tobaco manufacturers in the 1930s, but it was hardly “in the beginning” of our tobaco use.


> he wasn't selling anything counterfactual or deceptive

He was saying he is a physician, and by all evidence he wasn't. That's both deceptive and counterfactual.


I think 6,500 alive babies is probably a better credential then a diploma on a wall.

That is the “competent” part from the “competent quack”.

Obviously if we can believe his numbers, that is.


Doesn’t make it not strictly fraudulent.

Don’t worry, the world will never lack for Great Bureaucrats to tut-tut 6500 babies irregularly saved, and to regulate away the likelihood of such atrocities happening on the regular.

> I'm talking about basic use of a keyboard and mouse. You just expected other people will know how, yet have no basic knowledge of other professions, even those that are arguably more important.

I'm a bit confused about what you are saying. Basic use of a keyboard and mouse is not exclusively part of the software engineering or IT profession. It is in fact part of every job where as part of your job you use a computer. Which is almost every job nowadays.

Same as writers are not the only people who are taught how to write, and accountants are not the only people who are taught arithmetics.

> I recently tore a rotator cuff, none of the four muscles mentioned I had ever heard of in my life. It would have helped me immensely had I not had to spend an evening googling what are actually basic medical facts.

Sorry to hear that, and I hope you are feeling better. Not really sure though what is your point. Are you saying doctors should not know about basic use of a keyboard and mouse because you haven't heard of the rotator cuff? Or are you saying that people should be also taught about the rotator cuff who are not doctors? I just don't really understand your point.

> Or how many people who drive know what a catalytic converter is, [...] How about the light with the cryptic three letters ABS?

I'm really not sure what your point is.


I'm saying that we should not expect people to use computers efficiently, rather we should expect people to use computers in a "good enough" fashion.

I think that more cross-discipline experience would benefit everybody.


> We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

Nah. You would take a large number of humans, make half of them take the test with distractions and half without distracting statements and then you would compare their results statistically. Yes there would be some dumb ones, but as long as you test on enough people they would show up in both samples rougly at the same rate.

> become confused in ways that humans in that problem-solving class would not be.

You just state the same thing others are disputing. Do you think it will suddenly become convincing if you write it down a few more times?


> authors should have done a parallel comparison study against humans on the same question bank as if the study authors had set out to investigate whether humans or LLMs reason better in this situation.

Only if they want to make statements about humans. The paper would have worked perfectly fine without those assertions. They are, as you are correctly observing, just a distraction from the main thrust of the paper.

> maybe some would and some wouldn't that could be debated

It should not be debated. It should be shown experimentally with data.

If they want to talk about human performance they need to show what the human performance really is with data. (Not what the study authors, or people on HN imagine it is.)

If they don’t want to do that they should not talk about human performance. Simples.

I totaly understand why an AI scientist doesn’t want to get bogged down with studying human cognition. It is not their field of study, so why would they undertake the work to study them?

It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.”

And in the conclusions where they write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text that a human would immediately disregard.” Just write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text.” Thats it. Thats all they should have done, and there would be no complaints on my part.


> It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.”

Another option would be to more explicitly mark it as speculation. “The triggers are not contextual, so we expect most humans would ignore them.”

Anyway, it is a small detail that is almost irrelevant to the paper… actually there seems to be something meta about that. Maybe we wouldn’t ignore the cat facts!


i feel it's not quite that simple. certainly the changes you suggest make the paper more straightforwardly defensible. i imagine the reason they included the problematic assertion is that they (correctly) understood the question would arise. while inserting the assertion unsupported is probably the worst of both worlds, i really do think it is worthwhile to address.

while it is not realistic to insist every study account for every possible objection, i would argue that for this kind of capability work, it is in general worth at least modest effort to establish a human baseline.

i can understand why people might not care about this, for example if their only goal is assessing whether or not an llm-based component can achieve a certain level of reliability as part of a larger system. but i also think that there is similar, and perhaps even more pressing broad applicability for considering the degree to which llm failure patterns approximate human ones. this is because at this point, human are essentially the generic all-purpose subsystem used to fill gaps in larger systems which cannot be filled (practically, or in principle) by simpler deterministic systems. so when it comes to a problem domain like this one, it is hard to avoid the conclusion that humans provide a convenient universal benchmark to which comparison is strongly worth considering.

(that said, i acknowledge that authors probably cannot win here. if they provided even a modest-scale human study, i am confident commenters would criticize their sample size)


> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.


Yeah you're right, if that human is 5 years old or has crippling ADHD.


Not at all. There are cultural expectations within each field of what kind of questions students expect to be on a test. If those expectations are violated by the test, students will reasonably be distracted, second-guess themselves, etc.


You can argue until the cows come home. The point is that they claim without evidence that humans are not suspectible to this kind of distraction.

If they want to estabilish this as a fact there is a trivialy easy experiment they can conduct.

“Someone on hacker news strongly feels it is true, and is willing to argue the case with witty comments.” is not how scientific knowledge is estabilished. We either have done the experiments and have the data, or we don’t.


The answer is three apples.


You think too highly of humans.

Humans are not reliable. For every "no human would make this kind of mistake", you can find dozens to hundreds of thousands of instances of humans making this kind of mistake.


That's just because there's a lot of humans and we're doing a lot of things, all the time.

Humans are pretty good at not making mistakes in high-reasoning scenarios. The problem is that humans make mistakes in everything pretty constantly. Like, even saying a word - people say the wrong word all the time.

So when we look at really easy tasks that can be trivially automated, like say adding 2 + 2, we say "humans are so stupid! Computer is smart!".

Because humans get 2 + 2 wrong 1% of the time, but computers always get it right.

But, as we know, this isn't how it works. Actually, humans are much smarter than computers, and it's not even close. Because intelligence is multi-dimensional. The thing is, that failure rate for humans stays pretty constant as the complexity of the task increases, to a degree. Whereas computers start failing more and more, and quickly. It's a very, VERY sharp cliff for algorithms.

LLMs take the cliff further, but they do not eliminate it.


A reasonable person [0] would not make that mistake.

[0] https://en.m.wikipedia.org/wiki/Reasonable_person


[flagged]


If nothing else, you're certainly making your case stronger with each successive comment.


No but I've read about them in books.


LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.


I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.


> "This guy is a creeper and treats romantic partners terribly" is pure opinion, and cannot be defamatory.

That is true. But i think untrained and emotionaly involved individuals will have trouble navigating the boundaries of defamation. Instead of writing opinions like “treats romantic partners terribly” they will write statements purporting facts like “this creep lured me to his house, raped me, and gave me the clap”. This is not an opinion but three individually provable statements of facts. Plus the third would be considered “defamation per se” in most jurisdictions if it were false. (The false allegation that someone has an STD is considered so loathsome that in most places the person wouldn’t need to prove damages.)

Unles specifically coached people would write this second way. Both because it is rethoricaly more powerfull, but also because they would report on their own personal experience. To be able to say “treats romantic partners terribly” they would need to canvas multiple former partners and then put their emotionaly charged stories into calm terms. That requires a lot of work. While the kind of message i’m suggesting only requires the commenter to report things they personaly know about. And in an emotionaly charged situation, like a breakup, people would be more likely to exagarate in their descriptions, making defamatory claims more likely.

> Under US law, providers are generally not liable for defamatory content generated by users…

This is true, and i believe this is the real key. Even if the commenters would be liable, the site themselves would be unlikely to become liable with them.


Just keep in mind there are two very high bars you need to clear to come out ahead on a defamation action:

1. To prove that the factual claims made by the defendant were false, and that the defendant should have known they were false

2. That you suffered actual damages from those claims

Very hard to make happen on a dating app.


Worth pointing out that you're talking purely from a US point of view, and different countries treat slander and libel differently.

For example in the US, to sue for defamation you need to prove something is false, whereas in the UK the defendant has to prove that what they said or wrote (and are being sued for) is true.

(I've no idea whether this app had any non-US use, but thought worth adding this comment regardless since it's a general point about defamation law and being discussed on a site with a big international audience.)


> We don't have any sound reasons to believe the next one will be simpler.

Yes and no. I think you are right that the plasma shape is going to remain very complex.

But that's not the only reason why W7-X looks complicated. It has a ton of diagnostics ports on the plasma vessel just for research. Most of those we can probably design out for a production version.

So I would expect a production version of a stelarator to be simpler than W7-X, but still remain very complex.


> It would be interesting what this functional reserve is, right?

It is most likely not a single thing.

Looking for "the functional reserve" is like looking for which part of an airplane is the "multiple redundancy". Or which line of code is the "fault tolerance" in google's code base. It is not a single part, it is all the parts working together.

Just looking at the kidney example (which is not the only kind of function we can describe having functional reserve.) functional reserve is that there are two kidneys, and each kidney have multiple renal pyramids, and if this or that part of the kidney functions worse other parts compensate and will work overtime.

Depletion of functional reserve is not something literally running out (like a fuel tank running empty), it is more like a marauding gang shooting computers in a cloud data center. Sure initially all works as it used to, because the system identifies the damaged components and routes the processing to other ones. But if they keep it up they will damage enough that the data center will keel over and can't do what it could do before.

(No, I'm not saying that a human body is literally a data center, or literally an airplane. What I'm saying is that all three shares the common theme that some process is maintained in the presence of faults.)


It has nothing to do with the fuel cost. This is what ryanair's website says about oversized luggage: "Passengers who bring an oversize bag (over 55x40x20cm) to the boarding gate will either have their bag refused or, where available, placed in the hold of the aircraft for a fee of £/€ 70.00 - £/€ 75.00."

Ryanair earns every time they identify an oversized bag at the boarding gate. They just share some of that money with their staff.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: