>The same kind of bias keeps resurfacing in every major system: Claude, Gemini, ...

nemomarx · 2025-06-14T00:29:06 1749860946

This feels like a pretty big ergonomics gap in presenting things as a chat window at all?

bandrami · 2025-06-14T01:13:25 1749863605

I worked on a very early iteration of LMs (they weren't "large" yet) in grad school 20 years ago and we drove it with a Makefile. The "prompt" was an input file and it would produce a response as an artifact. It never even occurred to us to structure it as a sequential "chat" because at that point it was still too slow. But it does make me wonder how much the UX changes the way people think about it.

majormajor · 2025-06-14T00:53:05 1749862385

It's more compelling to fundraising and hype-pushing stories to make it look as "person-like" as possible.

hombre_fatal · 2025-06-14T01:19:54 1749863994

Or people like the familiar chat interface and they don’t want to dick around with a complicated workflow like the person above provided.

What are examples of 3rd party UIs that make these alternative, superior workflows easier?

wongarsu · 2025-06-14T02:02:58 1749866578

There is the "classic" text completion interface that OpenAI used before ChatGPT. Basically a text document that you ask the LLM to extend (or insert text at a marker somewhere in the text). Any difference between your text and the AI's text is only visible in text color in the editor and not passed on to the LLM.

That does favor GP's workflow: You start the document with a description of your problem and end with a sentence like: "The following is a proposed solution". Then you let the LLM generate text, which should be a solution. You edit that to your taste, then add the sentence: "These are the 10 biggest flaws with this plan:" and hit generate. The LLM doesn't know that it came up with the idea itself, so it isn't biased towards it.

Of course this style is much less popular with users and much harder to do things like instruction tuning. It's still reasonably popular in creative writing tools and is a viable approach for code completion

majormajor · 2025-06-14T01:36:35 1749864995

ChatGPT is how old again? People are FAR more familiar with other interfaces. For coding, autocomplete is a great already-existing interface; products that use it don't get as much hype, though, as the ones that claim to be independent agents that you're talking to. There's any number of common interfaces attached to that (like the "simplify this" right-click for Copilot) for refactoring, dealing with builds, tests, etc. No shortage of places you could further drop in an LLM instead of pushing things primarily through "chat with me" to type out "refactor this to make these changes".

Or you could make the person's provided workflow not just more automatic but more integrated: generate the output, have labels with hover text or inline overlays or such along "this does this" or "here are alternative ways to do this" or "this might be an issue with this approach." All could be done much better in a rich graphical user interface than slamming it into a chat log. (This is one of Cursor's biggest edges over ChatGPT - the interactive change highlighting and approval in my tool in my repo, vs a chat interface.)

In some other fields:

* email summarization is automatic or available at the press of a button, nobody expects you to open up a chat agent and go "please summarize this email" after opening a message in Gmail

* photo editors let you use the mouse to select an area and then click a button labeled "remove object" or such instead of requiring you to try to describe the edit in a chat box. sometimes they mix and match it too - highlight the area THEN describe a change. But that's approximately a million times better than trying to chat to it to describe the area precisely.

There are other scenarios we haven't figured out the best interface for because they're newer workflows. But the chat interface is just so unimaginative. For instance, I spent a long time trying to craft the right prompt to tweak the output of ChatGPT turning a picture of my cat into a human. I couldn't find the right words to get it to understand and execute what I didn't like about the image. I'm not UX inventor, but one simple thing that would've helped would've been an eye-doctor like "here's two options, click the one you like more." (Photoshop has something like this, but it's not so directed, it's more just "choose one of these, or re-roll" but at least it avoids polluting the chat context history as much). Or let me select particular elements and change or refine them individually.

A more structured interface should actually greatly help the model, too. Instead of having just a linear chat history to digest, it would have well-tagged and categorized feedback that it could keep fresh and re-insert into its prompts behind the scenes continually. (You could also try to do this based on the textual feedback, but like I said, it seemed to not be understanding what my words were trying to get at. Giving words as feedback on a picture just seems fundamentally high-loss.)

I find it hard to believe that there is any single field where a chat interface is going to be the gold standard. But: they're relatively easy to make and they let you present your model as a persona. Hard combo to overcome, though we're seeing some good signs!

shmval · 2025-06-14T20:58:48 1749934728

This. I think it's the key.

AdieuToLogic · 2025-06-14T01:43:51 1749865431

> It's not an LLM problem, it's a problem of how people use it.

True, but perhaps not for the reasons you might think.

> It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem.

LLM's do not "solve a problem." They are statistical text (token) generators whose response is entirely dependent upon the prompt given.

> LLMs can can't tell legitimate concerns from nonsensical ones.

Again, because LLM algorithms are very useful general purpose text generators. That's it. They cannot discern "legitimate concerns" because they do not possess the ability to do so.

Terr_ · 2025-06-14T03:19:58 1749871198

> LLM's do not "solve a problem."

Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.

For example, an LLM might take the string "2+2=" and give you "2+2=4", but it didn't solve a math problem, it solved a "what would usually get written here" problem.

We ignore this distinction at our peril.

AdieuToLogic · 2025-06-14T03:49:54 1749872994

> Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.

This is such a great way to express the actuality in a succinct manner.

Thank you for sharing it.

colkassad · 2025-06-14T01:22:04 1749864124

Feels like a high-level back propagation step. Not surprising, really!

thinkling · 2025-06-14T00:17:17 1749860237

You're saying roughly "you can't trust the first answer from an LLM but if you run it through enough times, the results will converge on something good". This, plus all the hoo-hah about prompt engineering, seem like clear signals that the "AI" in LLMs is not actually very intelligent (yet). It confirms the criticism.

sysmax · 2025-06-14T00:29:02 1749860942

Not exactly. Let's say, you-the-human are trying to fix a crash in the program knowing just the source location. You would look at the code and start hypothesizing:

* Maybe, it's because this pointer is garbage.

* Maybe, it's because that function doesn't work as the name suggests.

* HANG ON! This code doesn't check the input size, that's very fishy. It's probably the cause.

So, once you get that "Hang on" moment, here comes the boring part of of setting breakpoints, verifying values, rechecking observations and finally fixing that thing.

LLM's won't get the "hang on" part right, but once you point it right in their face, they will cut through the boring routine like no tomorrow. And, you can also spin 3 instances to investigate 3 hypotheses and give you some readings on a silver platter. But you-the-human need to be calling the shots.

majormajor · 2025-06-14T00:56:39 1749862599

You can make a better tool by training the service (some of which involves training the model, some of which involves iterating on the prompt(s) behind the scene) to get a lot of the iteration out of the way. Instead of users having to fill in a detailed prompt we now have "reasoning" models which, as their first step, dump out a bunch of probably-relevant background info to try to push the next tokens in the right direction. A logical next step if enough people run into the OP's issue here is to have it run that "criticize this and adjust" loop internally.

But it all makes it very hard to tell how much of the underlying "intelligence" is improving vs how much of the human scaffolding around it is improving.

abraxas · 2025-06-14T00:37:16 1749861436

Yeah given the stochastic nature of LLM outputs this approach and the whole field of prompt engineering feels like a classic case of cargo cult science.