>The same kind of bias keeps resurfacing in every major system: Claude, Gemini, Llama, clearly this isn’t just an OpenAI problem, it’s an LLM problem.
It's not an LLM problem, it's a problem of how people use it. It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem. In a parallel window, repeat your question and the previous answer and ask to outline 10 potential problems. Pick which ones appear valid, ask to elaborate. Pick your shortlist, ask yet another LLM thread to "patch" the original reply with these criticisms, then continue the original conversation with a "patched" reply.
LLMs can can't tell legitimate concerns from nonsensical ones. But if you, the user, do, they will pick it up and do all the legwork.
I worked on a very early iteration of LMs (they weren't "large" yet) in grad school 20 years ago and we drove it with a Makefile. The "prompt" was an input file and it would produce a response as an artifact. It never even occurred to us to structure it as a sequential "chat" because at that point it was still too slow. But it does make me wonder how much the UX changes the way people think about it.
There is the "classic" text completion interface that OpenAI used before ChatGPT. Basically a text document that you ask the LLM to extend (or insert text at a marker somewhere in the text). Any difference between your text and the AI's text is only visible in text color in the editor and not passed on to the LLM.
That does favor GP's workflow: You start the document with a description of your problem and end with a sentence like: "The following is a proposed solution". Then you let the LLM generate text, which should be a solution. You edit that to your taste, then add the sentence: "These are the 10 biggest flaws with this plan:" and hit generate. The LLM doesn't know that it came up with the idea itself, so it isn't biased towards it.
Of course this style is much less popular with users and much harder to do things like instruction tuning. It's still reasonably popular in creative writing tools and is a viable approach for code completion
ChatGPT is how old again? People are FAR more familiar with other interfaces. For coding, autocomplete is a great already-existing interface; products that use it don't get as much hype, though, as the ones that claim to be independent agents that you're talking to. There's any number of common interfaces attached to that (like the "simplify this" right-click for Copilot) for refactoring, dealing with builds, tests, etc. No shortage of places you could further drop in an LLM instead of pushing things primarily through "chat with me" to type out "refactor this to make these changes".
Or you could make the person's provided workflow not just more automatic but more integrated: generate the output, have labels with hover text or inline overlays or such along "this does this" or "here are alternative ways to do this" or "this might be an issue with this approach." All could be done much better in a rich graphical user interface than slamming it into a chat log. (This is one of Cursor's biggest edges over ChatGPT - the interactive change highlighting and approval in my tool in my repo, vs a chat interface.)
In some other fields:
* email summarization is automatic or available at the press of a button, nobody expects you to open up a chat agent and go "please summarize this email" after opening a message in Gmail
* photo editors let you use the mouse to select an area and then click a button labeled "remove object" or such instead of requiring you to try to describe the edit in a chat box. sometimes they mix and match it too - highlight the area THEN describe a change. But that's approximately a million times better than trying to chat to it to describe the area precisely.
There are other scenarios we haven't figured out the best interface for because they're newer workflows. But the chat interface is just so unimaginative. For instance, I spent a long time trying to craft the right prompt to tweak the output of ChatGPT turning a picture of my cat into a human. I couldn't find the right words to get it to understand and execute what I didn't like about the image. I'm not UX inventor, but one simple thing that would've helped would've been an eye-doctor like "here's two options, click the one you like more." (Photoshop has something like this, but it's not so directed, it's more just "choose one of these, or re-roll" but at least it avoids polluting the chat context history as much). Or let me select particular elements and change or refine them individually.
A more structured interface should actually greatly help the model, too. Instead of having just a linear chat history to digest, it would have well-tagged and categorized feedback that it could keep fresh and re-insert into its prompts behind the scenes continually. (You could also try to do this based on the textual feedback, but like I said, it seemed to not be understanding what my words were trying to get at. Giving words as feedback on a picture just seems fundamentally high-loss.)
I find it hard to believe that there is any single field where a chat interface is going to be the gold standard. But: they're relatively easy to make and they let you present your model as a persona. Hard combo to overcome, though we're seeing some good signs!
> It's not an LLM problem, it's a problem of how people use it.
True, but perhaps not for the reasons you might think.
> It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem.
LLM's do not "solve a problem." They are statistical text (token) generators whose response is entirely dependent upon the prompt given.
> LLMs can can't tell legitimate concerns from nonsensical ones.
Again, because LLM algorithms are very useful general purpose text generators. That's it. They cannot discern "legitimate concerns" because they do not possess the ability to do so.
Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.
For example, an LLM might take the string "2+2=" and give you "2+2=4", but it didn't solve a math problem, it solved a "what would usually get written here" problem.
> Right, or at any rate, the problems they do solve are ones of document-construction, which may sometimes resemble a different problem humans are thinking of... but isn't actually being solved.
This is such a great way to express the actuality in a succinct manner.
You're saying roughly "you can't trust the first answer from an LLM but if you run it through enough times, the results will converge on something good". This, plus all the hoo-hah about prompt engineering, seem like clear signals that the "AI" in LLMs is not actually very intelligent (yet). It confirms the criticism.
Not exactly. Let's say, you-the-human are trying to fix a crash in the program knowing just the source location. You would look at the code and start hypothesizing:
* Maybe, it's because this pointer is garbage.
* Maybe, it's because that function doesn't work as the name suggests.
* HANG ON! This code doesn't check the input size, that's very fishy. It's probably the cause.
So, once you get that "Hang on" moment, here comes the boring part of of setting breakpoints, verifying values, rechecking observations and finally fixing that thing.
LLM's won't get the "hang on" part right, but once you point it right in their face, they will cut through the boring routine like no tomorrow. And, you can also spin 3 instances to investigate 3 hypotheses and give you some readings on a silver platter. But you-the-human need to be calling the shots.
You can make a better tool by training the service (some of which involves training the model, some of which involves iterating on the prompt(s) behind the scene) to get a lot of the iteration out of the way. Instead of users having to fill in a detailed prompt we now have "reasoning" models which, as their first step, dump out a bunch of probably-relevant background info to try to push the next tokens in the right direction. A logical next step if enough people run into the OP's issue here is to have it run that "criticize this and adjust" loop internally.
But it all makes it very hard to tell how much of the underlying "intelligence" is improving vs how much of the human scaffolding around it is improving.
Yeah given the stochastic nature of LLM outputs this approach and the whole field of prompt engineering feels like a classic case of cargo cult science.
It's not an LLM problem, it's a problem of how people use it. It feels natural to have a sequential conversation, so people do that, and get frustrated. A much more powerful way is parallel: ask LLM to solve a problem. In a parallel window, repeat your question and the previous answer and ask to outline 10 potential problems. Pick which ones appear valid, ask to elaborate. Pick your shortlist, ask yet another LLM thread to "patch" the original reply with these criticisms, then continue the original conversation with a "patched" reply.
LLMs can can't tell legitimate concerns from nonsensical ones. But if you, the user, do, they will pick it up and do all the legwork.