Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For a lot of the usecases that involve summarizing some form of input data (for instance the article mentions book summaries, math walkthroughs etc), how can I trust the output to not be hallucinated? How can I reasonably judge that what it tells me is factual with respect to the input and not just made-up nonsense?

This is the problem I have with the GPT models. I don't think I can trust them for anything actually important.



For many use cases like summarization or information extraction, you can get deterministic and mostly non-creative results by adjusting the parameters (temperature, top-p, etc.). This is only possible via the API, though. And it work's most reliably when providing the whole input which should be worked on ("open book" as another commenter called it). I run a task like this for Hacker Jobs [1] and am quite happy with the results so far (there is also an article detailing how it works [2]). If you ask for facts that you hope are somehow remembered by the model itself, it is a different story.

[1] https://www.hacker-jobs.com [2] https://marcotm.com/articles/information-extraction-with-lar...


> ...by adjusting the parameters (temperature, top-p, etc.). This is only possible via the API, though

Not exactly true; https://platform.openai.com/playground


Yes, sorry, you're right of course. I wanted to say that you need to use the more developer-oriented tooling (API, Playground) if you want to have the parameter options.


That uses the API as far as I’m aware.


In open-book mode it does not hallucinate. That only happens in closed-book mode. So if you put a piece of text in the prompt you can trust the summary will be factual. You can also use it for information extraction - text to JSON.


What are you basing this assessment on? My understanding is that it can in principle still hallucinate, though with a lower probability.


I experimented on the task of information extraction with GPT3 and 4.


I've had it hallucinate with text I've fed it. More so with 3.5 than 4, but it has happened.


> This is the problem I have with the GPT models

You absolutely should think about different kinds of models, especially for tasks that don't truly require generative output.

If all you are doing is classification, I'd grab some ML toolkit that has a time-limited model search and just take whatever it selects for you.

Binary classifiers are the epitome of inspectable. You can follow things all the way through the pipeline and figure out exactly where we went off the rails.

You can have your cake & eat it too. Perhaps you have a classification front-end that uses more deterministic techniques that then feeds into a generative back-end.


> how can I trust the output to not be hallucinated?

You can't, not absolutely. You can have some level of confidence, like 99.99%, which is probably good enough tbh (and I'm a sceptic of these tools) and honestly, it is probably better than a human, on average, at this!

But if that is a deal-killer (and it sometimes is!) then yeah, sorry - there aren't workarounds here.


99.99% seems off by orders of magnitude to me. I don't have an exact number but I routinely see GPT 3.5 hallucinate, which is inconsistent with that level of confidence.

I've noticed this discussion tends to get too theoretical too quickly. I'm uninterested in perfection, 99.99% would be good enough. 70% wouldn't. The actual number is something specific, knowable, and hopefully improving.


I think it's way better than 70%, probably 95%+ even with bad data and poor prompts. I'd have to run more numbers but it's definitely better than 70%.

You can get to 99.9%+ with good data and well designed prompts. I'm sure it would be above 90% even with almost intentionally bad prompts, tbh.


It's definitely not that good if we share a definition of poor data/prompts.

This afternoon I tried to use Codium to autocomplete some capnproto Rust code. Everything it generated was totally wrong. For example, it used member functions on non-existent structs rather than the correct free functions.

But I'll give it some credit: that's an obscure library in a less popular language.


> This afternoon I tried to use Codium to autocomplete some capnproto Rust code.

This isn't what I said at all. I said with summarizing data.


I don’t have hard numbers but anecdotally hallucinating has gone down significantly with gpt4, it certainly still happens though.


True, "amount of hallucination" (very confident, but factually wrong) is probably something they can decrease in the next versions tho.

I also would not trust it with anything important, but there can be good applications for something that works 9/10 times.


Uhm - maybe train a secondary NN that scores summaries on their factual accurateness/quality? Anything under a given threshold is either sent for manual review or re-ran through the LLM until it passes.


Underlining answer - you can't.

Useful answer - fine tune on large training set, set temperature to 0, monitor token probability and highlight risk when probability < some threshold.


Doesn't same question apply to any content you're about read? How can you know that the blog post/article writer didn't "hallucinate"?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: