For a lot of the usecases that involve summarizing some form of input data (for ...

marcotm · on April 14, 2023

For many use cases like summarization or information extraction, you can get deterministic and mostly non-creative results by adjusting the parameters (temperature, top-p, etc.). This is only possible via the API, though. And it work's most reliably when providing the whole input which should be worked on ("open book" as another commenter called it). I run a task like this for Hacker Jobs [1] and am quite happy with the results so far (there is also an article detailing how it works [2]). If you ask for facts that you hope are somehow remembered by the model itself, it is a different story.

[1] https://www.hacker-jobs.com [2] https://marcotm.com/articles/information-extraction-with-lar...

AJRF · on April 14, 2023

> ...by adjusting the parameters (temperature, top-p, etc.). This is only possible via the API, though

Not exactly true; https://platform.openai.com/playground

marcotm · on April 14, 2023

Yes, sorry, you're right of course. I wanted to say that you need to use the more developer-oriented tooling (API, Playground) if you want to have the parameter options.

pseg134 · on April 14, 2023

That uses the API as far as I’m aware.

visarga · on April 14, 2023

In open-book mode it does not hallucinate. That only happens in closed-book mode. So if you put a piece of text in the prompt you can trust the summary will be factual. You can also use it for information extraction - text to JSON.

layer8 · on April 14, 2023

What are you basing this assessment on? My understanding is that it can in principle still hallucinate, though with a lower probability.

visarga · on April 14, 2023

I experimented on the task of information extraction with GPT3 and 4.

goatlover · on April 14, 2023

I've had it hallucinate with text I've fed it. More so with 3.5 than 4, but it has happened.

bob1029 · on April 14, 2023

> This is the problem I have with the GPT models

You absolutely should think about different kinds of models, especially for tasks that don't truly require generative output.

If all you are doing is classification, I'd grab some ML toolkit that has a time-limited model search and just take whatever it selects for you.

Binary classifiers are the epitome of inspectable. You can follow things all the way through the pipeline and figure out exactly where we went off the rails.

You can have your cake & eat it too. Perhaps you have a classification front-end that uses more deterministic techniques that then feeds into a generative back-end.

roflyear · on April 14, 2023

> how can I trust the output to not be hallucinated?

You can't, not absolutely. You can have some level of confidence, like 99.99%, which is probably good enough tbh (and I'm a sceptic of these tools) and honestly, it is probably better than a human, on average, at this!

But if that is a deal-killer (and it sometimes is!) then yeah, sorry - there aren't workarounds here.

iudqnolq · on April 14, 2023

99.99% seems off by orders of magnitude to me. I don't have an exact number but I routinely see GPT 3.5 hallucinate, which is inconsistent with that level of confidence.

I've noticed this discussion tends to get too theoretical too quickly. I'm uninterested in perfection, 99.99% would be good enough. 70% wouldn't. The actual number is something specific, knowable, and hopefully improving.

roflyear · on April 14, 2023

I think it's way better than 70%, probably 95%+ even with bad data and poor prompts. I'd have to run more numbers but it's definitely better than 70%.

You can get to 99.9%+ with good data and well designed prompts. I'm sure it would be above 90% even with almost intentionally bad prompts, tbh.

iudqnolq · on April 15, 2023

It's definitely not that good if we share a definition of poor data/prompts.

This afternoon I tried to use Codium to autocomplete some capnproto Rust code. Everything it generated was totally wrong. For example, it used member functions on non-existent structs rather than the correct free functions.

But I'll give it some credit: that's an obscure library in a less popular language.

roflyear · on April 15, 2023

> This afternoon I tried to use Codium to autocomplete some capnproto Rust code.

This isn't what I said at all. I said with summarizing data.

mrbombastic · on April 14, 2023

I don’t have hard numbers but anecdotally hallucinating has gone down significantly with gpt4, it certainly still happens though.

chaoz_ · on April 14, 2023

True, "amount of hallucination" (very confident, but factually wrong) is probably something they can decrease in the next versions tho.

I also would not trust it with anything important, but there can be good applications for something that works 9/10 times.

ptmvp · on April 14, 2023

Uhm - maybe train a secondary NN that scores summaries on their factual accurateness/quality? Anything under a given threshold is either sent for manual review or re-ran through the LLM until it passes.

AJRF · on April 14, 2023

Underlining answer - you can't.

Useful answer - fine tune on large training set, set temperature to 0, monitor token probability and highlight risk when probability < some threshold.

valstu · on April 14, 2023

Doesn't same question apply to any content you're about read? How can you know that the blog post/article writer didn't "hallucinate"?