Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you do enough measurements on that new prompt then I don't see why this shouldn't be a paper. People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.

Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.



> Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.

Drug trials may be expected to be somewhat reproducible.

What I don't get is how it can even be called research if it cannot be expected to be reproducible at all!

GPT is a closed source/weights, proprietary product that changes every couple of weeks or so. How can you expect a prompt to do the same for a reasonable length of time for the research to be even rudimentarily reproducible? And if it's not reproducible, what is it actually worth? I don't think much. Could as well have been a fault in the research setup or a fake.


> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.

Do you have any evidence that the weights for versioned models are being changed without notifications?


> Do you have any evidence that the weights for versioned models are being changed without notifications?

I think in a real scientific process, it's upon those who claim that they are not to provide the evidence.


I'm sorry, but that's entirely ridiculous. You're mangling up a concept of burden of proof here.

You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?

Aside - but even if the model weights did change, that wouldn't stop research being possible. Otherwise no drug trial could be replicated because you couldn't get the exact same participants at the exact same age.


> You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?

Wait a minute? The author of such a paper makes a claim about some observation that's based on the assumption that the studied model is defined in a way. I am disputing that claim since no evidence has been shown that it is defined because no definition has been given.

If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence. That's not called science. That's called authority.


> I am disputing that claim

You are entirely within your rights to say that the authors have assumed that openai is not lying about their models. They've probably also assumed that other paper authors are not lying in their papers.

You then say however:

> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.

And when I ask for evidence of this very specific claim, you turn around and say the burden is on me to show that you're lying. That is what is butchering the concept of burden of proof.

> If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence.

Absolutely not.


Look, the burden of proof in a scientific paper is on the authors. Not on me.

A company with a proprietary product that says something is not acceptable evidence in a scientific context. No need to allege that anyone is lying. Lying is irrelevant. What's relevant is that the research is falsifiable. It cannot be falsifiable if you don't know what the actual model is at a given point in time.


Apples and oranges comparison.

You couldn’t get the same participants, but you could get the same drugs. If you could get identical participants, that wouldn’t be very helpful since humans are so varied.

But for GPT based papers, what you’re actually testing could change without you knowing. There’s no way to know if a paper is reproducible at all.

If you can’t reproduce results, is it really research, or just show and tell?


> If you can’t reproduce results, is it really research, or just show and tell?

You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.

> what you’re actually testing could change without you knowing

If people are lying about an extremely important part of their product, which they have little reason to. But then this applies to pretty much everything. Starting with the assumption that people are lying about everything and nothing is as it seems may technically make things more reproducible but it's going to require unbelievable effort for very little return.

> There’s no way to know if a paper is reproducible at all.

This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing. And again, it requires an assumption that openai is lying about a specific feature of a product.


> You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.

Nobody said that to begin with. Re-read their comment.

> If people are lying about an extremely important part of their product [...]

Nobody is alleging that anyone is lying. It's just that we cannot be sure what the research actually refers to, because of the nature of a proprietary/closed model.

> This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing.

What does this have to do with the parent comment? I don't think it's appropriate to call anyone here silly, just because you don't like their comment and don't have good counter arguments.


> Nobody is alleging that anyone is lying.

Let's be clear, you have made an explicit claim that openai are lying.

> What does this have to do with the parent comment?

Because many other fields would kill for this level of reproducibility, grab an API key, spend a few quid running a script and you can get the results yourself.


With API you can choose versions fixed to a date. Are you suggesting that OpenAI is lying about these being fixed to a date?

Why would they lie about it?

The whole point of these versions is so that when you build on top of that it would keep working as you expect.


I'm not saying they lie about it, but as a hypothetical there could be many reasons to lie.

- realizing their model leaks confidential information against malicious prompts

- copyright claims against them forcing them to remove bits of data from the training set

- serious "alignment" bugs that need to be fixed

- vastly improved optimization techniques that slightly affect results in 0.1% of the cases

If updating the model would save the company a couple hundred million dollars, they might want to do it. And in some of the cases, I can imagine they have an incentive to keep the update low key.


If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.


> If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.

That's why you capture multiple of them and verify your data statistically?


And ideally if someone is proposing new prompting techniques they should test it across both the most capable models (which are unfortunately proprietary) and the best open models.

The problem is that what works on small LLMs does not necessarily scale to larger ones. See page 35 of [1] for example. A researcher only using the models of a few years ago (where the open models had <1B parameters) could come to a completely incorrect conclusion: that language models are incapable of generalising facts learned in one language to another.

[1] https://arxiv.org/pdf/2308.03296.pdf


While this is very interesting, there are enough differences between astronomy and whatever papers this Twitter user is talking about that it's not the insight porn you think it is.

The Twitter user doesn't even reference a single specific paper, kind of doing some hand wavy broad generalizations of his worst antagonists. So who really knows what he's talking about? I can't say.

If he means papers like the ones in this search - https://arxiv.org/search/?query=step+by+step+gpt4&searchtype... - they're all kind of interesting, especially https://arxiv.org/abs/2308.06834 which is the kind of "new prompt" class he's directly attacking. It is interesting because it was written by some doctors, and it's about medicine, so it has some interdisciplinary stuff that's more interesting than the computer science stuff. So I don't even agree with the premise of what the Twitter complainer is maybe complaining about, because he doesn't name a specific paper.

Anyway, to your original point, if we're comparing the research I linked and astronomy... well, they're completely different, it is totally intellectually dishonest to compare the two. Like tell me how I use astronomy research later in product development or whatever? Maybe in building telescopes? How does observing the supernova suggest new telescopes to build in the future, without suggesting that indeed, I will be reproducing the results, because I am building a new telescope to observe another such supernova? Astronomy cares very deeply about reproducibility, a different kind of reproducibility than these papers, but maybe more the same in interesting ways than the non-difference you're talking about. I'm not an astronomer, but if you want to play the insight porn game, I'd give these people a benefit of the doubt.


But you know the parameters of your telescope at least. If openai wants to update all the time, fine, but then they should be like how every other piece of research software works, where you can list what exact version of software you used and pull that version yourself if need be.


Stability is the purpose of the versioned models.


Not always but we can reproduce your findings in the future - credit to gravitational lensing causing some light paths to years longer to reach us.


You can select a static snapshot that presumably does not change, if you use the API


> You can select a static snapshot that presumably does not change, if you use the API

Sorry, I won't blindly believe a company who are cynical enough to call themselves "OpenAI", then publish a commercial closed source/weights model for profit.

Evidence that they do not change without notice or it didn't happen. Better even, provide the source and weights for research purposes. These models could be pulled at every instant if the company sees fit or ceases to exist.


Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.


> Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.

Seems to have hit hard?

I would find it borderline acceptable being offended by a user whose name has obviously been generated using a password generator if you could at least provide some substance to the discussion. Just labeling someone and questioning their competence based on your hurt feelings is a bit low. Please improve.


Is that a contractual guarantee, or more of a "trust us" kind of thing?


> People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.

I think this depends a lot on the "culture" of the subject area. For example in mathematics, it is common that only new results that have been thoroughly worked through are typically "publish-worthy".


Wouldn’t the “thoroughly worked through” part be analogous to extensive measurements of a prompt?


Let me put it this way: you can expect that a typical good math paper means working on the problem for, I would say, half a year (often much longer). I have a feeling that most papers that involve extensive measurements of prompts do not involve 1/2 to 1 year of careful

- hypothesis building

- experimental design

- doing experiments

- analyzing the experimental results

- doing new experiments

- analyzing in which sense the collected data support the hypothesis or not

- ...

work.


There's a great lesson here for marketers: the prospect can be convinced with the simple presence of graphs and data and measurements.

Even just the mere presence of data and data visuals is enough to legitimize what you're selling in the eyes of the prospect. When the prevailing religion is Scientism, data bestows that blessing of authority and legitimacy upon whatever it is you're trying to sell. Show and tell whatever conclusions you'd like from the data - the soundness of the logic supporting that conclusion is irrelevant. All that matters is you did the ritual of measuring and data-gathering and graph-ifying and putting it on display for the prospect.

There's a great book, How to Lie with Statistics, that covers this particular case, but demonstrates other popular ways in which data and data visuals are manipulated to sell things.


Having worked at famously data driven Meta and Google, this is 100% accurate.

You can turbo boost your career by mastering the art of “data ritual”. It doesn’t matter what the results are or magnitude of impact or what it cost to build and launch something. Show your results in a pretty way that looks like you did your diligence and you will be celebrated.


> and the entire value of the drug is in the measurements.

I'm not sure this is true.

While modern Adderall has a closely controlled mixture of multiple enantiomers, it hasn't always been this way.

Medicine historically didn't care nearly as much about racemic mixtures, and the possibility of stereo toxicity (eg Thalidomide).

Many drugs in modern human history, including mixed amphetamine salts, have been marketed with very little concern for racemic purity.


Agreed. People publish papers on algorithms all the time, imagine saying "Sorry, but new C++ is not a paper". There is a ton of space to be explored wrt prompts.

If you do the rigor on why something really is interesting, publish it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: