If you do enough measurements on that new prompt then I don't see why this shouldn't be a paper. People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.
Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.
> Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.
Drug trials may be expected to be somewhat reproducible.
What I don't get is how it can even be called research if it cannot be expected to be reproducible at all!
GPT is a closed source/weights, proprietary product that changes every couple of weeks or so. How can you expect a prompt to do the same for a reasonable length of time for the research to be even rudimentarily reproducible? And if it's not reproducible, what is it actually worth? I don't think much. Could as well have been a fault in the research setup or a fake.
I'm sorry, but that's entirely ridiculous. You're mangling up a concept of burden of proof here.
You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?
Aside - but even if the model weights did change, that wouldn't stop research being possible. Otherwise no drug trial could be replicated because you couldn't get the exact same participants at the exact same age.
> You can easily see this because it can be flipped around easily - you made a claim that they are being changed, even every few weeks! Should it really be on me to show that your very specific claim is false?
Wait a minute? The author of such a paper makes a claim about some observation that's based on the assumption that the studied model is defined in a way. I am disputing that claim since no evidence has been shown that it is defined because no definition has been given.
If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence. That's not called science. That's called authority.
You are entirely within your rights to say that the authors have assumed that openai is not lying about their models. They've probably also assumed that other paper authors are not lying in their papers.
You then say however:
> GPT is a closed source/weights, proprietary product that changes every couple of weeks or so.
And when I ask for evidence of this very specific claim, you turn around and say the burden is on me to show that you're lying. That is what is butchering the concept of burden of proof.
> If your twist on this issue would be true, then I would, by definition, have to accept everything that they claim as true without any evidence.
Look, the burden of proof in a scientific paper is on the authors. Not on me.
A company with a proprietary product that says something is not acceptable evidence in a scientific context. No need to allege that anyone is lying. Lying is irrelevant. What's relevant is that the research is falsifiable. It cannot be falsifiable if you don't know what the actual model is at a given point in time.
You couldn’t get the same participants, but you could get the same drugs.
If you could get identical participants, that wouldn’t be very helpful since humans are so varied.
But for GPT based papers, what you’re actually testing could change without you knowing.
There’s no way to know if a paper is reproducible at all.
If you can’t reproduce results, is it really research, or just show and tell?
> If you can’t reproduce results, is it really research, or just show and tell?
You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.
> what you’re actually testing could change without you knowing
If people are lying about an extremely important part of their product, which they have little reason to. But then this applies to pretty much everything. Starting with the assumption that people are lying about everything and nothing is as it seems may technically make things more reproducible but it's going to require unbelievable effort for very little return.
> There’s no way to know if a paper is reproducible at all.
This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing. And again, it requires an assumption that openai is lying about a specific feature of a product.
> You can't start with a statement about clinical trials not being perfectly reproducible and that's fine, then say this.
Nobody said that to begin with. Re-read their comment.
> If people are lying about an extremely important part of their product [...]
Nobody is alleging that anyone is lying. It's just that we cannot be sure what the research actually refers to, because of the nature of a proprietary/closed model.
> This is a little silly because these models are available extremely easily and at a pay-as-you-go pricing.
What does this have to do with the parent comment? I don't think it's appropriate to call anyone here silly, just because you don't like their comment and don't have good counter arguments.
Let's be clear, you have made an explicit claim that openai are lying.
> What does this have to do with the parent comment?
Because many other fields would kill for this level of reproducibility, grab an API key, spend a few quid running a script and you can get the results yourself.
I'm not saying they lie about it, but as a hypothetical there could be many reasons to lie.
- realizing their model leaks confidential information against malicious prompts
- copyright claims against them forcing them to remove bits of data from the training set
- serious "alignment" bugs that need to be fixed
- vastly improved optimization techniques that slightly affect results in 0.1% of the cases
If updating the model would save the company a couple hundred million dollars, they might want to do it. And in some of the cases, I can imagine they have an incentive to keep the update low key.
If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.
> If I book telescope time and capture a supernova then no one will ever be able to reproduce my raw results because it has already happened. I don't see why OpenAI pulling old model snapshots is any different.
That's why you capture multiple of them and verify your data statistically?
And ideally if someone is proposing new prompting techniques they should test it across both the most capable models (which are unfortunately proprietary) and the best open models.
The problem is that what works on small LLMs does not necessarily scale to larger ones. See page 35 of [1] for example. A researcher only using the models of a few years ago (where the open models had <1B parameters) could come to a completely incorrect conclusion: that language models are incapable of generalising facts learned in one language to another.
While this is very interesting, there are enough differences between astronomy and whatever papers this Twitter user is talking about that it's not the insight porn you think it is.
The Twitter user doesn't even reference a single specific paper, kind of doing some hand wavy broad generalizations of his worst antagonists. So who really knows what he's talking about? I can't say.
If he means papers like the ones in this search - https://arxiv.org/search/?query=step+by+step+gpt4&searchtype... - they're all kind of interesting, especially https://arxiv.org/abs/2308.06834 which is the kind of "new prompt" class he's directly attacking. It is interesting because it was written by some doctors, and it's about medicine, so it has some interdisciplinary stuff that's more interesting than the computer science stuff. So I don't even agree with the premise of what the Twitter complainer is maybe complaining about, because he doesn't name a specific paper.
Anyway, to your original point, if we're comparing the research I linked and astronomy... well, they're completely different, it is totally intellectually dishonest to compare the two. Like tell me how I use astronomy research later in product development or whatever? Maybe in building telescopes? How does observing the supernova suggest new telescopes to build in the future, without suggesting that indeed, I will be reproducing the results, because I am building a new telescope to observe another such supernova? Astronomy cares very deeply about reproducibility, a different kind of reproducibility than these papers, but maybe more the same in interesting ways than the non-difference you're talking about. I'm not an astronomer, but if you want to play the insight porn game, I'd give these people a benefit of the doubt.
But you know the parameters of your telescope at least. If openai wants to update all the time, fine, but then they should be like how every other piece of research software works, where you can list what exact version of software you used and pull that version yourself if need be.
> You can select a static snapshot that presumably does not change, if you use the API
Sorry, I won't blindly believe a company who are cynical enough to call themselves "OpenAI", then publish a commercial closed source/weights model for profit.
Evidence that they do not change without notice or it didn't happen. Better even, provide the source and weights for research purposes. These models could be pulled at every instant if the company sees fit or ceases to exist.
Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.
> Yeah, here it comes. In these conversations you don’t need to ask very many “why”s before it just turns out that the antagonist (you) has an axe to grind about OpenAI, and has added that the their misplaced sense of expertise with regard to the typical standards of proof in academic publications.
Seems to have hit hard?
I would find it borderline acceptable being offended by a user whose name has obviously been generated using a password generator if you could at least provide some substance to the discussion. Just labeling someone and questioning their competence based on your hurt feelings is a bit low. Please improve.
> People overestimate the value of "grand developments", and underestimate the value of actually knowing - in this case actually knowing how well something works, even if it is as simple as a prompt.
I think this depends a lot on the "culture" of the subject area. For example in mathematics, it is common that only new results that have been thoroughly worked through are typically "publish-worthy".
Let me put it this way: you can expect that a typical good math paper means working on the problem for, I would say, half a year (often much longer). I have a feeling that most papers that involve extensive measurements of prompts do not involve 1/2 to 1 year of careful
- hypothesis building
- experimental design
- doing experiments
- analyzing the experimental results
- doing new experiments
- analyzing in which sense the collected data support the hypothesis or not
There's a great lesson here for marketers: the prospect can be convinced with the simple presence of graphs and data and measurements.
Even just the mere presence of data and data visuals is enough to legitimize what you're selling in the eyes of the prospect. When the prevailing religion is Scientism, data bestows that blessing of authority and legitimacy upon whatever it is you're trying to sell. Show and tell whatever conclusions you'd like from the data - the soundness of the logic supporting that conclusion is irrelevant. All that matters is you did the ritual of measuring and data-gathering and graph-ifying and putting it on display for the prospect.
There's a great book, How to Lie with Statistics, that covers this particular case, but demonstrates other popular ways in which data and data visuals are manipulated to sell things.
Having worked at famously data driven Meta and Google, this is 100% accurate.
You can turbo boost your career by mastering the art of “data ritual”. It doesn’t matter what the results are or magnitude of impact or what it cost to build and launch something. Show your results in a pretty way that looks like you did your diligence and you will be celebrated.
Agreed. People publish papers on algorithms all the time, imagine saying "Sorry, but new C++ is not a paper". There is a ton of space to be explored wrt prompts.
If you do the rigor on why something really is interesting, publish it.
I feel this has nothing at all to do with LLMs and more to do with academic incentives in general. Focusing on quality over quantity won't advance your career. Publishing lots of new papers will, as long as they meet the minimum threshold to be accepted into whatever journal or conference you are aiming for. Having one good paper won't increase your h-score, three mediocre papers might.
Doubly so when there's a new breakthrough, where one of your low-effort papers might end up being the first saying something obvious that ends up being really important. Because then everyone will end up quoting your paper in perpetuity.
Being dismissive about this tweet or agreeing with the author is one thing. Not realizing that the absolute minimum of a scientific paper can be much lower than a new prompt for GPT-4 is what everyone should be aware of.
That’s like saying a biologist studying an endangered species isn’t doing science because the animal could disappear tomorrow. The permanence of a subject has no bearing on whether it is science or not.
The idea that science has to happen in a lab is of course absurd as well.
> That’s like saying a biologist studying an endangered species isn’t doing science because the animal could disappear tomorrow. The permanence of a subject has no bearing on whether it is science or not.
The idea that science has to happen in a lab is of course absurd as well.
The main point here is that anyone can likely start studing those endangered species and try to reproduce the results while in GPT4 it is not possible at all.
The lab point is related to fact that we are talking about the software here.
That is not reproducible research.
To reproduce the research, you need training data, source code and all the parameters.
With black-boxed generated models, there is no way to tell how they have been actually generated. It has no value for science to how to improve it further.
In the case of an endangered species a biologist would still have access to take samples from it and inspect it. Science doesn't have to happen in a lab but it's questionable to call something science when it involves hitting a black box endpoint which can change the underlying models and behaviors at a whim.
While I think the twitter post author is being a bit of an ass, they’re sort of right about the overvaluing we’ve put on simply better prompts. I wrote an opinionated GitHub gist about this exact issue:
I do the whole NLP publishing thing and I’ve hesitated to “write a paper” about applying techniques already known and used everywhere in the stable diffusion community to NLP models. That said, the AI community loves to pretend like they discovered something, such as a recent paper purporting to be the first to do “concept slider” Lora’s, despite these existing for months before that work was published on Civit.ai. The authors of course didn’t cite those already existing models.
Everyone chasing citations and clout hard right now because these professors and researchers realize that they only have 5-10 years before AI eats their jobs and most other white collar jobs. I don’t blame them. I want my mortgage paid off before I’m automated away!
The current scientific research apparatus is more about being first than about being correct or thorough. A paper that gets out early means more citations, and many of the faculty sit on the editorial boards, and are able to suggest/enforce specific citations during the review process. Academics aren't fully to blame for this, it's just how the incentives are set up in the system. Tenure and promotions are increasingly based on h-index; a measure of impact based largely on the number of citations.
It's hard to estimate the impact of an idea in the same way that you can estimate the impact of an investment (stock is a number that goes up or down). You're right that the current incentive system might be to blame, but a simplistic metric will be gamed just as easily - what would you propose?
Honestly, I don't think any metric can fix it, and don't have any easy solutions to this. The problem is larger than academia, more endemic to society. The root cause is society's values have changed. Previously, prestige mattered for something. Now, people would rather listen to pop stars than learned individuals, and wealth is the only metric that matters.
As a result, typical professions that used to confer prestige, and for which prestige was supposed to be just reward, such as a professor, a medical doctor, a judge, are now mainly pursued for pecuniary reasons (money, job security). And because they're not doing it for prestige, they don't necessarily care about being right/correct. Playing the game to maximize the revenue streams is paramount. I happen to know a number of faculty who are quite proud of their multiple revenue streams. This would be unthinkable for an academic 50 years ago.
For code generation, GPT4 is getting beat by the small prompt library LATS wrapped around GPT3.5. Given the recent release of MagicCoder / Instruct-OSS, that means a small prompt library + a small 7B model you can self-host beats the much fancier GPT4.
Similar to when simple NNs destroyed a decade of Bayesian modeling theses & research programs, it's frustrating for folks going other paths. But it doesn't make the work 'wrong'.
I’m not quite sure how to translate leaderboards like these into actual utility, but it certainly feels like “good enough” is only going to get more accessible and I agree with what I think is your broader point - more sophisticated techniques will make small, affordable, self-hostable models viable in their own right.
I’m optimistic we’re on a path where further improvement isn’t totally dependent on just throwing money at more parameters.
Ah you're right, LATS GPT3.5 is 84 while standalone GPT4 is 87
Given standalone GPT3.5 is "just" 48.. it's less about beating and more about meeting
RE:Good Enough & Feel... very much agreed. I find it very task dependent!
For example, GPT4 is 'good enough' that developers are comfortable copy-pasting & trying, even vs stack overflow results. We haven't seen LATS+MagicCoder yet, but as MagicCoder 7b already meets+exceeds GPT3.5 for HumanEval, there's a plausible hope for agent-aided GPT4-grade tools being always-on for all coding tasks, and sooner vs later. We made that bet for Louie.AI's interactive analyst interface, and as each month passes, evidence mounts. We can go surprisingly far with GPT3.5 before wanting to switch to GPT4 for this kind of interaction scenario.
Conversely... I've yet to see a true long-running autonomous coding autoGPT where the error rate doesn't kill it. We're experimenting with design partners on directions here -- think autonomous investigations etc -- but there's more on the advanced fringe and with special use cases, guard rails, etc. For most of our users and use cases... we're able to more reliably deliver -- today -- on the interactive scenarios with smaller snippets.
This right here. I feel like the focus on just throwing more GPU at the problem is a mistake many of these companies are making at the moment. The real breakthroughs will come when we figure out how to use the current models and compute power more efficiently. If it’s prompt engineering that leads to this breakthrough, so be it.
ArXiv is kinda reputation-based. That is to submit something you need to be endorsed, either done automatically, based on institution, or asking established authors. After being endorsed to a subject area, you can submit freely to it, keeping in mind:
>Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee.
"Articles" on arXiv are not peer-reviewed, they just check whether it looks like it belongs to one of the categories they hosts:
"Registered users may submit articles to be announced by arXiv. There are no fees or costs for article submission. Submissions to arXiv are subject to a moderation process that classifies material as topical to the subject area and checks for scholarly value. Material is not peer-reviewed by arXiv - the contents of arXiv submissions are wholly the responsibility of the submitter and are presented “as is” without any warranty or guarantee." [0]
They are commonly known as pre-prints, in a similar fashion to IACR ePrint [1] for cryptography.
Nah, this is just an early example of many "this is too easy it doesn't count" defensive human arguments against AI.
Parallel to the "you use copilot so your code quality is terrible and you don't really even understand it so it's not maintainable" human coping we are familiar with.
If there is any shred of truth to these defenses, it is temporary and will be shown false by future, more powerful AI models.
Consider the theoretical prompt that allows one of these models to rapidly improve itself into an AGI. Surely you'd want to read that paper right?
No prompt will cause an LLM to rapidly improve itself, much less into an AGI. Prompts don't cause permanent change in the LLM, only differences in output.
No matter how many times you feed the output of an LLM back to itself, the underlying model does not change. Online training (of actual model weights not just fine-tuning) would be hugely resource intensive and not guaranteed to do any better than the initial training. Interference will happen whether catastrophic or simply drift. We can fantasize about future architectures all day long, but that doesn't make them capable of AGI or even give us a path forward.
Developing prompts for these models isn't a science yet. It does seem to meet most of the criteria for an art though.
We recognize some outputs as high quality, and others as low quality, but often can't articulate the exact reason why. It seems that some people are able to reliably produce high quality results, indicating there is some kind of skill involved. More precisely, the quality of an individual artist's last output is positively correlated with the quality of their next output. A kind of imprecise "shop talk" has emerged, self describing as "prompt engineering", which resembles the conversations artists in other mediums have.
For people in tech this will seem most similar to graphic designers. They produce much nicer looking interfaces than lay people can. We often can't explain why, but recognize it to be the case. And graphic designers have their own set of jargon, which is useful to them, but is not scientific.
"Prompt artist" is a better term than "prompt engineer".
For starters we don't have a way to measure quality objectively, and this is the case for art in general. If you were to develop an objective measure of beauty for example, visual art as a discipline would quickly turn into a science. At some level we know that's possible, we're all just brains in jars. But AFAIK we aren't doing science there yet.
The science and engineering parts all have a measure of quality, sometimes that's a human rating, sometimes it's cross-entropy loss. There's nothing stopping someone from using the scientific method to investigate these things, but descriptively I haven't seen anyone, calling themselves a "prompt engineer/scientist", doing that yet.
"I used these words, and I got this output which is nice" sounds like, "I tried using these brushes and I made this painting which is nice". I can agree with the painting being nice, but not that science was used to engineer a nice painting.
"If you’re a researcher, consider pausing reading here, and instead please read our full paper for interesting science beyond just this one headline result. In particular, we do a bunch of work on open-source and semi-closed-source models in order to better understand the rate of extractable memorization (see below) across a large set of models."
So they are trying to rigorously quantify the behaviour of the model. Is this "look mom no hands"... I don't think so.
Studying the way LLMs behave to different prompts (or different ways of fine-tuning for a set of prompts) is valuable science.
Some of the most interesting papers published this year ("Automatic Multi-Step Reasoning and Tool-Use") compare prompt strategies across a variety of tasks. The results are fascinating, findings are applicable and invite further research in the area of "prompt selection" or "tool selection."
Attacking the participants in a systemic shift is 100% useless as it doesn't target the culprit.
In programming we have a similar phenomenon, that StackOverflow-driven (and I guess now GPT-driven) juniors have overtaken the industry and displaced serious talent. Because sufficient amounts of quantity always beats quality, even if the end result is inferior, this is caused by market dynamics, which operate on much cruder parameters than the sophisticated analysis of an individual noticing everything around them becoming "enshittified".
SO-driven juniors are cheap, plentiful, and easily replaceable. And a business that values less expense and less risk therefore prefers them, because it has no way to measure the quality of the final product with simple metrics.
The same mechanism is driving AI replacing our jobs currently, the avalanche of garbage papers by academics. This is entropy for you. We see it everywhere in modern society down to the food we eat. Quality goes away, replaced by cheap to produce and long shelf life.
If we don't fundamentally alter what the system sees as ACCEPTABLE, and VALUABLE, this process will inevitably continue until our world is completely unrecognizable. And to fundamentally alter the system, we need an impulse that aligns us as a society, startles us into action, all together (or at least significant majority of us). But it seems we're currently in "slowly boiled frog mode".
This whole thing reminds me a bit of playing Fallout 4.
There will be good data, the pre-AI enshitification data. The stuff from before the war.
And then... the data after. Tainted by the entropy, and lack of utility of AI.
Alas, this means in some senses, human progress will slow and stop in the tech field if we aren't careful and preserve ways to create pre-AI data. But the cost of it is so high in comparison to post... I'm not sold it will be worth it.
This paints a rosy picture of human-generated data now. It's not as if most human data is reliable. Even among peer reviewed scientific literature, most of it is crap and it takes effort to find the good stuff. Also, your analogy kind of misses the point of the Fallout games. The pre-war world was awful, filled with evil corporations like Vault-tech and Nuka Cola that murdered and poisoned their customers, and the point of the games is that people need to move on and not idealize the past.
I use AI day to day. I see what it can do and can't.
But when you see pages, and pages, and pages of GPT spam all over the place. Finding the nuggets of wisdom will be much harder than before, the "bomb" was dropped.
Thus actually leading to the whole FO4 main plot.
Yes, life will always find a way. And yes humanity can not put the genie in a bottle, we are much more likely to put it on a Fat Man catapult.
But it means, that in a sense... that we will all have to accept this background radiation of AI shit, as part of our new norm.
And this isn't the first time I've thought in similar ways. I remember reading older math texts (way pre-computer works in things like diffeq and PDE) and often thinking the explanations were clearer. Probably because of the increased effort to actually print something.
Who knows... maybe I'm just an old coot seeing patterns where there are none.
> "I remember reading older math texts (way pre-computer works in things like diffeq and PDE) and often thinking the explanations were clearer."
This is 100% the case in chess as well. The books before and after the computer era are orders of magnitude different in terms of readability. I think a major shift in society has been in motivation. In the past if you were studying advanced mathematics, let alone writing about it, it was solely and exclusively because you absolutely loved the field. And you were also probably several sigmas outside the mean intellectually. Now? It's most often because of some vague direction such as wanting a decent paying job, which may through the twists and turns of fate eventually see you writing books in a subject you don't particularly have much enthusiasm for.
And the much more crowded 'intellectual market', alongside various image crafting or signaling motivations, also creates a really perverse incentive. Exceptional competence and understanding within a subject makes it easy to explain things, even the esoteric and fabulously complex. See: Richard Feynman. But in modern times there often seems to be a desire to go the other direction - and make things sound utterly complex, even when they aren't. I think Einstein's paper on special relativity vs many (most?) modern papers is a good example. Einstein's writing was such that anybody with a basic education could clearly understand the paper, even if they might not fully follow the math. By contrast, so many modern papers seem to be written as if the author had a well worn copy of the Thesaurus of Incomprehensibility at his bedside table.
Hegel is a hack and doesn’t deserve to be cited here. Consider that he’s really only famous because his class motivated a bunch of other philosophers (the young Hegelian's, Marx, Stirner, Bruno Bauer et al) to meet in wine bars after class to complain about how terrible/impossible to understand his philosophy is.
Anyone whose even tried to read stuff of his I.e the phenomenology of spirit will tell you that he’s a charlatan and hack, and the people who constantly cite him (I.e Zizek, Lacan, Foucault) are also hacks.
Hey, can't a blind pig find an acorn? Is it less of an acorn because it was found by a pig?
Myself, I appreciate Schopenhauer a lot more, and we know what he thought of Hegel, but I'm not fanatical about it. If a hack hits on a good line, I'll nab it.
Why not? A paper is not necessarily scientific nor a breakthrough. In my view, a paper is written and documented communication that's usually approved by peers in the field. Also a blunt observation in nature can be noteworthy. However, we don't see such papers anymore as these fields have matured. Just go back in the history of your field and you will find trivial papers.
In the medical field, letters and case studies often document observations that may not be groundbreaking. However, scientific journals typically feature content that contributes to existing knowledge, making it somewhat novel. Consequently, presenting a set of POST parameters as an arXiv paper could be perceived as undermining the integrity of the entire preprint service.
Real science is reserved for those with real expertise! As the self-anointed gatekeeper of real science I decree that other peoples’ work fails to meet the minimum standard I have set for real science! Mind you not the work other actors in the scientific community publish and accept among their peers - they are not real scientists and their work is trivial. For shame!
Strongly disagree. I do think trivial work is not paper-worthy and it would be more beneficial not to publish such work, as it mostly a waste of time for the peer reviewing it and the readers who will gain nothing from reading it. It's no lie that most publish for the sake of publishing and this post just calls it out for what it is.
Trivial means different things to different people. I’m not really a fan of LLM hype but it seems to me a valid practice of scientific discovery to evaluate the use and optimization of such models.
The art and science of building these models is not disputed, but I think that the scientific value of prompts is tightly linked to reproducibility.
If you’ve developed a new prompt for a model whose weights you can directly access, then this prompt could have scientific value because its utility will not diminish over time or be erased with a new model update. I’m even generally of the view that a closed API endpoint whose expiration date is years into the future could have some value (but much less so). But simply finding a prompt for something like ChatGPT is not useful for science because we don’t even have certainty about which model it’s executing against.
Note that some of the best uses of these models and prompting have nothing to do with academics; this is a comment focused on the idea about writing academic papers about prompts.
I can maybe understand the frustration from a “scientific” perspective, but for a lot of these “one prompt papers” - you still need someone to sit down and do the analysis and comparisons. Very few papers focus only on GPT/ChatGPT.
Additionally, it gives people other ideas to try for themselves. And some of this stuff might be useful to someone in a specific scenario.
It’s not glamorous research or even future-proof seeing as how certain prompts can be surgically removed or blocked by the owner of the model, but I don’t think it warrants telling people not to do it.
It's hard to draw these lines, because you will certainly filter out a lot of bad (i.e. useless, low contribution to any field) papers, but you might also filter out some really important papers. Research being basic or something anyone could've done doesn't count againt its potential importance, just the expected value of importance I guess.
I'd rather we had a few too many bad papers than a few too few great papers.
The similarity being that it’s ego masquerading as academic.
Most things shared from there should have just been a blog post.
The last year has showed that AI/ML research and use did not need academic gatekeeping by PhDs and yet many in that scene keep trying self infatuating things with the lowest utility.
Behind all this is a valid question. How does one evaluate prompts and LLMs? As gipeties (custom gpts) become more popular millions of hours will be wasted by ones that have been built badly. Without some sort of automated quality control, gipeties will become a victim of their own success.
What's the difference between a paper on a new prompt and a paper discussing a new domain-specific model, e.g. heart failure risk? If they analyze the problem and solution equally, they both seem useful. It's not like most other ML papers share their weights or datasets.
This reminds me of how there was a boom in half-baked studies around COVID, e.g. modelling this or that aspect of the pandemic, or around mask wearing.
I imagine that most of these will simply have had little to no impact, and will only serve to bolster the publication list of those who wrote them.
Times are changing. Human researchers will dedicate more and more time towards getting language models to work in desired ways rather than doing the research themselves. Language models will largely be the ones making "research" discoveries. Both should be considered valid research IMO.
Anyone caught doing this should be kicked out of the industry. Period. You're scaming those funding your "research", you are misleading readers, and are producing low quality content wasting everyone's time.
Excuse me? Step by step wasn't paper-worthy? Hard disagree.
LLM research is currently in its infancy, because they are no older than a few years old. And a research field in its infancy is bound to have a few noteworthy "no sh*t, Sherlock" papers that would be obvious from hindsight.
The fact is, LLMs are a higher-order construct in machine learning, much like a fish is higher-order than a simple cellular colony. Lower-order ML constructs do not demonstrate emergent capabilities like step by step, stream of consciousness thinking, and so on.
Academics should be less jaded and approach the field with beginner's eyes. Because we are all beginners here.
I'm not surprised at the defence of "prompt engineering" here. It's something easy to do with no real knowledge, and I'm sure having it dismissed hurts some people.
But I 100% agree with the author, "prompt engineering" is not science, and I'd say it's not engineering either. All you're doing is exploring the parameter space of particular model in a very crude way. There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
None of the results of this process will transfer to any other model. It's simply not science. Papers like "step-by-step" are different, and relate more to learning and inference and do translate to different models and even different architectures.
Also, no, we are not all beginners here. Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
> But I 100% agree with the author, "prompt engineering" is not science, and I'd say it's not engineering either. All you're doing is exploring the parameter space of particular model in a very crude way. There is no "engineering" going on in this process, just a bunch of trial and error.
I wonder what your definition of “science” or “engineering” is…
If you remove the AI glasses, "prompt engineering" is just typing words and seeing if results match the expectations... which is exactly what any search engine pays their testers for. Those testers are making an important job to keep improving the quality of the product but they aren't engineers and even less so researchers.
Similarly a kid playing with the dose of water needed to build a sandcastle isn't a civil engineer nor an environmental researcher. Maybe on LinkedIn though.
I’m not sure the scientific method itself can withstand this sort of scrutiny. After all, it’s just making guesses about what will happen and then seeing what happens!
All right, here is a theory: LLMs contain "latent knowledge" that is sometimes used by the model during inference, and sometimes it isn't.
One way to "engage" these internal representations is to include keywords or patterns of text that make that latent knowledge more likely to "activate". Say, if you want to ask about palm trees, include a paragraph talking about a species of a palm tree (no matter whether it contains any information pertaining to the actual query, so long it's "thematically" right) to make a higher quality completion more likely.
It might not be the actual truth or what's going on inside the model. But it works quite consistently when applied to prompt engineering, and produces visibly improved results.
> It might not be the actual truth or what's going on inside the model.
This sums up pretty nicely why prompt hacking is not science. A scientific theory is related in a concrete way to the mechanism by which the phenomenon being studied works.
It's funny how often I see people make bring up the "did you know" tidbit about software engineering not being "real" engineering in a traditional sense, which seems to go very uncontroversially.
But prompt engineering is still a pressure point for some people, despite being wildly more simple and accessible (literally tell the thing to do a thing, and if it doesn't do the thing right, reword)
It feels as though we're getting to the technological equivalent of "what IS art anyways", and questions like if non traditional forms like video games are art (I'm thinking all the way up the chain to even say, Madden games)
And in my experience, when something is under constant questioning of whether or not it even counts as X, Y or Z, it usually can technically qualify, but...
If people are constantly debating whether or not it's even X, it's probably just not impressing people who don't engage in it, as opposed to "traditional" concepts of engineering and art, and part of the impression made comes from the investment and irreplaceable skillsets, things few, if anyone else at the time could have done.
This is why taping a banana on the wall is definitely technically art, but not many outside the art community that tapes bananas to walls really think much of it. It's so mundane and accessible a feat that it doesn't garner much merit to passerbys. It's art by the loosest technical definition, and is giving a lot of credit for a small amount of effort anyone could've done.
Admittedly "prompt engineering" is definitely less accessible than a roll of duct tape and a banana but I think we used to just call it "writing/communication", but I guess those who feel capable at that, often just do it manually anyways.
Right? I'm having a hard time imagining a definition that includes "trying new things and seeing what happens" but that doesn't include... "trying new things and seeing what happens"
"Science" has been twisted recently into a kind of witchcraft that can only be practiced by those anointed through the rigors of academia.
"Trust the science"
In reality, that is about the furthest from what you should do. As Feynman once said: "Science is the belief in the ignorance of experts". Electricity was also once considered a toy and good for nothing but parlor tricks.
Especially given this would be fine definition of engineering+science: "All you're doing is exploring the parameter space of particular model in a very crude way."
Science should aim to create general (that is, generalized or generalizable) knowledge. One prompt is just an anecdote, a method for creating performant prompts or deriving prompts from model characteristics would be more scientific.
> All you're doing is exploring the parameter space of particular model in a very crude way.
Yes. I come to think of prompt engineering as, in a sense, doing an approximate SELECT query on the latent behavioural space (excuse my lack of proper terminology, my background in ML is pretty thin) that can be thought of as "fishing out" the agent/personality/simulator that is most likely to give you the kind of answer you want. Of course a prompt is a very crude way to explore this space, but to me this is a consequence of extremely poor tooling. For one, llama.cpp now has negative prompts, while the GPT-4 API will probably never have them. So we make-do with the interface available.
> There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
That is incorrect. It is true that there is a lot of trial and error, yes. But it's not true that it's pure guessing either. While my approach can be best described as a systematic variant of vibe-driven development, at its core it's quite similar to genetic programming. The prompt is mutable, and it's efficacy is possible to evaluate at least in a qualitative sense vs the last version of the prompt. By iterative mutation (rephrasing, restructuring/refactoring the whole prompt, changing out synonyms, adding or removing formatting, adding or removing instructions and contextual information), it is possible to iterate from a terrible initial prompt to a much more elaborate prompt that gets you 90-97% of the way towards nearly exactly what you want to do, by combining the addition of new techniques with subjective judgement on how to proceed (which is incidentally not too different from some strains of classical programming). On GPT-4, at least.
> None of the results of this process will transfer to any other model.
Is that so? Yes, models are somewhat idiosyncratic, and you cannot just drag and drop the same prompt between them. But, in my admittedly limited experience of cross-model prompt engineering, I have found that techniques which helped me to achieve better results with the untuned GPT-3 base model, also helped me greatly with the 7B Llama 1 models. I hypothesise that (in the absence of muddling factors like RLHF-induced censorship of model output), similarly sized models should perform similarly on similar (not necessarily identical) queries. For the time being, this hypothesis is impossible to test because the only realistic peer to GPT-4 (i. e. Claude) is lobotomised to the extent where I would outright pay a premium to not have to use it. I have more to say on this, but won't unless you ask in the interests of brevity.
> Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
Language models have a long history, but a Markov chain can hardly be asked to create a basic Python client for a novel online API. I will also dispute the assertion that we know the "failings" of large language models. Several times now, previously "impossible" tasks have been proven eminently possible by further research and/or re-testing on improved models (better-trained, larger, novel fine-tuning techniques, etc). I am far from being on the LLM hype train, or saying they can do everything that optimists hope they can do. All I'm saying, is that the academia is doing itself a disservice by not looking at the field as something to be explored with no preconceptions, positive or negative.
I feel like the author of this tweet wasn’t saying step-by-step isn’t worthy, he was saying that non-reproducible results are not science. He emphasizes this twice in that tweet:
> one experiment on one data set with seed picking is not worthy reporting
> Additionally, we all need to understand this is just one good empirical result, now we need to make it useful…
Exactly, and I tend to agree with him. I argued some time ago here that a paper should take some time to try to explain why its results are happening, at least from a reasonable hypothesis (people didn't seem to agree). An experiment (even a simple one) starts from a null hypothesis and tries to disprove it. However, most of what we see coming out of "scientific" papers is basically just engineering, I guess?: we put all of these things together in some way (out of pure guess and/or preference bias) and these results happened. We don't know why, good luck figuring it out. Here is one example where it works (don't ask where it doesn't; we intentionally kept those out).
And while I obviously value very much the engineering advances we have seen, the science is still lacking, because not enough people are trying to understand why these things are happening. Although engineering advances are important and valuable, I don't understand exactly why people try so hard to call themselves scientists if they are basically skipping the scientific process entirely.
Is it non-reproducible? Also results which reproducibility can be measured and appears stable is perfectly good science. I dislike when people throw statements like that.
I have no idea. The author of that tweet seems to imply that the results aren’t reproducible. I was just commenting to point out that the author’s intent may have been different from what the grandparent comment was saying.
Everything that went in to creating GPT4 is AI/science or whatever. Probing GPT4 and trying to understand and characterize it is also a very worthy thing to do - else how can it be improved upon? But if making GPT is science, I'd say this stuff is more akin to psychology ;-)
> you realize nobody understands WHY or HOW these models work under the hood right?
Of course we understand how they work, we built them! There is no mystery in their mechanisms, we know the number of neurons, their connectivity, everything from the weights to the activation functions. This is not a mystery, this is several decades of technical developments.
> it's akin to evolution - we understand the process - that part is simple.
There is nothing simple about evolution. Things like horizontal gene transfer is very much not obvious, and the effect of things like environment is a field of active research.
> But the output/organisms we have to investigate how they work.
There is a fundamental difference with neural networks here: there are a lot of molecules in an animal’s body about which we have no clue. Similarly, we don’t know what a lot of almost any animal’s DNA encodes. Model species that are entirely mapped are few and far between. An artificial neural network is built from simple bricks that interact in well defined ways. We really cannot say the same thing about chemistry in general, much less bio chemistry.
> Of course we understand how they work, we built them! There is no mystery in their mechanisms, we know the number of neurons, their connectivity, everything from the weights to the activation functions. This is not a mystery, this is several decades of technical developments.
The discovery of DNA’s structure was heralded as containing the same explanatory power as you describe here.
Turns out, the story was much more complicated then, and is much more complicated now.
Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.
We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.
> The discovery of DNA’s structure was heralded as containing the same explanatory power as you describe here.
Not at all. It's like saying that since we can read hieroglyphics we know all about ancient Egypt. Deciphering DNA is tool to understand biology, it is not that understanding in itself.
> Turns out, the story was much more complicated then, and is much more complicated now.
We are reverse engineering biology. We are building artificial intelligence. There is a fundamental difference and equating them is fundamentally misunderstanding both of them.
> Anyone today who tells you they know why LLMs are capable of programming, and how they do it, is plainly lying to you.
How so? They can do it because we taught them, there is no magic.
> We have built a complex system that we only understand well at a basic “well there are weights and there’s attention, I guess?” layer. Past that we only have speculation right now.
Exactly in the same way that nobody understand in detail how a complex modern SoC works. Again, there is no magic.
> How so? They can do it because we taught them, there is no magic.
Yeah, no. I mean, we can’t introspect the system to see how it actually does programming at any useful level of abstraction. “Because we taught them” is about as useful a statement as “because its genetic parents were that way”.
No, of course it’s not magic. But that doesn’t mean we understand it at a useful level.
>> Exactly in the same way that nobody understand in detail how a complex modern SoC works. Again, there is no magic.
That's absolute BS. Every part of a SoC was designed by a person for a specific function. It's possible for an individual to understand - in detail - large portions of SoC circuitry. How any function of it works could be described in detail down to the transistor level by the design team if needed - without monitoring its behavior.
Why stop at chemistry? Chemistry is fundamentally quantum electrodynamics applied to huge ensembles of particles. QED is very well understood and gives the best predictions we have to date of any scientific theory.
How come we don’t entirely understand biology then?
> Why stop at chemistry? Chemistry is fundamentally quantum electrodynamics applied to huge ensembles of particles.
Chemistry is indeed applied QED ;) (and you don't need massive numbers of particles to have very complex chemistry)
> How come we don’t entirely understand biology then?
We understand some of the basics (even QED is not reality). That understanding comes from bottom-up studies of biochemistry, but most of it comes from top-down observation of whatever there happens to be around us. The trouble is that we are using this imperfect understanding of the basics to reverse engineer an insanely complex system that involves phenomena spanning 9 orders of magnitude both in space and time.
LLMs did not spawn on their own. There is a continuous progression from the perceptron to GPT-4, each one building on the previous generation, and every step was purposeful and documented. There is no sudden jump, merely an exponential progression over decades. It's fundamentally very different from anything we can see in nature, where nothing was designed and everything appears from fundamental phenomena we don't understand.
As I said, imagining that the current state of AI is anything like biology is a profound misunderstanding of the complexity of both. We like to think we're gods, but we're really children in a sand box.
I will ignore your patronizing remarks beyond acknowledging them here, in order to promote civil discourse.
I think you have missed my point by focusing on biology as an extremely complex field.e, it was my mistake to use it as an example in the first place. We don’t need to go that far;
sure, llms did not spawn on their own. They are a result of thousands of years of progress in countless fields of science and engineering. Like any modern invention, essentially.
Here I remember to make sure we are on the same page on what we’re discussing - as I understand, whether “prompt engineering” can be considered an engineering/science practice. Personally I haven’t considered this enough to form an opinion but your argument does not sound convincing to me;
I guess your idea of what llms represent matters here. The way I see it, in some abstract sense we are as society exploring a current peak - in compute $ or flops and performance on certain tasks - of a rather large but also narrow family of functions. By focusing our attention on functions composed of ones we understood how to effectively find parameters for, we were able to build at this point rather complicated processes for finding parameters for the compositions.
Yes, the components are understood, at various levels of rigor, but the thing produced is not yet sufficiently understood. Partly out of cost to reproduce such research, and partly due to complexity of the system, a driver for the cost.
The fact that “prompt engineering” as a practice and that companies supposedly base their business model on secret prompts is a testament, for me, to the fact they are not well understood. A well understood system you design has a well understood interface.
Now, I haven’t noticed a specific post OP was criticizing so i take it his remarks were general. He seems to thinks that some research is not worth publishing. I tend to agree that I would like research to be of high quality, but that is subjective. Is it novel? is it true?
Now, progress will be progress and im sure current architectures will change and models will get larger. And it may be that a few giants are the only one running models large enough to require prompt engineering. Or we may find a way to have those models understand us better than a human ever could. Doubtful. And post singularity anyway, by definition.
In either case yes, probably temporary profession. But in case open research will continue in those directions as well, there will be need for people to figure out ways to communicate effectively with these. You dismiss them as testers.
However, progress in science and engineering is often driven by data where theory is lacking and I’m not aware of the existence of deep theory as of yet. eg something that would predict how well a certain architecture would perform. Engineering ahead of theory, driven by $).
As in physics that we both mentioned, knowing the component part does not automatically grant you understanding of the whole. knowing everything there is to know about the relevant physical interaction, protein folding was a tough problem that AFAIR has had a lot of success with tools from the field. Square in the realm of physics even, and we can’t give good predictions without testing (computationally).
If someone tested some folding algorithm and visually inspected results, then found a trick how to consistently improve on the result in some subcase of proteins. Would that be worthy of publishing? if yes, why is this different? if not, why not?
We designed the process. We didn't design the models - the models were "designed" based on the features of a massive dataset and massive number of iterations.
Even if you understand evolution - you still don't understand how the human body or mind works. That needs to be investigated and discovered.
In the same way, you understanding how these models were trained doesn't help you understand how the models work. That needs to be investigated and discovered.
Psychology is a religious-like pseudoscience, they can not even define what "psy" is without using some conceptions from religions such as a soul.
upd these statements from me are so controversial, the number of "points" just dances lambada. The psy* areas are clearly polarized: some guys upvote all my messages in this topic and some other ones downvote all my messages in this topic. This is a sign of something interesting but I am not ready to elaborate on this statement in this comment which is going to become [flagged] eventually.
> This is a sign of something interesting but I am not ready to elaborate on this statement in this comment which is going to become [flagged] eventually.
Yet you're being a reply guy all over this thread, might as well just elaborate you clearly have the time and interest
If your contention is just something like "the root psy- comes out of mystical/spiritual conceptions in Ancient Greece, and that speaks to the bunk/ungrounded conceptions of modern psychology," then I would ask why the same critique is not levied against the ancient Greek conception of "nature" and the "natural" from which we get the word "physics".
You might retort here "ah well, 'nature' is just the word we use when we speak of observable phenomena in the hard sciences, its not muddied by religion like that crock stuff psychology."
And then I would say, "ok, if 'nature' is just observable phenomena, what is the aim or purpose of the hard sciences? If it is all just observing/experimenting on discrete phenomena, there would be nothing we could do or conclude from the rigor of physics."
You laugh at my insanity (well, if you believed in such a thing): "But we do conclude things from physics, because experiments are reproducible, and with their reproducibility we can gain confidence in generalizing the laws of our universe."
And yes! You would be correct here. But now all the sudden you have committed physics to something just as fundamentally "spiritual" as the soul: that the universe is sensible, rational, and "with laws." Which is indeed just speaking the very same mystical "nature" of ancient Greece from which we get phys-.
But this need not be some damning critique of physics itself (like psychology), and rather, can lead to a higher level understanding of all scientific pursuits: that we are everywhere cursed by a fundamental incompleteness, that in order even to enter into scientific pursuit we must shed an absolute skepticism for a qualified one. Because this is the only way we accumulate a network of reinforced hypotheses and conceptions, which do indeed help us navigate the purely phenomenal world we are bound in.
What is incorrect in this reference? You have not proposed any counterarguments. Also if you need just more fresh data - how do you propose to interpret the result of the Rosenhan's experiment?
They are _not_ doctors in terms of evidence-based medicine, just policemen without a token. The problem is obviously not about incorrect diagnosis, I can lie to any doctor about any symptoms and just go home with zero obstructions from feds.
As someone who was formerly in a mental ward for acute crisis, I would say that at least the 72 hour hold was an essential and necessary part of my treatment. I don't think that staying at home with unprepared family members for the acute period would have worked out, and I don't even have a problematic home environment!
The flip side of the coin is that I was in a really high quality hospital, I'm sure there are hospitals or facilities that can be more harmful rather than helpful.
I also have a problem with the way that they treat mental health like cancer, that once you have a diagnosis you will always have it. There are zero diagnostic criteria for "fully recovered" or removing dependence on medication, even after 5 or 10 years. It's also treated like a scarlet letter for insurance and unrelated things like TSA pre check - no matter how well you are doing you are still some level of risk to yourself and society. Though I could be wrong... the reoccurrence chart over time for my specific acute mania (with no depressive episodes) does look a lot like cancer remission charts with asymptotic approach to 80%+ reoccurrence after 2-4 years.
There is no evidence of existing at least one defined illness in psy* fields. For example, let me tell you that a person X fell ill with schizophrenia. What do you know about X or X's brain?
> Lower-order ML constructs do not demonstrate emergent capabilities like step by step, stream of consciousness thinking, and so on.
As a matter of fact, I did a project on the normalization of the text, e.g., translate "crossing of 6 a. and 12 s." into "crossing of sixth avenue and 12-th street" with a simple LM (order 3) and beam search on the lattice paths, lattice formed with hypotheses' variants. I got two fold decrease of word error rate compared to simpler approach with just outputting the most probable WFST path. It was not "step by step stream of consciousness," but nevertheless very impressive feat, when system started to know more without much effort.
The large LM's do not just output "most probable" token, they output most probable sequence of tokens and it is done with the beam search.
As you can see, my experience tells me that beam search alone can noticeably, if not tremendously, improve quality of the output, even for very simple LMs.
And if I may, the higher-order construct here is a beam search, not the LMs-as-matrix-coefficients' themselves. Beam search is used in speech recognition for decades now, SR does not work properly without it. LM's, apparently, also do not work without it.
The field of ML suffered from a problem where there were more entrants to the field than available positions/viable work. In many industrial positions, it was possible to hide a lack of progress behind ambiguity and/or poor metrics. This lead to a large amount of gate keeping for productive work as ultimately there wasn't enough to go around in the typical case.
This attitude is somewhat pervasive leading to blogs like the above. Granted, the Nth prompting paper probably isn't interesting - but new programming languages for prompts and prompt discovery techniques are very exciting. I wouldn't be surprised if it turned out that automatic prompt expansion using a small pre-processing model turns out to be an effective technique.
I would say rather than saying that a new prompt is not science, it’s certainly a new discovery that is worth sharing. Maybe there should be a higher bar for papers, but why we have to make a discovery a paper - and not publish it at all if it cannot be made one - when a simple blog post or a tweet would convey the discovery very well?
On a related note, there is this recent tweet purportedly showing that "offering to give a tip to ChatGPT" improves performance (or at the very least resulted in longer responses, which might not be a good proxy for performance) https://twitter.com/voooooogel/status/1730726744314069190
I'm reading this tweet as saying "you can't write a paper by prompting an LLM in these ways" rather than "you can't write a paper characterising the impact of prompting an LLM in these ways".
I'd agree the former won't get you a complete anything (longer than ~30 lines) by itself (90-95% cool, but with some incredible errors in the other 5-10%).
I'd also agree that the latter is worthy of publishing.
What you’re saying seems to be compounded by the black box aspects here.
Instinctively a prompt may look like one line of code. We can often know or prove what a compiler is doing, but higher dimensional space is just not understood in the same way.
What other engineered thing in history has had this much immediately useful emergent capability? Genetic algorithms finding antenna designs and flocking algorithms are fantastic, but I would argue narrower in scope.
Of course a paper is still expected to expand knowledge, to have rigor and impact, but I don’t see why a prompt centric contribution would inherently preclude this.
I think it's a reference to the "discovery" that if you ask GPT-4 to answer your query "step by step", it'll actually offer a better response than otherwise.
> emergent capabilities like step by step, stream of consciousness thinking
What makes these things "emergent capabilities"? They seem like pretty straightforward consequences of autoregressive generation. If you feed output back as input then you'll get more output conditioned on that new input and stream of conscious generation is just stochastic parroting isn't it?
They are emergent in the sense that there is nothing in the pre-training dataset that would show the LLM by example how to, for example, compare and contrast any given pairing of fruit, technologies, or fictional settings, while thinking with the mindset of a doctor that hates both options, and on top of that make sure that this ends up formatted as a stream-of-consciousness. It can learn all these aspects from the source data individually in isolation, but there's no way there are examples that show how to combine it all (awareness of world information + knowledge of how to use it) into a single answer. That's probably a very clumsy example - others online have supplied more rigorous ones that I recommend checking out.
Strictly speaking, it might be "stochastic parroting". But really, if you want to be a great and supremely effective stochastic parrot, you have to learn an internal representation of certain things so that you can predict them. And there are hints that this is exactly what a sufficiently-large large language model is doing.
Working with pure pre-trained models is quite hard, and takes some practice. The key part is that "let's think step by step" is a technique that is used by humans also, and therefore (I think) would be somewhat represented in the pre-training corpus. It would be somewhat harder to "activate" this mode of thinking than a "let's think step by step" would in a fine-tuned model, but it would be possible with some elbow grease.
Tbh in both they are mostly alchemy with some more or less expert lingo put in to make it sound more scientific.
AI/ML resembles more alchemy than e.g. physics, putting stuff into a pot and seeing what comes out. A lot of the math in those papers doesn't provide anything but some truisms, most of it is throwing stuff at the wall and seeing what sticks.
Academia needs to get over itself, can't wait to see how amazing this tech is going to get when the next generation who decide never to bother with those stuffy and navel gazing institutions becomes the driving force behind it.
Looking forward to "I made this cool thing, here's the code/library you can use" rather than the papers/gatekeeping/ego stroking/"muh PhD".
Think if Google had built an AI team around the former rather than the latter, they wouldn't have risked the future of their entire company and squandered their decade head start.
Compare with drug trials: Adderall only differs from regular amphetamine in the relative concentration of enantiomers, and the entire value of the drug is in the measurements.