> All you're doing is exploring the parameter space of particular model in a very crude way.
Yes. I come to think of prompt engineering as, in a sense, doing an approximate SELECT query on the latent behavioural space (excuse my lack of proper terminology, my background in ML is pretty thin) that can be thought of as "fishing out" the agent/personality/simulator that is most likely to give you the kind of answer you want. Of course a prompt is a very crude way to explore this space, but to me this is a consequence of extremely poor tooling. For one, llama.cpp now has negative prompts, while the GPT-4 API will probably never have them. So we make-do with the interface available.
> There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
That is incorrect. It is true that there is a lot of trial and error, yes. But it's not true that it's pure guessing either. While my approach can be best described as a systematic variant of vibe-driven development, at its core it's quite similar to genetic programming. The prompt is mutable, and it's efficacy is possible to evaluate at least in a qualitative sense vs the last version of the prompt. By iterative mutation (rephrasing, restructuring/refactoring the whole prompt, changing out synonyms, adding or removing formatting, adding or removing instructions and contextual information), it is possible to iterate from a terrible initial prompt to a much more elaborate prompt that gets you 90-97% of the way towards nearly exactly what you want to do, by combining the addition of new techniques with subjective judgement on how to proceed (which is incidentally not too different from some strains of classical programming). On GPT-4, at least.
> None of the results of this process will transfer to any other model.
Is that so? Yes, models are somewhat idiosyncratic, and you cannot just drag and drop the same prompt between them. But, in my admittedly limited experience of cross-model prompt engineering, I have found that techniques which helped me to achieve better results with the untuned GPT-3 base model, also helped me greatly with the 7B Llama 1 models. I hypothesise that (in the absence of muddling factors like RLHF-induced censorship of model output), similarly sized models should perform similarly on similar (not necessarily identical) queries. For the time being, this hypothesis is impossible to test because the only realistic peer to GPT-4 (i. e. Claude) is lobotomised to the extent where I would outright pay a premium to not have to use it. I have more to say on this, but won't unless you ask in the interests of brevity.
> Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
Language models have a long history, but a Markov chain can hardly be asked to create a basic Python client for a novel online API. I will also dispute the assertion that we know the "failings" of large language models. Several times now, previously "impossible" tasks have been proven eminently possible by further research and/or re-testing on improved models (better-trained, larger, novel fine-tuning techniques, etc). I am far from being on the LLM hype train, or saying they can do everything that optimists hope they can do. All I'm saying, is that the academia is doing itself a disservice by not looking at the field as something to be explored with no preconceptions, positive or negative.
Yes. I come to think of prompt engineering as, in a sense, doing an approximate SELECT query on the latent behavioural space (excuse my lack of proper terminology, my background in ML is pretty thin) that can be thought of as "fishing out" the agent/personality/simulator that is most likely to give you the kind of answer you want. Of course a prompt is a very crude way to explore this space, but to me this is a consequence of extremely poor tooling. For one, llama.cpp now has negative prompts, while the GPT-4 API will probably never have them. So we make-do with the interface available.
> There is no "engineering" going on in this process, just a bunch of trial and error. Perhaps it should be called "prompt guessing."
That is incorrect. It is true that there is a lot of trial and error, yes. But it's not true that it's pure guessing either. While my approach can be best described as a systematic variant of vibe-driven development, at its core it's quite similar to genetic programming. The prompt is mutable, and it's efficacy is possible to evaluate at least in a qualitative sense vs the last version of the prompt. By iterative mutation (rephrasing, restructuring/refactoring the whole prompt, changing out synonyms, adding or removing formatting, adding or removing instructions and contextual information), it is possible to iterate from a terrible initial prompt to a much more elaborate prompt that gets you 90-97% of the way towards nearly exactly what you want to do, by combining the addition of new techniques with subjective judgement on how to proceed (which is incidentally not too different from some strains of classical programming). On GPT-4, at least.
> None of the results of this process will transfer to any other model.
Is that so? Yes, models are somewhat idiosyncratic, and you cannot just drag and drop the same prompt between them. But, in my admittedly limited experience of cross-model prompt engineering, I have found that techniques which helped me to achieve better results with the untuned GPT-3 base model, also helped me greatly with the 7B Llama 1 models. I hypothesise that (in the absence of muddling factors like RLHF-induced censorship of model output), similarly sized models should perform similarly on similar (not necessarily identical) queries. For the time being, this hypothesis is impossible to test because the only realistic peer to GPT-4 (i. e. Claude) is lobotomised to the extent where I would outright pay a premium to not have to use it. I have more to say on this, but won't unless you ask in the interests of brevity.
> Language models have a long history, and while the very large models are impressive, most of their failings have been known for a very long time already. Things like "prompt engineering" will eventually end up in the same graveyard as "keyword engineers" of the past.
Language models have a long history, but a Markov chain can hardly be asked to create a basic Python client for a novel online API. I will also dispute the assertion that we know the "failings" of large language models. Several times now, previously "impossible" tasks have been proven eminently possible by further research and/or re-testing on improved models (better-trained, larger, novel fine-tuning techniques, etc). I am far from being on the LLM hype train, or saying they can do everything that optimists hope they can do. All I'm saying, is that the academia is doing itself a disservice by not looking at the field as something to be explored with no preconceptions, positive or negative.