There's a paper that probed how strongly a model would focus on prompt-supplied ...

There's a paper that probed how strongly a model would focus on prompt-supplied tokens when generating a response as a signal that it was trying to use the prompt as the source of information as opposed to knowledge it had been trained on. Ie, how much it was trying to lie based on it assuming that the information in the prompt was true, as opposed to having a rich internal model of the thing that is being verified. It looks like it works, sort of, sometimes, when you have access to the actual labels. The results from this work, in the more real-world unsupervised setting, are better than random, sure, but not good enough to really be exciting or reliable.

https://arxiv.org/html/2402.03563v1