Author here. Fully agree. > I agree the LLM is going to produce text that is goi...

Author here. Fully agree.

> I agree the LLM is going to produce text that is going to be narratively consistent with the assumptions of the question. But we already know LLMs are giant bullshitters. Is there any reason to think they are doing actual introspection and description of internal state? Or are they just going to give plausible-sounding words?

This sounds like something we could test by inverting the LLM-provided reason of why it can/cannot do something and playing that back through the model.

The example I gave in the blog was, unprompted, the LLM thought it should not generate code that it thought had already been executed. If we take it at face value, I could add a line in the code generation prompt to instruct it to generate code even if it "thinks" it has already executed it. We then assert the outcome.

Not very scientific, but might give us some insight when trialed across thousands of prompts.