In your assess_output_quality function, you ask the LLM to give a score first, then an explanation. I haven't been following the latest research on LLMs, but I thought you usually want the explanation first, to get the model to "think out loud" before committing to the final answer. Otherwise, it might commit semi-radndomly to some score, and proceed to write whatever explanation it can come up with to justify that score.