Tests to align your model seems neat. How reliable is it? Won’t models still hallucinate time to time? How do you think about performance monitoring/management?
Great questions! The tests act as few-shot examples for the LLMs, which has been shown to guide the style and accuracy of model outputs and improve performance quite well. For instance we’ve seen accuracy go from <70% to 93%+ vs without including the tests. The hallucinations are still an inherent risk with LLMs, especially with long-form context, but adding more diverse and well aligned examples as tests does reduce the hallucination risk and align the outputs with user intent.
In terms of performance management and monitoring, QA for LLMs is a difficult process to get right and we’re looking into ways how to a) make it easy for users to test out different function descriptions and tests on their own datasets to gauge performance and b) introduce ways how to seamlessly carry out continuous monitoring of function outputs with low effort. Still WIP but will keep you posted!