intuitively it has seemed that these kinds of "fuzzy text search" applications are an area where llms really shine. it's cool to see evidence of it working.
i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.
i'm curious about some kind of notion of "prompt overfitting." it's good to see the plots of improvement as the prompts change (although error bars probably would make sense here), but there's not much mention of hold out sets or other approaches to mitigate those concerns.