RL on any production system is very tricky, and so it seems difficult to work in...

RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.