Reward hacking is a well known and tracked problem at frontier labs - Claude 4’s system card reports on it for instance. It’s not surprising that a framework built on current llms would have reward hacking tendencies.
For this part of the stack the interesting question to me is how to identify and mitigate.
For this part of the stack the interesting question to me is how to identify and mitigate.