> And there won't be a point when human curated reward models are not needed any...

gitaarik · 2025-05-30T11:13:31 1748603611

But reward models are always curated by humans. If you generate a reward model with an LLM, it will contain hallucinations that need to be corrected by humans. But that is what a reward model is for. To correct the hallucinations of LLMs.

So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

vidarh · 2025-05-30T12:39:21 1748608761

> But reward models are always curated by humans.

There is no inherent reason why they need to be.

> So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.

This reasoning is begging the question: The reasoning is true only if the conclusion is true. It's therefore a logically invalid argument.

There is no inherent reason why this needs to be the case.

gitaarik · 2025-05-30T16:01:26 1748620886

Sorry but I don't follow your logic. Are you claiming that reward models that aren't curated by humans perform as well as ones that are?

Then what is a reward model's function according to you?

vidarh · 2025-05-31T09:06:01 1748682361

I'm claiming exactly what I wrote: That there is no inherent reason why a human curated one needs to be better.

JoshCole · 2025-05-30T17:14:56 1748625296

In reinforcement learning and related fields, a _reward model_ is a function that assigns a scalar value (a reward) to a given state, representing how desirable it is. You're at liberty to have compound states: for an example, a trajectory (often called tau) or a state action pair (typically represented by s and a).