They have been RLHF (reinforcement learning with human feedback) tuned. In essen...

alex_sf · on May 13, 2023

Instruction tuning is distinct from RLHF. Instruction tuning teaches the model to understand and respond (in a sensible way) to instructions, versus 'just' completing text.

RLHF trains a model to adjust it's output based on a reward model. The reward model is trained from human feedback.

You can have an instruction tuned model with no RLHF, RLHF with no instruction tuning, or instruction tuning and RLHF. Totally orthogonal.

stevenhuang · on May 13, 2023

In this case Open AI used RLHF to instruct-tune gpt3. Your pedantism here is unnecessary.

hyperbovine · on May 13, 2023

Not to be pedantic, but it’s “pedantry”.

alex_sf · on May 13, 2023

It's not being pedantic. RLHF and instruction tuning are completely different things. Painting with watercolors does not make water paint.

Nearly all popular local models are instruction tuned, but are not RLHF'd. The OAI GPT series are not the only LLMs in the world.

stevenhuang · on May 13, 2023

Man it really doesn't need to be said that RLHF is not the only way to instruct tune. The point of my comment was to say that was how GPT3.5 was instruct tuned, via RLHF through a question answer dataset.

At least we have this needless nerd snipe so others won't be potentially misled by my careless quip.

elcomet · on May 13, 2023

But that's still false. RLHF is not instruction fine-tuning. It is alignment. GPT 3.5 was first fine-tuned (supervised, not RL) on an instruction dataset, and then aligned to human expectations using RLHF.

stevenhuang · on May 13, 2023

You're right, thanks for the correction

alex_sf · on May 13, 2023

It sounds like we both know that's the case, but there's a ton of incorrect info being shared in this thread re: RLHF and instruction tuning.

Sorry if it came off as more than looking to clarify it for folks coming across it.

stevenhuang · on May 13, 2023

Yes all that misinfo was what lead me to post a quick link. I could have been more clear anyways. Cheers.