Reinforcement Learning With Human Feedback
Reinforcement Learning With Human Feedback
RLHF Step 1: Supervised finetuning of the pretrained model
RLHF Step 2: Creating a reward model
RLHF Step 3: Finetuning via proximal policy optimization
Proximal Policy Optimization