Reinforcement Learning from Human Feedback (RLHF)

In RLHF, we use human feedback as a proxy for the “reward” signal in RL.
Use cases
- Helpful: Give useful results
- Harmless: Avoid generating toxic, biased, or harmful content
- Aligned with Human Preferences: Respond in ways that humans find natural, helpful, and engaging
Multiple Techniques:
- GRPO (Group Relative Policy Optimization)
- PPO (Proximal Policy Optimization)
- DPO (Direct Preference Optimization)
GRPO was developed by Deepseek-R1
GRPO is mostly RL based and may involve humans

Steps for RL

Ask humans to compare different responses generated by the LLM and which response they prefer
Use this human preference data as reward model
Use this reward model to fine tune LLM