Reinforcement Learning from Human Feedback (RLHF)
- In RLHF, we use human feedback as a proxy for the “reward” signal in RL.
- Use cases
- Helpful: Give useful results
- Harmless: Avoid generating toxic, biased, or harmful content
- Aligned with Human Preferences: Respond in ways that humans find natural, helpful, and engaging
- Multiple Techniques:
- GRPO (Group Relative Policy Optimization)
- PPO (Proximal Policy Optimization)
- DPO (Direct Preference Optimization)
- GRPO was developed by
Deepseek-R1
- GRPO is mostly RL based and may involve humans
Steps for RL
- RL involves
- Agent: Our Learner
- Environment: Agent interacts with this world
- Action: Choices Agent can make
- Reward: Feedback environment gives to agent after action
- Policy: Agent’s strategy for choosing actions
- In RL we are trying to improve Policy
- Steps
- Observation: The agent observes the environment
- Action: The agent takes an action based on its current policy
- Feedback: The environment gives the agent a reward
- Learning: The agent updates its policy based on the reward
- Iteration: Repeat the process
Steps for RLHF
- Ask humans to compare different responses generated by the LLM and which response they prefer
- Use this human preference data as reward model
- Use this reward model to fine tune LLM