Reinforcement Learning from Human Feedback (RLHF)

  • In RLHF, we use human feedback as a proxy for the “reward” signal in RL.
  • Use cases
    • Helpful: Give useful results
    • Harmless: Avoid generating toxic, biased, or harmful content
    • Aligned with Human Preferences: Respond in ways that humans find natural, helpful, and engaging
  • Multiple Techniques:
    • GRPO (Group Relative Policy Optimization)
    • PPO (Proximal Policy Optimization)
    • DPO (Direct Preference Optimization)
  • GRPO was developed by Deepseek-R1
  • GRPO is mostly RL based and may involve humans

Steps for RL

  • RL involves
    • Agent: Our Learner
    • Environment: Agent interacts with this world
    • Action: Choices Agent can make
    • Reward: Feedback environment gives to agent after action
    • Policy: Agent’s strategy for choosing actions
  • In RL we are trying to improve Policy
  • Steps
    • Observation: The agent observes the environment
    • Action: The agent takes an action based on its current policy
    • Feedback: The environment gives the agent a reward
    • Learning: The agent updates its policy based on the reward
    • Iteration: Repeat the process

Steps for RLHF

  • Ask humans to compare different responses generated by the LLM and which response they prefer
  • Use this human preference data as reward model
  • Use this reward model to fine tune LLM