Direct Alignment from Preferences - Part 01. RLHF
19 Feb 2024Introduction
I dare say 2023 was the era of chatGPT. There were various factors behind chatGPT’s success. Alongside advances in GPUs, it became possible to quickly train tremendously huge models, and models of 175B and beyond appeared. Also, going beyond simply gathering large amounts of data, the use of high-quality, refined instruction datasets pushed up its performance as a chat model that can interact with people. But if I had to pick just one factor among the many, I’d pick RLHF, that is, Model alignment.
Model alignment, simply put, is “training a model to generate responses that match human preferences”. Conventional SFT (Supervised Fine-Tuning) trains a model to generate responses just like the training data. But a model trained with SFT has difficulty responding appropriately to data it hasn’t learned. Also, if inappropriate responses are included in the training data, there’s a risk of generating inappropriate or distorted responses. Therefore, Model alignment improves the model’s responses based on human feedback.
The Model alignment approach used in chatGPT is a method called RLHF, widely known as a training method using the training of an RM (Reward Model) and RL (Reinforcement Learning). But RL is a very unstable training method. So various methods that improve training stability without using RL have emerged, and these are called DAP (Direct Alignment from Preference).
In this three-part series (probably?), we’ll look at how RL and DAP differ, and explore the latest RL as well as DAP methods. Part 1 is on RL methods.
Model alignment - RL
RLHF
This is a method proposed in OpenAI’s 2022 paper Training language models to follow instructions with human feedback (Ouyang et al., 2022). Actually, even before that, the Learning to summarize from human feedback (Stiennon et al., 2020) paper proposed the same method for the summarization task, but it didn’t show as much impact as chatGPT.
RLHF is divided into three steps. 1) SFT: Train the model on data built in chat form. 2) RM: Generate multiple responses to a single question, and humans tag which response they prefer. 3) RL: Improve the model using the trained RM and an RL algorithm (PPO).
Among these, let’s look at the equations for Step 2 and Step 3 in a bit more detail.
- Reward model training
- Bradley-Terry Preference model
- Given a Reward function $r(x,y)$ trained to reflect preferences, the probability that $y_1$ is preferred over $y_2$ is expressed as follows. You can understand it as a Softmax of the Reward outputs. Put simply, the higher the reward compared to the opponent, the higher the probability of being preferred.
- Reward model loss function
- Build data composed of two response pairs for a single question, and humans tag the preference. Define the preferred response as $y_w$ and the opposite response as $y_l$.
- Train the model using the Loss below. The model is trained in the direction that raises the reward of $y_w$ and lowers the reward of $y_l$.
- Bradley-Terry Preference model
- RL (PPO)
-
Using the PPO algorithm, train a model that satisfies the Objective function below.
\[\max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)}[r_\phi (x, y)] - \beta D_{KL}[\pi_\theta(y | x) || \pi_{\text{ref}}(y | x)]\] - It trains so as to maximize the Reward while keeping the KL-divergence from the existing reference model from growing too large.
- Using KL-divergence as a regularizer is meant to avoid losing too much of the previously learned knowledge. Also, if you don’t use a Regularizer, a reward hacking phenomenon that over-optimizes to the RM can occur.
🚨 What is Reward hacking in LLMs? The phenomenon of continuing to generate effectively meaningless responses—useless ones like "How may I help you?" or "I'll give it a try. Please let me know how I should do it!"—which, while pointless, are more likely to be preferred than responses that tell a lie.
-
This method was effective, but it has three fatal drawbacks. 1) Expensive human annotator
- Building preference data evaluated by humans is a problem that takes a very long time, both in terms of cost and time.
- There’s currently a lot of open-source data, but highly refined Korean preference data is hard to find. Most of it is English translations. 2) Expensive computational cost
- The PPO algorithm requires four models: Policy model, Reference model, Reward model, Value model
- In a situation where the Policy model alone approaches 175B, you have to secure GPU space to additionally load three more models of similar size.
To solve this first problem, in 2023 Google Research proposed a method called RLAIF.
RLAIF
This is a method used in the RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (Lee et al., 2023) paper, published by Google Research in 2023.
The concept is very simple. The idea is to replace the Human feedback in RLHF with AI feedback. This is also a claim made in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023), exploiting the fact that an LLM can sufficiently produce performance similar to human evaluation.
Tagging the preference of responses by providing an evaluation prompt to off-the-shelf LLMs like chatGPT, Bard, and Claude has a high correlation with humans tagging it directly.
In practice, the performance difference between using RLAIF and using RLHF wasn’t large,
but in terms of price, they claimed that an AI labeler costs $0.06/example while a Human labeler costs $0.67/example—a difference of over 11x.
Also, AI feedback has an advantage not only in price but also in speed, enabling Iterative training. In the case of RLHF, a lot of time is consumed in the process of gathering the trained model’s responses after a training cycle completes and requesting tagging again from human annotators. But RLAIF has the advantage that, since AI feedback can be done anytime, training cycles can be performed quickly any number of times.
ReST (Deepmind)
Through the Reinforced Self-Training (ReST) for Language Modeling (Gulcehre et al., 2023) paper, Deepmind proposed a method called ReST, which proposed a method that can perform RLHF iteratively.
ReST is composed of four steps. 1) RM training: Train the RM with HF data 2) Grow step: Generate multiple responses for a single prompt using the current policy model 3) Data tagging: Tag preferences using the RM
- At this point, only data with a reward higher than the reward threshold is used in the next step 4) Improve step: Train the policy using the data gathered in step 3 and an Offline RL method 5) Repeat steps 2-4
The point is that you can use the initially built RM to also tag preferences for the self-generated responses. However, as shown in the figure below, they say performance can only improve progressively if you gradually raise the reward threshold with each iteration to improve data quality.
Also, they say that when running multiple iterations, you should keep the learning rate small to prevent the model from changing drastically from the previous model.
As you can see in the experimental results below, they say that using ReST can produce better performance than using RLHF (Online RL).
However, there’s a report that if you repeat iterations too many times, Reward hacking occurs. As in the graph below, you can see that the average reward increased, but the Human evaluation score difference from the Initial model gradually shrinks. Therefore, the work of experimentally finding an appropriate number of iterations is absolutely necessary.
References
- Training language models to follow instructions with human feedback
- Learning to summarize from human feedback (Stiennon et al., 2020)
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (Lee et al., 2023)
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
- Reinforced Self-Training (ReST) for Language Modeling (Gulcehre et al., 2023)