Direct Alignment from Preferences - Part 03. Online DAP

24 Mar 2024

Introdunction

In the previous post, “**Direct Alignment from Preferences

Part - 02.DAP**”, we looked at the latest DAP methodologies for training language models that can generate answers aligned with human preferences without using reinforcement learning (RL). However, DAP uses offline data. Offline data refers to a training scheme that uses data collected from somewhere else, rather than data generated by the very model “I am trying to train.” This has the drawback that, because there is a distribution shift relative to the data generated by the model being trained, the model may not reach the optimum even after training.

Therefore, online DAP methodologies emerged that can reach the optimum by using online data instead of offline data. In the final installment of this three-part series, Part 3, we will explore the latest online DAP methodologies.

Background

Before diving into online DAP methodologies, let’s first look at the definitions of online learning and offline learning and examine the characteristics of each.

On-line vs. Off-line learning

Off-line learning
- Trains the model using a “fixed dataset” that is built in advance
- No additional data collection happens during training
- Therefore, there is a data distribution gap between the data generated by the trained model and the data used for training
- Offline RL methodologies originally apply various techniques to correct for distribution shift
- However, although the previously proposed DPO solved RL in an offline manner, it did not use any additional correction (due to the characteristics of LLMs — because they are trained on a massive corpus, the data distribution is not very different, so the problem is not as prominent compared to other domains)
On-line RL
- Trains the model using data generated during the training process
- Continuously collects the data generated during training and uses it persistently for model training
- Because the data distribution is identical, the model can reach the optimum
- Can struggle with generating high-quality data. If the quality of the data generated by the model is low, it can backfire.

Model alignment - online-DAP

Rejection-sampling SFT (Meta)

This is the methodology used in Meta’s 2023 paper Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al, 2023).

First, a reward model is trained using offline-collected Human Feedback (HF) data. After that, Best-of-N samples are collected and used for SFT — an incredibly simple scheme.

A Best-of-N sample works like this: for a single prompt, N answers are generated. Then those N answers are passed to the trained reward model, and the answer with the highest measured reward is collected to build the SFT data.

The paper experimentally showed why this kind of methodology is possible.

When generating answers via sampling, as N grows the mean reward did not change much, but the max reward changed significantly. In other words, the sampling results have quite a large reward variance. Therefore, even SFT on Best-of-N data alone can reflect human feedback to some extent.

The paper mentions that the Llama-2-chat model used a scheme where, starting from the pretrained model, Rejection-sampling SFT was repeated 4 times, and from iteration-5 onward PPO was performed to push the final performance higher.

RSO (Deepmind)

In 2023, Deepmind proposed a methodology called RSO in the paper Statistical Rejection Sampling Improves Preference Optimization (Liu et al., 2023). The methodology proposed here is not so much online learning, but rather a way of “making the data distribution as similar as possible to the data samples that the final optimal model to be trained would generate” when collecting offline data. That sounds incredibly complicated, so let’s look at it in detail.

First, a reward model is trained using offline-collected HF data. Because the reward model is continuously used for data generation, training the reward model is very important.

Here, three human preference data collection methodologies are compared. 1) Direct:

Offline-collected preference data 2) SFT-sample-rank:

Data where, for a single prompt, multiple answers are generated via sampling from one SFT model and then tagged with ranks using a pre-trained RM 3) RSO-sample-rank:

Given a trained reward model $\rho_\psi$, data is generated from $\pi_r(y | x) = \frac{1}{Z(x)} \pi_{\text{sft}}(y | x) \exp\left(\frac{1}{\beta}\rho_\psi(x,y)\right)$ (not from $\pi_{\text{sft}}(y | x)$)

The generated answers are tagged with ranks by the trained RM to build the data

The problem here is: how do we generate data from $\pi_r(y

x)$ in RSO-sample-rank? In the paper, they use a scheme where the data generated by the SFT model is matched, via rejection sampling, to be as close as possible in distribution to what $\pi_r(y

x)$ would generate.

Rejection sampling proceeds as follows.

To express this more simply: when the value of $\mathbin{U}[0,1]$ is smaller than the value of the equation below, the corresponding answer is accepted. $\frac{\pi_{r_\phi}(y|x)}{M_{D_x} \pi_{\text{ref}}(y|x)} = \exp \left( \frac{1}{\beta} \left( r_\phi(x, y) - \max_{y' \in D_x} r_\phi(x, y') \right) \right)$

Looking at its meaning a bit more, it is a structure where, the higher the reward, the higher the probability of being accepted in the end. Since the max reward is unknown here, the reward is computed per batch (64) and the maximum value within the batch is used.

This equation may look unfamiliar, but it can be seen as a generalized version of the rejection sampling we examined above. When $\beta=0$, only the highest-reward sample in the batch is accepted, so it is identical to rejection sampling; when $\beta=\infty$, all samples are accepted, so it is identical to SFT-sample-rank.

The RSO paper argues that the RSO-sample-rank scheme, which builds data using the reward model, is closer to the optimal policy data distribution than the Direct or SFT-sample-rank data-building schemes, and demonstrated this through experiments.

exper1 exper2

The built data was trained using DPO/SLiC/RSO loss, and the superiority among loss functions varied by dataset. However, across all loss functions, the RSO-sample-rank scheme showed the highest performance.

Iterative DPO (Meta)

In 2023, Meta newly proposed a scheme that leverages something called Pairwise Cringe Loss in the paper Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss (Xu et al., 2023), and Iterative DPO was first proposed in this paper.

It is an extremely simple scheme: after a first DPO iteration using offline data, a preference data update is performed using the model’s own responses to carry out a second iteration. Even so, they report that performance improved with this scheme.

iterativeDPO

However, they did not clearly disclose whether a separate reward model is trained to do preference tagging each iteration, or whether the trained DPO model is used to do preference tagging each iteration. Likewise, training details and so on were not disclosed — probably because it was not the main contribution, so they did not reveal the details. Nevertheless, it is one of the models consistently compared against in subsequent online DAP methodologies.

Self-rewarding (Meta)

In 2024, Meta once again announced a new online DAP scheme through Self-Rewarding Language Models (Yuan et al., 2024). The fact that an LLM can be used as a judge for evaluating metrics is widely known through Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, and this capability was applied to the online-DAP scheme.

sr1

The method is simple. Using the current model, generate multiple answers to a prompt via sampling. Then feed those answers back into the current model to evaluate their ranking. (As you can tell here, this is a scheme that simply cannot be used if the model is a small model without the ability to judge.) Then DPO is performed with the data collected this way. And this process is repeated. Simple, right?

Let’s look at it in a bit more detail.

Because Self-rewarding requires the ability to act as a judge, data related to judging must be learned during the SFT stage.

sr2

Accordingly, they mention building a dataset that evaluates the score of answers as shown below, and using it together in SFT. It appears that building this data would take quite a bit of time and money, but they report that without adding this data, performance improvement in the subsequent DPO process is not large.

Afterward, when repeating the DPO process, the effect diminishes if the prompts are identical, so they create new prompts using the self-instruction scheme. Self-instruction is a methodology in which you give the LLM a few prompts as few-shot examples and ask it to create various new prompts. There’s an interesting remark: after DPO, the models do not work well with the self-instruction scheme. Therefore, when performing Self-reward iterations, prompt generation is done only with the initial SFT model.

They repeated the Self-rewarding scheme for 2 iterations and experimentally showed that performance continuously improved. In the first iteration, 4K pairs of data were generated and used, and in the second iteration, 7K pairs of data were generated and used.

M1: SFT model with judging data added
M2: Self-rewarding first iteration
M3: Self-rewarding second iteration

sr3

Furthermore, they showed that, through this process, performance as a judge also continuously improved.

sr4

I think this is a good methodology that can create, evaluate, and improve data on its own without external intervention (including better external LLM APIs). One limitation, however, is a report that the length bias became severe. As also mentioned in LLM-as-a-Judge, this is the phenomenon where, if you keep leaving evaluation to the LLM, it prefers longer answers.

Online-DAP (Deepmind)

In 2024, Deepmind proposed the OAIF methodology in the paper Direct Language Model Alignment from Online AI Feedback (Guo et al., 2024). It is very similar to the Self-reward scheme, but it does not trust its own judging ability and instead performs online DAP using an off-the-shelf model.

oaif1

There are two differences from Self-rewarding. First, the same prompt set is used every iteration. It seems they judged that prompt generation via self-instruction does not produce large performance gains. Second, instead of self-annotation, an external off-the-shelf model is used. This also appears to be an alternative that can be used in situations where the LLM being trained lacks the ability to act as a judge.

Evaluation was conducted on the TL;DR data (a summarization task). In the case of offline DPO, because it uses fixed offline data, reward hacking occurs. Therefore, beyond a certain number of steps, the reward may rise but performance plummets. In the case of OAIF, however, it shows performance similar to RLHF and RLAIF.

oaif2

However, there are a few limitations: when a small model is used as the annotator, it shows lower performance than RLAIF. The upper graph uses Palm-2-XL as the annotator, and the lower graph uses Palm-2-L as the annotator. Also, supposing you used GPT-4-turbo instead of Palm-2-XL, tagging all the data needed in that graph would cost roughly 30 million won. (Way too expensive…)

oaif3

When evaluated by humans, the performance improvement of Online-DPO was even more pronounced. In the comparison with Offline-DPO, not only was the win rate much higher, but it also received higher scores in terms of quality.

oaif4

Also, when compared in a 4-way manner against the RL schemes represented by RLHF and RLAIF, it showed an even larger performance gap.

oaif5

Finally, they cited as an advantage of applying Online DAP using an off-the-shelf LLM instead of an RM the fact that it can reflect “human preferences that can be expressed in natural language” without training a reward model. For example, suppose you want to train a model that prefers short answers because the answers are too long. Techniques that use an RM must collect data tagging short answers as win and long answers as lose for the same prompt, and retrain the RM. But if you use an off-the-shelf LLM, through prompt engineering you can add a sentence like “if the meaning is the same, give a higher score the shorter the answer,” and thereby reflect the preference without any additional retraining.

Conclusion

Across three posts, we examined three ways to train LLMs to generate answers that reflect human preferences — RL, Offline DAP, and Online DAP. I’ve organized the pros and cons, cost, computing resources, and so on of the various methodologies in each scheme into a table.

	Algorithm	NOT require human preference tagging	Unbiased data distribution (training data from training policy)	Online feedback (Iterative improvement)	NOT require external API cost	# of model for training
RL	RLHF	X	O	O	O	4
	RLAIF	O	O	O	X	4
	ReST	X	O	O	O	4
Offline DAP	SLiC-HF	X	X	X	O	2 (1)
	DPO	X	X	X	O	2 (1)
	IPO	X	X	X	O	2 (1)
	RRHF	X	△	X	O	1 (2)
	CLICK	X	△	X	O	1 (2)
	SPIN	O	△	O	O	2 (1)
Online DAP	Rejection-sampling FT	X	O	O	O	2
	RSO	X	O	X	O	3
	Iterative DPO	X	O	O	O	3 (2)
	self-rewarding	△ (works even without it)	O	O	O	2 (1)
	OAIF (online DPO)	O	O	O	X	2 (1)

These methodologies, called model alignment, are still an actively researched field. Of course, there is no single correct answer about which methodology you must use. You just choose the appropriate method to fit your own data situation and computing resources.

In 2023, OpenAI announced that it had created a new team. The team’s name is “Super alignment” — a team for building AI that is far smarter than humans. Going forward, beyond model alignment (reflecting human preferences), methodologies will emerge for super alignment, in which humans who are less smart than the AI train smarter models. I believe the field of model alignment will lay a solid foundation for that.

Untitled 45

Jae-Kyung Cho Being unique is better than being perfect

Direct Alignment from Preferences - Part 03. Online DAP

Introdunction

Background

On-line vs. Off-line learning

Model alignment - online-DAP

Rejection-sampling SFT (Meta)

RSO (Deepmind)

Iterative DPO (Meta)

Self-rewarding (Meta)

Online-DAP (Deepmind)

Conclusion

References

Jae-Kyung Cho Being unique is better than being perfect

Direct Alignment from Preferences - Part 03. Online DAP

Introdunction

Background

On-line vs. Off-line learning

Model alignment - online-DAP

Rejection-sampling SFT (Meta)

RSO (Deepmind)

Iterative DPO (Meta)

Self-rewarding (Meta)

Online-DAP (Deepmind)

Conclusion

References

Related posts

Diary - LLM Orchestration (Sakana Fugu) 01 Jul 2026

Diary - LLM Orchestration (Sakana Fugu) 01 Jul 2026

Diary - AI training이란 무엇일까 (feat. Claude Code) 06 Mar 2026