Paper review - A survey of preference-based reinforcement learning methods
01 Jan 2022The first post on my blog in 2022, and my first paper review, will start with A survey of preference-based reinforcement learning methods. The first time I came across Preference-based RL was at the SNU Summer AI lecture held during the summer break of 2021, given by Dr. Kimin Lee at Berkeley (he is a great speaker, so I really enjoyed it. If you’re interested, definitely check it out! SNU summer AI - PEBBLE). Let’s start with the survey paper and take a look at the overall flow.
Table of contents
Introduction
Preference-based reinforcement learning (PbRL) started from the following motivation.
PbRL have been proposed that can directly learn from an expert’s preferences instead of a hand-designed numeric reward.
RL based on rewards advanced dramatically with AlphaGo as a turning point, but it exposes several problems when it comes to defining the reward function. (crucially depends on the prior knowledge that is put into the definition of the reward function) Let’s look at the four problems that account for the biggest part.
- Reward hacking : The agent may maximize the given reward, without performing the intended task. For example, if you give a vacuum cleaner a positive reward when there is no dust, it won’t remove the dust but will just stare at the spots where there is no dust.
- Reward shaping : The reward does not only define the goal but also guides the agent to the correct solution. The thing is, not only do we not know the optimal reward function (it may not even exist in practice), but problems arise because a human has to design it. This is actually the biggest problem of RL….
- Infinite rewards : Some applications require infinite rewards. For example, a self-driving car must never hit a person, so we would need to give it a negative infinity reward, but this breaks classic RL. (One of RL’s assumptions is finite reward.)
- Multi-objective trade-offs : The trade-off may not be explicitly known. The trade-off relationships between rewards are hard to grasp properly. (It would be difficult to set the balance of a self-driving car’s ride comfort, speed, safety, etc. numerically.)
In any case, the problem is that it’s difficult to express the intrinsic motivation embedded in an agent’s behavior as a numerical scalar value. Various methods have been studied to solve this. There are approaches like Inverse RL and learning with advice, but PbRL differs from them in one respect.
PbRL aims at rendering reinforcement learning applicable to a wider spectrum of tasks and non-expert users.
In other words, PbRL aims to make it possible to train an agent with nothing but preferences, even for someone who knows nothing about RL. (A world where ordinary people can teach robots…!!)
Preliminaries
We looked at the motivation of PbRL in the Introduction, and now let’s look at the points needed to learn PbRL.
Preference learning is about inducing predictive preference models from empirical data.
PbRL aims to infer a preference model from the user’s data.
Preference learning
So now shall we look at some equations? A preference is expressed in the following five ways. First, we unconditionally assume that we are always comparing two options.

Markov decision processes with preferences (MDPP)
An MDPP consists of a sextuple.
\(S,A\) denote the state and action spaces, \(\mu\) is the initial state distribution, and \(\delta\) denotes the transition probability \(\delta(s'|s,a)\). \(\gamma\in[0,1]\) is the discount factor, of course.
Now the unusual one is \(\rho\), which is the probability distribution of preference. Since humans always have stochasticity, they may make different choices even given the same options. This expresses that as a probability. That is, \(\rho(\tau_1\succ\tau_2)\) is the probability of preferring \(\tau_1\) over \(\tau_2\). If approached strictly, it is a probability for which the complement holds, but since preference can be ambiguous, the complement may not hold. Now let’s define the dataset. We define the collection of all trajectories as follows.
Objective
Organizing the objective function as an equation gives the following.
The goal becomes finding \(\pi^*\) that satisfies this.
But this is not easy to use when the difference in preference is very small (and on top of that the user has stochasticity), so we’ll use a bit of a trick to turn it into a maximization problem.
Now we’ve entered the specialty of Deep learning. The idea is that we just optimize the policy using the Loss function below.
Additionally, since the preference objective must be satisfied for the entire dataset, we use a weighted sum to define the final Loss function.
A very simple and clear Loss function has been derived. (Of course, this is just the beginning.)
PbRL algorithms
PbRL algorithms are broadly divided into three.
- learning a policy computes a policy that tries to maximally comply with the preferences
- learning a preference model learns a model for approximating the expert’s preference relation
- learning a utility function estimates a numeric function for the expert’s evaluation criterion
In case 1, it’s direct policy learning—in other words, obtaining a policy by directly minimizing the Loss function in the Deep learning way. In the figure, this is the upper loop (dashed line). In cases 2 and 3, the idea is to process the preference before using it. This is the lower loop (dotted line). All three approaches show an on-policy character in that, after learning, they obtain samples through the new policy and learn again. (Dr. Kimin Lee’s PEBBLE paper, mentioned at the start, proposed off-policy PbRL.)
Related problem settings
Here we introduce studies that approached RL’s problems with similar methods.
- Learning with advice
Classic RL + additional constraints (Rule or preference) - Ordinal feedback
numeric ranking instead of pairwise preference - IRL
A very strong assumption that the expert demo is the optimal trajectory. Its drawback is that you can’t obtain additional feedback. (With GAIL now available, it seems possible nowadays.) - Unsupervised learning
A strong assumption that the more it learns, the more preferable the policy becomes.
I didn’t fully understand everything, but none of these studies have the perspective of non-expert interpretability, which is one of PbRL’s goals.
Design principles of PbRL
Now let’s look at how preference feedback is given in PbRL. This paper proposes three types.
- action preference
Comparing two action preferences for the same state - state preference
Comparing the respective best actions for different states - trajectory preference
Comparing the entire sequential trajectory composed of (state, action)
In fact, since case 3 includes cases 1 and 2, it’s fine to assume case 3. (Cases 1 and 2 can be seen as single-step trajectories.) When using trajectory preference, the biggest problem is credit assignment.
Yet, a difficulty with trajectory preferences is that the algorithm needs to determine which states or actions are responsible for the encountered preferences, which is also known as the temporal credit assignment problem
It means that we have to design the Objective function well so that, when comparing trajectories, we can determine which action taken in which state within them should be given credit.
Learning a policy
Policy distribution
This is a method that obtains the distribution of the parameterized policy and then finds the optimal policy via MAP (maximum-a-posterior) over the preference dataset.
A characteristic is that it samples two policies from the policy distribution and collects the resulting trajectory pair in a buffer. After enough pairs have accumulated, it receives expert feedback.
The Likelihood function here is said to be as follows, but in my opinion, a negative sign should be attached.
The likelihood is high if the realized trajectories \(\tau^\pi\) of the policy \(\pi\) are closer to preferred trajectory.
Whether MAP or MLE, the result is similar, so thinking of it as MLE makes it intuitively understandable.
The problem with this method is that it requires a distance function \(d\), and Euclidean distance reportedly struggles with high-dimensional continuous state spaces.
Ranking policy
This time, we compare policies head-on. It’s said to be a method similar to multi-arm bandit. A characteristic is that it goes through the preference feedback process immediately upon trajectory rollout.
First, you prepare a whole set of policies, compare them, and then use an algorithm called EDPS (evolutionary direct policy search) to turn it into a better policy set.
As is obvious, its drawback is that it requires an enormous amount of comparison.
Learning a preference model
Learning a preference model is the same as solving a classification problem.
This becomes a model that can figure out the preference between two actions for a given state \(s\). Let’s look at the algorithm proposed by Fürnkranz et al(2012).

Let’s look at line 5 in detail. For the initial state \(s\), it sweeps through all actions. After that, it forms a trajectory with the current policy. Based on this data, if we build a preference model (a classification model), we can derive a greedy policy.
\(k(s,a)\) is called the resulting count, and the higher it is, the higher the preference within the action set. However, it’s questionable how this could work in a continuous action set. Another characteristic is that if \(C\) is used as a continuous function, you can obtain uncertainty, which can be used for exploration. (I wonder why it wouldn’t work for discrete, but continuous probably works better.)
Learning a utility function
A utility function is similar to the reward in RL but shows a slight difference.
However, in the PbRL case it is sufficient to find a reward function (=utility function) that induces the same, optimal policy as the true reward function.
Since it doesn’t need to take the form of a fixed reward like in classic RL, it’s given a separate name. Basically it uses a scalar utility, defined simply as follows.
Here, we just need to select the policy that maximizes utility.

Let’s look at the algorithm. First, since the initial state of every trajectory is a sampled value, we can see that these are trajectories starting from different places. After creating the dataset, we learn the utility function through trajectory pair sampling (line 12). So what does the form of the utility function look like?
Linear utility function
A linear utility function can be defined as follows.
Here \(\psi\) denotes the feature function. If there’s prior knowledge, it would already be defined; we could also extract it using DL. The final goal is, of course, to find \(\theta\) that parameterizes the optimal utility function.
To consider preference, we have to compare two trajectories. We define that utility difference as follows.
We just need to make this difference larger. That way we can find a utility function that clearly distinguishes preferences. As for optimization methods, there’s a method using Loss and a method using Log likelihood. There’s no fundamentally big difference, but using Loss is a minimization problem while using Log likelihood is a maximization problem, so the function shape becomes left-right symmetric. These are methods used in prior studies, but they probably aren’t very important. (Since I’m going to use DL.)


Or if even that’s a hassle, you can just use gradient descent. If you use \(y=-x\) as the loss function, you can compute \(\boldsymbol{\theta}_{k+1}=\boldsymbol{\theta}_{k}+\alpha\left(\boldsymbol{\psi}\left(\boldsymbol{\tau}_{k 1}\right)-\boldsymbol{\psi}\left(\boldsymbol{\tau}_{k 2}\right)\right)\).
Such a linear utility function is simple but contains the following caveat.
However, it is unclear how the aggregated utility loss L(θ, ζ) is related to the policy loss L(π, ζ) (see Sec. 2.3), as the policy is subject to the system dynamics whereas the utility is not.
In the policy loss we looked at earlier, the policy is the entity that actually takes actions, so the system dynamics were taken into account. But the utility function we just saw only does a simple utility computation for a given trajectory. That’s why the relationship between the two is unclear.
Non-linear utility function
Various methods are introduced, but the most notable one is the approach using DL. It’s also the method that was used as a baseline in the PEBBLE paper introduced earlier.
DRL from human preference - Christiano et al.
Since this paper also needs to be covered, I’ll deal with it separately later.
The temporal credit assignment problem
In classic RL too, temporal credit assignment was a huge problem. Since a policy’s value includes all future possibilities, it’s hard to know how dominant an influence the action taken in the current state will have. To solve this, the paper explains that using Advantage is the standard approach. However, it’s hard to solve this explicitly.
Yet, if we try to solve the credit assignment problem explicitly, we also require the expert to comply with the Markov property. This assumption can easily be violated if we do not use a full state representation, i.e., if the expert has more knowledge about the state than the policy
To assume Markov, the expert has to know the true state. But everything in the world is a Hidden Markov model, so there’s no way to know it.
The methods of defining utility are broadly divided into three.
- Value-based utility
\(\pi^{*}(a \mid s)=\mathbb{I}\left(a=\underset{a^{\prime}}{\arg \max } \mathbb{E}_{\delta}\left[U\left(s^{\prime}\right) \mid s, a^{\prime}\right]\right)\) Usable only when the transition model is known. - Return-based utility
\(U(\boldsymbol{\tau})=\boldsymbol{\theta}^{T} \boldsymbol{\psi}(\boldsymbol{\tau})\) Can be generalized to reward-based utility. - Reward-based utility \(U\left(s_{t}, a_{t}\right)=\boldsymbol{\theta}^{T} \boldsymbol{\varphi}\left(s_{t}, a_{t}\right)\)
In case 3, this essentially produces the same effect as a reward. In that it figures out the reward based on preference, it takes a form similar to IRL. But since we will compare preferences between trajectories, it’s effectively the same as using return-based utility.
The DRL from human preference - Christiano et al. paper also used reward-based utility, but it had two characteristics.
In contrast to conventional reinforcement learning, a discount factor of γ = 1 is used for U(τ) in all approaches because the expert should not need to consider the effects of decay.
A major problem as all considered trajectories have a finite length.
Humans don’t need decay so they won’t do it, and they had no choice but to use finite length. Both are clear.
Trajectory preference elicitation
So how should we construct the trajectory query to get good feedback from the expert? Obtaining preferences from an expert is a very high-cost task. It’s a task that isn’t automated, after all. We have to use as few queries as possible, but then an exploration problem arises and we fall into a local optimum.
Trajectory generation
A PbRL trajectory must have three characteristics.
In order to be informative, the obtained preferences should be different from existing trajectories. Yet, the trajectories should also be close to optimal in order to obtain useful information. Furthermore, the trajectories need to contain sufficient information about the transition function to compute an optimal policy.
- They must be distinct enough to distinguish preferences,
- They must be sufficiently close to the optimal trajectory, and
- They must contain sufficient information about the transition function.
If we explore sufficiently and extract as diverse a variety of trajectories as possible, we can satisfy all three conditions. The fewer preference queries the better, but trajectory generation itself takes no labor, so the more the better.
Exploration is divided into undirected, directed, heterogeneous, and user-guided exploration. Since the method differs for every study, I won’t explain in detail, but it’s introduced that in the case of using DRL, undirected exploration using a stochastic policy is used. If you use an algorithm like SAC, it explores in the direction of increasing entropy.
Preference query generation
What kind of query can we extract from the created trajectories? There are likewise three approaches.
- Exhaustive generation : All possible queries (uses the entire dataset)
- Greedy generation : Use trajectories generated from the optimized policy (uses only trajectories created by the latest policy)
- Interleaved generation : Several ways (various approaches)
Sometimes you pick trajectories with high utility, sometimes you use an ensemble approach to pick the one with high variance—various methods exist depending on the study.
Still in the process of writing this… ㅠㅠ