Paper review - Learning trajectory preferences for manipulators via iterative improvement

10 Jan 2022

This is my second paper review. I’ll be reviewing the paper Learning trajectory preferences for manipulators via iterative improvement, submitted to NeurIPS in 2013.
PbRL itself doesn’t seem to be a very old research field, so a 2013 paper can be considered truly ancient. That said, since the survey paper mentioned that it used trajectory planning, I adopted it as my first one. It doesn’t use deep learning at all, and carries out learning with a very simple linear parameterized optimization.

Research using human demonstrations on a manipulator was actually well known. (I can’t quite remember what the robot’s name was, though…)

This paper argues that in high-DoF environments, giving a slightly improved trajectory preference rather than a human demo is more helpful for learning. Let’s take a look at the method and the results.

Table of contents

Introduction
Related works
Learning and Feedback model
Learning algorithm
Features describing object-object interactions
- Trajectory features
- Learning the scoring function
Experiments
- Evaluation metric
- Results
Conclusion

Introduction

This paper does a good job of defining, with several examples, why human preference is needed.

a household robot should move a glass of water in an upright position without jerks while maintaining a safe distance from nearby electronic devices

For example, trajectories of heavy items should not pass over fragile items but rather move around them

The gist of this research is that even when moving the same object, you should behave differently depending on what the object is and what is around it. This is a human preference that is hard to express as a reward function, and the goal is to let the robot learn it.

The robot learns a general model of the user’s preferences in an online fashion.

In this process you can’t really verify whether learning is going well, so the author defined something called regret. It’s structured so that the more the ranking of trajectories that the human and the robot think of becomes the same, the smaller the regret (explained later).

They used the Grocery checkout task — a problem of placing several objects and moving them with a robot arm — as the experimental environment.

This paper makes many comparisons with Learning from demonstration (LfD). The biggest problem with LfD is that you can’t know whether the user demonstration is actually optimal.

The user never discloses the optimal trajectory

In other words, the goal of this research is to learn how to improve based on preference.

Learning a score function representing the preferences in trajectories

Here, the goal is to obtain a score function, similar to the utility function that was one of the methods of PbRL.

Learning and Feedback model

First, let’s call the scoring function we want to learn $s(x,y;w)$. Here $x$ is the context, $y$ is the trajectory, and $w$ is the parameter we want to learn. The optimal scoring function reflecting human preference is denoted $s^*(x,y)$.

The process of learning the scoring function is divided into 3 steps.

Step 1: The robot receives a context x. It then uses a planner to sample a set of trajectories, and ranks them according to its current approximate scoring function s(x, y; w).
Based on the context x, the robot forms multiple trajectories. Then it assigns them ranks with the scoring function. It runs trajectory generation using RRT, and since there is randomness, various kinds of trajectories will be produced.
Step 2: The user either lets the robot execute the top-ranked trajectory, or corrects the robot by providing an improved trajectory y¯. This provides feedback indicating that s∗(x, y¯) > s∗(x, y).
- Re-ranking : selects the top trajectory and gives feedback by correcting the ranking.
- Zero-G : directly moves one of the trajectory waypoint positions.
Step 3: The robot now updates the parameter w of s(x, y; w) based on this preference feedback and returns to step
updates the scoring function.

It’s a truly simple structure, clearly. Since this is actually a paper from before even the AlphaGo paper came out, it may look crude from the perspective of today’s deep learning, but if this kind of theory hadn’t laid the groundwork, deep learning wouldn’t have been able to advance either.

Finally, they define Regret for performance evaluation.

$$REG_T=\frac{1}{T}\sum_{t=1}^T[s^*(x_t,y_t^*)-s^*(x_t,y_t)]$$ $$where\ y^*_t=argmax_ys^*(x_t,y)$$

But something seems off. Since $s^*$ is actually an unknown value, you can’t compute the Regret. So the author proves convergence using a regret bound. (Comes up later.)

To receive human feedback, you need a UI/UX. The author says they used a program called OpenRave. It was set up so that clicking on one of the multiple trajectories makes its rank the highest.

Learning algorithm

First, let’s keep in mind that this is a paper that doesn’t use deep learning.
The scoring function is defined as follows.

$$s(x,y;w_O,w_E)=w_O\dot\phi_O(x,y)+w_E\dot\phi_E(x,y)$$

Here O denotes the surrounding objects that the trajectory interacts with, and E denotes the object that must be manipulated and its environment.

Features describing object-object interactions

We enumerate waypoints of trajectory y as $y_1, .., y_N$ and objects in the environment as O = {$o_1$, .., $o_K$}. The robot manipulates the object $\bar{o}$ ∈ O we connect an object ok to a trajectory waypoint if the minimum distance to collision is less than a threshold or if $o_k$ lies below

It defines the trajectory, the objects, and the manipulated object, and connects them. When they get close enough, it connects an edge. An example figure is shown below.

First, the overall scoring function is as follows.

$$s_{O}\left(x, y ; w_{O}\right)=\sum_{\left(y_{j}, o_{k}\right) \in \mathcal{E}} \sum_{p, q=1}^{M} l_{k}^{p} l^{q}\left[w_{p q} \cdot \phi_{o o}\left(y_{j}, o_{k}\right)\right]$$

$l_k^p$ is the p-th attribute of the k-th object $o_k$. Every object has M attributes $[l_k^1,\dots,l_k^M]$, and each attribute is represented as binary. For example, a Laptop has the following attributes.

{heavy, fragile, sharp, hot, liquid, electronic} = [0,1,0,0,0,1]

Pretty naïve, isn’t it?? It can’t be helped. DL hadn’t properly developed yet!
$l^q$ is the q-th attribute of the manipulated object $\bar{o}$. That’s because the distance to surrounding objects must be adjusted depending on which object you’re moving.
$\phi_{oo}(y_j,o_k)$ is the feature of the edge. It belongs to $\phi_{oo}\in\mathcal{R}^4$, consisting of the minimum x, y, z distances + (a binary for whether $o_k$ lies vertically with $\bar{o}$).

In the end, it’s expressed as $\phi_o(x,y)=\sum_{\left(y_{j}, o_{k}\right) \in \mathcal{E}} l_{k}^{u} l^{v}\left[\phi_{o o}\left(y_{j}, o_{k}\right)\right]$.

Trajectory features

The trajectory is first split into 3 parts. (No idea why 3.)

In the figure it’s split into 3 waypoints, namely 1, 2, 4. For each trajectory segment, 3 features are applied respectively and then concatenated for use.

Robot arm configuration $\in \mathcal{R}^{27}$
($r,\theta,\phi$) of wrist and elbow w.r.t shoulder + elbow when the end effector attains maximum state (can indicate whether a joint lock occurs)
Orientation and temporal behavior of the object to be manipulated $\in \mathcal{R}^{28}$
-> part we store the cosine of the object’s maximum deviation, along the vertical axis, from its final orientation at the goal location + maximum deviation along the whole trajectory
Object-environment interactions $\in \mathcal{R}^{20}$
(i) minimum vertical distance from the nearest surface below it. (ii) minimum horizontal distance from the surrounding surfaces; and (iii) minimum distance from the table, on which the task is being performed, and (iv) minimum distance from the goal location

In total they formed a trajectory feature of $\phi_E(\cdot)\in\mathcal{R}^{75}$. Since I’m not really interested in hand-made features, I’ll move on quickly.

Learning the scoring function

It learns the parameters in an absurdly simple way. It’s just a simple linear update, right? It’s not even a random initialization.

So then, how can this minimize Regret? The author used Expected $\alpha$-informative feedback.

$$E_{t}\left[s^{*}\left(x_{t}, \bar{y}_{t}\right)\right] \geq s^{*}\left(x_{t}, y_{t}\right)+\alpha\left(s^{*}\left(x_{t}, y_{t}^{*}\right)-s^{*}\left(x_{t}, y_{t}\right)\right)-\xi_{t}$$

They say that if you choose appropriate $\alpha,\xi$ here, it is bounded as $E\left[R E G_{T}\right] \leq \mathcal{O}\left(\frac{1}{\alpha \sqrt{T}}+\frac{1}{\alpha T} \sum_{t=1}^{T} \xi_{t}\right)$. (For details, see the reference Online Structured Prediction via Coactive Learning.)

Experiments

They ran experiments on the following 3 tasks.

Manipulator centric : just moving an object
Environment centric : moving a fragile object
Human centric : moving a sharp object while avoiding a person

As baselines they used BiRRT, Manual, Oracle-SVM, and MMP-online (maximum margin planning).

Evaluation metric

To give a human preference, it has to be expressible as a number, whatever it may be. Likert scale and nDCG (normalized discounted cumulative gain) were used.
The Likert scale gives a 5-way choice from 1–5 (5 being best). nDCG is something often used in ranking recommendation algorithms and the like; it’s scored so that recommending the high-ranked items well is more important than recommending the low-ranked ones. The details are well explained on the following blog. Go to the nDCG explanation blog

Results

First, the TPP method got the highest score on all tasks.

You can see it adapts well to new environments and objects too. Oracle-SVM produces high performance early on, but it has the drawback of being hard to use in real life since it requires knowing the entire trajectory space.

This algorithm leverages the expert’s labels on trajectories (hence the name Oracle) and is trained using SVM-rank in a batch manner. This algorithm is not realizable in practice, as it requires labeling on the large space of trajectories

They even conducted a user study. Viewed from an HCI perspective, you can see that as the task gets harder, the time and number of feedbacks increase. You can also confirm that the more feedback users give, the higher the score they give.

Conclusion

They studied a method for well-selecting a robot manipulator’s trajectory using human preference. Rather than trajectory generation, this is really a paper that lets the robot judge which of the many trajectories formed by RRT is the best.

The disappointing part is that it used hand-designed features, but I’m sure later papers have improvements on this point.

Jae-Kyung Cho Being unique is better than being perfect

Paper review - Learning trajectory preferences for manipulators via iterative improvement

Introduction

Related works

Learning and Feedback model

Learning algorithm

Features describing object-object interactions

Trajectory features

Learning the scoring function

Experiments

Evaluation metric

Results

Conclusion

Related posts

Diary - AI training이란 무엇일까 (feat. Claude Code) 06 Mar 2026

Diary - What Is AI Training, Really? (feat. Claude Code) 06 Mar 2026

Diary - LLM에서 효율적인 강화학습이란 무엇일까 2 (feat. Qwen-3.5와 GLM-5) 26 Feb 2026