Paper review - Learning trajectory preferences for manipulators via iterative improvement

10 Jan 2022

두 번째 논문 리뷰입니다. 2013년 NeurIPS에 제출된 논문인 Learning trajectory preferences for manipulators via iterative improvement 를 리뷰해 보도록 하겠습니다.
PbRL 자체가 오래된 연구 분야는 아닌 것 같아서, 2013년 논문이면 정말 고대 논문이라고 볼 수 있겠습니다. 다만, trajectory planning을 사용했다는 내용이 리뷰논문에 있었기 때문에, 첫 번째로 채택되었습니다. 딥러닝을 사용하지도 않았고, 아주 간단한 linear parameterized optimization을 사용하여 학습을 진행하였습니다.

실제로 Manipulator에서 human demonstration을 사용하는 연구는 유명했었죠. (로봇 이름이 뭐였는지는 기억이 잘 안나지만…)

본 논문의 경우 High DoF 환경에서는 Human demo보다는 slightly improved trajectory preference를 주는 것이 학습에 도움이 된다고 주장합니다. 그리고 그 방법과 결과를 살펴보겠습니다.

Table of contents

Introduction
Related works
Learning and Feedback model
Learning algorithm
Features describing object-object interactions
- Trajectory features
- Learning the scoring function
Experiments
- Evaluation metric
- Results
Conclusion

Introduction

본 논문은 human preference가 필요한 이유를 여러 예시를 들어 잘 정의하고 있습니다.

a household robot should move a glass of water in an upright position without jerks while maintaining a safe distance from nearby electronic devices

For example, trajectories of heavy items should not pass over fragile items but rather move around them

같은 물건을 옮기더라도 어떤 물건이느냐, 주변에 어떤 것이 있느냐에 따라서 다르게 행동해야 한다는 것이 이 연구의 요지입니다. 이것은 reward function으로 나타내기 어려운 사람의 선호도 (Human preference) 이고, 이를 로봇이 학습할 수 있도록 하는 것입니다.

The robot learns a general model of the user’s preferences in an online fashion.

이 과정에서 학습이 잘 되는지를 사실 검증할 수 없는데, 저자는 regret이라는 것을 정의하였습니다. 사람과 로봇이 생각하는 trajectory의 rank이 같아질수록 regret이 작아지는 구조입니다 (뒷부분에 설명)

Grocery checkout task라는, 여러 물체를 두고 로봇팔을 이용해 이것들을 옮기는 문제를 실험환경으로 사용했습니다.

본 논문은 Learning from demonstratino (LfD)와 많은 비교를 하고 있습니다. LfD의 가장 큰 문제는 과연 user demonstration이 optimal인지 알 수 없다는 것입니다.

The user never discloses the optimal trajectory

즉, 본 연구의 목적은 어떻게 improve 하는지를 preference에 기반하여 배우는 것 입니다.

Learning a score function representing the preferences in trajectories

이 때 PbRL의 방법 중 하나였던 utility function 과 유사하게 score function을 구하는 것을 목표로 하는 것입니다.

Learning and Feedback model

우선 우리가 배우고자 하는 scoring function을 $s(x,y;w)$라고 하겠습니다. $x$는 context, $y$는 trajectory, 그리고 $w$는 우리가 학습하고자 하는 parameter입니다. Human preference 가 반영된 optimal scoring function은 $s^*(x,y)$로 나타내겠습니다.

Scoring function을 학습하는 과정은 3 step으로 나뉩니다.

Step 1: The robot receives a context x. It then uses a planner to sample a set of trajectories, and ranks them according to its current approximate scoring function s(x, y; w).
로봇이 x라는 context를 기반으로 multiple trajectory를 형성합니다. 그리고 scoring function으로 이들에게 rank를 부여합니다. RRT를 이용해서 trajectory generation을 실행하는데, randomness가 있기 때문에, 다양한 종류의 trajectory가 만들어지겠습니다.
Step 2: The user either lets the robot execute the top-ranked trajectory, or corrects the robot by providing an improved trajectory y¯. This provides feedback indicating that s∗(x, y¯) > s∗(x, y).
- Re-ranking : 가장 상위의 trajectory를 선택해 rank를 수정해 줌으로써 피드백을 부여합니다.
- Zero-G : trajectory waypoint 위치 중 하나를 직접 옮겨줍니다.
Step 3: The robot now updates the parameter w of s(x, y; w) based on this preference feedback and returns to step
scoring function을 update합니다.

참으로 쉬운 구조임이 분명합니다. 사실 알파고 논문이 나오기도 전의 논문이기 때문에, 현재의 딥러닝 관점에서는 허접해 보일수도 있습니다만, 이러한 이론이 배경이 되지 않았다면 딥러닝 또한 발전할 수 없었을 겁니다.

마지막으로 Performance evaluation을 위해 Regret을 정의합니다.

$$REG_T=\frac{1}{T}\sum_{t=1}^T[s^*(x_t,y_t^*)-s^*(x_t,y_t)]$$ $$where\ y^*_t=argmax_ys^*(x_t,y)$$

그런데 뭔가 이상합니다. 사실 $s^*$는 알 수 없는 값이기 때문에 Regret을 구할 수 없습니다. 그래서 저자는 regret bound를 이용하여 수렴성을 증명합니다. (이후에 나옴)

Human feedback을 받기 위해서는 UI/UX가 필요합니다. 저자는 OpenRave라는 프로그램을 사용했다고 합니다. Multiple trajectories 중 하나를 클릭하면 그것의 rank가 가장 높아지도록 설정되었습니다.

Learning algorithm

우선 딥러닝을 사용하지 않는 논문이라는 점을 염두에 두도록 하죠.
Scoring function은 아래와 같이 정의됩니다.

$$s(x,y;w_O,w_E)=w_O\dot\phi_O(x,y)+w_E\dot\phi_E(x,y)$$

여기서 O는 주변에 있는 trajectory가 interacting 하는 주변의 object들을 의미하고, E는 manipulate 해야 하는 object와 그 enviornment를 의미합니다.

Features describing object-object interactions

We enumerate waypoints of trajectory y as $y_1, .., y_N$ and objects in the environment as O = {$o_1$, .., $o_K$}. The robot manipulates the object $\bar{o}$ ∈ O we connect an object ok to a trajectory waypoint if the minimum distance to collision is less than a threshold or if $o_k$ lies below

trajectory 와 object, manipulated object를 정의하고 이들을 연결해 줍니다. 충분히 거리가 가까워지면 edge를 연결해 줍니다. 예시 그림은 아래와 같겠습니다.

우선 전체 scoring function은 다음과 같습니다.

$$s_{O}\left(x, y ; w_{O}\right)=\sum_{\left(y_{j}, o_{k}\right) \in \mathcal{E}} \sum_{p, q=1}^{M} l_{k}^{p} l^{q}\left[w_{p q} \cdot \phi_{o o}\left(y_{j}, o_{k}\right)\right]$$

$l_k^p$의 경우 k번째 object $o_k$의 p번째 특성입니다. 모든 object는 M개의 특성 $[l_k^1,\dots,l_k^M]$을 가지고 있고, 각 특성은 binary로 나타납니다. 예를 들어서, Laptop은 다음 특성을 가집니다.

{heavy, fragile, sharp, hot, liquid, electronic} = [0,1,0,0,0,1]

아주 Náive 하죠?? 어쩔 수 없습니다. DL이 제대로 발전하지 않았거든요!
$l^q$의 경우 manipulated object인 $\bar{o}$의 q번째 특성입니다. 어떤 물체를 옮기느냐에 따라서 주변 물체까지의 거리를 조절해야 하기 때문이죠.
$\phi_{oo}(y_j,o_k)$의 경우 edge의 feature가 되겠습니다. Minimum x,y,z 거리 + ($o_k$가 $\bar{o}$와 수직으로 놓여있는지 여부 binary) 로 구성되는 $\phi_{oo}\in\mathcal{R}^4$에 속하는 녀석입니다.

결국 $\phi_o(x,y)=\sum_{\left(y_{j}, o_{k}\right) \in \mathcal{E}} l_{k}^{u} l^{v}\left[\phi_{o o}\left(y_{j}, o_{k}\right)\right]$로 표현되겠습니다.

Trajectory features

Trajectory는 우선 3개로 분할됩니다. (왜 3개인지는 알 수 없음)

그림에서는 1,2,4 이렇게 3개의 waypoint로 분할 된 것이죠. 각 trajectory segment에 대해서 3가지 feature를 각각 적용한 후 concate해서 사용합니다.

Robot arm configuration $\in \mathcal{R}^{27}$
($r,\theta,\phi$) of wrist and elbow w.r.t shoulder + elbow when the end effector attains maximum state (joint lock이 발생하는 여부를 알려줄 수 있음)
Orientation and temporal behavior of the object to be manipulated $\in \mathcal{R}^{28}$
-> part we store the cosine of the object’s maximum deviation, along the vertical axis, from its final orientation at the goal location + maximum deviation along the whole trajectory
Object-environment interactions $\in \mathcal{R}^{20}$
(i) minimum vertical distance from the nearest surface below it. (ii) minimum horizontal distance from the surrounding surfaces; and (iii) minimum distance from the table, on which the task is being performed, and (iv) minimum distance from the goal location

총 $\phi_E(\cdot)\in\mathcal{R}^{75}$ 의 trajectory feature를 형성하였습니다. 사실 hand-made feature에는 큰 관심 없으므로 빠르게 넘어가겠습니다.

Learning the scoring function

터무니없이 간단한 방식으로 parameter를 학습합니다. 그냥 단순한 Linear update죠? 심지어 random initialize도 아니네요.

자 그렇다면 이게 어떻게 Regret을 최소화 할 수 있는 걸까요? 저자는 Expected $\alpha$-informative feedback을 이용했습니다.

$$E_{t}\left[s^{*}\left(x_{t}, \bar{y}_{t}\right)\right] \geq s^{*}\left(x_{t}, y_{t}\right)+\alpha\left(s^{*}\left(x_{t}, y_{t}^{*}\right)-s^{*}\left(x_{t}, y_{t}\right)\right)-\xi_{t}$$

에서 적절한 $\alpha,\xi$를 골라주면 $E\left[R E G_{T}\right] \leq \mathcal{O}\left(\frac{1}{\alpha \sqrt{T}}+\frac{1}{\alpha T} \sum_{t=1}^{T} \xi_{t}\right)$로 bound된다고 합니다. (자세한 내용은 참고문헌 Online Structured Prediction via Coactive Learning)

Experiments

다음과 같은 3가지 task에 대해서 실험을 진행하였습니다.

Manipulator centric : 그냥 물건 옮기기
Environment centric : fragile 물건 옮기기
Human centric : 날카로운 물체를 사람 피해서 옮기기

Baseline으로는 BiRRT, Manual, Oracle-SVM, MMP-online(maximum margin planning)을 사용하였습니다.

Evaluation metric

Human preference를 주려면 뭐가 됐든 숫자로 나타낼 수 있어야 합니다. Likert scale과 nDCG(normalized discounted cumulative gain)이 사용되었는데요.
Likert scale은 1-5 (5가 best)로 5지선다를 주는 것입니다. nDCG의 경우 순위 추천 알고리즘 등에서 자주 사용되는 것인데, 순위 높은 녀석들을 잘 추천하는 것이 낮은 녀석들을 추천하는 것보다 중요하도록 매겨진 것입니다. 자세한 사항은 다음 블로그에서 잘 설명되어 있습니다. nDCG 설명 블로그 바로가기

Results

우선 모든 task에 대해서 TPP 방식이 가장 높은 점수를 얻었습니다.

새로운 환경과 object들에도 잘 적응하는 것을 볼 수 있습니다. Oracle-SVM이 초반에 높은 성능을 내고는 있지만, 이는 모든 trajectory space를 알아야 하므로 실생활에서 사용하기 어렵다는 단점이 있습니다.

This algorithm leverages the expert’s labels on trajectories (hence the name Oracle) and is trained using SVM-rank in a batch manner. This algorithm is not realizable in practice, as it requires labeling on the large space of trajectories

User study까지 진행을 하였습니다. HCI 관점에서 바라본 것인데, task가 어려워질 수록 시간과 피드백의 수가 늘어나는 것을 확인할 수 있습니다. 또한 피드백을 줄 수록 user가 더 높은 점수를 주는 것도 확인할 수 있었습니다.

Conclusion

Human preference를 이용하여 Robot manipulator의 trajectory를 잘 선택하는 방식을 연구하였습니다. 사실 trajectory generation이라기보다는 RRT로 형성된 많은 trajectory 중 어떤 것이 제일 나은 것인지를 robot이 판단할 수 있게 하는 논문입니다.

아쉬운 점은 Hand-desinged feature를 사용했다는 점인데, 추후 논문들에서 개선된 점이 분명 있을 거라고 생각합니다.

Jae-Kyung Cho LLM Developers who was a Robotics engineer

Paper review - Learning trajectory preferences for manipulators via iterative improvement

Introduction

Related works

Learning and Feedback model

Learning algorithm

Features describing object-object interactions

Trajectory features

Learning the scoring function

Experiments

Evaluation metric

Results

Conclusion

Related posts

Diary - PyTorch 에서 all_gather_object 사용시 31 Mar 2025

Diary - LLM knowledge distillation 06 Mar 2025

Diary - 2024년 회고 01 Jan 2025