Jae-Kyung Cho LLM Developers who was a Robotics engineer

Diary - Catastropic performance drop of off-policy RL methods

On-policy RL methods (TRPO, PPO) are guaranteed of their performance increasing. Instead, off-policy RL methods (DQN, TD3, SAC) converge much faster because of their sample efficiency. However, I realized that these off-policy RL methods are vernerable by overfitting especially with lack of exploration. This blog explained it as “Catastropic drop” Supplement: sudden exploding happended in RL


I always suffered from this results when I trained SAC and DQN. Now we should add learning rate scheduler or regularizers or much more exploration noise to solve this issue. (REGULARIZATION MATTERS IN POLICY OPTIMIZATION) If you don’t do lazy things, just use PPO even if it needs more time to train.

Especially, there is gradient exploding issue when the recurrent networks are used. Exploding Gradients Problem with RNN

exploding gradient To solve this issue, we must use L2 normalization (most common answer) and gradient clipping. The next question is “How can we determine the threshold value for gradient clipping?”.