Diary - Catastropic performance drop of off-policy RL methods

09 Apr 2022

On-policy RL methods (TRPO, PPO) are guaranteed of their performance increasing. Instead, off-policy RL methods (DQN, TD3, SAC) converge much faster because of their sample efficiency. However, I realized that these off-policy RL methods are vernerable by overfitting especially with lack of exploration. This blog explained it as “Catastropic drop” Supplement: sudden exploding happended in RL

I always suffered from this results when I trained SAC and DQN. Now we should add learning rate scheduler or regularizers or much more exploration noise to solve this issue. (REGULARIZATION MATTERS IN POLICY OPTIMIZATION) If you don’t do lazy things, just use PPO even if it needs more time to train.

Especially, there is gradient exploding issue when the recurrent networks are used. Exploding Gradients Problem with RNN

exploding gradient To solve this issue, we must use L2 normalization (most common answer) and gradient clipping. The next question is “How can we determine the threshold value for gradient clipping?”.

Jae-Kyung Cho LLM Developers who was a Robotics engineer

Diary - Catastropic performance drop of off-policy RL methods

Related posts

Diary - vLLM 의 cascade attention 08 Jul 2025

Diary - Qwen3의 Hybrid thinking mode 01 May 2025

Diary - PyTorch 에서 all_gather_object 사용시 31 Mar 2025