Improving proximal policy optimization with alpha divergence

Publisher:
Elsevier
Publication Type:
Journal Article
Citation:
Neurocomputing, 2023, 534, pp. 94-105
Issue Date:
2023-05-14
Filename Description Size
1-s2.0-S0925231223001467-main.pdfPublished version2.79 MB
Adobe PDF
Full metadata record
Proximal policy optimization (PPO) is a recent advancement in reinforcement learning, which is formulated as an unconstrained optimization problem including two terms: accumulative discount return and Kullback–Leibler (KL) divergence. Currently, there are three PPO versions: primary, adaptive, and clipping. The most widely used PPO algorithm is the clipping version, in which the KL divergence is replaced by a clipping function to measure the difference between two policies indirectly. In this paper, we revisit this primary PPO and improve it in two aspects. One is to reformulate it as a linearly combined form to control the trade-off between two terms. The other is to substitute a parametric alpha divergence for KL divergence to measure the difference of two policies more effectively. This novel PPO variant is referred to as alphaPPO in this paper. Experiments on six benchmark environments verify the effectiveness of our alphaPPO, compared with clipping and combined PPOs.
Please use this identifier to cite or link to this item: