Improving proximal policy optimization with alpha divergence
- Publisher:
- Elsevier
- Publication Type:
- Journal Article
- Citation:
- Neurocomputing, 2023, 534, pp. 94-105
- Issue Date:
- 2023-05-14
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
1-s2.0-S0925231223001467-main.pdf | Published version | 2.79 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Proximal policy optimization (PPO) is a recent advancement in reinforcement learning, which is formulated as an unconstrained optimization problem including two terms: accumulative discount return and Kullback–Leibler (KL) divergence. Currently, there are three PPO versions: primary, adaptive, and clipping. The most widely used PPO algorithm is the clipping version, in which the KL divergence is replaced by a clipping function to measure the difference between two policies indirectly. In this paper, we revisit this primary PPO and improve it in two aspects. One is to reformulate it as a linearly combined form to control the trade-off between two terms. The other is to substitute a parametric alpha divergence for KL divergence to measure the difference of two policies more effectively. This novel PPO variant is referred to as alphaPPO in this paper. Experiments on six benchmark environments verify the effectiveness of our alphaPPO, compared with clipping and combined PPOs.
Please use this identifier to cite or link to this item: