Improving proximal policy optimization with alpha divergence

Xu, H; Yan, Z; Xuan, J; Zhang, G; Lu, J

Improving proximal policy optimization with alpha divergence

Xu, H Yan, Z Xuan, J Zhang, G

Lu, J

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Neurocomputing, 2023, 534, pp. 94-105
Issue Date:: 2023-05-14

Closed Access

	Filename	Description	Size
	1-s2.0-S0925231223001467-main.pdf	Published version	2.79 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, H
dc.contributor.author	Yan, Z
dc.contributor.author	Xuan, J
dc.contributor.author	Zhang, G https://orcid.org/0000-0003-3960-0583
dc.contributor.author	Lu, J
dc.date.accessioned	2024-01-11T04:31:32Z
dc.date.available	2024-01-11T04:31:32Z
dc.date.issued	2023-05-14
dc.identifier.citation	Neurocomputing, 2023, 534, pp. 94-105
dc.identifier.issn	0925-2312
dc.identifier.issn	1872-8286
dc.identifier.uri	http://hdl.handle.net/10453/174307
dc.description.abstract	Proximal policy optimization (PPO) is a recent advancement in reinforcement learning, which is formulated as an unconstrained optimization problem including two terms: accumulative discount return and Kullback–Leibler (KL) divergence. Currently, there are three PPO versions: primary, adaptive, and clipping. The most widely used PPO algorithm is the clipping version, in which the KL divergence is replaced by a clipping function to measure the difference between two policies indirectly. In this paper, we revisit this primary PPO and improve it in two aspects. One is to reformulate it as a linearly combined form to control the trade-off between two terms. The other is to substitute a parametric alpha divergence for KL divergence to measure the difference of two policies more effectively. This novel PPO variant is referred to as alphaPPO in this paper. Experiments on six benchmark environments verify the effectiveness of our alphaPPO, compared with clipping and combined PPOs.
dc.language	en
dc.publisher	Elsevier
dc.relation	http://purl.org/au-research/grants/arc/DP200100700
dc.relation.ispartof	Neurocomputing
dc.relation.isbasedon	10.1016/j.neucom.2023.02.008
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering, 17 Psychology and Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	40 Engineering
dc.subject.classification	46 Information and computing sciences
dc.subject.classification	52 Psychology
dc.title	Improving proximal policy optimization with alpha divergence
dc.type	Journal Article
utslib.citation.volume	534
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
utslib.for	17 Psychology and Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2024-01-11T04:31:30Z
pubs.publication-status	Published
pubs.volume	534

Abstract:

Proximal policy optimization (PPO) is a recent advancement in reinforcement learning, which is formulated as an unconstrained optimization problem including two terms: accumulative discount return and Kullback–Leibler (KL) divergence. Currently, there are three PPO versions: primary, adaptive, and clipping. The most widely used PPO algorithm is the clipping version, in which the KL divergence is replaced by a clipping function to measure the difference between two policies indirectly. In this paper, we revisit this primary PPO and improve it in two aspects. One is to reformulate it as a linearly combined form to control the trade-off between two terms. The other is to substitute a parametric alpha divergence for KL divergence to measure the difference of two policies more effectively. This novel PPO variant is referred to as alphaPPO in this paper. Experiments on six benchmark environments verify the effectiveness of our alphaPPO, compared with clipping and combined PPOs.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/174307