Advanced policy optimization algorithms with flexible trust region constraint

Xu, Haotian

Advanced policy optimization algorithms with flexible trust region constraint

Xu, Haotian

Permalink

Publication Type:: Thesis
Issue Date:: 2024

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (26.8 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, Haotian
dc.date.accessioned	2025-05-22T01:23:09Z
dc.date.available	2025-05-22T01:23:09Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/10453/187451
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	In reinforcement learning, trust region constraint aims to control the difference between two adjacent policies to be less than a pre-defined threshold, to stabilize the policy optimization. The landmark algorithm in this research field is trust region policy optimization (TRPO). However, how and where to apply this trust region control and its related techniques remains a challenging issue. In this thesis, we will address this problem from four perspectives. To relax the trust region for promoting exploration ability, we propose trust region policy optimization via entropy regularization for KL divergence constraint. Its novelty lies in that a Shannon entropy regularization term is added to the constraint to adjust the difference between two consecutive policies directly to select some potentially problematic policies, rather than modifying the objective function as seen in existing literature. To bound the trust region from below, we present twin trust region policy optimization. At first, we propose a reciprocal trust region policy optimization through exchanging the objective and the KL divergence constraint according to the reciprocal optimization technique, which induces a lower bound of step size search. Then TRPO and its reciprocal version are integrated to build twin TRPO, which has a lower bound and an upper bound for step size search, to facilitate the optimization procedure of TRPO. To characterize the trust region better for different environments, we construct an enhanced proximal policy optimization (PPO) formulated as an unconstrained minimization. The primary PPO with a single tunable factor is revisited, which is converted into a linear combination to control the trade-off between the return and the KL divergence, and then is inserted into a parameterized alpha divergence to replace KL divergence to adjust the trust region constraint. This algorithm can be solved using gradient-based optimization technique. To adjust the trust region and enhance the sample efficiency, we implement diversity-driven model ensemble trust region policy optimization. Firstly, we design a deep residual attention U-net with significantly fewer weights as our base models and add normalized Hilbert-Schmidt independence criterion as a regularization term to pursue model diversity explicitly for the model ensemble. Moreover, we propose an adaptive TRPO, where the parametric Renyi alpha divergence substitutes for KL divergence for measuring the trust region with the alpha value adaptively adjusted during training iterations. Finally, the effectiveness and efficiency of these proposed reinforcement learning algorithms have been validated experimentally on several widely used benchmark environments.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/187451/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2024 Haotian Xu
dc.rights	au.edu.uts.lib/cph
dc.title	Advanced policy optimization algorithms with flexible trust region constraint	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

In reinforcement learning, trust region constraint aims to control the difference between two adjacent policies to be less than a pre-defined threshold, to stabilize the policy optimization. The landmark algorithm in this research field is trust region policy optimization (TRPO). However, how and where to apply this trust region control and its related techniques remains a challenging issue. In this thesis, we will address this problem from four perspectives. To relax the trust region for promoting exploration ability, we propose trust region policy optimization via entropy regularization for KL divergence constraint. Its novelty lies in that a Shannon entropy regularization term is added to the constraint to adjust the difference between two consecutive policies directly to select some potentially problematic policies, rather than modifying the objective function as seen in existing literature. To bound the trust region from below, we present twin trust region policy optimization. At first, we propose a reciprocal trust region policy optimization through exchanging the objective and the KL divergence constraint according to the reciprocal optimization technique, which induces a lower bound of step size search. Then TRPO and its reciprocal version are integrated to build twin TRPO, which has a lower bound and an upper bound for step size search, to facilitate the optimization procedure of TRPO. To characterize the trust region better for different environments, we construct an enhanced proximal policy optimization (PPO) formulated as an unconstrained minimization. The primary PPO with a single tunable factor is revisited, which is converted into a linear combination to control the trade-off between the return and the KL divergence, and then is inserted into a parameterized alpha divergence to replace KL divergence to adjust the trust region constraint. This algorithm can be solved using gradient-based optimization technique. To adjust the trust region and enhance the sample efficiency, we implement diversity-driven model ensemble trust region policy optimization. Firstly, we design a deep residual attention U-net with significantly fewer weights as our base models and add normalized Hilbert-Schmidt independence criterion as a regularization term to pursue model diversity explicitly for the model ensemble. Moreover, we propose an adaptive TRPO, where the parametric Renyi alpha divergence substitutes for KL divergence for measuring the trust region with the alpha value adaptively adjusted during training iterations. Finally, the effectiveness and efficiency of these proposed reinforcement learning algorithms have been validated experimentally on several widely used benchmark environments.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/187451