Advanced policy optimization algorithms with flexible trust region constraint

Publication Type:
Thesis
Issue Date:
2024
Full metadata record
In reinforcement learning, trust region constraint aims to control the difference between two adjacent policies to be less than a pre-defined threshold, to stabilize the policy optimization. The landmark algorithm in this research field is trust region policy optimization (TRPO). However, how and where to apply this trust region control and its related techniques remains a challenging issue. In this thesis, we will address this problem from four perspectives. To relax the trust region for promoting exploration ability, we propose trust region policy optimization via entropy regularization for KL divergence constraint. Its novelty lies in that a Shannon entropy regularization term is added to the constraint to adjust the difference between two consecutive policies directly to select some potentially problematic policies, rather than modifying the objective function as seen in existing literature. To bound the trust region from below, we present twin trust region policy optimization. At first, we propose a reciprocal trust region policy optimization through exchanging the objective and the KL divergence constraint according to the reciprocal optimization technique, which induces a lower bound of step size search. Then TRPO and its reciprocal version are integrated to build twin TRPO, which has a lower bound and an upper bound for step size search, to facilitate the optimization procedure of TRPO. To characterize the trust region better for different environments, we construct an enhanced proximal policy optimization (PPO) formulated as an unconstrained minimization. The primary PPO with a single tunable factor is revisited, which is converted into a linear combination to control the trade-off between the return and the KL divergence, and then is inserted into a parameterized alpha divergence to replace KL divergence to adjust the trust region constraint. This algorithm can be solved using gradient-based optimization technique. To adjust the trust region and enhance the sample efficiency, we implement diversity-driven model ensemble trust region policy optimization. Firstly, we design a deep residual attention U-net with significantly fewer weights as our base models and add normalized Hilbert-Schmidt independence criterion as a regularization term to pursue model diversity explicitly for the model ensemble. Moreover, we propose an adaptive TRPO, where the parametric Renyi alpha divergence substitutes for KL divergence for measuring the trust region with the alpha value adaptively adjusted during training iterations. Finally, the effectiveness and efficiency of these proposed reinforcement learning algorithms have been validated experimentally on several widely used benchmark environments.
Please use this identifier to cite or link to this item: