Composing Policies in Deep Reinforcement Learning

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Reinforcement learning (RL) has been developed with an atomic agent on various environment. As RL researchers adopt many agents on much complicated domains such as vision, language and robotics, the cooperation of those agents is required to achieve an optimal performance. In case of lifelong RL, there is a challenge on how to compose multiple agents to get an optimal design for the domain. Like this, the importance of agents’ cooperation has been highlighted. This thesis endeavors the cooperation of several agents in hierarchical RL (HRL) and transfer RL (TRL) to find an optimal design. In HRL, there are two researches for the optimal design of multiple agents. The first one is to research the optimal level synchronization based on flow-based deep generative model in HRL. The research focuses on how to find an off-policy correction which reflects the exact goal for the current lower-level policy without an indirect estimation. It adopts the flowbased deep generative model (FDGM) to support a direct off-policy correction. However, FDGM has its own chronic issue regarding a biased log-density estimate. It considers an inverse model to reflect the lower-level policy with FDGM when the higher-level policy is being trained, considering overcoming the chronic issue of FDGM. The second research is to study the autonomous non-monolithic exploration in options framework, especially HRL. HRL can support autonomous function with a mode switching controller of its own unique structure. In TRL, this thesis shows the unsupervised pre-training RL with Successor Features (SFs) using non-monolithic exploration scheme. An existing research, APS, has a combined intrinsic reward for pre-training. The combined intrinsic reward causes the performance of fine-tuning to be deteriorated. The research splits the combined intrinsic reward based on non-monolithic exploration scheme. It helps the original intention of SFs by decoupling the dynamics of the environment from the rewards. The second one is to research an offline-toonline RL having non-monolithic exploration scheme to support an online policy without destroying an offline policy. The research can modulate how much an offline policy or an online policy will be utilized during online training. The research focuses on balancing the pro of the offline policy, which is termed exploitation, with that of the online policy, which is referred to as exploration, without modifying the offline policy. In conclusion, this thesis contributes to the composition of agent design in deep reinforcement learning for optimization.
Please use this identifier to cite or link to this item: