TBQ(σ): Improving efficiency of trace utilization for off-policy reinforcement learning

Publication Type:
Conference Proceeding
Citation:
Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, 2019, 2 pp. 1025 - 1032
Issue Date:
2019-01-01
Full metadata record
© 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org) Ail rights reserved. OfF-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traccs and slow down the learning process Alternatively, some non-probabdistic measurement methods such as General Q(A) and Naive Q(A) never cut traces, but face convergence problems in practice To address the above issues, this paper introduces a new method named TBQ(a) which effectively unifies the tree-backup algorithm and Naive Q(A) By introducing a new parameter a to illustrate the degree of utilizing traces, TBQ(
Please use this identifier to cite or link to this item: