TBQ(σ): Improving efficiency of trace utilization for off-policy reinforcement learning

Shi, L; Li, S; Cao, L; Yang, L; Pan, G

TBQ(σ): Improving efficiency of trace utilization for off-policy reinforcement learning

Shi, L Li, S Cao, L

Yang, L Pan, G

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, 2019, 2 pp. 1025 - 1032
Issue Date:: 2019-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1.4 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Shi, L	en_US
dc.contributor.author	Li, S	en_US
dc.contributor.author	Cao, L https://orcid.org/0000-0003-1562-9429	en_US
dc.contributor.author	Yang, L	en_US
dc.contributor.author	Pan, G	en_US
dc.date.accessioned	2020-03-20T04:56:13Z
dc.date.available	2020-03-20T04:56:13Z
dc.date.issued	2019-01-01	en_US
dc.identifier.citation	Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, 2019, 2 pp. 1025 - 1032	en_US
dc.identifier.isbn	9781510892002	en_US
dc.identifier.issn	1548-8403	en_US
dc.identifier.uri	http://hdl.handle.net/10453/139445
dc.description.abstract	© 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org) Ail rights reserved. OfF-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traccs and slow down the learning process Alternatively, some non-probabdistic measurement methods such as General Q(A) and Naive Q(A) never cut traces, but face convergence problems in practice To address the above issues, this paper introduces a new method named TBQ(a) which effectively unifies the tree-backup algorithm and Naive Q(A) By introducing a new parameter a to illustrate the degree of utilizing traces, TBQ(<t) creates an effective integration of TB(A) and Naive Q(A) and continuous role shift between them The contraction property ofTB(cr) is theoretically analyzed for both policy evaluation and control settings We also derive the online version of TBQ(cr) and give the convergence proof We empirically show that, for e (0, l) in e-greedy policies, there exists some degree of utilizing traces for A 6 (0,1), which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.	en_US
dc.relation.ispartof	Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS	en_US
dc.title	TBQ(σ): Improving efficiency of trace utilization for off-policy reinforcement learning	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	2	en_US

Abstract:

© 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org) Ail rights reserved. OfF-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traccs and slow down the learning process Alternatively, some non-probabdistic measurement methods such as General Q(A) and Naive Q(A) never cut traces, but face convergence problems in practice To address the above issues, this paper introduces a new method named TBQ(a) which effectively unifies the tree-backup algorithm and Naive Q(A) By introducing a new parameter a to illustrate the degree of utilizing traces, TBQ(

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/139445