Prioritized Experience Replay based on Multi-armed Bandit

Liu, X; Zhu, T; Jiang, C; Ye, D; Zhao, F

Prioritized Experience Replay based on Multi-armed Bandit

Liu, X Zhu, T

Jiang, C Ye, D

Zhao, F

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Expert Systems with Applications, 2022, 189, pp. 116023
Issue Date:: 2022-03-01

Closed Access

	Filename	Description	Size
	Prioritized Experience Replay based on Multi-armed Bandit.pdf	Published version	2.19 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, X
dc.contributor.author	Zhu, T https://orcid.org/0000-0003-3411-7947
dc.contributor.author	Jiang, C
dc.contributor.author	Ye, D https://orcid.org/0000-0002-7561-0992
dc.contributor.author	Zhao, F
dc.date.accessioned	2023-04-06T01:57:50Z
dc.date.available	2023-04-06T01:57:50Z
dc.date.issued	2022-03-01
dc.identifier.citation	Expert Systems with Applications, 2022, 189, pp. 116023
dc.identifier.issn	0957-4174
dc.identifier.uri	http://hdl.handle.net/10453/169265
dc.description.abstract	Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.
dc.language	en
dc.publisher	Elsevier
dc.relation.ispartof	Expert Systems with Applications
dc.relation.isbasedon	10.1016/j.eswa.2021.116023
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	01 Mathematical Sciences, 08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Prioritized Experience Replay based on Multi-armed Bandit
dc.type	Journal Article
utslib.citation.volume	189
utslib.for	01 Mathematical Sciences
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2023-04-06T01:57:48Z
pubs.publication-status	Published
pubs.volume	189

Abstract:

Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/169265