CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum

Ao, S; Zhou, T; Long, G; Lu, Q; Zhu, L; Jiang, J

CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum

Ao, S Zhou, T Long, G

Lu, Q Zhu, L Jiang, J

Permalink

Publication Type:: Conference Proceeding
Citation:: Advances in Neural Information Processing Systems, 2021, 13, pp. 10444-10456
Issue Date:: 2021-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (994.43 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Ao, S
dc.contributor.author	Zhou, T
dc.contributor.author	Long, G https://orcid.org/0000-0003-3740-9515
dc.contributor.author	Lu, Q
dc.contributor.author	Zhu, L
dc.contributor.author	Jiang, J https://orcid.org/0000-0001-5301-7779
dc.date.accessioned	2022-07-24T04:25:57Z
dc.date.available	2022-07-24T04:25:57Z
dc.date.issued	2021-01-01
dc.identifier.citation	Advances in Neural Information Processing Systems, 2021, 13, pp. 10444-10456
dc.identifier.isbn	9781713845393
dc.identifier.issn	1049-5258
dc.identifier.uri	http://hdl.handle.net/10453/159139
dc.description.abstract	Goal-conditioned reinforcement learning (RL) usually suffers from sparse reward and inefficient exploration in long-horizon tasks. Planning can find the shortest path to a distant goal that provides dense reward/guidance but is inaccurate without a precise environment model. We show that RL and planning can collaboratively learn from each other to overcome their own drawbacks. In “CO-PILOT”, a learnable path-planner and an RL agent produce dense feedback to train each other on a curriculum of tree-structured sub-tasks. Firstly, the planner recursively decomposes a long-horizon task to a tree of sub-tasks in a top-down manner, whose layers construct coarse-to-fine sub-task sequences as plans to complete the original task. The planning policy is trained to minimize the RL agent’s cost of completing the sequence in each layer from top to bottom layers, which gradually increases the sub-tasks and thus forms an easy-to-hard curriculum for the planner. Next, a bottom-up traversal of the tree trains the RL agent from easier sub-tasks with denser rewards on bottom layers to harder ones on top layers and collects its cost on each sub-task train the planner in the next episode. CO-PILOT repeats this mutual training for multiple episodes before switching to a new task, so the RL agent and planner are fully optimized to facilitate each other’s training. We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (SoRB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency. Our code is available at https://github.com/Shuang-AO/CO-PILOT.
dc.language	en
dc.relation.ispartof	Advances in Neural Information Processing Systems
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	1701 Psychology, 1702 Cognitive Sciences
dc.title	CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum
dc.type	Conference Proceeding
utslib.citation.volume	13
utslib.for	1701 Psychology
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.date.updated	2022-07-24T04:25:53Z
pubs.publication-status	Published
pubs.volume	13

Abstract:

Goal-conditioned reinforcement learning (RL) usually suffers from sparse reward and inefficient exploration in long-horizon tasks. Planning can find the shortest path to a distant goal that provides dense reward/guidance but is inaccurate without a precise environment model. We show that RL and planning can collaboratively learn from each other to overcome their own drawbacks. In “CO-PILOT”, a learnable path-planner and an RL agent produce dense feedback to train each other on a curriculum of tree-structured sub-tasks. Firstly, the planner recursively decomposes a long-horizon task to a tree of sub-tasks in a top-down manner, whose layers construct coarse-to-fine sub-task sequences as plans to complete the original task. The planning policy is trained to minimize the RL agent’s cost of completing the sequence in each layer from top to bottom layers, which gradually increases the sub-tasks and thus forms an easy-to-hard curriculum for the planner. Next, a bottom-up traversal of the tree trains the RL agent from easier sub-tasks with denser rewards on bottom layers to harder ones on top layers and collects its cost on each sub-task train the planner in the next episode. CO-PILOT repeats this mutual training for multiple episodes before switching to a new task, so the RL agent and planner are fully optimized to facilitate each other’s training. We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (SoRB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency. Our code is available at https://github.com/Shuang-AO/CO-PILOT.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/159139