Handling out-of-distribution scenarios in offline reinforcement learning

HKUST Electronic Theses

Handling out-of-distribution scenarios in offline reinforcement learning

by Hon Hing Chak

THESIS 2022

M.Phil. Computer Science and Engineering

1 online resource (x, 69 pages) : illustrations (some color)

Abstract

In standard reinforcement learning (RL), environment interactions are usually available for model training to facilitate continuous exploration and performance improvement. But, in many reinforcement learning applications, models are only trained with pre-existing datasets without online interactions with the environments. This problem is called offline reinforcement learning (Offline RL). Recent studies in Offline RL mixes traditional RL techniques with some forms of regularization, which usually aims at matching the RL policy with the dataset-generating policy. This is to address extrapolation errors when we evaluate the quality of out-of-distribution state-action pairs, also known as the distributional shift issue. However, most regularization techniques assume that the environment states encountered by the RL agent during deployment will stay close to the dataset distribution such that the agent could identify suitable actions to minimize distributional shift. In many real-world applications with highly stochastic environments, this might not be true. In the case that an unfamiliar environment state, i.e. out-of-distribution (OOD) state is encountered, the agent might pick actions that are unregularized during training. These unregularized actions could lead to further distributional shifts in latter interactions, forming a vicious cycle.

In this thesis, we propose an offline RL model by combining the regular actor-critic architecture with a Wasserstein-1 divergence critic inspired by Wasserstein Generative Adversarial Network with gradient penalty (WGAN-GP) to address the issue of OOD states. We build a gradient penalized critic network to capture the divergence from the state action distribution of the dataset using the Wasserstein-1 distance metric, while extending the distance metric to the full state space during training. When encountering unfamiliar environment states during deployment, the model would still be able to output actions that are close to the marginal action distribution of the dataset. In our experiments, we tested our model in a real-world application setting: automatic cremation control system. We show that our model produces actions similar to human actions in OOD states, but previous studies are unable to recognize meaningful actions in those states. In addition, our model enjoys stable performances and outperforms the human baseline.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science and Engineering Supervisors Wong, Raymond Chi-Wing Authors Chak, Hon Hing Subjects Reinforcement learning Mathematical models Language English Call number Thesis CSE 2022 Chak DOI 10.14711/thesis-991013114550603412

Full record

Handling out-of-distribution scenarios in offline reinforcement learning

by Hon Hing Chak

Post a Comment Cancel reply