THESIS
2022
1 online resource (x, 69 pages) : illustrations (some color)
Abstract
In standard reinforcement learning (RL), environment interactions are usually available
for model training to facilitate continuous exploration and performance improvement.
But, in many reinforcement learning applications, models are only trained with pre-existing
datasets without online interactions with the environments. This problem is called offline
reinforcement learning (Offline RL). Recent studies in Offline RL mixes traditional RL
techniques with some forms of regularization, which usually aims at matching the RL
policy with the dataset-generating policy. This is to address extrapolation errors when we
evaluate the quality of out-of-distribution state-action pairs, also known as the distributional
shift issue. However, most regularization techniques assume that the environment
s...[
Read more ]
In standard reinforcement learning (RL), environment interactions are usually available
for model training to facilitate continuous exploration and performance improvement.
But, in many reinforcement learning applications, models are only trained with pre-existing
datasets without online interactions with the environments. This problem is called offline
reinforcement learning (Offline RL). Recent studies in Offline RL mixes traditional RL
techniques with some forms of regularization, which usually aims at matching the RL
policy with the dataset-generating policy. This is to address extrapolation errors when we
evaluate the quality of out-of-distribution state-action pairs, also known as the distributional
shift issue. However, most regularization techniques assume that the environment
states encountered by the RL agent during deployment will stay close to the dataset distribution
such that the agent could identify suitable actions to minimize distributional
shift. In many real-world applications with highly stochastic environments, this might
not be true. In the case that an unfamiliar environment state, i.e. out-of-distribution
(OOD) state is encountered, the agent might pick actions that are unregularized during training. These unregularized actions could lead to further distributional shifts in latter
interactions, forming a vicious cycle.
In this thesis, we propose an offline RL model by combining the regular actor-critic architecture
with a Wasserstein-1 divergence critic inspired by Wasserstein Generative Adversarial
Network with gradient penalty (WGAN-GP) to address the issue of OOD states.
We build a gradient penalized critic network to capture the divergence from the state action
distribution of the dataset using the Wasserstein-1 distance metric, while extending
the distance metric to the full state space during training. When encountering unfamiliar
environment states during deployment, the model would still be able to output actions
that are close to the marginal action distribution of the dataset. In our experiments, we
tested our model in a real-world application setting: automatic cremation control system.
We show that our model produces actions similar to human actions in OOD states, but
previous studies are unable to recognize meaningful actions in those states. In addition,
our model enjoys stable performances and outperforms the human baseline.
Post a Comment