THESIS
2018
xii, 175 pages : illustrations ; 30 cm
Abstract
The booming development of Artificial Intelligence promotes a large number of online interactive
services including recommender system (RecSys), online advertising, dialogue system, etc. These
services require sophisticated algorithms to decide actions sequentially and to maximize the cumulative
user feedback in the long run. To accomplish this goal, algorithms should simultaneously
exploit and explore the user interests according to the partial and noisy user feedback. Bandit problem
can successfully formulate the exploitation-exploration trade-off in these applications. When
facing the insufficient observations in a target domain of interest, unfortunately, bandit policies may
explore more than needed, which may lead to worse performance.
In this thesis, we study a novel and c...[
Read more ]
The booming development of Artificial Intelligence promotes a large number of online interactive
services including recommender system (RecSys), online advertising, dialogue system, etc. These
services require sophisticated algorithms to decide actions sequentially and to maximize the cumulative
user feedback in the long run. To accomplish this goal, algorithms should simultaneously
exploit and explore the user interests according to the partial and noisy user feedback. Bandit problem
can successfully formulate the exploitation-exploration trade-off in these applications. When
facing the insufficient observations in a target domain of interest, unfortunately, bandit policies may
explore more than needed, which may lead to worse performance.
In this thesis, we study a novel and challenging problem: Transferable Bandit. Via transfer
learning, transferable bandit leverages prior knowledge from the existing source domains with sufficient
user feedback to further optimize the cumulative rewards in the target domains of interest.
Transferable bandit harness the collective and mutually reinforcing power of the bandit formulation
and transfer learning. First, transfer learning improves the exploitation, accelerates its exploration,
and balances the exploitation and exploration appropriately in the target domain. Second, the transferable
bandit policy explores how to transfer and speeds up the knowledge transfer.
This thesis addresses several critical challenges of transferable bandit problems. First, we propose
Transferable Contextual Bandit (TCB) policy to bridge the action and context heterogeneity.
Second, we present Lifelong Contextual Bandit (LCB) policy that sequentially transfers knowledge
across homogeneous domains and maximizes overall cumulative rewards. Third, to facilitate the
large-scale online deployment, we illustrate two speed up methods including stochastic approximation
and feature selection. This thesis also presents a general framework based on the upper
confidence bound principle to address the transferable bandit problem. Both empirical studies on
real-world datasets and theoretical regret analysis validate this thesis.
Post a Comment