Sign in

Online learning in MDPs with linear function approximation and bandit feedback

By Gergely Neu and Julia Olkhovskaya
We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We allow the... Show more
June 12, 2021
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
Online learning in MDPs with linear function approximation and bandit feedback
Click on play to start listening