‎

1 Reinforcement Learning: Past, Present, and future Prospectives

Markov Decision Process (MDP) - Bellman (1953)
- Agent, state, reward and environment
- Agent doesnt know about the dynamics of the state
- Reward is a modeling choice that enforces the agent behavior
- Behavior Policy is a distribution of action given a state.
Optimality in MDP
- Finite Horizon
- Infinite Horizon
- Average Reward
Recursive Bellman equations define the sequence of possible decisions. The transition functions take the agent from state to state where each previous set of states impacts the next transition.
TD-Gammon
Q-learning
DQN
- Experience replay buffer and mini-batch SGD
- Seperate target network stabilizes opmtimizaion targets
- normalized error
- able to play 49 atari games and reach superhuman results on 75% of them
Imrpoving DQN since 2016
- Double Q learning (Van Hasselt et al)
- Average G learning (Anshel et al)
- Hindsight experience replay (Andrychowicz et al)
- Distributional RL
- Ape x - distributed replay buffer

How to use various tasks to show that a learning strategy has generalized
Mutli-room benchmark
- Agent must learn to navigate not just based on a single room

PILCO Deisenroth & Rasmussen 2011)
World models (David Ha and schmidhuber 2018)
- learn models for policy optimization in visual domains
Ensembles of Bayesian NNs (Chua et al 2018)