Table of Contents

1 Reinforcement Learning: Past, Present, and future Prospectives

  • Katja Hofmann

1.1 Key RL Challenges

  • Explore vs Exploit
  • Credit Assignment
  • Function approximation

1.2 Background

  • Markov Decision Process (MDP) - Bellman (1953)
    • Agent, state, reward and environment
    • Agent doesnt know about the dynamics of the state
    • Reward is a modeling choice that enforces the agent behavior
    • Behavior Policy is a distribution of action given a state.
  • Optimality in MDP
    • Finite Horizon
    • Infinite Horizon
    • Average Reward
  • Recursive Bellman equations define the sequence of possible decisions. The transition functions take the agent from state to state where each previous set of states impacts the next transition.
  • TD-Gammon
  • Q-learning
  • DQN
    • Experience replay buffer and mini-batch SGD
    • Seperate target network stabilizes opmtimizaion targets
    • normalized error
    • able to play 49 atari games and reach superhuman results on 75% of them
  • Imrpoving DQN since 2016
    • Double Q learning (Van Hasselt et al)
    • Average G learning (Anshel et al)
    • Hindsight experience replay (Andrychowicz et al)
    • Distributional RL
    • Ape x - distributed replay buffer

1.3 Exploration vs Exploitation - Common Approaches

  • Optimistic initialization
  • Epsilion greedy
  • Softmax Policy
    • actions are sampled on their probability of being optimal
  • Upper Confidence Bound
  • Posterior Sampling
    • agent maintains a distribution
    • useful in bandit settings

1.4 Boostrapped DQN - Osband et al 2016

  • Approximate uncertainty over Q using deep ensembles
  • extended in 2018 with randomized prior function (Osband el al, 2018)

1.5 Actor - Critic

  • Conservative policy iteration (CPI) (kakade and langford 2002)
  • Trust Region Policy Optimization (TRPO) (Schulman et al 2015)
  • Proximal Policy Optimization (PPO) (Schulman et al 2017)
  • Soft Actor Critic (SAC) (Haarnoja et al 2018)
  • Optimistic Actor Critic (Hofmann et al 2019)

1.6 Generalization in ML

  • How to use various tasks to show that a learning strategy has generalized
  • Mutli-room benchmark
    • Agent must learn to navigate not just based on a single room

1.7 Meta-learning

  • Model agnostic meta learning (MAML) (Finn et al 2017)

1.8 Models

1.8.1 learning from data?

  • PILCO Deisenroth & Rasmussen 2011)
  • World models (David Ha and schmidhuber 2018)
    • learn models for policy optimization in visual domains
  • Ensembles of Bayesian NNs (Chua et al 2018)

2 Papers

  • Kaelbling, Littman and Moore
  • Cobbe et al 2019
  • Dayan et al 1993 (successor features)
  • Chevalier-Boisvert el al 2018 (multi-room benchmark)

Author: Sam Partee

Created: 2019-12-09 Mon 16:25

Validate