Reinforcement Learning Lecture Model-Based Reinforcement ......2013/12/06 · Wolfgang Kohler (1917)¨ Intelligenzprufungen am¨ Menschenaffen The Mentality of Apes [movie] Goal-directed

ReinforcementLearning

Model-Based ReinforcementLearning

Model-based, PAC-MDP, sample complexity,exploration/exploitation, RMAX, E3,

Bayes-optimal, Bayesian RL, model learning

Vien NgoMarc Toussaint

University of Stuttgart

Outline

• Model-Based RL

• Exploration/Exploitation

• PAC-MDP

• E3, RMAX

• Bayes optimal: Bayesian model-based RL

2/??

RL Approaches

3/??

Limitations of the model-free view

• given learnt values, behavior is a fixed SR (or state-action) mapping

• if the “goal” changes: need to re-learn values/policy for every state inthe world! all previous values are obsolete

• no general “knowledge”, only values

• no anticipation of outcomes (s′), only of value

• no “planning”

4/??

Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes

[movie]

Goal-directed Decision Making

5/??

Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes

[movie]

Goal-directed Decision Making

5/??

Goal-directed vs. habitual: Devaluation

Niv, Joel & Dayan: A normative perspective on motivation. TICS,10:375-381, 2006.

6/??

Goal-directed vs. habitual: Devaluation


7/??

By definition, goal-directed behavior is performed to obtain a desired goal.Although all instrumental behavior is instrumental in achieving its contingentgoals, it is not necessarily purposively goal-directed. Dickinson and Balleine[1,11] proposed that behavior is goal-directed if: (i) it is sensitive to thecontingency between action and outcome, and (ii) the outcome is desired.Based on the second condition, motivational manipulations have been used todistinguish between two systems of action control: if an instrumental outcomeis no longer a valued goal (for instance, food for a sated animal) and thebehavior persists, it must not be goaldirected. Indeed, after moderate amountsof training, outcome revaluation brings about an appropriate change ininstrumental actions (e.g. leverpressing) [43,44], but this is no longer the casefor extensively trained responses ([30,31], but see [45]). That extensive trainingcan render an instrumental action independent of the value of its consequentoutcome has been regarded as the experimental parallel of the folk psychologymaxim that wellperformed actions become habitual [9] (see Figure I).


8/??

Model-based RL

• Model learning:Given data D = {(st, at, rt, st+1)}Ht=1 estimate P (s′|a, s) and P (r|a, s)

– discrete state-action: P (s′|a, s) = #(s′,a,s)#(a,s)

– continuous state-action: P (s′|a, s) = N(s′ |φ(s, a)>β,Σ)

estimate parameters β (and perhaps Σ) as for regression(including non-linear features, regularization, cross-validation!)

• Planning:– discrete state-action: Value Iteration with the estimated model– continuous state-action:

Least Squares Value IterationDifferential Dynamic ProgrammingPlanning-by-Inference

9/??

Exploration: Motivation

Start: many potential paths

10/??


Continue using this path?

Try out alternatives?11/??


12/??


13/??

Reinforcement learning

• In reinforcement learning (RL), the agent starts to act without a modelof the environment.

• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.

• RL algorithms we have seen thus far: Q-learning, TD-learning

• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.

• Planning algorithms we have seen thus far: value iteration, policyiteration.

14/??

Reinforcement learning

• In reinforcement learning (RL), the agent starts to act without a modelof the environment.

• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.

• RL algorithms we have seen thus far: Q-learning, TD-learning

• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.

• Planning algorithms we have seen thus far: value iteration, policyiteration.

14/??

Exploration / Exploitation

• In contrast to supervised learning, in RL the data used for learningdepend on the agent.

• Two different types of behavior:– exploration: behave with the goal to learn as much as possible– exploitation: behave with the goal of getting as much reward aspossible

• Challenge in exploration: which actions will lead to the most importantlearning progress with respect to the goal?→ exploration as fundamental intelligent behavior

15/??

Recall Markov Decision Processesa0

s0

r0

a1

s1

r1

a2

s2

r2

P (s0:T , a0:T , r0:T ;π) =

P (s0)P (a0|s0;π)P (r0|a0, s0)∏Tt=1 P (st|at-1, st-1)P (at|st;π)P (rt|at, st)

– world’s initial state distribution P (s0)

– world’s transition probabilities P (st+1 | at, st)– world’s reward probabilities P (rt | at, st)– agent’s policy π(at | st) (or deterministic at = π(st))– discount parameter γ for future rewards

– two different sources of uncertainty: the world itself (not controlled bythe agent) vs. the policy (controlled by the agent)

16/??

Exploration-exploitation tradeoff

• Goal of reinforcement learning agent:maximize future rewards E[

∑∞t=0 γ

trt | s0;π]

• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.

• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.

• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?

17/??

Exploration-exploitation tradeoff

• Goal of reinforcement learning agent:maximize future rewards E[

∑∞t=0 γ

trt | s0;π]

• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.

• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.

• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?

17/??

Sample Complexity

• Let M be an MDP with N states, K actions, discount factor γ ∈ [0, 1)

and a maximal reward Rmax > 0.

• Let A be an algorithm (that is, a reinforcement learning agent) that actsin the environment, resulting in s0, a0, r0, s1, a1, r1, . . . .

• Let V At,M = E[

∑∞s=0 γ

srt+s | s0, a0, r0 . . . st−1, at−1, rt−1, st].• V ∗ is the value function of the optimal policy.

• Definition: Let ε > 0 be a prescribed accuracy and δ > 0 be anallowed probability of failure. The expression η(ε, δ,N,K, γ,Rmax) is asample complexity bound for algorithm A if independently of thechoice of s0, with probability at least 1− δ, the number of timestepssuch that V A

t,M < V ∗(xt)− ε is at most η(ε, δ,N,K, γ,Rmax).(Kakade, 2003)

18/??

Efficient exploration

• An algorithm with sample complexity that is polynomial in 1/ε, log(1/δ),N , K, 1/(1− γ), Rmax is called PAC-MDP (probably approximatelycorrect in MDPs).

19/??

Exploration strategies

• The exploration strategy is reflected in the policy π(s).

• In the following, assume we have estimates Q(s, a).

• greedy (only exploit): π(s) = argmaxa Q(s, a)

– problem: learned model not the same as the true environment– Without exploration, agent is likely to miss high rewards.

• random: choose action a with probability 1/{]actions}– problem: ignores value estimates and thus rewards

20/??

Exploration strategies (continued)

• ε-greedy: π(s) =

argmaxa Q(s, a) with probability 1− εrandom action with probability ε

– most popular method– Converges to the optimal value function with probability 1 (all pathswill be visited sooner or later), if the exploration rate diminishesaccording to an appropriate schedule.– problem: sample complexity exponential in the number of states

• Boltzmann: choose action a with probability exp(Q(s,a)/T )∑a exp(Q(s,a)/T )

– temperature T controls amount of exploration– problem again: sample complexity exponential in the number ofstates

21/??

Exploration strategies (continued)• Other heuristics for exploration:

– minimize variance of action value estimates– optimistic initial values (“optimism in the face of uncertainty”)– state bonuses: frequency, recency, error etc.→ problem again: sample complexity exponential in the number ofstates

• Bayesian RL: optimal exploration strategy– distribution over MDP models (i.e., the parameters of the MDP)– posterior distribution updated after each new observation– exploration strategy minimizes uncertainty of parameters– Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., noother policy is better in expectation w.r.t. prior distribution over MDPs)– only tractable for very simple problems

• E3 and Rmax: principled approach to the exploration-exploitationtradeoff with polynomial sample complexity 22/??

Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)

• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward

• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.

• From observed data, E3 constructs two MDPs:

– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know

– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration

23/??

Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)

• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward

• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.

• From observed data, E3 constructs two MDPs:

– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know

– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration 23/??

E3 sketch

Input: State sOutput: Action a

if s is known thenPlan in MDPknown B Sufficiently accurate model estimatesif resulting plan has value above some threshold then

return first action of plan B Exploitationelse

Plan in MDPunknownreturn first action of plan B Planned exploration

end ifelse

return action with the least observations in s B Direct explorationend if

24/??

E3 example

S. Singh (Tutorial 2005)

25/??

E3 example

S. Singh (Tutorial 2005)

26/??

E3 example

M : true known state MDP M: estimated known state MDPS. Singh (Tutorial 2005)

27/??

E3

Implementation Setting

• T is the time horizon.

• GTmax is the maximum T-step return. (discounted case GTmax ≤ TRmax).

• A state is known if it was visited O(

(NTGTmax/ε)4V armax log(1/δ)

)times. (V armax is the maximum variance of the random payoffs over allstates).

• For the exploration/exploitation choice at known states: It’s assumed tobe given the optimal value function V ∗. If V obtained from theMDPknown > (V ∗ − ε) then do exploitation.

28/??

RMAX

• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.

• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).

• RMAX builds an approximate MDP based on reward function

R(s, a) =

R(s, a) if (s,a) known

Rmax otherwise

29/??

RMAX

• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.

• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).

• RMAX builds an approximate MDP based on reward function

R(s, a) =

R(s, a) if (s,a) known

Rmax otherwise

29/??

RMAX’s sketch

Initialize all couter n(s, a) = 0, n(s, a, s′) = 0.Initialize T (s′|s, a) = Is=s′ , R(s, a) = Rmax

while (1) doCompute policy πt using MDP model of (T , R).Choose a = πt(s), observe s′, r.n(s, a) = n(s, a) + 1

r(s, a) = r(s, a) + r

n(s, a, s′) = n(s, a, s′) + 1

if n(s, a) = m thenUpdate T (·|s, a) = n(s, a, ·)/n(s, a), and R(s, a) = r(s, a)/n(s, a).

end ifend while

30/??

RMAX Analysis

• The general PAC-MDP theorem does not easily adapt to the analysisof E3 because of E3 s use of two internal models (Original analysisdepends on horizon and mixing time).

• (Upper bound) There exists m = O(NT 2

ε2 ln2 NAδ

), then with probability

of at least 1− δ, V (st) ≥ V ∗(st)− ε is true for all but

O

(N2AT 3

ε3ln2 NA

δ

)

where N is the number of states.

• For discounted case: T = − log ε1−γ

31/??

Limitations of E3 and RMAX

• E3/RMAX is called “efficient” because its sample complexity scalesonly polynomially in the number of states.

• In natural environments, however, this number of states is enormous: itis exponential in the number of objects in the environment.

• Hence E3/RMAX scales exponentially in the number of objects.

• Generalization over states and actions is crucial for exploration.

32/??

KWIKKWIK (Known What It Knows): a supervised-learning model

• Input set: X

• Output set: Y

• Observation set: Z

• Hypothesis class: H : X 7→ Y

• Target function: h∗ ∈ H• Special symbol: ? (I don’t know)

33/??

KWIK

34/??

KWIKExample: Coin Learning

• Predict Pr(head) ∈ [0, 1] for a coin

– from many observations: head or tail

• Algorithm

– Predict ? the first O(1/ε2 log(1/δ)) times

– Use empirical estimate afterwards

– The bound follows from Hoeffdings bound O(1/ε2 log(1/δ))Li et. al. ICML 2008.

35/??

KWIK-RMAX

• T (s′|s, a) and R(s, a) are TWIK-learned.

36/??

Bayesian Model-based RLEncode unknown prob. T(s, a, s′) with random variables θ

• θsas′ = Pr(s′|s, a): random variable in [0, 1].• θsa = Pr(·|s, a): multinomial distribution.

Pr(ns1 = n1, · · · , nsk = nk) =n!

n1! · · ·nk!pn1

1 · · · pnkk

37/??

POMDP FormulationBayesian model-based RL→ Partially Observable MDP (POMDP)

• State space: SP = S×Θ

• S: Observable MDP state space.

• Θ: All unknown model’s parameter (Unobservable)

38/??

POMDP Formulation

• Assume an RL agent in an MDP environment 〈S,A,T,R〉.• In model-based RL: each unknown transition probability is parameterized

by a parameter θas,s′ ∈ [0, 1].

• Then, its POMDP formulation 〈SP ,A,TP ,RP ,O,Z〉:

• a new state space Snew = S × {θas,s′}

• observation space O = S

• The transition TP (s, θ, a, s′, θ′) = Pr(s′, θ′|s, θ, a):

Pr(s′|s, θs,s′

a , a) = θs,s′

a and Pr(θ′|θ) = δθ(θ′).

• The observation function Z(s′, a, o) = Pr(o|s′, a):

Pr(o|s′, a) = δs′(o).

• The reward function RP (s, θ, a, s′, θ′) = R(s, a, s′).

39/??

POMDP Formulation (2)

• The belief over all unknown parameters θs,s′

a : b(θ) = Pr(θ). If assumingthat the belief prior is a product of Dirichlets, then a posterior is found in aclosed form. Then , the belief is written as

b(θ) =∏s,a

D(θsa;nsa) (1)

where each unknown distribution θsa per one pair (s, a) is represented by

one Dirichlet D(θsa;nsa) = k∏s′ θ

ns,s′a −1

s,a,s′ ; and nsa is a vector of parameters

{ns,s′

a }.• Then, the closed form of belief update operator after observing a

transition (s, a, s′) is

bs,s′

a (θ) = kθs,s′

a

∏s,a

D(θsa;nsa) =∏s,a

D(θsa;nsa + δs,a,s′(s, a, s′)) (2)

40/??

Bayesian Optimality

• The Bellman’s equations

V π(b, s) = R(b, s, π(b, s)) +

∫b′,s′

p(b′, s′|π(b, s), b, s)V π(b′, s′)

– typically, MDP structure is fixed; belief over the parameters– belief updated after each observation (s, a, r, s′)

– only tractable for very simple problems

• Bayes-optimal policy a = argmaxa

V (b, a)

– no other policy leads to more rewards in expectation w.r.t. priordistribution over MDPs– solves the exploration-exploitation tradeoff implicitly: minimizesuncertainty about the parameters, while exploiting where it is certain– is not PAC-MDP efficient!

41/??

Example: BRL for Banditsmany slides from the course of Autonomous System, Marc Toussaint WS13/14.

42/??

Bandits recap

• Let at ∈ {1, .., n} be the choice of machine at time tLet yt ∈ R be the outcome with mean 〈yat〉A policy or strategy maps all the history to a new choice:

π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at

• Problem: Find a policy π that

max〈T∑t=1

yt〉

ormax〈yT 〉

• “Two effects” of choosing a machine:– You collect more data about the machine→ knowledge– You collect reward

43/??

The Belief State• “Knowledge” can be represented in two ways:

– as the full history

ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]

– as the beliefbt(θ) = P (θ|ht)

where θ are the unknown parameters θ = (θ1, .., θn) of allmachines

• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =

∏i bt(θi|ht)

e.g. for Gaussian bandits with constant noise, θi = µi

bt(µi|ht) = N(µi|yi, si)

e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):

bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)

ai,t =

t−1∑s=1

[as= i][ys=0] , bi,t =

t−1∑s=1

[as= i][ys=1]44/??

The Belief MDP

P (b′|y, a, b) =

1 if b′ = b′[b,a,y]

0 otherwise, P (y|a, b) =

∫θa

b(θa) P (y|θa)

• The Belief MDP describes a different process: the interaction between theinformation available to the agent (bt or ht) and its actions, where the agentuses his current belief to anticipate observations, P (y|a, b).

• The belief (or history ht) is all the information the agent has avaiable; P (y|a, b)the “best” possible anticipation of observations. If it acts optimally in the BeliefMDP, it acts optimally in the original problem.

45/??

Example: POMDP Reduction to Belief MDPAn example of binary n-armed bandits

• State space Sb = {∀b|T}.• Transition function

T (b, a, b′) =∑r

p(b′|b, a, r)p(r|b, a)

= δb′=b(a,r=1)p(r|b, a) + δb′=b(a,r=0)(1− p(r|b, a))

where p(r|b, a) =∫p(r|µa, a)b(µa)dµa = αab/(α

ab + βab )

• Similarly, what is the reward function R(b, a) =?

• A well-defined belief MDP formulation: {Sb, A, T,R} (finite belief statespace, transition function, Bellman equations)

46/??

Conclusions

• RL agents need to solve the exploration-exploitation tradeoff.

• Sample complexity measures the required number of explorativeactions of an algorithm.

• Ideas for driving exploration: random actions, optimism in the face ofuncertainty, maximizing learning progress and information gain

47/??

References

• Kakade (2003): On the sample complexity of reinforcement learning.PhD thesis.

• Poupart, Vlassis, Hoey, Kevin Regan: An analytic solution to discreteBayesian reinforcement learning. ICML 2006.

• Li, Littman, Walsh: Knows what it knows: a framework for self-awarelearning. ICML 2008.

48/??

Documents

Reinforcement Learning Lecture Model-Based Reinforcement ......2013/12/06 · Wolfgang Kohler (1917)¨ Intelligenzprufungen am¨ Menschenaffen The Mentality of Apes [movie] Goal-directed