An introduction to reinforcement learning

Having taken a quick look at several overviews of reinforcement learning, I wrote a script here to conclude and take down some key concepts and points to help myself understand the reinforcement learning.

Introduction

Terminology

  • Environment
  • State
    • agent state
      the state that the agent can observe.
    • environment state
      the whole information that the environment contains in the state.
  • Action
  • Reward
  • Policy
    A policy defines the learning agent’s way of behaving at a given time. It is an agent’s behavior function. It is a map from state to action.
    • Deterministic policy: a = π ( s ) a = \pi(s) a=π(s). a: action; s: state
    • Stochastic policy: π ( a ∣ s ) = P [ A = a ∣ S = s ] \pi(a|s) = \mathbb{P}[A=a|S=s] π(as)=P[A=aS=s]
  • Value function
    Value function is a prediction of future reward. Used to evaluate how good is each state and/or action and therefore to select between actions.
  • model
    how the agent senses the environment. agent’s representation of the environment. It predicts what the environment will do next. Can be model-free: not every situation needs a model.
    • Transitions model: P \mathcal{P} P predicts the next state (i.e. dynamics)
      P s s ′ a = P [ S ′ = s ′ ∣ S = s , A = a ] \mathcal{P}^{a}_{ss'} = \mathbb{P}[S'=s'|S=s, A=a] Pssa=P[S=sS=s,A=a]
    • Rewards model: R \mathcal{R} R predicts the next (immediate) reward, e.g.
      R s a = E [ R ∣ S = s , A = a ] \mathcal{R}^a_s = \mathbb{E}[R|S=s,A=a] Rsa=E[RS=s,A=a]
  • model-free
  • on-policy/off_policy

Goal

To maximize the expected cumulative reward.
(Reward Hypothesis)

Classification

  • two types of tasks
    • episodic
    • continuous (no terminal state). like automated stock trading.
  • two ways of learning (sampling methods)
    • Monte Carlo (only update when the episodes end)
    • TD Learning Methods (update every time states switch)
  • two approaches (whether MDP is known)
    • Model Free
      Policy and/or Value Function. No model.
      We don’t try to explicitly understand the environment. Don’t need to build a model to describe the environment.
      • value-based
        In value-based RL, the goal is to optimize the value function V ( s ) V(s) V(s). The value function is a function that tells us the maximum expected future reward the agent will get at each state. The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state. The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value.
        Therefore, no policy (implicit). You don’t need a policy to sample/determine which action to choose.
      • policy-based
        The policy π ( s ) \pi(s) π(s) is what defines the agent behavior at a given time. No value function.
        • Deterministic: a policy at a given state will always return the same action.
        • Stochastic: output a distribution probability over actions.
      • Actor Critic
        It stores both the policy and the value function at the same time.
    • Model Based
      Policy and/or Value Function. NModel.
      First build the model. Then we do actions base on the model.

Markov Decision Process

Markov Process

Markov Process (or Markov Chain) is a tuple ⟨ S , P ⟩ \langle\mathcal{S}, \mathcal{P}\rangle S,P.

  • S \mathcal{S} S is a (finite) set of states.
  • P \mathcal{P} P is a state transition probability matrix. P s s ′ = P [ S t + 1 = s ′ ∣ S t = s ] \mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s] Pss=P[St+1=sSt=s]

Markov Reward Process

Markov Reward Process is a tuple ⟨ S , P , R , γ ⟩ \langle\mathcal{S},\mathcal{P},\mathcal{R},\mathcal{\gamma}\rangle S,P,R,γ.

  • R \mathcal{R} R is a reward function. R s = E [ R t + 1 ∣ S t = s ] \mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s] Rs=E[Rt+1St=s] (immediate reward)

  • γ \mathcal{\gamma} γ is a discount factor. γ ∈ [ 0 , 1 ] \mathcal{\gamma} \in [0,1] γ[0,1]

  • The return G t G_t Gt is the total discounted reward from time-step t.
    G t = R t + 1 + γ R t + 2 + . . . = ∑ k = 0 ∞ γ k R t + k + 1 G_t = R_{t+1}+\gamma R_{t+2}+... = \sum_{k=0}^\infty\gamma^kR_{t+k+1} Gt=Rt+1+γRt+2+...=k=0γkRt+k+1
    Here, no expectation is because G t G_t Gt is for a single sequence sample.

  • The value function v ( s ) v(s) v(s) gives the long-term value of state s s s. It is the expected return starting from state s s s.
    v ( s ) = E [ G t ∣ S t = s ] v(s) = \mathbb{E}[G_t|S_t=s] v(s)=E[GtSt=s]

  • Bellman Equation for MRPs
    The value funxtion can be decomposed into two parts:

    • immediate reward R t + 1 R_{t+1} Rt+1
    • discounted value of successor state γ v ( S t + 1 ) \gamma v(S_{t+1}) γv(St+1)

    v ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ v ( S t + 1 ) ∣ S t = s ] = R s + γ ∑ s ′ ∈ S P s s ′ v ( s ′ ) v(s) = \mathbb{E}[G_t|S_t=s] = \mathbb{E}[R_{t+1}+\gamma v(S_{t+1})|S_t=s] = \mathcal{R}_s+\gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}v(s') v(s)=E[GtSt=s]=E[Rt+1+γv(St+1)St=s]=Rs+γsSPssv(s)
    In matrix form: v = R + γ P v v = \mathcal{R} + \gamma \mathcal{P}v v=R+γPv
    Can be solved directly for small MRPs: v = ( I − γ P ) − 1 R v = (I-\gamma\mathcal{P})^{-1}\mathcal{R} v=(IγP)1R

Markov Decision Process

Markov Decision Process is a tuple ⟨ S , A , P , R , γ ⟩ \langle\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle S,A,P,R,γ

  • A \mathcal{A} A is a finite set of actions
  • P \mathcal{P} P is a state transition probability matrix
    P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] \mathcal{P}^a_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a] Pssa=P[St+1=sSt=s,At=a]
  • R \mathcal{R} R is a reward function
    R s a = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{R}^a_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a] Rsa=E[Rt+1St=s,At=a]
  • A policy π \pi π is a distribution over actions given states. It fully defines the behavior of an agent.
    π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s) = \mathbb{P}[A_t=a|S_t=s] π(as)=P[At=aSt=s]
  • The state-value function v π ( s ) v_\pi(s) vπ(s) of an MDP is the expected return starting from state s s s, and then following policy π \pi π
    v π ( s ) = E π [ G t ∣ S t = s ] v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s] vπ(s)=Eπ[GtSt=s]
    Greedy policy improvement over V ( s ) V(s) V(s) requires model of MDP.
  • The action-value function q π ( s , a ) q_\pi(s,a) qπ(s,a) is the expected return starting from state s s s, taking action a a a, and then following policy π \pi π
    q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s, A_t=a] qπ(s,a)=Eπ[GtSt=s,At=a]
    Greedy policy improvement over Q ( s , a ) Q(s,a) Q(s,a) is model-free.
  • Bellman Expectation Equation (the deriving methods are very similar to the ones in MRPs)
    v π ( s ) = E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] v_\pi(s) = \mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s] vπ(s)=E[Rt+1+γvπ(St+1)St=s]
    v π = R π + γ P π v π v_\pi = \mathcal{R}^\pi+\gamma\mathcal{P}^\pi v_\pi vπ=Rπ+γPπvπ
    q π ( s , a ) = E π [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}_\pi[R_{t+1}+\gamma q_\pi(S_{t+1}, A_{t+1})|S_t=s, A_t=a] qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)St=s,At=a]
  • Optimal Value Function
    The optimal state-value function v ∗ ( s ) v_*(s) v(s) is the maximum value function over all policies
    v ∗ ( s ) = max ⁡ π v π ( s ) v_*(s) = \max_\pi v_\pi(s) v(s)=maxπvπ(s)
    The optimal action-value function q ∗ ( s , a ) q_*(s,a) q(s,a) is the maximum action-value function over all policies
    q ∗ ( s , a ) = max ⁡ π q π ( s , a ) q_*(s,a)=\max_\pi q_\pi(s,a) q(s,a)=maxπqπ(s,a)

Dynamic Programming

Dynamic programming assumes full knowledge of the MDP.

Sampling Methods for model-free. (solutions for small MDPs)

To estimate and optimise the value function of an unknown MDP.

Monte Carlo

Monte Carlo policy evaluation uses empirical mean return instead of expected return according to v π ( s ) = E π [ G t ∣ S t = s ] v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s] vπ(s)=Eπ[GtSt=s] using the whole episodes with terminals.

  • Update value V ( S t ) V(S_t) V(St) toward actual return G t G_t Gt.
    V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha(G_t − V(S_t)) V(St)V(St)+α(GtV(St))
  • GLIE Monte-Carlo Control (On-Policy) to update action-value function through sampling.
    • Sample kth episode using π : { S 1 , A 1 , R 2 , . . . , S T } ∼ π \pi:\{S_1,A_1,R_2,...,S_T\}\sim\pi π:{S1,A1,R2,...,ST}π
    • For each state S t S_t St and action A t A_t At in the episode,
      N ( S t , A t ) ← N ( S t , A t ) + 1 N(S_t,A_t) \gets N(S_t,A_t)+1 N(St,At)N(St,At)+1
      Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) ( G t − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t)) Q(St,At)Q(St,At)+N(St,At)1(GtQ(St,At))
    • Improve policy based on new action-value function
      ϵ ← 1 / k \epsilon \gets 1/k ϵ1/k
      π ← ϵ − g r e e d y ( Q ) \pi\gets\epsilon-greedy(Q) πϵgreedy(Q)

Temporal-Difference Learning (TD)

TD learning learns from incomplete episodes by bootstrapping (update involves an estimate).

  • (TD/TD(0)) Update value V(St) toward estimated return R t + 1 + γ V ( S t + 1 ) R_t+1 + \gamma V(S_{t+1}) Rt+1+γV(St+1).
    V ( S t ) ← V ( S t ) + α ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha(R_{t+1}+\gamma V(S_{t+1}) − V(S_t)) V(St)V(St)+α(Rt+1+γV(St+1)V(St))
    SARSA (On-Policy)
    use TD instead of MC in GLIE Monte-Carlo Control
    Q ( S , A ) ← Q ( S , A ) + α ( R + γ Q ( S ′ , A ′ ) − Q ( S , A ) ) Q(S, A) \gets Q(S, A)+\alpha(R+\gamma Q(S', A') − Q(S, A)) Q(S,A)Q(S,A)+α(R+γQ(S,A)Q(S,A))
    convergence condition
    • GLIE sequence of policies π t ( a ∣ s ) \pi_t(a|s) πt(as)
    • Robbins-Monro sequence of step-sizes α t \alpha_t αt
      ∑ t = 1 ∞ α t = ∞ \sum^\infty_{t=1}\alpha_t=\infty t=1αt=
      ∑ t = 1 ∞ α t 2 &lt; ∞ \sum^\infty_{t=1}\alpha_t^2&lt;\infty t=1αt2<
  • n-step temporal-difference learning
    V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha (G^{(n)}_t − V(S_t)) V(St)V(St)+α(Gt(n)V(St)), where G t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ( S t + n ) G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n−1}R_{t+n} + \gamma^nV(S_{t+n}) Gt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnV(St+n)
    When n = ∞ n=\infty n=, it is MC methods.
    n-step SARSA
    Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t ( n ) − Q ( S t , A t ) ) Q(S_t, A_t)\gets Q(S_t, A_t)+\alpha(q^{(n)}_t − Q(S_t, A_t)) Q(St,At)Q(St,At)+α(qt(n)Q(St,At)), where q t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q ( S t + n ) q_t^{(n)}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^nQ(S_{t+n}) qt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnQ(St+n)
  • (Forward-view TD( λ \lambda λ)
    V ( S t ) ← V ( S t ) + α ( G t λ − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha (G^\lambda_t − V(S_t)) V(St)V(St)+α(GtλV(St)), where G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G^\lambda_t = (1-\lambda)\sum^\infty_{n=1}\lambda^{n-1}G_t^{(n)} Gtλ=(1λ)n=1λn1Gt(n)
    Like MC, can only be computed from complete episodes.
    Forward View Sarsa( λ \lambda λ) combines all n-step Q-returns q t ( n ) q_t^{(n)} qt(n)
    Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha (q^\lambda_t − Q(S_t,A_t)) Q(St,At)Q(St,At)+α(qtλQ(St,At)), where q t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 q t ( n ) q^\lambda_t = (1-\lambda)\sum^\infty_{n=1}\lambda^{n-1}q_t^{(n)} qtλ=(1λ)n=1λn1qt(n)
  • (Backward-view TD( λ \lambda λ))
    V ( s ) ← V ( s ) + α δ t E t ( s ) V(s) \gets V(s)+\alpha \delta_t E_t(s) V(s)V(s)+αδtEt(s), where δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_t = R_{t+1}+\gamma V(S_{t+1}) - V(S_t) δt=Rt+1+γV(St+1)V(St), E 0 ( s ) = 0 E_0(s)=0 E0(s)=0 and E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s ) E_t(s)=\gamma\lambda E_{t-1}(s)+1(S_t=s) Et(s)=γλEt1(s)+1(St=s)

It is biased but with lower variance than the MC methods. It is more efficient but more sensitive to initial value.

Off-Policy Learning

Evaluate target policy π ( a ∣ s ) π(a|s) π(as) to compute v π ( s ) v_π(s) vπ(s) or q π ( s , a ) q_π(s, a) qπ(s,a), while following behaviour policy μ ( a ∣ s ) \mu(a|s) μ(as): { S 1 , A 1 , R 2 , . . . , S T } ∼ μ \{S_1, A_1, R_2, ..., S_T \} \sim \mu {S1,A1,R2,...,ST}μ

  • Importance sampling
  • Q-Learning
    Q ( S , A ) ← Q ( S , A ) + α ( R + γ max ⁡ a ′ Q ( S ′ , a ′ ) − Q ( S , A ) ) Q(S, A) \gets Q(S, A) + α(R + \gamma\max_{a&#x27;}Q(S&#x27;, a&#x27;) − Q(S, A)) Q(S,A)Q(S,A)+α(R+γmaxaQ(S,a)Q(S,A))

Value Function Approximation (solution for large MDPs)

Basic Idea

There are too many states and/or actions to store in memory. It is too slow to learn the value of each state individually.
So we estimate value function with function approximation

  • v ^ ( s , w ) ≈ v π ( s ) \hat{v}(s,w) \thickapprox v_\pi(s) v^(s,w)vπ(s)
  • q ^ ( s , a , w ) ≈ q π ( s , a ) \hat{q}(s,a,w) \thickapprox q_\pi(s,a) q^(s,a,w)qπ(s,a)
  • q ^ ( s , a 1 ) , . . . , q ^ ( s , a m ) ≈ q π ( s , a 1 , w ) , . . . , q π ( s , a m , w ) \hat{q}(s,a_1), ...,\hat{q}(s,a_m) \thickapprox q_\pi(s,a_1,w), ..., q_\pi(s,a_m,w) q^(s,a1),...,q^(s,am)qπ(s,a1,w),...,qπ(s,am,w)

The function approximators options (differential)

  • Linear combinations of features
  • Neural network

Incremental Methods

Stochastic Gradient Descent

To find parameter vector w w w minimising mean-squared error between approximate value fn v ^ ( s , w ) \hat{v}(s,w) v^(s,w) and true value fn v π ( s ) v_\pi(s) vπ(s) : J ( w ) = E π [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] J(w) = \mathbb{E}_\pi[(v_\pi(S)-\hat{v}(S,w))^2] J(w)=Eπ[(vπ(S)v^(S,w))2]

  • Gradient descent finds a local minimum
    △ w = − 1 2 α ∇ w J ( w ) = α E π [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] \triangle w=-\frac{1}{2}\alpha\nabla_wJ(w)=\alpha\mathbb{E}_\pi[(v_\pi(S)-\hat{v}(S,w))\nabla_w\hat{v}(S,w)] w=21αwJ(w)=αEπ[(vπ(S)v^(S,w))wv^(S,w)]
  • Stochastic gradient descent samples the gradient
    △ w = α ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) \triangle w=\alpha(v_\pi(S)-\hat{v}(S,w))\nabla_w\hat{v}(S,w) w=α(vπ(S)v^(S,w))wv^(S,w)
  • Expected update is equal to full gradient update.

Incremental Prediction Algorithms

In practice, we substitute a target for v π ( s ) v_\pi(s) vπ(s)

  • For MC, the target is the return G t G_t Gt
    ∆ w = α ( G t − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) ∆w = α(G_t − \hat{v}(St, w))∇_w\hat{v}(S_t, w) w=α(Gtv^(St,w))wv^(St,w)
  • For TD(0), the target is the TD target R t + 1 + γ v ^ ( S t + 1 , w ) R_{t+1} + \gamma\hat{v}(S_{t+1}, w) Rt+1+γv^(St+1,w)
    ∆ w = α ( R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) ∆w = α(R_{t+1} + γ\hat{v}(S_{t+1}, w) − \hat{v}(S_t, w))∇_w\hat{v}(S_t, w) w=α(Rt+1+γv^(St+1,w)v^(St,w))wv^(St,w)
  • For TD( λ \lambda λ), the target is the λ \lambda λ-return G t λ G^λ_t Gtλ
    ∆ w = α ( G t λ − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) ∆w = α(G^λ_t − \hat{v}(S_t, w))∇_w\hat{v}(S_t, w) w=α(Gtλv^(St,w))wv^(St,w)

For action-value function approximation, the derivation is similar.

Batch Methods

Least Squares Prediction

Given value function approximation v ^ ( s , w ) ≈ v π ( s ) \hat{v}(s, w) ≈ v_π(s) v^(s,w)vπ(s) and experience D \mathcal{D} D consisting of ⟨ s t a t e , v a l u e ⟩ \langle state, value\rangle state,value pairs D = { ⟨ s 1 , v 1 π ⟩ , ⟨ s 2 , v 2 π ⟩ , . . . , ⟨ s T , v T π ⟩ } \mathcal{D} = \{\langle s_1, v^π_1\rangle,\langle s_2, v^π_2\rangle, ..., \langle s_T , v^π_T\rangle\} D={s1,v1π,s2,v2π,...,sT,vTπ}, which parameters w w w give the best fitting value fn v ^ ( s , w ) \hat{v}(s, w) v^(s,w)?
Least squares algorithms find parameter vector w w w minimising sum-squared error between v ^ ( s t , w ) \hat{v}(s_t, w) v^(st,w) and target values v t π v^π_t vtπ,
L S ( w ) = ∑ t = 1 T ( v t π − v ^ ( s t , w ) ) 2 = E D [ ( v π − v ^ ( s , w ) ) 2 ] LS(w) = \sum^T_{t=1}(v^π_t − \hat{v}(s_t, w))^2= \mathbb{E}_\mathcal{D}[(v^π − \hat{v}(s, w))^2] LS(w)=t=1T(vtπv^(st,w))2=ED[(vπv^(s,w))2]

  • Stochastic Gradient Descent with Experience Replay
    Value from experience ⟨ s , v π ⟩ ∼ D \langle s,v^\pi\rangle\sim\mathcal{D} s,vπD and then apply stochastic gradient descent update. It converges to least squares solution w π = a r g m i n w L S ( w ) w^\pi=argmin_w LS(w) wπ=argminwLS(w).
    ∆ w = α ( v π − v ^ ( s , w ) ) ∇ w v ^ ( s , w ) ∆w = α(v^\pi − \hat{v}(s, w))∇_w\hat{v}(s, w) w=α(vπv^(s,w))wv^(s,w)
  • Deep Q-Networks (DQN) with Experience Replay

Least Squares Control

Policy Gradient

score function ∇ θ log ⁡ π θ ( s , a ) \nabla_\theta\log\pi_\theta(s,a) θlogπθ(s,a)
policy π θ \pi_\theta πθ and the gradient ∇ θ π θ ( s , a ) \nabla_\theta\pi_\theta(s,a) θπθ(s,a)
∇ θ π θ ( s , a ) = π θ ( s , a ) ∇ θ π θ ( s , a ) π θ ( s , a ) = π θ ( s , a ) ∇ θ log ⁡ π θ ( s , a ) \nabla_\theta\pi_\theta(s,a) = \pi_\theta(s,a)\frac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)} = \pi_\theta(s,a)\nabla_\theta\log\pi_\theta(s,a) θπθ(s,a)=πθ(s,a)πθ(s,a)θπθ(s,a)=πθ(s,a)θlogπθ(s,a)

Monte-Carlo methods

Actor-Critic methods

Maintain two sets of parameters

  • Critic: Updates action-value function parameters w w w
  • Actor: Updates policy parameters θ \theta θ, in direction suggested by critic

(how to accelerate the gradient descent.)
Reducing Variance Using a Baseline
Subtract a baseline function B ( s ) B(s) B(s) from the policy gradient. This can reduce variance, without changing expectation.

  • E π θ [ ∇ θ log ⁡ π θ ( s , a ) B ( s ) ] = ∑ s ∈ S d π θ ( s ) ∑ a ∇ θ π θ ( s , a ) B ( s ) = ∑ s ∈ S d π θ B ( s ) ∇ θ ∑ a ∈ A π θ ( s , a ) = 0 \mathbb{E}_{π_θ}[∇_θ\log π_θ(s, a)B(s)] = \sum_{s∈S}d^{π_θ}(s)\sum_a∇_θπ_θ(s, a)B(s) = \sum_{s∈S}d^{π_θ}B(s)∇_θ\sum_{a∈A}π_θ(s, a) = 0 Eπθ[θlogπθ(s,a)B(s)]=sSdπθ(s)aθπθ(s,a)B(s)=sSdπθB(s)θaAπθ(s,a)=0

A good baseline is the state value function B ( s ) = V π θ ( s ) B(s)=V^{\pi_\theta}(s) B(s)=Vπθ(s)
So we can rewrite the policy gradient using the advantage function A π θ ( s , a ) A^{\pi_\theta}(s,a) Aπθ(s,a)

  • A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) Aπθ(s,a)=Qπθ(s,a)Vπθ(s)
  • ∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( s , a ) A π θ ( s , a ) ] ∇_θJ(\theta) = \mathbb{E}_{π_θ}[∇_θ\log π_θ(s, a)A^{\pi_\theta}(s,a)] θJ(θ)=Eπθ[θlogπθ(s,a)Aπθ(s,a)]

Eligibility Traces (can be used online)

Deterministic gradient theorem can help improve.

Model-Based Reinforcement Learning

Model Definition

A model M \mathcal{M} M is a representation of an MDP ⟨ S , A , P , R ⟩ \langle\mathcal{S}, \mathcal{A},\mathcal{P}, \mathcal{R}\rangle S,A,P,R, parametrized by η η η. Here, we will assume state space S \mathcal{S} S and action space A \mathcal{A} A are known.
So a model M = ⟨ P η , R η ⟩ \mathcal{M} = \langle\mathcal{P}_η, \mathcal{R}_η\rangle M=Pη,Rη represents state transitions P η ≈ P \mathcal{P}_η ≈ \mathcal{P} PηP and rewards R η ≈ R \mathcal{R}_η ≈ \mathcal{R} RηR
S t + 1 ∼ P η ( S t + 1 ∣ S t , A t ) S_{t+1} ∼ \mathcal{P}_η(S_{t+1} | S_t, A_t) St+1Pη(St+1St,At)
R t + 1 = R η ( R t + 1 ∣ S t , A t ) R_{t+1} = \mathcal{R}_η(R_{t+1} | S_t, A_t) Rt+1=Rη(Rt+1St,At)
Typically assume conditional independence between state transitions and rewards
P [ S t + 1 , R t + 1 ∣ S t , A t ] = P [ S t + 1 ∣ S t , A t ] P [ R t + 1 ∣ S t , A t ] \mathbb{P} [S_{t+1}, R_{t+1} | S_t, A_t] = \mathbb{P} [S_{t+1} | S_t, A_t] \mathbb{P} [R_{t+1} | S_t, A_t] P[St+1,Rt+1St,At]=P[St+1St,At]P[Rt+1St,At]
Goal: estimate model M η \mathcal{M}_\eta Mη from experience { S 1 , A 1 , R 2 , . . . , S T } \{S_1,A_1,R_2,...,S_T\} {S1,A1,R2,...,ST}
After given a model M = ⟨ P η , R η ⟩ \mathcal{M} = \langle\mathcal{P}_η, \mathcal{R}_η\rangle M=Pη,Rη, we then solve the MDP ⟨ S , A , P η , R η ⟩ \langle\mathcal{S}, \mathcal{A},\mathcal{P}_\eta, \mathcal{R}_\eta\rangle S,A,Pη,Rη

Integrating Learning and Planning

  • Model-Free RL
    • No model
    • Learn value function (and/or policy) from real experience
  • Model-Based RL (using Sample-Based Planning)
    • Learn a model from real experience
    • Plan value function (and/or policy) from simulated experience
  • Dyna
    • Learn a model from real experience
    • Learn and plan value function (and/or policy) from real and simulated experience

Simulation-Based Search

combine Dyna and simulation-based search.

Ref:
https://www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/
Online courses by David Silver

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
强化学习是一种机器学习方法,它致力于教会智能体在一个动态环境中做出最优决策。在强化学习中,智能体通过不断与环境进行交互来学习,并且根据环境的反馈来调整自己的行为。 强化学习的一个核心概念是“奖励”,它是环境对智能体行为的评价。智能体的目标是通过选择能够最大化长期奖励累积的行为策略来学习。在学习的过程中,智能体通过试错和学习的方法逐步改进自己的决策策略。 强化学习涉及到很多基本元素,比如:状态、动作、策略和价值函数。状态是指代表环境的信息,动作是智能体可以执行的动作选择,策略是智能体根据当前状态选择动作的方法,价值函数是用来评估每个状态或动作的价值。这些元素相互作用,并通过学习算法来更新和改善,使得智能体能够做出更好的决策。 强化学习有很多不同的算法,其中最著名的是Q-learningDeep Q-Network(DQN)。Q-learning是一种基于值函数的学习方法,它通过不断更新状态-动作对的价值来优化策略。而DQN则是在Q-learning的基础上引入了深度神经网络,使得智能体能够处理更复杂的环境和任务。 总之,强化学习是一种通过交互式学习来教会智能体做出最优决策的方法。它在许多领域有广泛的应用,比如人工智能、自动驾驶、游戏AI等。通过不断的试错和学习,智能体可以不断改进自己的行为策略,达到最优性能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值