(3)深度强化学习基础【策略学习】

Policy-Based Reinforcement Learning

We will use a neural network to approximate the policy function. This neural network is called the policy network, which can be used to control the movement of agents.

1. Policy Function Approximation

1.1 Policy Function π ( a ∣ s ) \pi(a\mid s) π(as)

• Policy function π ( a ∣ s ) \pi(a\mid s) π(as) is a probability density function (PDF).

• It takes state s s s as input.

• It output the probabilities for all the actions, e.g.,

                                 \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; π ( l e f t ∣ s ) = 0.2 , \pi(left\mid s)=0.2, π(lefts)=0.2,
                                 \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; π ( r i g h t ∣ s ) = 0.1 , \pi(right\mid s)=0.1, π(rights)=0.1,
                                 \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; π ( u p ∣ s ) = 0.7. \pi(up\mid s)=0.7. π(ups)=0.7.

• The agent performs an action a a a random drawn from the distribution.

Can we directly learn a policy function π ( a ∣ s ) ? \pi(a\mid s)? π(as)?

• If there are only a few states and actions, then yes, we can.

• Draw a table (matrix) and learn the entries.

• What if there are too many (or infinite) states or actions?

1.2 Policy Network π ( a ∣ s ; θ ) \pi(a\mid s; \theta) π(as;θ)

Policy network: Use a neural net to approximate π ( a ∣ s ) \pi(a\mid s) π(as).

• Use policy network π ( a ∣ s ; θ ) \pi(a\mid s; \theta) π(as;θ) to approximate π ( a ∣ s ) \pi(a\mid s) π(as).

θ \theta θ: trainable parameters of the neural net.

∑ a ∈ A π ( a ∣ s ; θ ) = 1. \sum_{a∈A}\pi(a\mid s; \theta)=1. aAπ(as;θ)=1.

• Here, A A A = {“left”,“right”,“up”} is the set all actions.

• That is why we use softmax activation.

2. State-Value Function Approximation

2.1 review: Action-Value Function

2.2 State-Value Function

V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s ) ⋅ Q π ( s t , a ) . V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a). Vπ(st)=EA[Qπ(st,A)]=aπ(as)Qπ(st,a).
               \;\;\;\;\;\;\; Integrate out action A ∽ π ( • ∣ s t ) . A \backsim\pi(•\mid s_{t}). Aπ(st).

V π ( s t ) V_{\pi}(s_{t}) Vπ(st) can evaluate the current state. The greater the V π ( s t ) V_{\pi}(s_{t}) Vπ(st), the greater the chance of winning. Given the state s s s, V π ( s t ) V_{\pi}(s_{t}) Vπ(st) can evaluate the strategy π \pi π. If π \pi π is good, the greater V π ( s t ) V_{\pi}(s_{t}) Vπ(st) is, the greater the current chance of winning.

Definition: State-value function.
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s ) ⋅ Q π ( s t , a ) . V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a). Vπ(st)=EA[Qπ(st,A)]=aπ(as)Qπ(st,a).

Approximate state-value function.

• Approximate policy function π ( a ∣ s t ) \pi(a \mid s_{t}) π(ast) by policy network π ( a ∣ s t ; θ ) . \pi(a \mid s_{t}; \theta). π(ast;θ).

• Approximate policy function V π ( s t ) V_{\pi}(s_{t}) Vπ(st) by:

                                         \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; V ( s t ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) . V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a). V(st;θ)=aπ(as;θ)Qπ(st,a).

Definition: Approximate state-value function.
V ( s t ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) . V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a). V(st;θ)=aπ(as;θ)Qπ(st,a).

Policy-based learning: Learn θ \theta θ that maximizes J ( θ ) = E s [ V ( S ; θ ) ] . J(\theta)=E_{s}[V(S;\theta)]. J(θ)=Es[V(S;θ)].

The better the policy network, the larger the J ( θ ) J(\theta) J(θ).

How to improve θ \theta θ? Policy gradient ascent!

• Observe state s s s.

• Update policy by: θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ \theta\leftarrow\theta+\beta\cdot\frac{\partial V(s;\theta)}{\partial \theta} θθ+βθV(s;θ)

                                       P o l i c y    g r a d i e n t    : ∂ V ( s ; θ ) ∂ θ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;Policy\;gradient\;:\frac{\partial V(s;\theta)}{\partial \theta} Policygradient:θV(s;θ)

3. Policy Gradient

Definition: Approximate state-value function.
V ( s t ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) . V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a). V(st;θ)=aπ(as;θ)Qπ(st,a).

Policy gradient: Derivative of V ( s ; θ )    V(s;\theta)\; V(s;θ)w.r.t θ . \theta. θ.
∂ V ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) ∂ θ . \frac{\partial V(s;\theta)}{\partial \theta}=\frac{\partial\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}. θV(s;θ)=θaπ(as;θ)Qπ(st,a).

Push derivative inside the summation.

                           = ∑ a ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) ∂ θ . \;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}. =aθπ(as;θ)Qπ(st,a).

Pretend Q π Q_{\pi} Qπ is independent of θ \theta θ.
(It may not be true.)

                           = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a )    \;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a)\; =aθπ(as;θ)Qπ(st,a) P o l i c y   G r a d i e n t : F o r m 1 Policy\,Gradient: Form 1 PolicyGradient:Form1

However, this policy gradient is not used in the actual process, but the Monte Carlo approximation of this policy gradient is actually used.

                     \;\;\;\;\;\;\;\;\;\;

Two forms of policy gradient:

• Form 1: ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) . \frac{\partial V(s;\theta)}{\partial \theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a). θV(s;θ)=aθπ(as;θ)Qπ(st,a).

• Form 2: ∂ V ( s ; θ ) ∂ θ = E A ∽ π ( • ∣ s ; θ ) [ ∂ l o g π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] . \frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)]. θV(s;θ)=EAπ(s;θ)[θlogπ(As;θ)Qπ(s,A)].

3.1 Calculate Policy Gradient for Discrete Actions

If the actions are discrete, e.g., action space A = A= A= {“left”, “right”, “up”},…

Use Form1: ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) . \frac{\partial V(s;\theta)}{\partial\theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a). θV(s;θ)=aθπ(as;θ)Qπ(st,a).

  1. Calculate f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) , f(a,\theta)=\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a), f(a,θ)=θπ(as;θ)Qπ(st,a), for every action a ∈ A . a∈A. aA.
  2. Policy gradient: ∂ V ( s ; θ ) ∂ θ = f ( " l e f t " , θ ) + f ( " r i g h t " , θ ) + f ( " u p " , θ ) \frac{\partial V(s;\theta)}{\partial\theta}=f("left",\theta)+f("right",\theta)+f("up",\theta) θV(s;θ)=f("left",θ)+f("right",θ)+f("up",θ).

               \;\;\;\;\;\;\; If A A A is big, this approach is costly.

3.2 Calculate Policy Gradient for Continuous Actions

If the actions are continuous, e.g., action space A A A = [0,1],…

Use form 2: ∂ V ( s ; θ ) ∂ θ = E A ∽ π ( • ∣ s ; θ ) [ ∂ l o g π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] . \frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)]. θV(s;θ)=EAπ(s;θ)[θlogπ(As;θ)Qπ(s,A)].

  1. Randomly sample an action a ^ \hat{a} a^ according to the PDF π ( ⋅ ∣ s ; θ ) . \pi(\cdot \mid s; \theta). π(s;θ).
  2. Calculate g ( a ^ , θ ) = ∂ l o g π ( a ^ ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ^ ) g(\hat{a}, \theta)=\frac{\partial log\pi(\hat{a}\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,\hat{a}) g(a^,θ)=θlogπ(a^s;θ)Qπ(s,a^)
         \;\;
  3. Use g ( a ^ , θ ) g(\hat{a}, \theta) g(a^,θ) as an approximation to the policy gradient ∂ V ( s ; θ ) ∂ θ \frac{\partial V(s;\theta)}{\partial \theta} θV(s;θ).

3.3 Update policy network using policy gradient

The Reinforce algorithm needs to play a game and observe all the rewards before updating the network.

Later…

4. Summary

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值