Policy-Based Reinforcement Learning
We will use a neural network to approximate the policy function. This neural network is called the policy network, which can be used to control the movement of agents.
1. Policy Function Approximation
1.1 Policy Function π ( a ∣ s ) \pi(a\mid s) π(a∣s)
• Policy function π ( a ∣ s ) \pi(a\mid s) π(a∣s) is a probability density function (PDF).
• It takes state s s s as input.
• It output the probabilities for all the actions, e.g.,
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
π
(
l
e
f
t
∣
s
)
=
0.2
,
\pi(left\mid s)=0.2,
π(left∣s)=0.2,
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
π
(
r
i
g
h
t
∣
s
)
=
0.1
,
\pi(right\mid s)=0.1,
π(right∣s)=0.1,
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
π
(
u
p
∣
s
)
=
0.7.
\pi(up\mid s)=0.7.
π(up∣s)=0.7.
• The agent performs an action a a a random drawn from the distribution.
Can we directly learn a policy function π ( a ∣ s ) ? \pi(a\mid s)? π(a∣s)?
• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• What if there are too many (or infinite) states or actions?
1.2 Policy Network π ( a ∣ s ; θ ) \pi(a\mid s; \theta) π(a∣s;θ)
Policy network: Use a neural net to approximate π ( a ∣ s ) \pi(a\mid s) π(a∣s).
• Use policy network π ( a ∣ s ; θ ) \pi(a\mid s; \theta) π(a∣s;θ) to approximate π ( a ∣ s ) \pi(a\mid s) π(a∣s).
• θ \theta θ: trainable parameters of the neural net.
• ∑ a ∈ A π ( a ∣ s ; θ ) = 1. \sum_{a∈A}\pi(a\mid s; \theta)=1. ∑a∈Aπ(a∣s;θ)=1.
• Here, A A A = {“left”,“right”,“up”} is the set all actions.
• That is why we use softmax activation.
2. State-Value Function Approximation
2.1 review: Action-Value Function
2.2 State-Value Function
•
V
π
(
s
t
)
=
E
A
[
Q
π
(
s
t
,
A
)
]
=
∑
a
π
(
a
∣
s
)
⋅
Q
π
(
s
t
,
a
)
.
V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a).
Vπ(st)=EA[Qπ(st,A)]=∑aπ(a∣s)⋅Qπ(st,a).
\;\;\;\;\;\;\;
Integrate out action
A
∽
π
(
•
∣
s
t
)
.
A \backsim\pi(•\mid s_{t}).
A∽π(•∣st).
V π ( s t ) V_{\pi}(s_{t}) Vπ(st) can evaluate the current state. The greater the V π ( s t ) V_{\pi}(s_{t}) Vπ(st), the greater the chance of winning. Given the state s s s, V π ( s t ) V_{\pi}(s_{t}) Vπ(st) can evaluate the strategy π \pi π. If π \pi π is good, the greater V π ( s t ) V_{\pi}(s_{t}) Vπ(st) is, the greater the current chance of winning.
Definition: State-value function.
V
π
(
s
t
)
=
E
A
[
Q
π
(
s
t
,
A
)
]
=
∑
a
π
(
a
∣
s
)
⋅
Q
π
(
s
t
,
a
)
.
V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a).
Vπ(st)=EA[Qπ(st,A)]=∑aπ(a∣s)⋅Qπ(st,a).
Approximate state-value function.
• Approximate policy function π ( a ∣ s t ) \pi(a \mid s_{t}) π(a∣st) by policy network π ( a ∣ s t ; θ ) . \pi(a \mid s_{t}; \theta). π(a∣st;θ).
• Approximate policy function V π ( s t ) V_{\pi}(s_{t}) Vπ(st) by:
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; V ( s t ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) . V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a). V(st;θ)=∑aπ(a∣s;θ)⋅Qπ(st,a).
Definition: Approximate state-value function.
•
V
(
s
t
;
θ
)
=
∑
a
π
(
a
∣
s
;
θ
)
⋅
Q
π
(
s
t
,
a
)
.
V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a).
V(st;θ)=∑aπ(a∣s;θ)⋅Qπ(st,a).
Policy-based learning: Learn θ \theta θ that maximizes J ( θ ) = E s [ V ( S ; θ ) ] . J(\theta)=E_{s}[V(S;\theta)]. J(θ)=Es[V(S;θ)].
The better the policy network, the larger the J ( θ ) J(\theta) J(θ).
How to improve θ \theta θ? Policy gradient ascent!
• Observe state s s s.
• Update policy by: θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ \theta\leftarrow\theta+\beta\cdot\frac{\partial V(s;\theta)}{\partial \theta} θ←θ+β⋅∂θ∂V(s;θ)
P o l i c y g r a d i e n t : ∂ V ( s ; θ ) ∂ θ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;Policy\;gradient\;:\frac{\partial V(s;\theta)}{\partial \theta} Policygradient:∂θ∂V(s;θ)
3. Policy Gradient
Definition: Approximate state-value function.
•
V
(
s
t
;
θ
)
=
∑
a
π
(
a
∣
s
;
θ
)
⋅
Q
π
(
s
t
,
a
)
.
V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a).
V(st;θ)=∑aπ(a∣s;θ)⋅Qπ(st,a).
Policy gradient: Derivative of
V
(
s
;
θ
)
V(s;\theta)\;
V(s;θ)w.r.t
θ
.
\theta.
θ.
•
∂
V
(
s
;
θ
)
∂
θ
=
∂
∑
a
π
(
a
∣
s
;
θ
)
⋅
Q
π
(
s
t
,
a
)
∂
θ
.
\frac{\partial V(s;\theta)}{\partial \theta}=\frac{\partial\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}.
∂θ∂V(s;θ)=∂θ∂∑aπ(a∣s;θ)⋅Qπ(st,a).
Push derivative inside the summation.
= ∑ a ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s t , a ) ∂ θ . \;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}. =∑a∂θ∂π(a∣s;θ)⋅Qπ(st,a).
Pretend
Q
π
Q_{\pi}
Qπ is independent of
θ
\theta
θ.
(It may not be true.)
= ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) \;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a)\; =∑a∂θ∂π(a∣s;θ)⋅Qπ(st,a) P o l i c y G r a d i e n t : F o r m 1 Policy\,Gradient: Form 1 PolicyGradient:Form1
However, this policy gradient is not used in the actual process, but the Monte Carlo approximation of this policy gradient is actually used.
\;\;\;\;\;\;\;\;\;\;
Two forms of policy gradient:
• Form 1: ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) . \frac{\partial V(s;\theta)}{\partial \theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a). ∂θ∂V(s;θ)=∑a∂θ∂π(a∣s;θ)⋅Qπ(st,a).
• Form 2: ∂ V ( s ; θ ) ∂ θ = E A ∽ π ( • ∣ s ; θ ) [ ∂ l o g π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] . \frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)]. ∂θ∂V(s;θ)=EA∽π(•∣s;θ)[∂θ∂logπ(A∣s;θ)⋅Qπ(s,A)].
3.1 Calculate Policy Gradient for Discrete Actions
If the actions are discrete, e.g., action space A = A= A= {“left”, “right”, “up”},…
Use Form1: ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) . \frac{\partial V(s;\theta)}{\partial\theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a). ∂θ∂V(s;θ)=∑a∂θ∂π(a∣s;θ)⋅Qπ(st,a).
- Calculate f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s t , a ) , f(a,\theta)=\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a), f(a,θ)=∂θ∂π(a∣s;θ)⋅Qπ(st,a), for every action a ∈ A . a∈A. a∈A.
- Policy gradient: ∂ V ( s ; θ ) ∂ θ = f ( " l e f t " , θ ) + f ( " r i g h t " , θ ) + f ( " u p " , θ ) \frac{\partial V(s;\theta)}{\partial\theta}=f("left",\theta)+f("right",\theta)+f("up",\theta) ∂θ∂V(s;θ)=f("left",θ)+f("right",θ)+f("up",θ).
\;\;\;\;\;\;\; If A A A is big, this approach is costly.
3.2 Calculate Policy Gradient for Continuous Actions
If the actions are continuous, e.g., action space A A A = [0,1],…
Use form 2: ∂ V ( s ; θ ) ∂ θ = E A ∽ π ( • ∣ s ; θ ) [ ∂ l o g π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] . \frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)]. ∂θ∂V(s;θ)=EA∽π(•∣s;θ)[∂θ∂logπ(A∣s;θ)⋅Qπ(s,A)].
- Randomly sample an action a ^ \hat{a} a^ according to the PDF π ( ⋅ ∣ s ; θ ) . \pi(\cdot \mid s; \theta). π(⋅∣s;θ).
- Calculate
g
(
a
^
,
θ
)
=
∂
l
o
g
π
(
a
^
∣
s
;
θ
)
∂
θ
⋅
Q
π
(
s
,
a
^
)
g(\hat{a}, \theta)=\frac{\partial log\pi(\hat{a}\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,\hat{a})
g(a^,θ)=∂θ∂logπ(a^∣s;θ)⋅Qπ(s,a^)
\;\; - Use g ( a ^ , θ ) g(\hat{a}, \theta) g(a^,θ) as an approximation to the policy gradient ∂ V ( s ; θ ) ∂ θ \frac{\partial V(s;\theta)}{\partial \theta} ∂θ∂V(s;θ).
3.3 Update policy network using policy gradient
The Reinforce algorithm needs to play a game and observe all the rewards before updating the network.
Later…