Value-Based Reinforcement Learning
review:
Definition: Discounted return(cumulative discounted future reward)
⋅
\cdot
⋅
U
t
=
R
t
+
γ
R
t
+
1
+
γ
2
R
t
+
2
+
γ
3
R
t
+
3
+
.
.
.
U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+...
Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+...
⋅
\cdot
⋅ The return depends on action
A
t
,
A
t
+
1
,
A
t
+
2
,
.
.
.
A_{t},A_{t+1},A_{t+2},...
At,At+1,At+2,... and states
S
t
,
S
t
+
1
,
S
t
+
2
,
.
.
.
S_{t},S_{t+1},S_{t+2},...
St,St+1,St+2,...
⋅
\cdot
⋅ Actions are random:
P
[
A
=
a
∣
S
=
s
]
=
π
(
a
∣
s
)
.
P[A=a|S=s]=\pi(a|s).
P[A=a∣S=s]=π(a∣s).
\;\;\;
(Policy function)
⋅
\cdot
⋅ States are random:
P
[
S
′
=
s
′
∣
S
=
s
,
A
=
a
]
=
p
(
s
′
∣
s
,
a
)
.
P[S^{'}=s^{'}|S=s, A=a]=p(s^{'}|s,a).
P[S′=s′∣S=s,A=a]=p(s′∣s,a).
\;\;\;
(State transition)
Definition: Action-value function for policy
π
.
\pi.
π.
⋅
\cdot
⋅
Q
π
(
s
t
,
a
t
)
=
E
[
U
t
∣
S
t
=
s
t
,
A
t
=
a
t
]
.
Q_{\pi}(s_{t},a_{t}) = E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}].
Qπ(st,at)=E[Ut∣St=st,At=at].
⋅
\cdot
⋅ Taken w.r.t actions
A
t
+
1
,
A
t
+
2
,
A
t
+
3
,
.
.
.
A_{t+1},A_{t+2},A_{t+3},...
At+1,At+2,At+3,... and states
S
t
+
1
,
S
t
+
2
,
S
t
+
3
,
.
.
.
S_{t+1},S_{t+2},S_{t+3},...
St+1,St+2,St+3,...
⋅
\cdot
⋅ Integrate out everything except for the observations:
A
t
=
a
t
A_{t}=a_{t}
At=at and
S
t
=
s
t
.
S_{t}=s_{t}.
St=st.
Definition: Optimal action-value function
⋅
\cdot
⋅
Q
∗
(
s
t
,
a
t
)
=
m
a
x
π
Q
π
(
s
t
,
a
t
)
.
Q^{*}(s_{t},a_{t}) = \underset{\pi}{max}Q_{\pi}(s_{t},a_{t}).
Q∗(st,at)=πmaxQπ(st,at).
⋅
\cdot
⋅ Whatever policy function
π
\pi
π is used, the result of taking
a
t
a_{t}
at at state
s
t
s_{t}
st cannot be better than
Q
∗
(
s
t
,
a
t
)
.
Q^{*}(s_{t},a_{t}).
Q∗(st,at).
1. Deep Q-Network(DQN)
Goal: Win the game( ≈ \approx ≈ maximize the total reward.)
Question: If we know
Q
∗
(
s
,
a
)
Q^{*}(s,a)
Q∗(s,a), what is the best action?
⋅
\cdot
⋅ Obviously, the best action is
a
∗
=
a
r
g
m
a
x
a
Q
∗
(
s
,
a
)
.
a^{*} = arg\underset{a}{max}Q^{*}(s,a).
a∗=argamaxQ∗(s,a).
(
Q
∗
\;\;\;\;\;\;\;\;\;\;\;\;\;(Q^{*}
(Q∗ is an indication for how good it is for an agent to pick action
a
a
a while being in state
s
s
s).
Q
∗
Q^{*}
Q∗ is a prophet who can always guide us to make actions. But in fact, it is impossible to approximate an omnipotent prophet.
Challenge: We do not know
Q
∗
(
s
,
a
)
.
Q^{*}(s,a).
Q∗(s,a).
⋅
\cdot
⋅ Solution: Deep Q Network(DQN)
⋅
\cdot
⋅ Use neural network
Q
∗
(
s
,
a
,
w
)
Q^{*}(s,a,w)
Q∗(s,a,w) to approximate
Q
∗
(
s
,
a
)
Q^{*}(s,a)
Q∗(s,a).
w w w is the parameter of neural network, s s s is the input, and the output of neural network is many values, which are the possible scores of all actions. We train the network through rewards, and the scoring of this network will gradually improve and become better and better.
Deep Q Network:
⋅
\cdot
⋅ Input shape: size of the screenshot.
⋅
\cdot
⋅ Output shape: dimension of action space(scoring of each action).
Question: Based on the predictions, what should be the action?
Answer: If the score of that action is high, which action should be used.
2. Temporal Difference (TD) Learning
The most commonly used method for training DQN is TD algorithm.
Example
⋅
\cdot
⋅ I want to drive from NYC to Atlanta.
⋅
\cdot
⋅ Model Q(
w
w
w) estimates the time cost, e.g., 1000 minutes.
Qestion: How do I update the model?
⋅ \cdot ⋅ Make a prediction: q = Q ( w ) , e . g . , q = 1000. q = Q(w), e.g., q = 1000. q=Q(w),e.g.,q=1000.
⋅ \cdot ⋅ Finish the trip and get the target$ y, e.g., y = 860.$
⋅ \cdot ⋅ Loss: L = 1 2 ( q − y ) 2 . L = \frac{1}{2}(q-y)^{2}. L=21(q−y)2.
⋅ \cdot ⋅ Gradient: ∂ L ∂ w = ∂ q ∂ w ⋅ ∂ L ∂ q = ( q − y ) ⋅ ∂ Q ( w ) ∂ w . \frac{\partial L}{\partial w}=\frac{\partial q}{\partial w} \cdot \frac{\partial L}{\partial q}=(q-y)\cdot\frac{\partial Q(w)}{\partial w}. ∂w∂L=∂w∂q⋅∂q∂L=(q−y)⋅∂w∂Q(w).
⋅ \cdot ⋅ Gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t . w_{t+1}=w_{t}- \alpha\cdot\frac{\partial L}{\partial w}\mid_{w=w_{t}}. wt+1=wt−α⋅∂w∂L∣w=wt.
⋅
\cdot
⋅ Can I update the model before finishing the trip?
⋅
\cdot
⋅ Can I get a better
w
w
w as soon as I arrived at DC?
Temporal Difference (TD) Learning
⋅
\cdot
⋅ Model’s estimate:
\;\;\;\;\;\;\;\;\;\;\; NYC to Atlanta: 1000 minutes (estimate).
⋅ \cdot ⋅ I arrived to DC; actual time cost:
\;\;\;\;\;\;\;\;\;\;\; NYC to DC: 300 minutes (actual).
⋅
\cdot
⋅ Model now updates its estimate:
\;\;\;\;\;\;\;\;\;\;\;
DC to Atlanta: 600 minutes (estimate)
⋅ \cdot ⋅ Model’s estimate: Q ( w ) = 1000 m i n u t e s Q(w)= 1000 \,minutes Q(w)=1000minutes
⋅ \cdot ⋅ Updated estimate: 300 + 600 = 900 m i n u t e s ( T D t a r g e t ) . 300 + 600 = 900 minutes (TD target). 300+600=900minutes(TDtarget).
⋅ \cdot ⋅ TD target y = 900 y = 900 y=900 is a more reliable estimate than 1000 1000 1000.
⋅ \cdot ⋅ Loss: L = 1 2 L = \frac{1}{2} L=21 ( Q ( w ) − y ) ⏟ TD error \underbrace{(Q(w)-y) }_{\text{TD error}} TD error (Q(w)−y) 2 . ^{2}. 2.
⋅ \cdot ⋅ Gradient: ∂ L ∂ w = ( 1000 − 900 ) ⏟ TD error ⋅ ∂ Q ( w ) ∂ w . \frac{\partial L}{\partial w}=\underbrace{(1000-900) }_{\text{TD error}} \cdot \frac{\partial Q(w)}{\partial w}. ∂w∂L=TD error (1000−900)⋅∂w∂Q(w).
⋅ \cdot ⋅ Gradient descent: w t + 1 = w t − α ⋅ ∂ L ∂ w ∣ w = w t . w_{t+1}=w_{t}-\alpha \cdot \frac{\partial L}{\partial w} \mid_{w=w_{t}}. wt+1=wt−α⋅∂w∂L∣w=wt.
3. Why does TD learning work?
⋅
\cdot
⋅ Model’s estimates:
\;\;\;\;\;
NYC to Atlanta:
1000
1000
1000 minutes.
\;\;\;\;\;
DC to Atlanta:
600
600
600 minutes.
\;\;\;\;\;
⇒
\Rightarrow
⇒ NYC to DC:
400
400
400 minutes.
⋅
\cdot
⋅ Ground truth:
\;\;\;\;\;
NYC to DC:
300
300
300 minutes.
⋅ \cdot ⋅ TD error: δ = 400 − 300 = 100 \delta=400-300=100 δ=400−300=100
4. How to apply TD learning to DQN?
⋅
\cdot
⋅ In the “driving time” example, we have the equation:
T
N
Y
C
→
A
T
L
⏟
Model’s estimate
≈
T
N
Y
C
→
D
C
⏟
Actual time
+
T
D
C
→
A
T
L
⏟
Model’s estimate
.
\;\;\;\;\;\;\;\;\;\;\;\underbrace{T_{NYC\to ATL}}_{\text{Model's estimate}}\approx\underbrace{T_{NYC\to DC}}_{\text{Actual time}}+\underbrace{T_{DC\to ATL}}_{\text{Model's estimate}}.
Model’s estimate
TNYC→ATL≈Actual time
TNYC→DC+Model’s estimate
TDC→ATL.
The above is the form of TD algorithm.
⋅
\cdot
⋅ In deep reinforcement learning:
Q
(
s
t
,
a
t
,
w
)
≈
r
t
+
γ
⋅
Q
(
s
t
+
1
,
a
t
+
1
;
w
)
.
\;\;\;\;\;\;\;\;\;\;\;Q(s_{t},a_{t},w)\approx r_{t}+\gamma \cdot Q(s_{t+1},a_{t+1};w).
Q(st,at,w)≈rt+γ⋅Q(st+1,at+1;w).
Prove
\,
\,
5. Summary
Definition: Optimal action-value function.
⋅ \cdot ⋅ Q ∗ ( s t , a t ) = m a x π E [ U t ∣ S t = s t , A t = a t ] . Q^{*}(s_{t},a_{t})=\underset{\pi}{max} \,E[U_{t}\mid S_{t}=s_{t},A_{t}=a_{t}]. Q∗(st,at)=πmaxE[Ut∣St=st,At=at].
The Q ∗ Q^{*} Q∗ function can score all actions based on the current state, and the score can reflect the quality of each state. As long as there is a Q ∗ Q^{*} Q∗ function, it can be used to control the movement of the agent. At each moment, the agent only needs to select the action with the highest score to execute this action. However, we don’t have Q ∗ Q^{*} Q∗ function. The purpose of value learning is to learn a function to approximate Q ∗ Q^{*} Q∗ function, so we have D Q N DQN DQN.
DQN: Approximate Q ∗ Q^{*} Q∗ (s,a) using a neural network(DQN).
⋅ \cdot ⋅ Q ∗ ( s , a ; w ) Q^{*}(s,a;w) Q∗(s,a;w) is a neural network parameterized by w w w.
⋅ \cdot ⋅ Input: observed state s s s.
⋅ \cdot ⋅ Output: scores for all the action a ∈ A . a ∈ A. a∈A.