关于MarkDown公式详细编辑可以参考博客
Initialize replay memory
M
M
to capacity
Initialize action-value function
Q
Q
with random weights
for episode = do
Dl
D
l
← model and shuffle
D
D
← Random
for
i=1
i
=
1
,
|D|
|
D
|
do
Construct the state
si
s
i
using
xi
x
i
With probability
ϵ
ϵ
select a random action
ai
a
i
Otherwise select
ai=arg
a
i
=
a
r
g
maxQπ(si,a;θ)
m
a
x
Q
π
(
s
i
,
a
;
θ
)
if
ai
a
i
= 1 then
Obtain the annotation
yi
y
i
Dl
D
l
←
Dl+(xi,yi)
D
l
+
(
x
i
,
y
i
)
Updata model
ϕ
ϕ
based on
Dl
D
l
end if
Receive a reward
ri
r
i
from test data
if
|Dl|
|
D
l
|
=
B
B
then
Store transition in
M
M
Break
end if
Construct the new state
Store transition
(si,si,ri,si+1)
(
s
i
,
s
i
,
r
i
,
s
i
+
1
)
in
M
M
Sample random minibatch of transitionsfrom , and perform gradient descent step on
L(θ)
L
(
θ
)
Update policy with
θ
θ
end for
end for