1、算法:
Selection using DQN:
a
⋆
=
argmax
a
Q
(
s
t
+
1
,
a
;
w
)
.
a^{\star}=\operatorname*{argmax}_{a}Q(s_{t+1},a;\mathbf{w}).
a⋆=aargmaxQ(st+1,a;w).
Evaluation using target network:
y
t
=
r
t
+
γ
⋅
Q
(
s
t
+
1
,
a
⋆
;
w
−
)
.
y_{t}=r_{t}+\gamma\cdot Q(s_{t+1},a^{\star};\mathbf{w}^{-}).
yt=rt+γ⋅Q(st+1,a⋆;w−).
2、算法实现:
class DoubleDQN:
def __init__(self, dim_obs=None, num_act=None, discount=0.9):
self.discount = discount
self.model = QNet(dim_obs, num_act)
self.target_model = QNet(dim_obs, num_act)
self.target_model.load_state_dict(self.model.state_dict())
def get_action(self, obs):
qvals = self.model(obs)
return qvals.argmax()
def compute_loss(self, s_batch, a_batch, r_batch, d_batch, next_s_batch):
# Compute current Q value based on current states and actions.
qvals = self.model(s_batch).gather(1, a_batch.unsqueeze(1)).squeeze()
# next state的value不参与导数计算,避免不收敛。
next_qvals, _ = self.target_model(next_s_batch).detach().max(dim=1)
loss = F.mse_loss(r_batch + self.discount * next_qvals * (1 - d_batch), qvals)
return loss