Initialize Q arbitrarily
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
Choose A from S using policy derived from Q(e.g., ε-greedy)
Take action A, observe R,S'
Q(S,A) ← *Q(S,A) + α*[R + γ*maxQ(S',a)]
S ← S'
until S is terminal
# α : Learing rate
# γ : 衰减值
Q-Learning 公式
最新推荐文章于 2022-10-14 10:00:11 发布