（2）深度强化学习基础【价值学习】

Rershing Ren

已于 2022-07-25 17:48:57 修改

阅读量165

点赞数

分类专栏： Deep reinforcement learning 文章标签：学习人工智能神经网络机器学习

于 2022-07-25 17:46:55 首次发布

本文链接：https://blog.csdn.net/weixin_49716548/article/details/125964312

版权

Deep reinforcement learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文介绍了价值基础的强化学习，强调了折扣回报在决策过程中的作用。深Q网络（DQN）用于近似最优动作值函数，通过神经网络学习预测每个状态下的最佳动作。同时，详细阐述了TD（Temporal Difference）学习，它允许模型在不完整的信息下进行更新，类似于从纽约到亚特兰大的驾驶时间估计示例。DQN结合TD学习，通过不断更新模型来逼近最优策略。

摘要由CSDN通过智能技术生成

Value-Based Reinforcement Learning

review：

Definition: Discounted return(cumulative discounted future reward)
$\cdot$ $U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+...$

$\cdot$ The return depends on action $A_{t},A_{t+1},A_{t+2},...$ and states $S_{t},S_{t+1},S_{t+2},...$
$\cdot$ Actions are random: $P[A=a|S=s]=\pi(a|s).$ $\;\;\;$ (Policy function)
$\cdot$ States are random: $P[S^{'}=s^{'}|S=s, A=a]=p(s^{'}|s,a).$ $\;\;\;$ (State transition)

Definition: Action-value function for policy $\pi.$
$\cdot$ $Q_{\pi}(s_{t},a_{t}) = E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}].$

$\cdot$ Taken w.r.t actions $A_{t+1},A_{t+2},A_{t+3},...$ and states $S_{t+1},S_{t+2},S_{t+3},...$
$\cdot$ Integrate out everything except for the observations: $A_{t}=a_{t}$ and $S_{t}=s_{t}.$

Definition: Optimal action-value function
$\cdot$ $Q^{*}(s_{t},a_{t}) = \underset{\pi}{max}Q_{\pi}(s_{t},a_{t}).$
$\cdot$ Whatever policy function $\pi$ is used, the result of taking $a_{t}$ at state $s_{t}$ cannot be better than $Q^{*}(s_{t},a_{t}).$

1. Deep Q-Network(DQN)

Goal: Win the game( $\approx$ maximize the total reward.)

Question: If we know $Q^{*}(s,a)$ , what is the best action?
$\cdot$ Obviously, the best action is $a^{*} = arg\underset{a}{max}Q^{*}(s,a).$
$Q^{*}$ is an indication for how good it is for an agent to pick action $a$ while being in state $s$ ).
$Q^{*}$ is a prophet who can always guide us to make actions. But in fact, it is impossible to approximate an omnipotent prophet.

Challenge: We do not know $Q^{*}(s,a).$
$\cdot$ Solution: Deep Q Network(DQN)
$\cdot$ Use neural network $Q^{*}(s,a,w)$ to approximate $Q^{*}(s,a)$ .

$w$ is the parameter of neural network, $s$ is the input, and the output of neural network is many values, which are the possible scores of all actions. We train the network through rewards, and the scoring of this network will gradually improve and become better and better.

Deep Q Network:
$\cdot$ Input shape: size of the screenshot.
$\cdot$ Output shape: dimension of action space(scoring of each action).

Question: Based on the predictions, what should be the action?
Answer: If the score of that action is high, which action should be used.

2. Temporal Difference (TD) Learning

The most commonly used method for training DQN is TD algorithm.

Example

$\cdot$ I want to drive from NYC to Atlanta.
$\cdot$ Model Q( $w$ ) estimates the time cost, e.g., 1000 minutes.

Qestion: How do I update the model?

$\cdot$ Make a prediction: $q = Q (w), e . g ., q = 1000.$

$\cdot$ Finish the trip and get the target$ y, e.g., y = 860.$

$\cdot$ Loss: $\frac{1}{2}(q-y)^{2}.$

$\cdot$ Gradient: $\frac{\partial L}{\partial w}=\frac{\partial q}{\partial w} \cdot \frac{\partial L}{\partial q}=(q-y)\cdot\frac{\partial Q(w)}{\partial w}.$

$\cdot$ Gradient descent: $w_{t+1}=w_{t}- \alpha\cdot\frac{\partial L}{\partial w}\mid_{w=w_{t}}.$

$\cdot$ Can I update the model before finishing the trip?
$\cdot$ Can I get a better $w$ as soon as I arrived at DC?

Temporal Difference (TD) Learning

$\cdot$ Model’s estimate:

$\;\;\;\;\;\;\;\;\;\;\;$ NYC to Atlanta: 1000 minutes (estimate).

$\cdot$ I arrived to DC; actual time cost:

$\;\;\;\;\;\;\;\;\;\;\;$ NYC to DC: 300 minutes (actual).

$\cdot$ Model now updates its estimate:
$\;\;\;\;\;\;\;\;\;\;\;$ DC to Atlanta: 600 minutes (estimate)

$\cdot$ Model’s estimate: $\,minutes$

$\cdot$ Updated estimate: $300 + 600 = 900 min u t es (T D t a r g e t) .$

$\cdot$ TD target $y = 900$ is a more reliable estimate than $1000$ .

$\cdot$ Loss: $\frac{1}{2}$ $\underbrace{(Q(w)-y) }_{\text{TD error}}$ $^{2}.$

$\cdot$ Gradient: $\frac{\partial L}{\partial w}=\underbrace{(1000-900) }_{\text{TD error}} \cdot \frac{\partial Q(w)}{\partial w}.$

$\cdot$ Gradient descent: $w_{t+1}=w_{t}-\alpha \cdot \frac{\partial L}{\partial w} \mid_{w=w_{t}}.$

3. Why does TD learning work?

$\cdot$ Model’s estimates:
$\;\;\;\;\;$ NYC to Atlanta: $1000$ minutes.
$\;\;\;\;\;$ DC to Atlanta: $600$ minutes.
$\;\;\;\;\;$ $\Rightarrow$ NYC to DC: $400$ minutes.

$\cdot$ Ground truth:
$\;\;\;\;\;$ NYC to DC: $300$ minutes.

$\cdot$ TD error: $\delta=400-300=100$

4. How to apply TD learning to DQN?

$\cdot$ In the “driving time” example, we have the equation:
$\;\;\;\;\;\;\;\;\;\;\;\underbrace{T_{NYC\to ATL}}_{\text{Model's estimate}}\approx\underbrace{T_{NYC\to DC}}_{\text{Actual time}}+\underbrace{T_{DC\to ATL}}_{\text{Model's estimate}}.$

The above is the form of TD algorithm.

$\cdot$ In deep reinforcement learning:
$\;\;\;\;\;\;\;\;\;\;\;Q(s_{t},a_{t},w)\approx r_{t}+\gamma \cdot Q(s_{t+1},a_{t+1};w).$

Prove

$\,$
$\,$

5. Summary

Definition: Optimal action-value function.

$\cdot$ $Q^{*}(s_{t},a_{t})=\underset{\pi}{max} \,E[U_{t}\mid S_{t}=s_{t},A_{t}=a_{t}].$

The $Q^{*}$ function can score all actions based on the current state, and the score can reflect the quality of each state. As long as there is a $Q^{*}$ function, it can be used to control the movement of the agent. At each moment, the agent only needs to select the action with the highest score to execute this action. However, we don’t have $Q^{*}$ function. The purpose of value learning is to learn a function to approximate $Q^{*}$ function, so we have $D QN$ .