（3）深度强化学习基础【策略学习】

最新推荐文章于 2024-11-09 15:56:43 发布

Rershing Ren

最新推荐文章于 2024-11-09 15:56:43 发布

阅读量140

点赞数 1

分类专栏： Deep reinforcement learning 文章标签：学习深度学习人工智能神经网络

本文链接：https://blog.csdn.net/weixin_49716548/article/details/125983827

版权

Deep reinforcement learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Policy-Based Reinforcement Learning

We will use a neural network to approximate the policy function. This neural network is called the policy network, which can be used to control the movement of agents.

1. Policy Function Approximation

1.1 Policy Function $\pi(a\mid s)$

• Policy function $\pi(a\mid s)$ is a probability density function (PDF).

• It takes state $s$ as input.

• It output the probabilities for all the actions, e.g.,

$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ $\pi(left\mid s)=0.2,$
$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ $\pi(right\mid s)=0.1,$
$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ $\pi(up\mid s)=0.7.$

• The agent performs an action $a$ random drawn from the distribution.

Can we directly learn a policy function $\pi(a\mid s)?$

• If there are only a few states and actions, then yes, we can.

• Draw a table (matrix) and learn the entries.

• What if there are too many (or infinite) states or actions?

1.2 Policy Network $\pi(a\mid s; \theta)$

Policy network: Use a neural net to approximate $\pi(a\mid s)$ .

• Use policy network $\pi(a\mid s; \theta)$ to approximate $\pi(a\mid s)$ .

• $\theta$ : trainable parameters of the neural net.

• $\sum_{a∈A}\pi(a\mid s; \theta)=1.$

• Here, $A$ = {“left”,“right”,“up”} is the set all actions.

• That is why we use softmax activation.

2. State-Value Function Approximation

2.1 review: Action-Value Function

2.2 State-Value Function

• $V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a).$
$\;\;\;\;\;\;\;$ Integrate out action $\backsim\pi(•\mid s_{t}).$

$V_{\pi}(s_{t})$ can evaluate the current state. The greater the $V_{\pi}(s_{t})$ , the greater the chance of winning. Given the state $s$ , $V_{\pi}(s_{t})$ can evaluate the strategy $\pi$ . If $\pi$ is good, the greater $V_{\pi}(s_{t})$ is, the greater the current chance of winning.

Definition: State-value function.
$V_{\pi}(s_{t})=E_{A}[Q_{\pi}(s_{t},A)]=\sum_{a}\pi(a\mid s) \cdot Q_{\pi}(s_{t},a).$

Approximate state-value function.

• Approximate policy function $\pi(a \mid s_{t})$ by policy network $\pi(a \mid s_{t}; \theta).$

• Approximate policy function $V_{\pi}(s_{t})$ by:

$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ $V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a).$

Definition: Approximate state-value function.
• $V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a).$

Policy-based learning: Learn $\theta$ that maximizes $J(\theta)=E_{s}[V(S;\theta)].$

The better the policy network, the larger the $J(\theta)$ .

How to improve $\theta$ ? Policy gradient ascent!

• Observe state $s$ .

• Update policy by: $\theta\leftarrow\theta+\beta\cdot\frac{\partial V(s;\theta)}{\partial \theta}$

$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;Policy\;gradient\;:\frac{\partial V(s;\theta)}{\partial \theta}$

3. Policy Gradient

Definition: Approximate state-value function.
• $V(s_{t};\theta)=\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a).$

Policy gradient: Derivative of $V(s;\theta)\;$ w.r.t $\theta.$
• $\frac{\partial V(s;\theta)}{\partial \theta}=\frac{\partial\sum_{a}\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}.$

Push derivative inside the summation.

$\;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) \cdot Q_{\pi}(s_{t},a)}{\partial \theta}.$

Pretend $Q_{\pi}$ is independent of $\theta$ .
(It may not be true.)

$\;\;\;\;\;\;\;\;\;\;\;\;\;=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a)\;$ $Policy\,Gradient: Form 1$

However, this policy gradient is not used in the actual process, but the Monte Carlo approximation of this policy gradient is actually used.

$\;\;\;\;\;\;\;\;\;\;$

Two forms of policy gradient:

• Form 1: $\frac{\partial V(s;\theta)}{\partial \theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a).$

• Form 2: $\frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)].$

3.1 Calculate Policy Gradient for Discrete Actions

If the actions are discrete, e.g., action space $A =$ {“left”, “right”, “up”},…

Use Form1: $\frac{\partial V(s;\theta)}{\partial\theta}=\sum_{a}\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a).$

Calculate $f(a,\theta)=\frac{\partial\pi(a\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s_{t},a),$ for every action $a \in A .$
Policy gradient: $\frac{\partial V(s;\theta)}{\partial\theta}=f("left",\theta)+f("right",\theta)+f("up",\theta)$ .

$\;\;\;\;\;\;\;$ If $A$ is big, this approach is costly.

3.2 Calculate Policy Gradient for Continuous Actions

If the actions are continuous, e.g., action space $A$ = [0,1],…

Use form 2: $\frac{\partial V(s;\theta)}{\partial \theta}=E_{A\backsim\pi(•\mid s;\theta)}[\frac{\partial log\pi(A\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,A)].$

Randomly sample an action $\hat{a}$ according to the PDF $\pi(\cdot \mid s; \theta).$
Calculate $g(\hat{a}, \theta)=\frac{\partial log\pi(\hat{a}\mid s; \theta) }{\partial \theta}\cdot Q_{\pi}(s,\hat{a})$
$\;\;$
Use $g(\hat{a}, \theta)$ as an approximation to the policy gradient $\frac{\partial V(s;\theta)}{\partial \theta}$ .