强化学习(Value Function Approximation)-Today9

Value Function Approximation主要是使用神经网络来求最优解问题,主要包括Algorithm for state value function、Sarsa和value function approximation的结合、Q-learning和value function approximation的结合、Deep Q-learning。

由于tables的数据不能处理很大的state space或者state space 连续的情况,因此引出函数曲线拟合来泛化state space,使其不用全部访问,访问一些状态即可进行全部改变。

①Algorithm for state value function状态值估计法分为三步:选取目标函数、算法最优化、v\hat{}(s,w)的选取

(1)Objiective function选取目标函数

使估计值v\hat{}(s,w)目标函数接近真实v_{\pi }(s),需要寻找最优的w来定义目标函数以及优化目标函数,使J(w)=E[v_{\pi }(s)-v\hat{}(s,w)]=||v\hat{}-v_{\pi }||^{2}_{D}最小

可以选择概率分布来求解上式子,如:

①选择uniform distribution 均匀分布:

J(w)=E[v_{\pi }(s)-v\hat{}(s,w)]=\frac{1}{|s|}\sum _{s\in S}(v_{\pi }(s)-v\hat{}(s,w))^{2}

②选择stationary distribution:

J(w)=E[v_{\pi }(s)-v\hat{}(s,w)]=\sum _{s\in S}d_{\pi }(s)(v_{\pi }(s)-v\hat{}(s,w))^{2}

d_{\pi }(s)是s的概率且d_{\pi }(s)\geq 0,\sum _{s\in S}d_{\pi }(s)=1

(2)Optimization algotithm算法最优化

①求解minJ(w)使用gredient-descent算法求解:

w_{k+1}=w_{k}-\alpha _{k}\triangledown _{w}J(w_{k})

\triangledown _{w}J(w)=\triangledown _{w}E[(v_{\pi }(s)-v\hat{}(s,w))^{2}]

                 =E[\triangledown _{w}(v_{\pi }(s)-v\hat{}(s,w))^{2}]

                              =-2E[v_{\pi }(s)-v\hat{}(s,w)]\triangledown _{w}v\hat{}(s,w)

②使用stochastic gradient计算:

由于v_{\pi }(s)计算不出来,因此使用Monte Carlo方法,使用g_{t}代替:

w_{t+1}=w_{t}+\alpha _{t}[v_{\pi }(s)-v\hat{}(s,w)]\triangledown _{w}v\hat{}(s,w)

w_{t+1}=w_{t}+\alpha _{t}[g_{t}-v\hat{}(s,w)]\triangledown _{w}v\hat{}(s,w)

w_{t+1}=w_{t}+\alpha _{t}[r_{t+1}+\gamma v\hat{}(s_{t+1},w_{t})-v\hat{}(s,w)]\triangledown _{w}v\hat{}(s,w)

v\hat{}(s,w)的选取

[1]linear function

v\hat{}(s,w)=\Phi ^{T}(s)w,v\hat{}(s,w)是s的线性函数,\triangledown _{w}v\hat{}(s,w)=\Phi (s)

w_{t+1}=w_{t}+\alpha _{t}[r_{t+1}+\gamma \Phi ^{T}(s_{t+1})w_{t}-\Phi ^{T}(s_{t})w_{t}]\Phi (s_{t})

[2]nonliner function

使用神经网络进行选取。

②Sarsa和value function approximation的结合

给定一个策略\pi _{t}(s_{t}),从r_{t+1},s_{t+1},看a_{t+1},其实就是将Algorithm for state value function的state value换为action value

(1)Value Update

w_{t+1}=w_{t}+\alpha _{t}[r_{t+1}+\gamma q\hat{}(s_{t+1},a_{t+1},w_{t})-q\hat{}(s_{t},a_{t},w_{t})]\triangledown _{w}q\hat{}(s_{t+1},a_{t+1},w_{t})

(2)Policy update(使用\varepsilon -Greedy)

\pi _{t+1}(a|s_{t})=1-\frac{\varepsilon }{|A|}(|A|-1),a=argmax_{a\in A(s_{t})}q\hat{}(s_{t+1},a,w_{t})

\pi _{t+1}(a|s_{t})=\frac{\varepsilon }{|A|},其他

③Q-learning和value function approximation的结合

给定一个策略\pi _{t}(s_{t}),产生r_{t+1},s_{t+1},看a_{t+1},其实就是将Sarsa的贝尔曼公式部分换为贝尔曼最优公式

(1)Value Update

w_{t+1}=w_{t}+\alpha _{t}[r_{t+1}+\gamma max_{a\in A(s_{t+1})}q\hat{}(s_{t+1},a,w_{t})-q\hat{}(s_{t},a_{t},w_{t})]\triangledown _{w}q\hat{}(s_{t+1},a_{t+1},w_{t})

(2)Policy update(使用\varepsilon -Greedy)

\pi _{t+1}(a|s_{t})=1-\frac{\varepsilon }{|A|}(|A|-1),a=argmax_{a\in A(s_{t})}q\hat{}(s_{t+1},a,w_{t})

\pi _{t+1}(a|s_{t})=\frac{\varepsilon }{|A|},其他

④Deep Q-learning(DNQ)

使用深度神经网络+强化学习:

J(w)=E[(R+\gamma max_{a\in A(S{}')}q\hat{}(S{}',a,w))-q\hat{}(S,A,W))]

y=R+\gamma max_{a\in A(S{}')}q\hat{}(S{}',a,w)

使用 main network:q\hat{}(s,a,w),target network:q\hat{}(s,a,w_{T})

w_{T}不动,计算w梯度,再使用求得的w更新w_{T}

\triangledown _{w}J(w)=E[(R+\gamma max_{a\in A(S{}')}q\hat{}(S{}',a,w))-q\hat{}(S,A,W))\triangledown _{w}q\hat{}(S,A,W)]

  • 12
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Function approximation is the process of finding a simpler function that approximates a more complex function. In the context of a calculator, function approximation can be useful for simplifying complex calculations or for finding quick approximations to functions that are difficult to evaluate directly. One common method for function approximation on a calculator is the Taylor series. The Taylor series is a way of expressing a function as an infinite sum of simpler functions, each of which is a derivative of the original function evaluated at a particular point. The Taylor series can be truncated to a finite number of terms to get an approximation to the original function. For example, consider the function f(x) = sin(x). The Taylor series for sin(x) centered at x=0 is: sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ... If we truncate this series after the first three terms, we get the approximation: sin(x) ≈ x - x^3/6 This approximation is valid for values of x close to 0. For larger values of x, we would need to include more terms in the series to get a good approximation. Another common method for function approximation on a calculator is interpolation. Interpolation involves fitting a simpler function to a set of data points. For example, if we have a set of data points (x1,y1), (x2,y2), ..., (xn,yn), we can fit a polynomial of degree n-1 to the data points that passes through all of them. This polynomial can then be used as an approximation to the original function. Function approximation can be a powerful tool for simplifying calculations and for getting quick approximations to complex functions. However, it is important to remember that approximations are only valid within certain ranges and may not be accurate for all values of the input variable.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值