Reinforcement Learning - Charles Isbell from Georgia Tech

最新推荐文章于 2021-04-22 17:08:07 发布

chitoseyono

最新推荐文章于 2021-04-22 17:08:07 发布

阅读量210

点赞数

分类专栏： MachineLearning 文章标签：强化学习

本文链接：https://blog.csdn.net/chitoseyono/article/details/87940721

版权

MachineLearning 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

你可以从这里 Udacity上的课程听课，是比较简单易懂的教程，比起单纯看Sutton的书还更有意思更加无痛入门一点
（Sutton的书写的是很详细不过真的看的很累，可以结合着看吧）

Markov Decision Processes

Markov property means only the present matters.
The rules are stationary.

Feature	Appearance	Notion
STATE	S	a set of states
MODEL	T(s, a, s’)~Pr(s’ \| s,a)	rules, a probability from s execute a to s’
ACTION	A(s), A	action in the state or a set of actions
REWARD	R(s), R(s, a), R(s, a, s’)	a scale value you get for being into a state / being into a state and taking an action / being into a state and taking an action and end up in another state

Solution to MDP

Feature	Appearance	Notion
POLICY	π(s)~a	a function takes in a state and returns an action

Stationary of Preference

U()stands for the utility of the sequence of the rewards receive for visiting states S0, S1, S2…

if
U(S0, S1, S2, …) > U(S0, S1’, S2’, …)
then
U(S1, S2, …) > U(S1’, S2’, …)
在这里插入图片描述
通过方法二 Rmax/(1-γ) 对utility定义可以使有穷数列实现无穷数列的效果(有穷时间走无穷远的距离，但永远走不到边界，有点像奇点singularity)

Optimal Policy
在这里插入图片描述
从某state开始的效用utility并不是指在某state下获得的reward最多（immediate），而是这个state会带来的utility最多(long term)，即延时报酬delayed reward。