强化学习note1导论

最新推荐文章于 2023-09-11 13:02:17 发布

Loiser1

最新推荐文章于 2023-09-11 13:02:17 发布

阅读量181

点赞数

分类专栏：强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/Loiser1/article/details/109179420

版权

强化学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Textbook:Sutton and Barton reinforcement learning

周博磊老师中文课

coding

架构:Pytorch

在这里插入图片描述

与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label
强化学习:不一定i.i.d;没有立刻feed back(delay reward)

exploration(采取新行为)&exploitation(采取目前最好的行为)

feature:

Trial-and-error exploration
Delay reward
time matters(sequential ,non i.i.d)
Agent’s actions affect the subsequential data it recieves

compared with supervised learning,reinforcement learaning can sometimes surpass the behavior of human

possible rollout sequence

agent&environment

rewards:scalar feedback

sequential decision making:

近期与远期奖励的trade off

full observation&partial observation

RL Agent:

component:

1.policy:agent’s behavior function

from state/observation to action

stochastic policy:Probabilistic sample: $\pi(a|s)=P[A_t=a|S_t=s]$

deterministic policy: $a^*=arg\underset{a}{max}\,\pi(a|s)$

2.value function:

expected discounted sum of future rewards under a particular policy $\pi$

discount factor weights immediate vs future rewards

used to quantify goodness/badness of states and actions
$v_{\pi}(s)\overset{\triangle}=E_\pi[G_t]=E_\pi[\sum_{k=0}\gamma^kR_{t+k+1}|S_t=s]$
Q-function(use to select among actions):
$q_\pi(s,a)\overset{\triangle}=E_\pi[G_T|S_t,A_t=a]$

5.Model

A model predicts what the environment will do next

Types of RL Agents based on what the Agent Learns

Value-based agent

显示学习价值函数、隐式学习策略

Policy-based agent

显示学习policy、no value function

Actor-critic agent

结合policy and value function

Types of RL Agents based on what the Agent Learns

Model-based

直接学习model(环境转移)

Model-free

直接学习value function/policy function

No model
在这里插入图片描述

Exploration and Exploitation

exploration:进行试错

exploitation:选择已知情况下的最优

import gym

#这个锤子在python里面跑，一跑就卡...
env=gym.make('CartPole-v0')
env.reset()
env.render()
action=env.action_space.sample()
observation,reward,done,info=env.step(action)

exmaple

next calss:Markov决策过程

Loiser1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
强化学习note1导论

Textbook:Sutton and Barton reinforcement learning周博磊老师中文课coding架构:Pytorch与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label强化学习:不一定i.i.d;没有立刻feed back(delay reward) explorati
复制链接

扫一扫

专栏目录