强化学习note1导论

Textbook:Sutton and Barton reinforcement learning

周博磊老师中文课

coding

架构:Pytorch

在这里插入图片描述

与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label
强化学习:不一定i.i.d;没有立刻feed back(delay reward)

​ exploration(采取新行为)&exploitation(采取目前最好的行为)

feature:

  • Trial-and-error exploration
  • Delay reward
  • time matters(sequential ,non i.i.d)
  • Agent’s actions affect the subsequential data it recieves

compared with supervised learning,reinforcement learaning can sometimes surpass the behavior of human

possible rollout sequence

agent&environment

rewards:scalar feedback

sequential decision making:

近期与远期奖励的trade off

full observation&partial observation

RL Agent:
component:
1.policy:agent’s behavior function

from state/observation to action

stochastic policy:Probabilistic sample: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s)=P[A_t=a|S_t=s] π(as)=P[At=aSt=s]

deterministic policy: a ∗ = a r g m a x a   π ( a ∣ s ) a^*=arg\underset{a}{max}\,\pi(a|s) a=argamaxπ(as)

2.value function:

expected discounted sum of future rewards under a particular policy π \pi π

discount factor weights immediate vs future rewards

used to quantify goodness/badness of states and actions
v π ( s ) = △ E π [ G t ] = E π [ ∑ k = 0 γ k R t + k + 1 ∣ S t = s ] v_{\pi}(s)\overset{\triangle}=E_\pi[G_t]=E_\pi[\sum_{k=0}\gamma^kR_{t+k+1}|S_t=s] vπ(s)=Eπ[Gt]=Eπ[k=0γkRt+k+1St=s]
Q-function(use to select among actions):
q π ( s , a ) = △ E π [ G T ∣ S t , A t = a ] q_\pi(s,a)\overset{\triangle}=E_\pi[G_T|S_t,A_t=a] qπ(s,a)=Eπ[GTSt,At=a]

5.Model

A model predicts what the environment will do next

Types of RL Agents based on what the Agent Learns
Value-based agent

显示学习价值函数、隐式学习策略

Policy-based agent

显示学习policy、no value function

Actor-critic agent

结合policy and value function

Types of RL Agents based on what the Agent Learns
Model-based

直接学习model(环境转移)

Model-free

直接学习value function/policy function

No model
在这里插入图片描述

Exploration and Exploitation

exploration:进行试错

exploitation:选择已知情况下的最优

import gym

#这个锤子在python里面跑,一跑就卡...
env=gym.make('CartPole-v0')
env.reset()
env.render()
action=env.action_space.sample()
observation,reward,done,info=env.step(action)

exmaple

next calss:Markov决策过程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值