Reinforcement Learning

Reinforcement learning:

1 Introduction:

 Inmachine learning area, there are three types of learning, Supervised learning,Reinforcement learning, Unsupervised learning. The difference between them arebelow:

 1) Supervised Learning: In Supervised Learning,teacher provides a desired response (output) for a given situation (input) andlearner is suppose to learn mapping of best response given a situation.

2) Reinforcement Learning (RL): InRL, given a situation (state) and associated action a reward is known tolearner. Based on these rewards, learner is supposed to learn the best actiongiven a situation.

3) Unsupervised Learning: Inunsupervised learning, learner tries to learn the pattern in data representing differentsituations (input).

2 The model of environments

  The environments are modeled in MDP. That contains the following parts:

 1) State: S={S1, S2 ,... ,Sn}

 2) Action: A={a1, a2,...an}

 3) Reward of immediately action: R(S)

 4) probability of State Transition: Pr(S'|S, a)

 And our goal is to find an optimal policy:

 policy: is mapping of states to actions: π(s)

3 How to find policy in environments

 Our goal is to find optimal policy given astart point and a goal in an environment.

RL have two flavor of method, one isthe agent know well the environments, which are represented by MDP. The otheris the agent know part of the environments.

3.1 The environment, model, is known to agents

 1) Value iteration

Strength:  It can find optimal policy.

Weakness: It is hard to converge,and may not converge in some situations; long steps to converge.


 2) policy iteration


Modified policy iteration:

  Due to the long time to converge in value iteration, we can use thefollowing algorithm, and then use the policy greedily.



3.2 The environment, model is partially known to agents.

 1)Q-Learning


   Model free. We don't need to model. We just learn reward from the environment.

Strength: It does not need to knowthe Transition Pr, and Reward.

weakness:

1)it may can not find optimal policy;

2) it takes long time to converge,

3) it does not work for many states

notice: we can use greedy a torandom the greedy actions for state s.


 2) Prioritized Sweeping

   Learn model: Pr, Reward.


 


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值