Machine learning week 10(Andrew Ng)

Reinforcement learning

1. Reinforcement learning introduction

1.1. What is Reinforcement Learning?

The key idea is rather than you need to tell the algorithm what the right output y for every single input is, all you have to do instead is specify a reward function that tells it when it’s doing well and when it’s doing poorly.

1.2. Mars rover example
1.3. The return in Reinforcement learning

在这里插入图片描述
The first step is r 0 r^0 r0.
在这里插入图片描述
Select the orientation according to the first two tables

1.4. Making decisions: Policies in reinforcement learning

For example, π ( 2 ) \pi(2) π(2) is left while π ( 5 ) \pi(5) π(5) is right. The number expresses state.
在这里插入图片描述

1.5. Review of key concepts

在这里插入图片描述
在这里插入图片描述

2. State-action value function

2.1. State-action value function definition

在这里插入图片描述
The iteration will be used.
在这里插入图片描述
在这里插入图片描述

2.2. State-action value function example
2.3. Bellman Equation

Q ( s , a ) = R ( s ) + r ∗ m a x Q ( s ′ , a ′ ) Q(s,a) = R(s) + r * max Q(s^{'},a^{'}) Q(s,a)=R(s)+rmaxQ(s,a)
在这里插入图片描述

2.4. Random (stochastic) environment

Sometimes it actually ends up accidentally slipping and going in the opposite direction.

3. Continuous state spaces

3.1. Example of continuous state space applications

Every variable is continuous.

3.2. Lunar lander

在这里插入图片描述

3.3. Learning the state-value function

在这里插入图片描述
在这里插入图片描述
Q is a random value at first. We will train the model to find a better Q.
在这里插入图片描述

3.4. Algorithm refinement: Improved neural network architecture

在这里插入图片描述

3.5. Algorithm refinement: ε-greedy policy

ε = 0.05
在这里插入图片描述
If we choose a bad ε, we may take 100 times as long.

3.6. Algorithm refinement: Mini-batch and soft update

The idea of mini-batch gradient descent is to not use all 100 million training examples on every single iteration through this loop. Instead, we may pick a smaller number, let me call it m prime equals say, 1,000. On every step, instead of using all 100 million examples, we would pick some subset of 1,000 or m prime examples.
在这里插入图片描述
在这里插入图片描述

  • Soft update
    When we set Q equals to Q n e w Q_{new} Qnew, it can make a very abrupt change to Q.So we will adjust the parameters in Q.
    W = 0.01 ∗ W n e w + 0.99 W W = 0.01*W_{new} + 0.99 W W=0.01Wnew+0.99W
    B = 0.01 ∗ B n e w + 0.99 B B = 0.01*B_{new} + 0.99 B B=0.01Bnew+0.99B
3.7. The state of reinforcement learning

在这里插入图片描述

Summary

在这里插入图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值