《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章

1. Introduction (about machine learning)


2. Concept Learning and the General-to-Specific Ordering

3. Decision Tree Learning

4. Artificial Neural Networks


5. Evaluating Hypotheses

6. Bayesian Learning

7. Computational Learning Theory


8. Instance-Based Learning

9. Genetic Algorithms

10. Learning Sets of Rules

11. Analytical Learning

12. Combining Inductive and Analytical Learning

13. Reinforcement Learning


13. Reinforcement Learning

Reinforcement learning addresses the question of  how  an autonomous agent(自治agent) that senses and acts in its environment can learn to choose optimal actions to achieve its goals. Each time the agent performs an action in its environment, a  trainer may provide a reward or penalty to  indicate the desirability  of the resulting state. The task of  the agent is to learn from this indirect, delayed reward, to choose sequences
of  actions that produce  the  greatest cumulative(累积) reward. This chapter focuses on an algorithm called  Q  learning  that  can acquire optimal control strategies from delayed rewards, even when the agent has no prior knowledge  of  the effects  of its actions on the environment. Reinforcement learning algorithms are related  to dynamic programming(动态规划) algorithms frequently used to solve optimization problems. 

13.1  INTRODUCTION

This general setting for robot learning is summarized in Figure 13.1.

The problem of  learning a control policy to choose actions  is similar in some respects to the function approximation problems discussed in other chapters. The target  function  to  be learned  in  this  case is  a  control policy,  π : S -> A,  that outputs an appropriate action a  from the set A, given the current state s  from the set S. However, this reinforcement  learning problem differs from other function
approximation tasks in several important respects:

Delayed reward(延迟回报):  the trainer provides only a sequence of immediate  reward values as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal credit assignment(时间信用分配):  determining which of the actions in its sequence are to be credited with producing the eventual rewards.

Exploration(探索):  The learner faces a tradeoff in choosing whether to favor exploration  of unknown states and actions (to gather new  information), or exploitation of  states and actions that  it has  already learned will yield high reward  (to maximize  its cumulative reward). 

Partially observable  states(部分观察状态): In many  practical situations sensors provide only partial  information. For example, a  robot with a  forward-pointing camera cannot see what  is behind  it. In such cases, it may  be  necessary  for the  agent  to consider its previous observations together with  its current sensor data when  choosing actions.

Life-long learning(终生学习):  Robot learning often requires that  the robot  learn several related tasks within the same environment, using the same sensors. This setting raises the possibility of  using previously obtained experience or knowledge to reduce sample complexity when  learning new tasks. 

13.2  THE LEARNING TASK 

Here we define one quite general formulation of  the problem, based on Markov decision processes(马尔科夫决策过程). This formulation of  the problem follows the problem  illustrated in Figure  13.1.

an example:

13.3  Q  LEARNING

13.3.1 The Q Function 

13.3.2  An Algorithm for Learning Q

由于

所以(13.4)可重写为

进而

可得如下算法:

13.3.3  An Illustrative Example

13.3.4  Convergence

13.4  NONDETERMINISTIC REWARDS AND ACTIONS

To summarize, we have simply redefined V and Q in  the nondeterministic case to be the expected value of
its previously defined quantity for the deterministic case.

13.5  TEMPORAL DIFFERENCE LEARNING(时间差分学习)


13.8  SUMMARY AND FURTHER READING 


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值