《Machine Learning（Tom M. Mitchell）》读书笔记——14、第十三章

最新推荐文章于 2021-12-12 21:11:02 发布

mmc2015

最新推荐文章于 2021-12-12 21:11:02 发布

阅读量920

点赞数

分类专栏：《MachineLearning，Tom Mitchell》

本文链接：https://blog.csdn.net/mmc2015/article/details/41790529

版权

《MachineLearning，Tom Mitchell》专栏收录该内容

17 篇文章 2 订阅

订阅专栏

1. Introduction (about machine learning)

2. Concept Learning and the General-to-Specific Ordering

3. Decision Tree Learning

4. Artificial Neural Networks

5. Evaluating Hypotheses

6. Bayesian Learning

7. Computational Learning Theory

8. Instance-Based Learning

9. Genetic Algorithms

10. Learning Sets of Rules

11. Analytical Learning

12. Combining Inductive and Analytical Learning

13. Reinforcement Learning

Reinforcement learning addresses the question of how an autonomous agent(自治agent) that senses and acts in its environment can learn to choose optimal actions to achieve its goals. Each time the agent performs an action in its environment, a trainer may provide a reward or penalty to indicate the desirability of the resulting state. The task of the agent is to learn from this indirect, delayed reward, to choose sequences
of actions that produce the greatest cumulative(累积) reward. This chapter focuses on an algorithm called Q learning that can acquire optimal control strategies from delayed rewards, even when the agent has no prior knowledge of the effects of its actions on the environment. Reinforcement learning algorithms are related to dynamic programming(动态规划) algorithms frequently used to solve optimization problems.

13.1 INTRODUCTION

This general setting for robot learning is summarized in Figure 13.1.

The problem of learning a control policy to choose actions is similar in some respects to the function approximation problems discussed in other chapters. The target function to be learned in this case is a control policy, π : S -> A, that outputs an appropriate action a from the set A, given the current state s from the set S. However, this reinforcement learning problem differs from other function
approximation tasks in several important respects:

Delayed reward(延迟回报): the trainer provides only a sequence of immediate reward values as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal credit assignment(时间信用分配): determining which of the actions in its sequence are to be credited with producing the eventual rewards.

Exploration(探索): The learner faces a tradeoff in choosing whether to favor exploration of unknown states and actions (to gather new information), or exploitation of states and actions that it has already learned will yield high reward (to maximize its cumulative reward).

Partially observable states(部分观察状态): In many practical situations sensors provide only partial information. For example, a robot with a forward-pointing camera cannot see what is behind it. In such cases, it may be necessary for the agent to consider its previous observations together with its current sensor data when choosing actions.

Life-long learning(终生学习): Robot learning often requires that the robot learn several related tasks within the same environment, using the same sensors. This setting raises the possibility of using previously obtained experience or knowledge to reduce sample complexity when learning new tasks.

13.2 THE LEARNING TASK

Here we define one quite general formulation of the problem, based on Markov decision processes(马尔科夫决策过程). This formulation of the problem follows the problem illustrated in Figure 13.1.

an example:

13.3 Q LEARNING

13.3.1 The Q Function

令

则

13.3.2 An Algorithm for Learning Q

由于

所以(13.4)可重写为

进而

可得如下算法:

13.3.3 An Illustrative Example

13.3.4 Convergence

13.4 NONDETERMINISTIC REWARDS AND ACTIONS

To summarize, we have simply redefined V and Q in the nondeterministic case to be the expected value of
its previously defined quantity for the deterministic case.

13.5 TEMPORAL DIFFERENCE LEARNING(时间差分学习)

13.8 SUMMARY AND FURTHER READING