1. Introduction (about machine learning)
2. Concept Learning and the General-to-Specific Ordering
3. Decision Tree Learning
4. Artificial Neural Networks
5. Evaluating Hypotheses
6. Bayesian Learning
7. Computational Learning Theory
8. Instance-Based Learning
9. Genetic Algorithms
10. Learning Sets of Rules
11. Analytical Learning
12. Combining Inductive and Analytical Learning
13. Reinforcement Learning
13. Reinforcement Learning
Reinforcement learning addresses the question of how an autonomous agent(自治agent) that senses and acts in its environment can learn to choose optimal actions to achieve its goals. Each time the agent performs an action in its environment, a trainer may provide a reward or penalty to indicate the desirability of the resulting state. The task of the agent is to learn from this indirect, delayed reward, to choose sequences
of actions that produce the greatest cumulative(累积) reward. This chapter focuses on an algorithm called Q learning that can acquire optimal control strategies from delayed rewards, even when the agent has no prior knowledge of the effects of its actions on the environment. Reinforcement learning algorithms are related to dynamic programming(动态规划) algorithms frequently used to solve optimization problems.13.1 INTRODUCTION
This general setting for robot learning is summarized in Figure 13.1.
The problem of learning a control policy to choose actions is similar in some respects to the function approximation problems discussed in other chapters. The target function to be learned in this case is a control policy, π : S -> A, that outputs an appropriate action a from the set A, given the current state s from the set S. However, this reinforcement learning problem differs from other function
approximation tasks in several important respects:Delayed reward(延迟回报): the trainer provides only a sequence of immediate reward values as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal credit assignment(时间信用分配): determining which of the actions in its sequence are to be credited with producing the eventual rewards.
Exploration(探索): The learner faces a tradeoff in choosing whether to favor exploration of unknown states and actions (to gather new information), or exploitation of states and actions that it has already learned will yield high reward (to maximize its cumulative reward).
Partially observable states(部分观察状态): In many practical situations sensors provide only partial information. For example, a robot with a forward-pointing camera cannot see what is behind it. In such cases, it may be necessary for the agent to consider its previous observations together with its current sensor data when choosing actions.
Life-long learning(终生学习): Robot learning often requires that the robot learn several related tasks within the same environment, using the same sensors. This setting raises the possibility of using previously obtained experience or knowledge to reduce sample complexity when learning new tasks.
13.2 THE LEARNING TASK
Here we define one quite general formulation of the problem, based on Markov decision processes(马尔科夫决策过程). This formulation of the problem follows the problem illustrated in Figure 13.1.
an example:
13.3 Q LEARNING
13.3.1 The Q Function
令
则
13.3.2 An Algorithm for Learning Q
由于
所以(13.4)可重写为
进而
可得如下算法:
13.3.3 An Illustrative Example
13.3.4 Convergence
13.4 NONDETERMINISTIC REWARDS AND ACTIONS
To summarize, we have simply redefined V and Q in the nondeterministic case to be the expected value of
its previously defined quantity for the deterministic case.13.5 TEMPORAL DIFFERENCE LEARNING(时间差分学习)
13.8 SUMMARY AND FURTHER READING