Chapter 16 Applications and 17 Frontiers

xiwang_chn

于 2020-08-25 10:38:41 发布

阅读量288

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/107976056

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Notes of chapter 16 Applications and 17 Frontiers

Questions
- Why not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximation.
Application
Frontiers

Questions

Why not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximation.

Application

Samuel’s Checkers Player 1959-1967

Samuel was one of the first to make effective use of heuristic search methods and of what we would now call temporal-difference learning.

Samuel’s programs played by performing a lookahead search from each current position. They used what we now call heuristic search methods to determine how to expand the search tree and when to stop searching. The terminal board positions of each search were evaluated, or “scored,” by a value function, or “scoring polynomial,” using linear function approximation.

Samuel used two main learning methods, the simplest of which he called rote learning. It consisted simply of saving a description of each board position encountered during play together with its backed-up value determined by the minimax procedure. the essential idea of temporal-difference learning—that the value of a state should equal the value of likely following states. Samuel came closest to this idea in his second learning method, his “learning by generalization”.

Samuel did not include explicit rewards. Instead, he fixed the weight of the most important feature, the piece advantage feature.

“Better-than-average novice” after learning from many games against itself. Fairly good amateur opponents characterized it as “tricky but beatable”.

TD-Gammon 1992-2002

The learning algorithm in TD-Gammon was a straightforward combination of the TD( $\lambda$ ) algorithm and nonlinear function approximation using a multilayer artificial neural network (ANN) trained by backpropagating TD errors.

Tesauro obtained an unending sequence of games by playing his learning backgammon player against itself.

After playing about 300,000 games against itself, TD-Gammon 0.0 as described above
learned to play approximately as well as the best previous backgammon computer programs.

The tournament success of TD-Gammon 0.0 with zero expert backgammon knowledge
suggested an obvious modification: add the specialized backgammon features but keep
the self-play TD learning method. This produced TD-Gammon 1.0. TD-Gammon 1.0 was
clearly substantially better than all previous backgammon programs and found serious
competition only among human experts.

TD-Gammon illustrates the combination of learned value functions and decision-time search as in heuristic search and MCTS methods, resulting in great improvements
in the overall caliber of human tournament play.

Watson’s Daily-Double Wagering 2011-2013

It adapted Tesauro’s TD-Gammon system described above to create the strategy used by Watson in “Daily-Double” (DD) wagering in its celebrated winning performance against human champions.

Action values were computed whenever a betting decision was needed by using two types of estimates that were learned before any live game play took place. The first were estimated values of the afterstates (Section 6.8) that would result from selecting each legal bet. These estimates were obtained from a state-value function, $\hat{v}(·,w)$ , defined by parameters $w$ , that gave estimates of the probability of a win for Watson from any game state. The second estimates used to compute action values gave the “in-category DD confidence".

Human-level Video Game Play

A team of researchers at Google DeepMind developed an impressive demonstration that
a deep multi-layer ANN can automate the feature design process (Mnih et al., 2013, 2015).

Mnih et al. developed a reinforcement learning agent called deep Q-network (DQN) that combined Q-learning with a deep convolutional ANN, a many-layered, or deep, ANN specialized for processing spatial arrays of data such as images. Another motivation for using Q-learning was that DQN used the experience replay method, described below, which requires an off-policy algorithm. Being model-free and off-policy made Q-learning a natural choice.

DQN advanced the state-of-the-art in machine learning by impressively demonstrating the promise of combining reinforcement learning with modern methods of deep learning.

Mastering the Game of Go

Alpha Go

It selected moves by a novel version of MCTS that was guided by both a policy and a value function learned by reinforcement learning with function approximation provided by deep convolutional ANNs.

It started from weights that were the result of previous supervised learning from a large collection of human expert moves

Alpha Zero

AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy. AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.

Personalized Web Services

It formulated personalized recommendation as a Markov decision problem (MDP) with
the objective of maximizing the total number of clicks users make over repeated visits to
a website, using life-time value (LTV) optimization.

Thermal Soaring

By experimenting with various reward signals, they found that learning was best with a reward signal that at each time step linearly combined the vertical wind velocity and vertical wind acceleration observed on the previous time step.

Learning was by one-step Sarsa, with actions selected according to a soft-max distribution
based on normalized action values.

This computational study of thermal soaring illustrates how reinforcement learning can further progress toward different kinds of objectives.

Frontiers

General Value Functions and Auxiliary Tasks

Rather than predicting the sum of future rewards, we might predict the sum of the future values of a sound or color sensation, or of an internal, highly processed signal such as another prediction. Whatever signal is added up in this way in a value-function-like prediction, we call it the cumulant of that prediction. We formalize it in a cumulant signal $C_t \in R$ . Using this, a general value function, or $G V F$ .

One simple way in which auxiliary tasks can help on the main task is that they may require some of the same representations as are needed on the main task.

Another simple way in which the learning of auxiliary tasks can improve performance is best explained by analogy to the psychological phenomena of classical conditioning

Finally, perhaps the most important role for auxiliary tasks is in moving beyond the assumption we have made throughout this book that the state representation is fixed and given to the agent.

Temporal Abstraction via Options

Can the MDP framework be stretched to cover all the levels simultaneously?

Perhaps it can. One popular idea is to formalize an MDP at a detailed level, with a small time step, yet enable planning at higher levels using extended courses of action that correspond to many base-level time steps. To do this we need a notion of course of action that extends over many time steps and includes a notion of termination. A general way to formulate these two ideas is as a policy, $\pi$ , and a state-dependent termination function, $\gamma$ , as in GVFs. We define a pair of these as a generalized notion of action termed an option.

Options effectively extend the action space. The agent can either select a low-level action/option, terminating after one time step, or select an extended option that might execute for many time steps before terminating.

Observations and State

In many cases of interest, and certainly in the lives of all natural intelligences, the sensory input gives only partial information about the state of the world.

The framework of parametric function approximation that we developed in Part II is far less restrictive and, arguably, no limitation at all.

First, we would change the problem. The environment would emit not its states, but only observations—signals that depend on its state but, like a robot’s sensors, provide only partial information about it.
Second, we can recover the idea of state as used in this book from the sequence of observations and actions.
The third step in extending reinforcement learning to partial observability is to deal with certain computational considerations.
The fourth and final step in our brief outline of how to handle partial observability in reinforcement learning is to re-introduce approximation.

Designing Reward Signals

Success of a reinforcement learning application strongly depends on how well the reward signal frames the goal of the application’s designer and how well the signal assesses progress in reaching that goal.

One challenge is to design a reward signal so that as an agent learns, its behavior approaches, and ideally eventually achieves, what the application’s designer actually desires.

Even when there is a simple and easily identifiable goal, the problem of sparse reward often arises.

It is tempting to address the sparse reward problem by rewarding the agent for achieving subgoals that the designer thinks are important way stations to the overall goal. But augmenting the reward signal with well-intentioned supplemental rewards may lead the agent to behave very di↵erently from what is intended;

Remaining Issues

First, we still need powerful parametric function approximation methods that work well in fully incremental and online settings.

Second (and perhaps closely related), we still need methods for learning features such that subsequent learning generalizes well.

Third, we still need scalable methods for planning with learned environment models.

A fourth issue that needs to be addressed in future research is that of automating the choice of tasks on which an agent works and uses to structure its developing competence.

The fifth issue that we would like to highlight for future research is that of the interaction between behavior and learning via some computational analog of curiosity.

A final issue that demands attention in future research is that of developing methods to make it acceptably safe to embed reinforcement learning agents into physical environments.

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 16 Applications and 17 Frontiers

Notes of chapter 16 Applications and 17 FrontiersQuestionsWhy not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximati
复制链接

扫一扫

专栏目录