Chapter 16 Applications and 17 Frontiers

Questions

Why not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximation.

Application

Samuel’s Checkers Player 1959-1967

Samuel was one of the first to make effective use of heuristic search methods and of what we would now call temporal-difference learning.

Samuel’s programs played by performing a lookahead search from each current position. They used what we now call heuristic search methods to determine how to expand the search tree and when to stop searching. The terminal board positions of each search were evaluated, or “scored,” by a value function, or “scoring polynomial,” using linear function approximation.

Samuel used two main learning methods, the simplest of which he called rote learning. It consisted simply of saving a description of each board position encountered during play together with its backed-up value determined by the minimax procedure. the essential idea of temporal-difference learning—that the value of a state should equal the value of likely following states. Samuel came closest to this idea in his second learning method, his “learning by generalization”.

Samuel did not include explicit rewards. Instead, he fixed the weight of the most important feature, the piece advantage feature.

“Better-than-average novice” after learning from many games against itself. Fairly good amateur opponents characterized it as “tricky but beatable”.

TD-Gammon 1992-2002

The learning algorithm in TD-Gammon was a straightforward combination of the TD( λ \lambda λ) algorithm and nonlinear function approximation using a multilayer artificial neural network (ANN) trained by backpropagating TD errors.

Tesauro obtained an unending sequence of games by playing his learning backgammon player against itself.

After playing about 300,000 games against itself, TD-Gammon 0.0 as described above
learned to play approximately as well as the best previous backgammon computer programs.

The tournament success of TD-Gammon 0.0 with zero expert backgammon knowledge
suggested an obvious modification: add the specialized backgammon features but keep
the self-play TD learning method. This produced TD-Gammon 1.0. TD-Gammon 1.0 was
clearly substantially better than all previous backgammon programs and found serious
competition only among human experts.

TD-Gammon illustrates the combination of learned value functions and decision-time search as in heuristic search and MCTS methods, resulting in great improvements
in the overall caliber of human tournament play.

Watson’s Daily-Double Wagering 2011-2013

It adapted Tesauro’s TD-Gammon system described above to create the strategy used by Watson in “Daily-Double” (DD) wagering in its celebrated winning performance against human champions.

Action values were computed whenever a betting decision was needed by using two types of estimates that were learned before any live game play took place. The first were estimated values of the afterstates (Section 6.8) that would result from selecting each legal bet. These estimates were obtained from a state-value function, v ^ ( ⋅ , w ) \hat{v}(·,w) v^(,w), defined by parameters w w w, that gave estimates of the probability of a win for Watson from any game state. The second estimates used to compute action values gave the “in-category DD confidence".

Human-level Video Game Play

A team of researchers at Google DeepMind developed an impressive demonstration that
a deep multi-layer ANN can automate the feature design process (Mnih et al., 2013, 2015).

Mnih et al. developed a reinforcement learning agent called deep Q-network (DQN) that combined Q-learning with a deep convolutional ANN, a many-layered, or deep, ANN specialized for processing spatial arrays of data such as images. Another motivation for using Q-learning was that DQN used the experience replay method, described below, which requires an off-policy algorithm. Being model-free and off-policy made Q-learning a natural choice.

DQN advanced the state-of-the-art in machine learning by impressively demonstrating the promise of combining reinforcement learning with modern methods of deep learning.

Mastering the Game of Go

Alpha Go

It selected moves by a novel version of MCTS that was guided by both a policy and a value function learned by reinforcement learning with function approximation provided by deep convolutional ANNs.

It started from weights that were the result of previous supervised learning from a large collection of human expert moves

Alpha Zero

AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy. AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.

Personalized Web Services

It formulated personalized recommendation as a Markov decision problem (MDP) with
the objective of maximizing the total number of clicks users make over repeated visits to
a website, using life-time value (LTV) optimization.

Thermal Soaring

By experimenting with various reward signals, they found that learning was best with a reward signal that at each time step linearly combined the vertical wind velocity and vertical wind acceleration observed on the previous time step.

Learning was by one-step Sarsa, with actions selected according to a soft-max distribution
based on normalized action values.

This computational study of thermal soaring illustrates how reinforcement learning can further progress toward different kinds of objectives.

Frontiers

General Value Functions and Auxiliary Tasks

Rather than predicting the sum of future rewards, we might predict the sum of the future values of a sound or color sensation, or of an internal, highly processed signal such as another prediction. Whatever signal is added up in this way in a value-function-like prediction, we call it the cumulant of that prediction. We formalize it in a cumulant signal C t ∈ R C_t \in R CtR. Using this, a general value function, or G V F GVF GVF.

One simple way in which auxiliary tasks can help on the main task is that they may require some of the same representations as are needed on the main task.

Another simple way in which the learning of auxiliary tasks can improve performance is best explained by analogy to the psychological phenomena of classical conditioning

Finally, perhaps the most important role for auxiliary tasks is in moving beyond the assumption we have made throughout this book that the state representation is fixed and given to the agent.

Temporal Abstraction via Options

Can the MDP framework be stretched to cover all the levels simultaneously?

Perhaps it can. One popular idea is to formalize an MDP at a detailed level, with a small time step, yet enable planning at higher levels using extended courses of action that correspond to many base-level time steps. To do this we need a notion of course of action that extends over many time steps and includes a notion of termination. A general way to formulate these two ideas is as a policy, π \pi π, and a state-dependent termination function, γ \gamma γ, as in GVFs. We define a pair of these as a generalized notion of action termed an option.

Options effectively extend the action space. The agent can either select a low-level action/option, terminating after one time step, or select an extended option that might execute for many time steps before terminating.

Observations and State

In many cases of interest, and certainly in the lives of all natural intelligences, the sensory input gives only partial information about the state of the world.

The framework of parametric function approximation that we developed in Part II is far less restrictive and, arguably, no limitation at all.

First, we would change the problem. The environment would emit not its states, but only observations—signals that depend on its state but, like a robot’s sensors, provide only partial information about it.
Second, we can recover the idea of state as used in this book from the sequence of observations and actions.
The third step in extending reinforcement learning to partial observability is to deal with certain computational considerations.
The fourth and final step in our brief outline of how to handle partial observability in reinforcement learning is to re-introduce approximation.

Designing Reward Signals

Success of a reinforcement learning application strongly depends on how well the reward signal frames the goal of the application’s designer and how well the signal assesses progress in reaching that goal.

One challenge is to design a reward signal so that as an agent learns, its behavior approaches, and ideally eventually achieves, what the application’s designer actually desires.

Even when there is a simple and easily identifiable goal, the problem of sparse reward often arises.

It is tempting to address the sparse reward problem by rewarding the agent for achieving subgoals that the designer thinks are important way stations to the overall goal. But augmenting the reward signal with well-intentioned supplemental rewards may lead the agent to behave very di↵erently from what is intended;

Remaining Issues

First, we still need powerful parametric function approximation methods that work well in fully incremental and online settings.

Second (and perhaps closely related), we still need methods for learning features such that subsequent learning generalizes well.

Third, we still need scalable methods for planning with learned environment models.

A fourth issue that needs to be addressed in future research is that of automating the choice of tasks on which an agent works and uses to structure its developing competence.

The fifth issue that we would like to highlight for future research is that of the interaction between behavior and learning via some computational analog of curiosity.

A final issue that demands attention in future research is that of developing methods to make it acceptably safe to embed reinforcement learning agents into physical environments.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
4S店客户管理小程序-毕业设计,基于微信小程序+SSM+MySql开发,源码+数据库+论文答辩+毕业论文+视频演示 社会的发展和科学技术的进步,互联网技术越来越受欢迎。手机也逐渐受到广大人民群众的喜爱,也逐渐进入了每个用户的使用。手机具有便利性,速度快,效率高,成本低等优点。 因此,构建符合自己要求的操作系统是非常有意义的。 本文从管理员、用户的功能要求出发,4S店客户管理系统中的功能模块主要是实现管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理,用户客户端:首页、车展、新闻头条、我的。门店客户端:首页、车展、新闻头条、我的经过认真细致的研究,精心准备和规划,最后测试成功,系统可以正常使用。分析功能调整与4S店客户管理系统实现的实际需求相结合,讨论了微信开发者技术与后台结合java语言和MySQL数据库开发4S店客户管理系统的使用。 关键字:4S店客户管理系统小程序 微信开发者 Java技术 MySQL数据库 软件的功能: 1、开发实现4S店客户管理系统的整个系统程序; 2、管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理等。 3、用户客户端:首页、车展、新闻头条、我的 4、门店客户端:首页、车展、新闻头条、我的等相应操作; 5、基础数据管理:实现系统基本信息的添加、修改及删除等操作,并且根据需求进行交流信息的查看及回复相应操作。
现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本微信小程序医院挂号预约系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此微信小程序医院挂号预约系统利用当下成熟完善的SSM框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。微信小程序医院挂号预约系统有管理员,用户两个角色。管理员功能有个人中心,用户管理,医生信息管理,医院信息管理,科室信息管理,预约信息管理,预约取消管理,留言板,系统管理。微信小程序用户可以注册登录,查看医院信息,查看医生信息,查看公告资讯,在科室信息里面进行预约,也可以取消预约。微信小程序医院挂号预约系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值