《Reinforcement Learning: An Introduction》 读书笔记 - 目录

本系列笔记基于Richard S. Sutton的《Reinforcement Learning: An Introduction》第二版,涵盖多臂老虎机、有限马尔科夫决策过程等内容。针对不同章节进行详细解析,包括动态规划、蒙特卡洛方法与时序差分学习等核心概念。

这一系列笔记是基于Richard S. Sutton的《Reinforcement Learning: An Introduction》第二版
因为这本书在出版之前,作者就在官网上发布了几次草稿版,不同时间发布的版本之间的排版有所差异(尤其是2017年和2018年的之间)
本系列基于2018年的几个版本,所以如果文中部分内容所指明的地方和读者看到的不一致,敬请谅解~

第2章:多臂老虎机(Multi-armed Bandits)

第3章:有限马尔科夫决策过程(Finite Markov Decision Processes)

第4章:动态规划(Dynamic Programming)

第5章:蒙特卡洛方法(Monte Carlo Methods)

第6章:时序差分学习(TD-Learning)

The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Games, Afterstates, and Other Special Cases 6.9 Summary 6.10 Bibliographical and Historical Remarks III. A Unified View 7. Eligibility Traces 7.1 -Step TD Prediction 7.2 The Forward View of TD( ) 7.3 The Backward View of TD( ) 7.4 Equivalence of Forward and Backward Views 7.5 Sarsa( ) 7.6 Q( ) 7.7 Eligibility Traces for Actor-Critic Methods 7.8 Replacing Traces 7.9 Implementation Issues 7.10 Variable 7.11 Conclusions 7.12 Bibliographical and Historical Remarks 8. Generalization and Function Approximation 8.1 Value Prediction with Function Approximation 8.2 Gradient-Descent Methods 8.3 Linear Methods 8.3.1 Coarse Coding 8.3.2 Tile Coding 8.3.3 Radial Basis Functions 8.3.4 Kanerva Coding 8.4 Control with Function Approximation 8.5 Off-Policy Bootstrapping 8.6 Should We Bootstrap? 8.7 Summary 8.8 Bibliographical and Historical Remarks 9. Planning and Learning 9.1 Models and Planning 9.2 Integrating Planning, Acting, and Learning 9.3 When the Model Is Wrong 9.4 Prioritized Sweeping 9.5 Full vs. Sample Backups 9.6 Trajectory Sampling 9.7 Heuristic Search 9.8 Summary 9.9 Bibliographical and Historical Remarks 10. Dimensions of Reinforcement Learning 10.1 The Unified View 10.2 Other Frontier Dimensions 11. Case Studies 11.1 TD-Gammon 11.2 Samuel's Checkers Player 11.3 The Acrobot 11.4 Elevator Dispatching 11.5 Dynamic Channel Allocation 11.6 Job-Shop Scheduling Bibliography Index
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值