2020年08月_xiwang_chn

原创 Chapter 9: On-policy Prediction with Approximation

Chapter 9: On-policy Prediction with Approximation1 Introduction2 Determine the approximate function2.1 Mean Squared Value Error2.2 Stochastic gradient descent and semi-gradient methods to minize VEGradient Monte CarloSemi-gradient (bootstrapping estimate)

2020-08-25 10:38:53 152

原创 Chapter 16 Applications and 17 Frontiers

Notes of chapter 16 Applications and 17 FrontiersQuestionsWhy not combine function approximation and policy approximation with Dyna? The update of policy can be realized with minimizing TD error, and value table can be replaced by ANN or linear approximati

2020-08-25 10:38:41 291

原创 Chapter 13: Policy Gradient Methods

Policy Gradient Methods0 QuestionsQ1: For problems that are not MDP, is it pracitcal to learn a sequential policy model using a temporal convolution network.Q2: Can parameterized policy focus on some interested action space as well as action-value methods?

2020-08-25 10:36:55 223

原创 Chapter 12: Eligibility Traces

Notes of Chapter 12: Eligibility Traces1 Introduction2 λ\lambdaλ-return (offline λ\lambdaλ-return algorithm)N-Step return:***λ\lambdaλ-return of continuing tasks***:***λ\lambdaλ-return of episodic/continuing(T=∞\infin∞) tasks***:Offline λ\lambdaλ-return al

2020-08-25 10:36:33 643

原创 Chapter 10: On-policy Control with Approximation

Notes of Chapter 10: On-policy Control with Approximation1 Introduction2 On-policy control with approximation of episodic tasks2.1 *General gradient-descent update* for action-value prediction is:2.2 *Gradient-descent update* for semi-gradient n-step Sars

2020-08-25 10:36:15 163

原创 Chapter 8: Planning and Learning with Tabular Methods

Chapter 8: Planning and Learning with Tabular MethodsIntroductionWhen the model is dynamicWhen the model is largeExpected & sample UpdateDecision-time planningHeuristic searchRollout AlgorithmMonte carlo tree search (MCTS) Introduction Planning and lea

2020-08-25 10:35:47 397

原创 Chapter 6&7: Temporal-Difference Learning

Chapter 6&7: Temporal-Difference Learning1 Introduction2 n-step TD prediction (estimate V)3 Off-policy n-step sarsa4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm5 Question 1 Introduction Temporal-difference (TD) lea

2020-08-25 10:35:26 635

原创 Chapter 5: Monte Carlo Methods

Chapter 5: Monte Carlo Methods1 Introduction2 Policy evaluation (Monte Carlo Prediction; on-policy)3 Policy improvement (on-policy)4 Generalized policy iteration (GPI; on-policy)4.1 Monte Carlo control with Exploring Starts4.2 Monte Carlo control without

2020-08-25 10:34:40 549

原创 Chapter 4: Dynamic Programming

Notes of chapter 4: Dynamic Programming (General dynamic programming DP needs to know the whole model (transition and reward functoins). Bootstrap means updating one estimate from another estimate. It is used to update estimates of the values of states. Th

2020-08-25 10:33:50 598

xiwang