强化学习学习笔记
文章平均质量分 93
主要记录强化学习的知识点
最適当承诺
这个作者很懒,什么都没留下…
展开
-
Off-policy Actor-critic in RL
DDPG, TD3, Soft Actor-critic原创 2022-09-15 21:48:02 · 1122 阅读 · 0 评论 -
Policy Gradient Methods of Deep Reinforcement Learning (Part Two)
本文将讨论分布空间的Natrual Gradient, 然后将Natural Gradient 用于Actor Critic。另外说明Trust Region Policy Optimization(TRPO) 和 Proximal Policy Optimization(PPO)算法。原创 2022-01-13 13:11:33 · 111 阅读 · 0 评论 -
Policy gradient Method of Deep Reinforcement learning (Part One)
本文会总结深度强化学习中基于策略学习的各种方法,从优化的角度说明REINFORCE(Monte Carlo based policy gradient)和参数空间Vanilla policy gradient的各种Actor Critic Methods.因为基于策略学习的方法因为能够在状态和动作的连续空间中进行,其在机器人控制中会经常用到,本文后续会分析策略学习的强化学习在机器人灵原创 2022-01-04 21:37:53 · 1130 阅读 · 0 评论 -
4.1 Temporal Differential of one step
DefinitionDynamic Programming uses the euqations in the second line and has to know the environmental dynamics ( dynamics of environemntcan produce the chain between this state to the next state, but it's hard to know.Cons) ( It uses the rel...原创 2021-08-15 23:56:40 · 129 阅读 · 0 评论 -
3.3 Monte Carlo Methods: case study: Blackjack of Policy Improvement of on- & off-policy Evaluation
BackgroundIn3.1 Monte Carlo Methods & case study: Blackjack of Policy Evaluation, we finished the evaluation for the specific policy ( hit unless 20 or21). In this article, we will summarize the policy improvement for原创 2021-08-09 11:35:45 · 213 阅读 · 0 评论 -
3.2 Off-Policy Monte Carlo Methods & case study: Blackjack of off-Policy Evaluation
BackgroundIn many cases, we are not able to find the examples inour specific policy. However, we could find examples in other policies with actions and statesappearing in our policy.What's more,in deterministic on-policy, we always have to compromis...原创 2021-08-05 17:43:02 · 227 阅读 · 0 评论 -
3.1 Monte Carlo Methods & case study: Blackjack of on-Policy Evaluation
Monte CarloDefinitionMonte Carlo PredictionDefinitionPesudocodeInitialization: Returns = [ ]Loop for N times: generate an episide following the spesific policy G = 0loop for each step of episode, t = T-1...原创 2021-07-20 11:28:40 · 114 阅读 · 0 评论 -
1. Finite Markov Decision Process
# **Finite Markov Decision Process**原创 2021-07-12 20:52:00 · 119 阅读 · 0 评论 -
2.2 DP: Value Iteration & Gambler‘s Problem
Value IterationBackgroundPolicy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even nothe last many sweeps of policy evaluation,.原创 2021-07-16 22:10:20 · 180 阅读 · 0 评论 -
2.1 Dynamic programming and case study: Jack‘s car rental
Dynamic programmingPolicy evaluationBy iterations, we could get value function set ( v(s) ) for specific policy, given the environment's dynamics.pesudocode原创 2021-07-16 11:10:31 · 776 阅读 · 0 评论