深度学习入门(9) - Reinforcement Learning 强化学习

andyc_03

于 2024-04-29 17:04:52 发布

阅读量886

收藏 14

点赞数 13

文章标签：深度学习人工智能强化学习

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/andyc_03/article/details/138318928

版权

Reinforcement Learning

an agent performs actions in environment, and receives rewards

goal: Learn how to take actions that maximize reward

Stochasticity: Rewards and state transitions may be random

Credit assignment: Reward $r_t$ may not directly depend on action $a_t$

Nondifferentiable: Can’t backprop through the world

Nonstationary: What the agent experiences depends on how it acts

Markov Decision Process (MDP)

Mathematical formalization of the RL problem: A tuple $(S,A,R,P,\gamma)$

$S$ : Set of possible states

$A$ : Set of possible actions

$R$ : Distribution of reward given (state, action) pair

$P$ : Transition probability: distribution over next state given (state, action)

$\gamma$ : Discount factor (trade-off between future and present rewards)

Markov Property: The current state completely characterizes the state of the world. Rewards and next states depend only on current state, not history.

Agent executes a policy $\pi$ giving distribution of actions conditioned on states.

Goal: Find best policy that maximizes cumulative discounted reward $\sum_t \gamma^tr_t$

请添加图片描述

We will try to find the maximal expected sum of rewards to reduce the randomness.

Value function $V^{\pi}(s)$ : expected cumulative reward from following policy $\pi$ from state $s$

Q function $Q^{ \pi}(s,a)$ : expected cumulative reward from following policy $\pi$ from taking action $a$ in state $s$

Bellman Equation

After taking action a in state s, we get reward r and move to a new state s’. After that, the max possible reward we can get is $max_{a'} Q^*(s',a')$

Idea: find a function that satisfy Bellman equation then it must be optimal

start with a random Q, and use Bellman equation as an update rule.

请添加图片描述

But if the state is large/infinite, we can’t iterate them.

Approximate Q(s, a) with a neural network, use Bellman equation as loss function.

-> Deep q learning

Policy Gradients

Train a network $\pi_{\theta}(a,s)$ that takes state as input, gives distribution over which action to take

Objective function: Expected future rewards when following policy $\pi_{\theta}$

Use gradient ascent -> play some tricks to make it differentiable

请添加图片描述

Other approaches:

Actor-Critic

Model-Based

Imitation Learning

Inverse Reinforcement Learning

Adversarial Learning

…

Stochastic computation graphs

andyc_03

关注

13
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

andyc_03 CSDN认证博客专家 CSDN认证企业博客

码龄7年

暂无认证

342: 原创

6万+: 周排名

4万+: 总排名

9万+: 访问

: 等级

3743: 积分

157: 粉丝

238: 获赞

23: 评论

258: 收藏

私信

关注

热门文章

分类专栏

LCT 2篇
数论 10篇
字符串 3篇
树链剖分 2篇
主席树 5篇
cf 3篇
状压 dp 5篇
数位dp 8篇
分治 2篇
动态规划 35篇
lca 6篇
差分约束 14篇
贪心 5篇
tarjan 17篇
拓扑序 2篇
最短路 24篇
最小生成树 4篇
快速幂 1篇
kmp 1篇
树状数组 1篇
区间dp 3篇
搜索 3篇
并查集 2篇
爬山 2篇
线段树 4篇
笛卡尔树 1篇
斜率dp 2篇
插头dp 3篇
二分图 4篇
二分 1篇
johnson 1篇
矩阵乘法 2篇
入门 1篇
题解（练习） 1篇
算法 218篇

最新评论

【论文阅读】CLIP:Learning Transferable Visual Models From Natural Language Supervision
CSDN-Ada助手: 你好，CSDN 开始提供 #论文阅读# 的列表服务了。请看：https://blog.csdn.net/nav/advanced-technology/paper-reading?utm_source=csdn_ai_ada_blog_reply 。如果你有更多需求，请来这里 https://gitcode.net/csdn/csdn-tags/-/issues/34?utm_source=csdn_ai_ada_blog_reply 给我们提。
【论文阅读】EgoPCA: A New Framework for Egocentric Hand-Object Interaction
CSDN-Ada助手: 你好，CSDN 开始提供 #论文阅读# 的列表服务了。请看：https://blog.csdn.net/nav/advanced-technology/paper-reading?utm_source=csdn_ai_ada_blog_reply 。如果你有更多需求，请来这里 https://gitcode.net/csdn/csdn-tags/-/issues/34?utm_source=csdn_ai_ada_blog_reply 给我们提。
机器学习小结
CSDN-Ada助手: Python入门技能树或许可以帮到你：https://edu.csdn.net/skill/python?utm_source=AI_act_python
树链剖分
xyzcoolplayer: 不够详细，应该讲一下代码（代码看不懂）
【01trie】【启发式合并】P6072 『MdOI R1』Path
qq_54179200: 可以讲一下bel数组的用处嘛，没看懂，还有v是存什么的

最新文章

2024年13篇

2023年10篇

2022年65篇

2021年72篇

2020年182篇

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。

余额充值