《Reinforcement Learning》第三章自学笔记

最新推荐文章于 2024-07-19 15:45:46 发布

喵呜嘻嘻嘻

最新推荐文章于 2024-07-19 15:45:46 发布

阅读量90

点赞数

分类专栏：强化学习文章标签：算法

本文链接：https://blog.csdn.net/z3w97/article/details/115875018

版权

3 篇文章 0 订阅

订阅专栏

Multi-armed Bandits

evaluate (evaluative feedback) V.S. instruct (instructive feedback)：区分强化学习（前者）和其他学习的最重要特征
associative V.S. nonassociative：前者指的是when actions are taken in more than one situation

You are faced repeatedly with a choice among $k$ different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period.
The value of an arbitrary action $a$ is the expected reward given that $a$ is selected:
$q_{*}(a) \doteq \mathbb{E}\left[R_{t} \mid A_{t}=a\right]$

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注