二、Value-Based Reinforcement Learning

由于在看DRL论文中,很多公式都很难理解。因此最近在学习DRL的基本内容。

再此说明,非常推荐B站 “王树森 老师的DRL 强化学习” 本文的图表及内容,都是基于王老师课程的后自行理解整理出的内容。

目录

A.前文提示

B. DQN Value_Based method

一、DQN算法介绍

二、TD算法的介绍

三、如何使用TD算法 训练DQN

C. DQN算法总结


A.前文提示

在上篇文档中我们提到,训练AI agent 主要有两种方式。

一种是:基于Policy Π 的 Policy_Based_Learning method

一种是:基于 Optimal AC Func Value_Based_learning method

B. DQN Value_Based method

一、DQN算法介绍

游戏的目标是打赢游戏 = 强化学习中最大化奖励

DQN算法是Value based method,因此其依赖acion_value function

Value_Based method表示,在状态S下执行动作A的平均回报有多好(因为是考虑的期望)。(即Q*是一个先知,他能告诉每一个执行动作的平均回报)。DQN算法会选择Q值最高的动作

value-based method 即学习一个函数来近似Q*

来近似估计

DQN算法的输入是S,输出是对于所有可能动作的打分(如果是左右上三个动作,即会产生一个3×1的向量)

二、TD算法的介绍

Temporal Difference Learning

目标使TD error变为0,即预计时间与实际值相等

TD error = Q(w)- TDTarget

一个例子,从NYC到达ATL预计1000min 实际用了860min

模型预测,

TD target = 300+600 = 900 300是实际值,600是估计值。

因此,TD target值相较于估测值Q(w)更加可靠。

三、如何使用TD算法 训练DQN

因此

对两边同求期望,得到下面式子

DQN算法根据当前状态St+1 对每一个动作进行打分。并选择分数最高的动作。

C. DQN算法总结

思想

来近似估计

观测到状态S

基于状态S,求得使Q值分数最大的A(实际的Q值)

有了 S 和 A 后

Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

沈夢昂志

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值