unity3d ai学习_Unity AI –通过Q学习进行强化学习

unity3d ai学习Welcome to the second entry in the Unity AI Blog series! For this post, I want to pick up where we left off last time, and talk about how to take a Contextual Bandit problem, and extend it...
摘要由CSDN通过智能技术生成

unity3d ai学习

Welcome to the second entry in the Unity AI Blog series! For this post, I want to pick up where we left off last time, and talk about how to take a Contextual Bandit problem, and extend it into a full Reinforcement Learning problem. In the process, we will demonstrate  how to use an agent which acts via a learned Q-function that estimates the long-term value of taking certain actions in certain circumstances. For this example we will only use a simple gridworld, and a tabular Q-representation. Fortunately, this, basic idea applies to almost all games. If you like to try out the Q-learning demo, follow the link here. For a deeper walkthrough of how Q-learning works, continue to the full text below.

欢迎来到Unity AI博客系列中的第二篇文章! 对于这篇文章,我想谈谈上次中断的地方,并讨论如何解决上下文强盗问题 ,并将其扩展到完整的强化学习问题。 在此过程中,我们将演示如何使用通过学习的Q函数起作用的代理,该Q函数估计在某些情况下采取某些行动的长期价值。 在此示例中,我们将仅使用简单的gridworld和表格Q表示。 幸运的是,这个基本思想几乎适用于所有游戏。 如果您想尝试Q学习演示,请点击此处的链接。 要深入了解Q学习的工作原理,请继续下面的全文。

Q学习算法 (The Q-Learning Algorithm)

Contextual Bandit Recap

上下文强盗回顾

The goal when doing Reinforcement Learning is to train an agent which can learn to act in ways that maximizes future expected rewards within a given environment. In the last post in this series, that environment was relatively static. The state of the environment was simply which of the three possible rooms the agent was in, and the actions were choosing which chest within that room to open. Our algorithm learned the Q-function for each of these state-action pairs: Q(s, a). This Q-function corresponded to the expected future reward that would be acquired by taking that action within that state over time. We called this problem the “Contextual Bandit.”

进行强化学习的目的是训练能够在给定环境中以最大化未来预期收益的方式行事的代理商。 在本系列的最后一篇文章中,该环境是相对静态的。 环境状态仅是代理程序所在的三个可能的房间中的哪个,而动作则在选择该房间中要打开哪个箱子。 我们的算法为以下每个状态动作对学习了Q函数:Q(s,a)。 该Q函数对应于预

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值