Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning

本文提出了一种结合内疚规避心理和心智理论的多智能体强化学习框架。该框架不仅考虑了物质奖励,还引入了心理损失的概念来促进智能体间的合作行为。实验在一个完全可观测的环境中进行,智能体通过抓兔子或鹿的合作任务来测试该方法的有效性。
摘要由CSDN通过智能技术生成

Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning

author

Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia

Dung Nguyen, Svetha Venkatesh,Phuoc Nguyen,Truyen Tran

insight

Guilt aversion induces experience of a utility loss in people if they believe they have disappointed others, and this promotes cooperative behavior in human.

model

Learning is driven by not only material rewards but also psychological loss due to the feeling of guilt if an agent believes that it has harmed others.

A reward shaping strategy, where the additional reward is from the intrinsic social motivation of being fair to others. Our reward shaping function is defined over actions space.

  • computational model of ToM
  • reinforcement learning:
    • update agent‘s beliefs about the other agents
    • compute psychological rewards using a guilt averse model, followed by an update of the value function

contribution

Our contribution is to design and test a framework that brings the psychological concept of guilt aversion into multi-agent reinforcement learning, and in effect it connects social psychology, psychological game theory , multi-agent systems and reinforcement learning. For the first time, we explore and establish a computational model for embedding guilt aversion coupled with theory of mind on reinforcement learning framework and study it in the extended Markov Games.

hypothesis

  • agents可以分辨出agent i的policy是属于合作还是非合作的
  • 奖励规则如下:

在这里插入图片描述

contents

Theory of Mind Agents with Guilt Aversion

First-order Theory of Mind (ToM1) Agent
  • zero-order belief: b(0) which is a probability distribution over events that agent follows a cooperative or an uncooperative policy;
  • first-order belief: recursive belief, representing what agent i thinks about the belief of agent j’s belief (the probability distribution over events that agent i follows a cooperative or an uncooperative policy).
  • 基于belief推测agent的policy是合作还是非合作,即得到policy type:

在这里插入图片描述

  • 计算belief integration function BI(jj):

置信度:

在这里插入图片描述

BI(jj):

  • 更新belief:
    在这里插入图片描述

zero-order:

在这里插入图片描述

first-order:

在这里插入图片描述

Guilt Aversion (GA)
  • expected material value:

在这里插入图片描述

  • psychological reward:

  • total reward:

在这里插入图片描述

Update the Value Function

由于针对的问题,原本的奖励对应状态的奖励就是固定的,用psychological reward引导agent选择不伤害others利益的行为。

由于原本奖励是固定的,让问题变得简单了,整合了psychological reward的方法类似于RL中value function的更新方式。
在这里插入图片描述

experiment

  • environment:
    • fully observation
    • agents’ aim: to catch stag or hare; the players need to cooperate to catch the stag.
    • action: {left, up, down, right, stay}
    • 奖励方式:一起抓stag每人4;两个同时抓hare,每人2;一个人抓hare,这个人得3,另一个没有。

在这里插入图片描述
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值