Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning-CSDN博客

本文链接：https://blog.csdn.net/agent_snail/article/details/108703809

Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning

author

Applied Artiﬁcial Intelligence Institute, Deakin University, Geelong, Australia

Dung Nguyen， Svetha Venkatesh，Phuoc Nguyen，Truyen Tran

insight

Guilt aversion induces experience of a utility loss in people if they believe they have disappointed others, and this promotes cooperative behavior in human.

model

Learning is driven by not only material rewards but also psychological loss due to the feeling of guilt if an agent believes that it has harmed others.

A reward shaping strategy, where the additional reward is from the intrinsic social motivation of being fair to others. Our reward shaping function is deﬁned over actions space.

computational model of ToM
reinforcement learning：
- update agent‘s beliefs about the other agents
- compute psychological rewards using a guilt averse model, followed by an update of the value function

contribution

Our contribution is to design and test a framework that brings the psychological concept of guilt aversion into multi-agent reinforcement learning, and in eﬀect it connects social psychology, psychological game theory , multi-agent systems and reinforcement learning. For the ﬁrst time, we explore and establish a computational model for embedding guilt aversion coupled with theory of mind on reinforcement learning framework and study it in the extended Markov Games.

hypothesis

agents可以分辨出agent i的policy是属于合作还是非合作的
奖励规则如下：

在这里插入图片描述

Theory of Mind Agents with Guilt Aversion

First-order Theory of Mind (ToM1) Agent

zero-order belief： b(0) which is a probability distribution over events that agent follows a cooperative or an uncooperative policy;
first-order belief： recursive belief, representing what agent i thinks about the belief of agent j’s belief (the probability distribution over events that agent i follows a cooperative or an uncooperative policy).
基于belief推测agent的policy是合作还是非合作，即得到policy type：

在这里插入图片描述

计算belief integration function BI(j_j)：

置信度：

在这里插入图片描述

BI(j_j)：

更新belief：

zero-order：

在这里插入图片描述

first-order:

在这里插入图片描述

Guilt Aversion (GA)

expected material value:

在这里插入图片描述

psychological reward:
total reward:

在这里插入图片描述

Update the Value Function

由于针对的问题，原本的奖励对应状态的奖励就是固定的，用psychological reward引导agent选择不伤害others利益的行为。

由于原本奖励是固定的，让问题变得简单了，整合了psychological reward的方法类似于RL中value function的更新方式。
在这里插入图片描述

experiment

environment：
- fully observation
- agents’ aim: to catch stag or hare; the players need to cooperate to catch the stag.
- action: {left, up, down, right, stay}
- 奖励方式：一起抓stag每人4；两个同时抓hare，每人2；一个人抓hare，这个人得3，另一个没有。

在这里插入图片描述

Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning