Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning
author
Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia
Dung Nguyen, Svetha Venkatesh,Phuoc Nguyen,Truyen Tran
insight
Guilt aversion induces experience of a utility loss in people if they believe they have disappointed others, and this promotes cooperative behavior in human.
model
Learning is driven by not only material rewards but also psychological loss due to the feeling of guilt if an agent believes that it has harmed others.
A reward shaping strategy, where the additional reward is from the intrinsic social motivation of being fair to others. Our reward shaping function is defined over actions space.
- computational model of ToM
- reinforcement learning:
- update agent‘s beliefs about the other agents
- compute psychological rewards using a guilt averse model, followed by an update of the value function
contribution
Our contribution is to design and test a framework that brings the psychological concept of guilt aversion into multi-agent reinforcement learning, and in effect it connects social psychology, psychological game theory , multi-agent systems and reinforcement learning. For the first time, we explore and establish a computational model for embedding guilt aversion coupled with theory of mind on reinforcement learning framework and study it in the extended Markov Games.
hypothesis
- agents可以分辨出agent i的policy是属于合作还是非合作的
- 奖励规则如下:
contents
Theory of Mind Agents with Guilt Aversion
First-order Theory of Mind (ToM1) Agent
- zero-order belief: b(0) which is a probability distribution over events that agent follows a cooperative or an uncooperative policy;
- first-order belief: recursive belief, representing what agent i thinks about the belief of agent j’s belief (the probability distribution over events that agent i follows a cooperative or an uncooperative policy).
- 基于belief推测agent的policy是合作还是非合作,即得到policy type:
- 计算belief integration function BI(jj):
置信度:
BI(jj):
- 更新belief:
zero-order:
first-order:
Guilt Aversion (GA)
- expected material value:
-
psychological reward:
-
total reward:
Update the Value Function
由于针对的问题,原本的奖励对应状态的奖励就是固定的,用psychological reward引导agent选择不伤害others利益的行为。
由于原本奖励是固定的,让问题变得简单了,整合了psychological reward的方法类似于RL中value function的更新方式。
experiment
- environment:
- fully observation
- agents’ aim: to catch stag or hare; the players need to cooperate to catch the stag.
- action: {left, up, down, right, stay}
- 奖励方式:一起抓stag每人4;两个同时抓hare,每人2;一个人抓hare,这个人得3,另一个没有。