【强化学习】公平性Actor-Critic算法

code120302

已于 2024-05-08 11:23:07 修改

阅读量1.1k

点赞数 11

分类专栏：笔记文章标签：强化学习网络优化算法

于 2024-05-07 20:04:26 首次发布

本文链接：https://blog.csdn.net/code120302/article/details/138536058

版权

笔记专栏收录该内容

32 篇文章

订阅专栏

Bringing Fairness to Actor-Critic Reinforcement Learning for Network Utility Optimization 阅读笔记

Problem Formulation
Learning Algorithm
- Learning with Multiplicative-Adjusted Rewards
- Solving Fairness Utility Optimization
Evaluations
Code Implementation

在网络优化问题中，公平性(fairness)是一个重要的考虑指标。随着越来越多的设备接入网络中，网络中的资源分配、任务调度等需要充分考虑设备之间的公平性，在系统效率与用户公平性之间达到一种平衡。近年来，强化学习被成功应用于网络优化问题的在线决策中，然而大部分算法聚焦于最大化所有agent的长期收益，很少考虑公平性。在这样的背景下，作者提出了一种fairness Actor-Critic algorithm，该算法将公平性融入到AC算法的设计中，旨在优化整体公平效用函数。具体做法为，设计了一种适应性奖励，在原奖励的基础上乘以一个权重，该权重与效用函数和过去的奖励有关。实验部分，作者将算法用于求解网络调度问题（convex)与视频流QoE优化问题(non-convex)，说明了算法的有效性。

Problem Formulation

考虑一个网络效用优化问题，网络建模为环境，用户是agents，agent与环境进行交互，学习策略来优化rewards（如数据率等）。假设有K个agents，使用随机策略(stochastic policy) $\pi$ (a|s)表示状态s下选择动作a的概率。 $x_{\pi,k}$ 代表agent k在策略 $\pi$ 下的平均奖励
在这里插入图片描述
在本文中，使用 $\alpha$ -fiar 效用函数，该函数广泛应用于网络优化领域。对于任意的 $\alpha$ >= 0，有

Learning Algorithm

假定在任何策略下的马尔科夫链都是不可还原/非周期性的。

Learning with Multiplicative-Adjusted Rewards

为了优化公平效用，在算法中需要追踪历史reward。为什么能使用过去历史reward来实现公平呢？
假设这样一个场景，两个agent分别有自己的reward，在某个策略下，如果截至到epoch t时agent 1比agent 2获得了更多的累积奖励，那么我们需要偏好使用策略梯度更新agent 2而不是agent 1。因此过去历史reward能够用于优化公平性。
使用 $h_{\pi, t}$ 表示截止epoch t从采样路径中获得的数据，使用一个一致连续函数( uniformly-continuous function) $\phi(h_{\pi, t})$ 计算奖励的乘子。一致连续函数本身是“公平性”的体现。定义适应性奖励(adjust rewards)为
在这里插入图片描述
使用 $\hat{\rho_{\pi}}$ 表示MDP下平均单步适应性奖励，定义状态价值函数和动作价值函数如下：

可以看到，V和Q都是有边界的。定义一个增强函数

因为适应性奖励依赖于过去的历史h，所以标准RL的策略梯度理论不再适用适应性奖励。重新分析MDP。

在这里插入图片描述
当策略参数发生微小改变，平均奖励的改变如上式。
证明：定义新的状态 $z_t = [s_t, h_{\pi, t}]$ ，新的马尔可夫过程为状态 $z_t$ 、动作 $a_t$ 和奖励 $\hat{r_{k,t}}$ 的链。使用 $p^a_{zz'}$ 表示状态转移概率， $V_{\pi}(z)$ 和 $Q_{\pi}(z,a)$ 为状态-值函数、动作-值函数。用 $P^{\pi}(z|s)$ 表示对于给定的状态s发生z的有限概率。Q函数与V函数表示如下
在这里插入图片描述
定义一个辅助函数

其中 $A_{\pi}(z,a) = Q_{\pi}(z,a) - V_{\pi}(z,a)$ 。则有

因为 $\sum_{a}\pi(a|s)Q_{\pi}(z,a) = V_{\pi}(z)$ , 所以根据推导，有 $G_{\theta+\epsilon, \theta+\epsilon, \theta+\epsilon}$ = 0 。上述推导的最后一步中，第一项和第三项能够消掉，最后得到
在这里插入图片描述
当策略参数发生的改变 $\phi$ 十分微小，策略 $\pi_{\theta}$ 的相应改变可以用 $\epsilon \nabla \pi_{\theta}(a|s) + O(||\epsilon||^2_2)$ 来bound。那么有

以上的梯度和较小的学习率能够使得算法收敛到一个平稳点。
策略梯度算法如下：（类似于REINFORCE算法）
在这里插入图片描述

Solving Fairness Utility Optimization

Lemma 2说明了新的策略梯度算法能收敛到适应性MDP的平稳点。定义最优策略的参数为 $\theta^*$ ，那么初始奖励的单步平均值为
在这里插入图片描述
我们需要证明 $\theta^*$ 也是优化问题 $\sum_{k}U(x_{\pi_{\theta},k})$ 的平稳点。
对于一致连续函数 $\phi$ ，设定为效用函数U的一阶导数。该函数是符合Lipschitz连续的，有 $∣ U^{'} (x) - U^{'} (y) ∣ <= L ∣ x - y ∣$ , 对于L > 0。那么适应性奖励可以表示为
在这里插入图片描述

理论1：策略梯度算法能够收敛至公平效用函数的平稳点。
证明：由上已知， $\theta^*$ 是适应性MDP的平稳点，即 $\nabla_{\theta} \hat{\rho_{\pi_{\theta}}} |_{\theta=\theta^* }= 0$ ，需要证明 $\theta^*$ 也是 $\alpha$ -fair 效用函数 $\sum_{k} U(x_{\pi_{\theta},k})$ 的平稳点，也即 $\nabla_{\theta} [\sum_{k} U(x_{\pi_{\theta},k})] |_{\theta=\theta^* }= 0$ 。
所以我们需要分析单步平均适应性奖励 $\hat{\rho_{\pi_{\theta}}}$ 和单步平均奖励 $x_{\pi_{\theta},k}$ 的关系

根据公式(17)，有
在这里插入图片描述
在policy $\pi_{\theta}$ 下，对于任意的 $\epsilon$ > 0 存在一个足够大的T使得， $\sum^{T}_{t=1} r_{k,t} - x_{\pi_{\theta},k}| < \epsilon$ ，结合U’的Lipschitz continuity有

其中C1是 $|U'(x_{\pi_{\theta},k})|$ 的边界，C2是 $\sum^{T}_{t=1} r_{k,t}|$ 的边界。当T足够大，有
$\hat{\rho_{\pi_{\theta}}} = \sum_{k} x_{\pi_{\theta},k}U'(x_{\pi_{\theta},k})$
由于 $\theta^*$ 是适应性MDP的平衡点，有 $\nabla_{\theta} [x_{\pi_{\theta},k}U'(x_{\pi_{\theta},k})] |_{\theta=\theta^* }= 0$ ，也即 $\nabla_{\theta} \hat{\rho_{\pi_{\theta}}} |_{\theta=\theta^* }= 0$ 。

上述证明结果可以形成一个新的actor-critic算法，使用 $\hat{V_{w}}(s_{t})$ 作为神经网络近似state-value function，使用TD误差来训练 $\hat{V_{w}}(s_{t})$ 。
在这里插入图片描述

Evaluations

两个场景：无线网络调度和QoE优化
结果都表明FAC算法的优势：能够优化全局的效用、收敛速度快。
在这里插入图片描述

Code Implementation

以下是个人尝试复现的代码（文章细节之处仍有存疑）

import torch
from torch import nn
from torch.nn import functional as F
import numpy as np


class Critic(nn.Module):
    def __init__(self, agent_id, state_dim, action_dim, hidden_num=250):
        super(Critic).__init__()
        self.agent_id = agent_id
        self.fc1 = nn.Linear(state_dim + action_dim, hidden_num)
        self.fc2 = nn.Linear(hidden_num, 1)

    def forward(self, s, a):
        cat = torch.cat([s, a], dim=1)
        x = self.f1(cat)
        x = F.relu(x)
        x = self.fc2(x)
        return x


class Actor(nn.Module):
    def __init__(self, agent_id, state_dim, action_dim, hidden_num=250):
        super(Actor).__init__()
        self.agent_id = agent_id
        self.f1 = nn.Linear(state_dim, hidden_num)
        self.f2 = nn.Linear(hidden_num, action_dim)

    def forward(self, s):
        x = self.f1(s)
        x = F.relu(x)
        x = self.f2(x)
        x = F.softmax(x, dim=-1)
        return x


class FairnessActorCritic:
    def __init__(self, state_dim, hidden_num, action_dim, device, n_agents, alpha, actor_lr=0.001, critic_lr=0.01,
                 gamma=0.9):
        self.agents = []
        self.n_agents = n_agents  # k agents
        self.alpha = alpha
        self.gamma = gamma
        self.device = device

        self.actor = Actor(0, state_dim, action_dim, hidden_num).to(self.device)
        self.critic = Critic(0, state_dim, action_dim, hidden_num).to(self.device)
        self.actor_optim = torch.optim.Adam(self.actor.parameters(), actor_lr)
        self.critic_optim = torch.optim.Adam(self.critic.parameters(), critic_lr)

        # self.actors = [Actor(aid, state_dim, action_dim, hidden_num).to(self.device) for aid in range(n_agents)]
        # self.critics = [Critic(aid, state_dim, action_dim, hidden_num).to(self.device) for aid in range(n_agents)]
        # self.actors_optim = [torch.optim.Adam(actor.parameters(), actor_lr) for actor in self.actors]
        # self.critics_optim = [torch.optim.Adam(critic.parameters(), critic_lr) for critic in self.critics]
        self.step = 0
        self.total_past_rewards = self.create_past_reward_list()

    def create_past_reward_list(self):
        list = []
        for i in range(self.n_agents):
            list.append(0)
        return list

    def alpha_fair_utility_function(self, x):
        if self.alpha == 1:
            return np.log(x)
        return x ** (1 - self.alpha) / (1 - self.alpha)

    def utility_first_order_derivative(self, x):
        if self.alpha == 1:
            return 1 / x
        return 1 / (x ** self.alpha)

    def to_tensor(self, inputs):
        if torch.is_tensor(inputs):
            return inputs
        return torch.FloatTensor(inputs).to(self.device)

    def take_action(self, states):
        """
        :param states: [state 1, state 2, ..., state k]
        :return: [action1, action 2, ..., action k]
        """
        # numpy[n_states]-->tensor[1,n_states]
        states = [states]
        actions = []
        for state in states:
            action_prob = self.actor(self.to_tensor(state))
            action_dist = torch.distributions.Categorical(action_prob)
            action = action_dist.sample().item()
            actions.append(action)

        # for actor, state in zip(self.actors, states):
        #     # predict the probabilities of each action under current states
        #     action_prob = actor(self.to_tensor(state))
        #     # construct the probability distribution same as the probability of output actions
        #     action_dist = torch.distributions.Categorical(action_prob)
        #     # sample action from the distribution
        #     action = action_dist.sample().item()
        #     actions.append(action)
        return actions

    def calculate_adjust_rewards(self, current_rewards):
        """
        :param current_rewards: list [r1, r2, ..., rk]
        :param total_past_rewards: list [total_r1, total_r2, ..., total_rk]
        :return: sum of adjust rewards of all agents
        """
        adjust_rewards = 0
        if self.step == 0:
            adjust_rewards = sum(current_rewards)
        else:
            for i in range(self.n_agents):
                adjust_rewards += current_rewards[i] * self.utility_first_order_derivative(
                    self.total_past_rewards[i] / self.step)
        return adjust_rewards

    def learn(self, states, actions, rewards, next_states):
        # shape of tensor is [1, k]
        adjust_rewards = self.calculate_adjust_rewards(rewards)

        '''
        in the article, the author doesn't declare which structure of the algorithm, may the 
        framework is centralized training and decentralized execution or centralized ?
        so here we will use the average of the all agents' state-value through network V to 
        represent the V(st) and V(st+1) ->only one Actor-Critic
        '''
        next_states_V = sum(self.take_action(next_states)) / len(self.take_action(next_states))
        states_V = sum(self.take_action(states)) / len(self.take_action(states))
        advantage = adjust_rewards + self.gamma * next_states_V - states_V
        advantage = self.to_tensor(advantage)
        states = [self.to_tensor(state) for state in states]
        actions = [self.to_tensor(action) for action in actions]

        critic_loss = 0
        for i in range(self.n_agents):
            critic_loss += torch.log(self.critic(states[i], actions[i]))
        critic_loss = critic_loss * advantage

        actor_loss = 0
        for i in range(self.n_agents):
            actor_loss += self.actor(states[i])
        actor_loss = actor_loss * advantage

        self.critic_optim.zero_grad()
        critic_loss.backward()
        self.critic_optim.step()

        self.actor_optim.zero_grad()
        actor_loss.backward()
        self.actor_optim.step()

        self.step += 1
        for i in range(self.n_agents):
            self.total_past_rewards[i] += rewards[i]

        reward_ave = [pr / self.step for pr in self.total_past_rewards]
        utility_sum = 0
        for i in range(self.n_agents):
            utility_sum += self.alpha_fair_utility_function(reward_ave[i])

        return actor_loss, critic_loss, reward_ave, utility_sum

————————————————————————————
参考文献：
【1】J. Chen, Y. Wang and T. Lan, “Bringing Fairness to Actor-Critic Reinforcement Learning for Network Utility Optimization,” IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, Vancouver, BC, Canada, 2021, pp. 1-10