Learning-Driven Interference-Aware Workload Parallelization for Streaming Applications in Heterogeneous Cluster
-
TPDS2021第一篇论文解读:出处, Manuscript received Feb. 2019
1. Motivation
- 以前要么只考虑了hybrid CPU-GPU下的任务调度问题,没考虑异构heterogeneous clusters下的任务调度问题
- 要么只考虑了data center level and cluster level的任务调度,没考虑GPU资源的充分利用
- 要么只考虑了异构clusters中的任务之间干扰问题,没考虑hybrid CPU-GPU架构
- 要么只考虑了learning-driven任务并行策略,没考虑资源竞争带来的干扰问题
- (我加)SP综述P6: Heterogeneous processing nodes might influence the processing speed of the SP application in all types of infrastructure(single-node, cluster, cloud, fog, CPU&GPU).
2. Problem&&Objective
Problem
streaming application parallelization problem with fine-grained task division and task interference detection in the CPU-GPU heterogeneous cluster.
Objective
- maximize the cluster task throughput in long term through automatically generating the best scheduling actions
- improve the resource utilization by considering the task interference
Quality of Service(QoS)(我加)
- 资源利用率高
- 任务吞吐量大
- 弹性可扩展性强(elasticity)
3. Technique
-
pre-training:Multi-Layer Perception (MLP) network
-
Stag1-Learning-Driven Workload Parallelization (LDWP) – cluster scheduler
-
基于Deep Reinforcement Learning (DRL)
-
作用:为相互独立的任务选择一个合适的最优的执行节点
-
cluster-level scheduling model
- use deep Q-network(DQN)
- 根据运行状态、cluster环境、任务特征,执行最优并行调度动作
-
-
Stag2-Interference-Aware Workload Parallelization (IAWP) – node scheduler
-
基于Neural Collaborative Filtering (NCF)
-
作用:确定子任务依赖的合适异构计算单元个数,考虑子任务干扰
-
-
迁移学习 transfer learning technology
- 作用:重建(rebuild)任务调度模型,提高泛化能力
- parameter-transfer method for cluster level: 当异构cluster发生改变时基于之前构造的模型快速生成一个高效调度网络
- for node-level: 将现有的NCF模型参数应用在新加入的工作节点上
4. Prerequisites
Deep Reinforcement Learning(DRL)
-
learning goal: maximize the expected cumulative reward
-
future discounted reward R t = ∑ t ′ = t T γ t ′ − t r t ′ R_t=\sum_{t'=t}^{T}\gamma^{t'-t}r_{t'} Rt=∑t′=tTγt′−trt′
-
optimal action-value function Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a): maximum expected return that can be achieved by following a policy after observing some state sequence s and then taking some action a.
-
Bellman equation Q ∗ ( s , a ) = E [ r + γ m a x a ′ Q ∗ ( s ′ , a ′ ) ] Q^*(s,a)=\pmb E[ r+\gamma max_{a'}Q^*(s',a')] Q∗(s,a)=EEE[r+γmaxa′Q∗(s′,a′)]
-
Q ( s , a ; θ ) ≈ Q ∗ ( s , a ) Q(s,a;\theta) \approx Q^*(s,a) Q(s,a;θ)≈Q∗(s,a)
-
model-free Q-learning algorithm:
- iteratively update Q ( s , a ; θ ) Q(s,a;\theta) Q(s,a;θ)
- choosing actions that maximize a quality function Q t + 1 ( s t , a t ) Q_{t+1}(s_t,a_t) Qt+1(st,at) at a specific time step t
- 用非线性approximators时高度不稳定,不收敛
-
Deep Q-Network (DQN): much more stable
- Q-network: A neural network function approximator with weights θ \theta θ
- like Q-learning, be trained iteratively by updating the parameters θ \theta θ of the Q-network to reduce the mean-squared error of Bellman equation
- optimal target value y = r + γ m a x a ′ Q ( s ′ , a ′ ; θ − ) y=r+\gamma max_{a'}Q(s',a';\theta^-) y=r+γmaxa′Q(s′,a′;θ−), θ − \theta^- θ− is from some previous iteration
- loss functions L i ( θ i ) = E [ ( r + γ m a x a ′ Q ( s ′ , a ′ ; θ i − ) − Q ( s , a ; θ i ) ) 2 ] L_i(\theta_i)=\pmb E [(r+\gamma max_{a'}Q(s',a';\theta_i^-)-Q(s,a;\theta_i))^2] Li(θi)=EEE[(r+γmaxa′Q(s′,a′;θi−)−Q(s,a;θi))2]
- parameters θ \theta θ are updated by Stochastic Gradient Descent (SGD) algorithm
- At each time-step t, action is selected by an ε \varepsilon ε-greedy policy with respect to the current Qnetwork
Neural Collaborative Filtering(NCF)
-
广泛用于推荐系统
-
neighborhood based method
- similarity between users or the similarity between items
-
latent factor model:
-
characterizes both users and items in a latent factor domain
-
Matrix Factorization (MF): mapping the rating matrix into a joint latent space of user and item features
-
K: the dimension of the latent feature space; M: Users; N: items; rating matrix R M × N R_{M\times N} RM×N
-
R M × N = U M × K V K × N R_{M\times N}=U_{M\times K}V_{K\times N} RM×N=UM×KVK×N, U u U_u Uu user latent feature vector, V i V_i Vi item latent feature vector
-
y u i y_{ui} yui represents the rating that user u gives to item i. y ^ u i = f ( U u , V i ) = ∑ k = 1 K U u k V i k = U u T V i \hat{y}_{ui}=f(U_u,V_i)=\sum_{k=1}^KU_{uk}V_{ik}=U_u^TV_i y^ui=f(Uu,Vi)=∑k=1KUukVik=UuTVi
-
缺点:
-
cold start problem / sparsity problem
-
fixed inner product不能捕获非线性的复杂特征
-
-
-
Multi-Layer Perception (MLP) network
- Inputs: user latent vector u u \pmb u_u uuuu and item latent vector v i \pmb v_i vvvi obtained from the embedding layer
- Outputs: predicted score y ^ u i \hat{y}_{ui} y^ui
- y ^ u i = ϕ o u t ( ϕ X ( . . . ϕ 2 ( ϕ 1 ( u u , v i ) ) . . . ) ) \hat{y}_{ui}=\phi_{out}(\phi_X(...\phi_2(\phi_1(\pmb{u}_u,\pmb{v}_i))...)) y^ui=ϕout(ϕX(...ϕ2(ϕ1(uuuu,vvvi))...)), X是隐藏层个数, ϕ \phi ϕ 是映射函数
5. Problem Analysis
-
Workload Model
- 任务之间没有依赖,子任务之间才有,子任务共享任务的所有输入数据
- T i = ( T I D i , T S i z e i ) T_i=(TID_i,TSize_i) Ti=(TIDi,TSizei), T I D i TID_i TIDi 是任务类型的编号, T S i z e i TSize_i TSizei 是输入数据的大小
- T i = G ( N i , E i ) T_i=G(N_i,E_i) Ti=G(Ni,Ei), G is workflow graph
- N i = { S T i 1 , . . . , S T i , n i } N_i=\{ST_{i1},...,ST_{i,n_i}\} Ni={STi1,...,STi,ni} , n i n_i ni 是第i个任务的子任务个数, S T i j = ( S T I D i j , T S i z e i ) ST_{ij}=(STID_{ij},TSize_i) STij=(STIDij,TSizei)
- E i j k ∈ E i E_i^{jk}\in E_i Eijk∈Ei 表示一条从子任务 S T i j ST_{ij} STij 指向子任务 S T i k ST_{ik} STik的边(依赖关系)
- association pattern 子任务利用数据的特征与任务的输入数据之间的关系
-
System Model
-
分布式计算平台:a cluster manager node and a cluster of hybrid CPU-GPU worker nodes,each worker node comprises multicore CPUs and multiple GPUs;cluster manager node 上有Cluster Scheduler(CS), 每个worker node上有一个Node Scheduler(NS)
-
cluster manager node:
- respond to the task processing requests of the Application System (AS)
- control the task execution in the cluster
- analyze each task to get the task-related configuration requirements, adds the task to a task waiting queue of the cluster
- Cluster Scheduler (CS) collects the information of worker node resource and task execution states, use LDWP to schedule the tasks in the waiting queue to the appropriate worker nodes
-
a task processing request can be divided into a series of independent tasks(minimum scheduling unit)
-
任务分配给一个节点后,先被划分成子任务,放入subtask pending queue or subtask ready queue,然后Node Scheduler(NS) 用IAWP 为等待队列里的子任务选择合适的计算单元(CPU or GPU)
-
分配给GPU的子任务先被放入priority queue,然后NS根据子任务间的干扰IAWP执行合适的子任务
-
M worker nodes { S 1 , . . . , S M } \{S_1,...,S_M\} {S1,...,SM}
-
Node State Vector
S i ( t ) = ( S r i c ( t ) , S r i g ( t ) , S r i c m ( t ) , S r i g m ( t ) , S r i t x ( t ) , S r i r x ( t ) ) S_i(t)=(Sr_i^c(t),Sr_i^g(t),Sr_i^{cm}(t),Sr_i^{gm}(t),Sr_i^{tx}(t),Sr_i^{rx}(t)) Si(t)=(Sric(t),Srig(t),Sricm(t),Srigm(t),Sritx(t),Srirx(t)), r可能是利用率的意思
Variable Meaning S r i c ( t ) Sr_i^c(t) Sric(t) CPU utilization S r i g ( t ) Sr_i^g(t) Srig(t) GPU utilization S r i c m ( t ) Sr_i^{cm}(t) Sricm(t) Host memory utilization S r i g m ( t ) Sr_i^{gm}(t) Srigm(t) GPU memory utilization S r i t x ( t ) Sr_i^{tx}(t) Sritx(t) Node uplink traffic rate S r i r x ( t ) Sr_i^{rx}(t) Srirx(t) node downlink traffic rate
-
-
Interference Analysis
-
一个GPU可以同时执行多个任务
-
有相似特征的子任务可能竞争相同的资源,当在同一个GPU上run时就会发生干扰,通过IAWP预测优先级队列的子任务与正在运行的子任务的相似度,避免同时运行
-
竞争资源
-
Streaming Multiprocessors(SM), memory resources (L1 cache, L2 cache, texture cache and DRAM memory), interconnect network
-
前两类用利用率量化
-
第三个用global load throughput和global store throughput metrics量化
-
用这些指标衡量相似性
-
-
6. WORKLOAD PARALLELIZATION IN HETEROGENEOUS CLUSTER
Design Overview(和第5节差不多)
-
Stage 1:CS用LDWP把task queue的top task分配给M个worker nodes中的一个
-
用DQN model for cluster level scheduling
-
DQN 输入:the current environment observation vector,包括计算平台资源特征和任务特征
-
DQN输出:所有调度动作的期望价值,执行值最大的那个动作
-
-
Stage 2:NS把ready queue的top subtask分配给N个计算单元中的一个,如果是分配给了一个GPU,则放进priority queue,再用IAWP选择合适的subtask执行
- 用NCF model估计子任务的speedup、性能指标值
- speedup:在GPU上运行比在CPU上运行所获得的相对性能增益
- NCF 输入:subtask vector and metric vector
- NCF 输出:一个子任务的性能指标预测值
- IAWP首先根据speedup的大小顺序分配给计算单元让部分运行起来,然后计算运行的子任务和ready的子任务间的相似度,进行并行子任务分配
Task Scheduling in Heterogeneous Cluster
-
DQN Based LDWP Scheduling Method
-
Action Space:即选哪个工作节点跑任务, A C S = { S 1 , . . . , S M } A_{CS}=\{S_1,...,S_M\} ACS={S1,...,SM}
-
State Space:
- 观察向量 x t x_t xt是所有任务特征向量 T i T_i Ti和所有节点状态向量 S j ( t ) S_j(t) Sj(t)的集合,即 x t = ( T I D , T S i z e , S r 1 c ( t ) , S r 1 g ( t ) , S r 1 c m ( t ) , S r 1 g m ( t ) , S r 1 t x ( t ) , S r 1 r x ( t ) , . . . , S r M c ( t ) , S r M g ( t ) , S r M c m ( t ) , S r M g m ( t ) , S r M t x ( t ) , S r M r x ( t ) ) x_t=(TID,TSize,Sr_1^c(t),Sr_1^g(t),Sr_1^{cm}(t),Sr_1^{gm}(t),Sr_1^{tx}(t),Sr_1^{rx}(t),...,Sr_M^c(t),Sr_M^g(t),Sr_M^{cm}(t),Sr_M^{gm}(t),Sr_M^{tx}(t),Sr_M^{rx}(t)) xt=(TID,TSize,Sr1c(t),Sr1g(t),Sr1cm(t),Sr1gm(t),Sr1tx(t),Sr1rx(t),...,SrMc(t),SrMg(t),SrMcm(t),SrMgm(t),SrMtx(t),SrMrx(t))
- 状态 s t s_t st是 t 之前的所有观察向量和动作, s t = x 1 , a 1 , x 2 , a 2 , . . . , x t − 1 , a t − 1 , x t s_t=x_1,a_1,x_2,a_2,...,x_{t-1},a_{t-1},x_t st=x1,a1,x2,a2,...,xt−1,at−1,xt, 这就是DQN的输入
-
Reward: 在当前状态 s t s_t st进行 a t a_t at动作,得到新状态 s t + 1 s_{t+1} st+1和回报 r t r_t rt, r t = t + 1 r_t=t+1 rt=t+1时的吞吐量$ - t$时的吞吐量
-
experience tuple e t = ( s t , a t , r t , s t + 1 ) e_t=(s_t,a_t,r_t,s_{t+1}) et=(st,at,rt,st+1) 记录尝试玩一下的信息
-
replay memory D t = { e 1 , . . . , e t } D_t=\{e_1,...,e_t\} Dt={e1,...,et} 用于记录重玩信息
-
用mini-batch随机采样 e t e_t et, 减小模型更新的方差,提高学习过程稳定性
-
为进一步提高稳定性,在Q-learning update中分离出一个target network用于生成target Q value y,target network和evaluation network的结构是一样的,每做 μ \mu μ步迭代target network才拷贝evaluation network的参数,在Q值更新到Q值更新影响target value之间增加了一个delay,减少了发散和振荡
-
-
算法一:LDWP O(PM)
-
E = 10 , T = 6 , μ = 2 , ε = 0.3 , Q 函 数 = 伤 害 计 算 方 法 E=10, T=6, \mu=2, \varepsilon=0.3, Q函数=伤害计算方法 E=10,T=6,μ=2,ε=0.3,Q函数=伤害计算方法 O(1)
-
玩游戏,最多只能实验 10 次,得到一个伤害最高的玩法
-
每次游戏可以发 6 次技能,即把T个任务分配到node上
-
选择发哪个技能让伤害最大
-
按30%的概率随便选择一个技能,即把一个任务随便分给一个node O(1)
-
按70%的概率选择伤害最高的技能,通过Q函数估算每种技能的伤害 O(PM)
-
-
发出这个技能 a t a_t at,得到敌情 x t + 1 x_{t+1} xt+1 和伤害值 r t r_t rt O(1)
-
记录这次技能的效果 O(1)
-
随机选一个历史技能 O(1)
-
如果这是最后一次玩,就懒得纠正Q函数和更新了 O(1)
-
根据经验判断这个情况这个走位到底对不对 O(PM)
-
如果不对,纠正它 O(1)
-
根据纠正后伤害增加值优化Q函数的伤害值估算方法
-
-
每发2次技能,通过复习Q更新我们的经验,温故而知新 O(PM)
-
-
Discussion:
-
预训练:先随机初始化参数,用MLP网络训练直到收敛,将得到的参数作为DQN模型的初始参数
-
可扩展性:用迁移学习,根据之前DQN模型隐藏层的参数学习新环境下的DQN模型
-
Task Scheduling in CPU-GPU Node
-
建模
-
S T i j = ( S T r i j C , S T r i j M e m , S T r i j S M , S T r i j L 1 , S T r i j L 2 , S T r i j T e x , S T r i j D r a m , S T r i j T p l , S T r i j T p s , T S i z e i ) ST_{ij}=(STr_{ij}^{C},STr_{ij}^{Mem},STr_{ij}^{SM},STr_{ij}^{L1},STr_{ij}^{L2},STr_{ij}^{Tex},STr_{ij}^{Dram},STr_{ij}^{Tpl},STr_{ij}^{Tps},TSize_i) STij=(STrijC,STrijMem,STrijSM,STrijL1,STrijL2,STrijTex,STrijDram,STrijTpl,STrijTps,TSizei)
代表每一个子任务需要的性能开销
Variable Meaning S T r i j C STr_{ij}^C STrijC CPU usage S T r i j M e m STr_{ij}^{Mem} STrijMem Host memory usage S T r i j S M STr_{ij}^{SM} STrijSM GPU Streaming Multiprocessor(SM) usage S T r i j L 1 STr_{ij}^{L1} STrijL1 GPU L1 cache usage S T r i j L 2 STr_{ij}^{L2} STrijL2 GPU L2 cache usage S T r i j T e x STr_{ij}^{Tex} STrijTex GPU texture cache usage S T r i j D r a m STr_{ij}^{Dram} STrijDram GPU memory usage S T r i j T p l STr_{ij}^{Tpl} STrijTpl Global load throughput of GPU S T r i j T p s STr_{ij}^{Tps} STrijTps Global store throughput of GPU -
SubTask Description Matrix (STDM): 每行代表一个子任务的性能指标向量,外加一列代表子任务的speedup估计值,即 S T i j ′ = ( S T i j , s p d ) ST'_{ij}=(ST_{ij},spd) STij′=(STij,spd)
-
每来一个子任务,随机测两个 S T i j ′ ST'_{ij} STij′中的指标,用linux proc file system 测CPU指标,用NVIDIA profiler tool测GPU指标
-
为了评估测试时间profiling time取多少为好,先完全执行了16个子任务获得每个子任务的所有性能指标值,然后分别执行2s、3s、4s、5s、6s测量每一项性能指标,算出每种情况下的准确率,最终得到4s时的准确率为92.3%,之后增加缓慢,所以profiling time取4s
-
设置一个CPU和一个GPU作为profiler去专门评估每个到来的子任务的两个指标,生成的向量stitch到STDM里;如果没子任务到来,他们也可以去执行子任务
-
-
基于NCF预测指标
- input layer(这里没说清楚到底是什么, 我猜了点):types of a subtask( x s \pmb x_s xxxs我猜是任务的ID, M行一列,M个子任务),a performance metric( x m \pmb x_m xxxm我猜是指标的编号, N行一列,N个指标),用one-hot编码转成二进制稀疏向量
- embedding layer:把稀疏向量映射为稠密向量,称为“embedding vector”,看成是隐藏因子模型中的隐藏向量
- U T x s = \pmb U^T\pmb x_s= UUUTxxxs= subtask latent vector u s \pmb u_s uuus
- V T x m = \pmb V_T\pmb x_m= VVVTxxxm= metric latent vector v m \pmb v_m vvvm
- U ∈ R M × K \pmb U \in \mathbb R^{M\times K} UUU∈RM×K latent factor matrix for subtasks,M是子任务个数
- V ∈ R N × K \pmb V \in \mathbb R^{N\times K} VVV∈RN×K latent factor matrix for metrics,N是指标个数
- MLP迭代式:
KaTeX parse error: No such environment: eqnarray at position 8: \begin{̲e̲q̲n̲a̲r̲r̲a̲y̲}̲ \pmb a_1 &=& \…
-
W x \pmb W_x WWWx权重矩阵, b x b_x bx 偏置向量, g x g_x gx 激活函数用的Relu(适合稀疏向量,避免过拟合), h \pmb h hhh edge weights of the output layer
-
Loss function(又用了SGD):
J = 1 2 M N [ ∑ s = 1 M ∑ m = 1 N ( y ^ s m − y s m ) 2 + λ ∑ w s m w s m 2 ] J=\frac 1 {2MN}\left[ \begin{matrix}\sum_{s=1}^M\sum_{m=1}^N(\hat y_{sm}-y_{sm})^2+\lambda \sum_{w_{sm}}w_{sm}^2 \end{matrix} \right] J=2MN1[∑s=1M∑m=1N(y^sm−ysm)2+λ∑wsmwsm2]
-
NCF Based IAWP Scheduling Method
- 有一个CPU核空闲,就把最小speedup估值的子任务分配给它
- GPU子任务优先级队列是定长的,队列没满时就把speedup估值最大的入队
- 在GPU内进行干扰检测,只考虑GPU相关的指标,即 G i = ( S T r i j S M , S T r i j L 1 , S T r i j L 2 , S T r i j T e x , S T r i j D r a m , S T r i j T p l , S T r i j T p s , T S i z e i ) G_i=(STr_{ij}^{SM},STr_{ij}^{L1},STr_{ij}^{L2},STr_{ij}^{Tex},STr_{ij}^{Dram},STr_{ij}^{Tpl},STr_{ij}^{Tps},TSize_i) Gi=(STrijSM,STrijL1,STrijL2,STrijTex,STrijDram,STrijTpl,STrijTps,TSizei), 用cosine计算相似度: c o s ( θ ) = G i G j ∣ ∣ G i ∣ ∣ ∣ ∣ G j ∣ ∣ = ∑ k = 1 n G i k × G j k ∑ k = 1 n G i k 2 ∑ k = 1 n G j k 2 cos(\theta)=\frac{\pmb G_i\pmb G_j}{||\pmb G_i||||\pmb G_j||}=\frac{\sum_{k=1}^nG_{ik}\times G_{jk}}{\sqrt{\sum_{k=1}^nG_{ik}^2}\sqrt{\sum_{k=1}^n}G_{jk}^2} cos(θ)=∣∣GGGi∣∣∣∣GGGj∣∣GGGiGGGj=∑k=1nGik2∑k=1nGjk2∑k=1nGik×Gjk, c o s ( θ ) cos(\theta) cos(θ)越大表示 θ \theta θ越小,相似度越大
- 对每个优先级队列中的子任务,计算其与每个正在执行的子任务的相似度,取平均值的倒数作为优先级
- 如果GPU和设备内存的利用率低于给定的阈值,NS就把最高优先级的子任务分配给GPU去执行
-
算法2:IAWP O(n+nG)(=O(n))
-
初始化STDM和随机权重 W \pmb W WWW O(n)
-
定义两个特征向量 x s , x m \pmb x_s,\pmb x_m xxxs,xxxm作为NCF的输入
-
从embedding layer获得隐藏向量 u s , v m \pmb u_s,\pmb v_m uuus,vvvm O(n)
-
用MLP预测STDM中的缺失值$\hat y_{sm}=\phi_{out}(\phi_X(…\phi_2(\phi_1(\pmb u_s,\pmb v_m))…)) $ O(n)?
-
基于损失函数对网络权重 W \pmb W WWW执行一次梯度下降运算, 填满STDM矩阵 O(n)?
-
如果CPU空闲,分配一个最小speedup的子任务给它(只要一个子任务CPU就不会空闲了)O(n)
-
GPU优先级队列如果没满,不断分配最大speedup的子任务给它直到队列满 O(n)?
-
对每个优先级队列中的子任务: (node里一共G个GPU) O(nG)
8.1 从STDM里提取子任务的与GPU相关的性能指标向量 G i \pmb G_i GGGi
8.2 计算这个子任务与所有当前正在执行的子任务之间的相似度
8.3 根据相似度算出这个子任务的优先级
-
如果GPU和内存的利用率低于阈值,分配最高优先级的子任务给它 O(1)
-
-
Discussion:
- 和DQN一样先用MLP预训练收敛得到一套参数作为NCF的初始值,不同的是用了Adaptive
Moment Estimation (Adam) algorithm- compute adaptive learning rates for different parameters
- 比normal SGD的收敛速度更快
- 将预训练得到的参数作为NCF的初始值,然后转用SGD进行优化,因为Adam需要保存momentum information以正确更新参数,不合适于NCF中参数的更新
- 和DQN一样用了参数迁移方法(任务都不同了仍可以直接用,因为体现的是资源之间的一种普遍规律,而资源类型和资源特征是没变的)
- 如果新增节点中GPU的个数和原来的一样,NCF模型可以直接用在新增节点上
- 如果GPU个数变多了,就拷贝多份性能指标参数给每个GPU…
- 假设一个节点内GPU的个数是2的指数次方个(这样方便参数迁移)
- 和DQN一样先用MLP预训练收敛得到一套参数作为NCF的初始值,不同的是用了Adaptive
7. Experiment
Experiment Setup
-
12个物理服务器节点:1个作为cluster manager and cluster scheduler, 11个work node, 通过Gigabit switch相连
Type CPU Num Mem GPU Mem GPU Num Num 1 2 32GB 24GB 2 2 2 2 32GB 8GB 4 4 3 2 16GB 24GB 4 4 4 2 8GB 8GB 2 2 -
CPU: Intel® Xeon® CPU E5-2620 v3@ 2.40 GHz with 16 cores
-
GPU: Tesla P4 or Tesla K80
-
Docker: 1.11.2-cs3
-
Kubernetes: version 1.6.0 管理容器平台
-
视频处理任务:(有些子任务有CPU、GPU两个实现版本以方便调度)
Application Subtask Type Num CPU version Pipeline operation GPU version Pipeline operation Intrusion Detection 6 GMM background modeling, Frame differencing, Gaussian pyramid, DOG pyramid, Scale space extrema detection, Tracking GMM background modeling, Frame differencing, Gaussian pyramid, DOG pyramid, Scale space extrema detection Video Synopsis 5 GMM background modeling, Frame differencing, Tube extraction, Tube optimization, Stitching GMM background modeling, Frame differencing, Tube extraction, Face Recognition 7 Load image file, Face locations, Face landmarks, Face encodings, Face distance, Compare faces Batch face locations, Face landmarks, Face encodings, Face distance, Compare faces -
DQN: multi-layer perception network with two hidden layers(分别有256和128个neurons),每个neurons的激活函数都是Relu,用SGD优化,batch_size=20,learning rate l r = 0.1 lr=0.1 lr=0.1, discount rate γ = 0.8 \gamma=0.8 γ=0.8, 小的学习速率可以避免发散,大的折扣率可以兼顾到更长远的收益
-
NCF: multilayer perception network with three hidden layers(分别有128、64、32个neurons)
Convergence
-
convergence evaluation experiments:随机选3000个任务(1000个video synopsis任务,1000个intrusion detection任务,1000个face recognition任务),task arrival rate = 50个/min
-
DQN预训练/无预训练实验
-
NCF预训练/无预训练实验
Performance Analysis
-
3种比较方法:
- random selection method
- Expected Time to Compute (ETC) method:估算任务在每一个计算单元上运行的Minimum Completion Times(MCT), 分配给MCT最小的计算单元
- PATS method:基于任务在CPU和GPU上运行所估计的relative performance和computational loads进行调度
-
1500个任务,每个任务从集合{500 video synopsis tasks, 500 intrusion detection tasks, 500 face recognition tasks}中随机选择,task arrival rate = 50个/min
-
分布式计算系统的吞吐量平均提高26.9%
-
Impact of Cluster Node Number
- 受到达速率的限制,节点为11时达到饱和,cluster throughput不再显著上升,但排队等待时间减少
- 第一个节点上吞吐率随节点个数的变化,达到饱和后吞吐率会减小(负载均衡):
- 受到达速率的限制,节点为11时达到饱和,cluster throughput不再显著上升,但排队等待时间减少
-
Performance With Different Task Arrival Rate
- 一共还是1500个任务,以不同的任务速率(10-60个/min)提交,11个工作节点
- 一共还是1500个任务,以不同的任务速率(10-60个/min)提交,11个工作节点
-
Evaluation of IAWP Method in Terms of GPU Utilization
- 从系统基本稳定后开始统计,平均提高GPU利用率14.7%
-
Evaluation of IAWP Method in Terms of Load Balancing Degree
- 用资源利用率表示节点的负载均衡度,用负载均衡度的标准差表示cluster的负载均衡度,标准差越小越好
- 节点 M i M_i Mi的负载均衡度 L o a d i = ( U C P U + U h M e m + U G P U + U d M e m ) / 4 Load_i=(U^{CPU}+U^{hMem}+U^{GPU}+U^{dMem})/4 Loadi=(UCPU+UhMem+UGPU+UdMem)/4, 即CPU利用率、主机内存利用率、GPU利用率、GPU内存利用率的平均值
- 平均负载均衡度 L o a d a v g = ∑ i = 1 M L o a d i / M Load_{avg}=\sum_{i=1}^MLoad_i/M Loadavg=∑i=1MLoadi/M
- 负载均衡度标准差 L o a d s t d = ∑ i = 1 M ( L o a d i − L o a d a v g ) 2 / M Load_{std}=\sqrt{\sum_{i=1}^M(Load_i-Load_{avg})^2/M} Loadstd=∑i=1M(Loadi−Loadavg)2/M
- 集群cluster的负载均衡度
D
e
g
r
e
e
L
o
a
d
B
a
l
a
n
c
i
n
g
=
1
−
L
o
a
d
s
t
d
Degree_{LoadBalancing}=1-Load_{std}
DegreeLoadBalancing=1−Loadstd
-
Evaluation of LDWP and IAWP Method in Terms of Speedup
- Speedup: processing time of the Random method divided by the processing time of the other
three methods.
- Speedup: processing time of the Random method divided by the processing time of the other
-
Performance With Cluster Changing
- 随机选择任务提交,开始时8个节点,t1时增加3个节点,t2时减少两个节点,12分钟左右稳定
- 随机选择任务提交,开始时8个节点,t1时增加3个节点,t2时减少两个节点,12分钟左右稳定
Related Work
Task Scheduling for Hybrid Architectures
-
没考虑系统运行时状态、任务干扰:
-
stochastic local search method
-
Expected Time to Compute (ETC)
-
online reinforcement learning based task scheduler
-
predictive poweraware scheduling algorithm
-
exploring the relationship between task scheduling algorithms and energy constraints
-
-
interference-driven cluster management framework: 检测过度干扰任务,在不同节点上重启任务
-
Mystic:predict the interference,Collaborative Filtering (CF),没考虑CPU-GPU异构
Task Scheduling in Heterogeneous Cluster
-
heuristic energy-aware stochastic task scheduling strategy
-
reliability maximization method with energy constraint
-
Dynamic Voltage and Frequency Scaling(DVFS), 没考虑任务干扰
-
没考虑node level调度:
-
resource-aware hybrid scheduling algorithm: hierarchical clustering of the available resources into groups
-
stochastic dynamic level scheduling algorithm: 同时处理task dependency, time randomness, and processor heterogeneity
-
DRS:Dependency-aware and resource-efficient scheduling,用mutual reinforcement learning减少响应时间
-
-
DRL-cloud:low energy cost, low reject rate, low runtime, fast convergence
-
Learning driven parallelization for large-scale video workload in hybrid CPU-GPU cluster: 这篇论文的作者之前发表的,和这篇论文很像,但本文考虑了multi-tasking parallelism and interference avoidance on the GPU
9. Conclusion
- issue:efficient workload parallelization in CPU-GPU heterogeneous cluster and the interference of fine-grained tasks on the GPU
- propose:two-stage task scheduling approach for streaming applications based on deep reinforcement learning and neural collaborative filtering
- steps:
- cluster-level scheduler assigns task to the appropriate worker node according to
the runtime system status and the characteristics of each task. - node-level scheduler distributes subtask to the appropriate computing unit according to the estimated speedup.
- 被分配给GPU的子任务先进入一个优先级队列中,根据队列里的子任务与正在执行的子任务之间的相似度评估干扰程度,算出队列里子任务的优先级,然后进行调度
- cluster-level scheduler assigns task to the appropriate worker node according to
- 特点
- 初始参数:pre-training
- 可扩展性:transfer learning based generalization strategy, quickly rebuild an effective scheduling model when the computing power of the cluster changes
10. Assessment
- 图表不清晰
- 存在一些单词和语法错误
- NCF部分有待改进,各种推荐算法、SGD改进、Loss改进
- IAWP中计算的相似度可能是负数,取倒数后还是负数,不能直接作为优先级,实际上负数的优先级应该大于正数的优先级
- 泛化能力:
- 没考虑同一个node内多个GPU之间的分配(GPU的优先级)
- 没考虑GPU、CPU也有各种类型,而是用一个CPU一个GPU代表所有的CPU、GPU