Multi-task Learning

基于Supervised Learning Lecture 8

Multi-task learning

  • Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
  • The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
  • Typical scenario: many tasks many tasks but only few examples per task. If n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set S or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
  • When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
  • Applications: Learning a set of linear classi ers for related objects
    (cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
  • Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
  • The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

  • Fix probability measures μ1,,μT on Rd×R
    – T tasks
    – Each task is a probability measure, e.g. μt(x,y)=P(x)δ(w,xy) . δ is a deterministic function, interpreted as the conditional probability and wx is an underlying parameter
    Rd can also be a Hilbert space
  • Draw data: (xt1vector,yt1scalar),,(xtn,ytn)μt,t=1,,T (in practice n may vary with t)
  • Learning method:

    min(f1,,fT)F1Tt=1T1ni=1n(yti,ft(xti))


    where F is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
  • Goal is to minimise the multi-task error

    R(f1,,fT)=1Tt=1TE(x,y)μt(yti,ft(xti))

Linear MTL

  • “task” = “linear model”
    – Regression: yti=wt,xti+ϵti
    – Binary classification: yti=sign(wt,xti)ϵti
  • Learning method: min(w1,,wT)S1TTt=11nni=1(yti,wt,xti) . Here, S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
  • The multitask error of W=[w1,,wT] is: R(W)=1TTt=1E(x,y)μt(yti,wt,x)
  • It is possible to give bounds on the uniform deviation
    supWS{R(W)1Tt=1T1ni=1n(yti,wt,xti)}

    and derive bounds for excess error
    R(W^)minWSR(W)

Regularisers for linear MTL

Often we drop the constraint (i.e. WS ) and consider the penalty methods

minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λΩ(w1,,wT)

Different regularisers encourage different types of commonalities between the tasks:

  • variance (or other convex quadratic regularisers) encourage closeness to mean
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)
  • Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
    ||W||2,1:=j=1dt=1Tw2tj
  • Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
    ||w1,,wT||tr

    – extension of joint sparsity; rotate the initial data representation
    – The l1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
  • More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

  • general quadratic regulariser
    Ωvar=s,t=1Tws,Estwt

    where the matrix E=(Est)Ts,t=1RdT×dT is positive definite.
    E
  • variance regulariser
    Let γ[0,1] and
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)=1Tt=1T||wt||2+1γγt=1T||wtw¯||22

    γ=1 : independent tasks; γ=0 : identical tasks
    – regulariser favours weight vectors which are close to its mean.
    – If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
  • Link to the kernel methods (quadratic regulariser)
    The problem
    minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λs,t=1Tws,Estwt

    is equivalent to

    minv1Tt=1T1ni=1n(yti,v,Btxti)+λv,v(1)


    where Bt are p×d matrices (typically pd ) linked to E by E=(BTB)1,Bdim=p×dT=[B1,,BT]concatenate by columns and wt=(Bt)Tvt
    Interpretation:
    – We learn a single function (x,t)ft(x) using the feature map (x,t)Bt(x) and corresponding multitask kernel K((x1,t1),(x2,t2))=Bt1x1,Bt2x2
    – Writing v,Btx=BTtv,x , we interpret this as having a single regression vector which is transformed by matrix Bt to obtain the task specific weight vector.
  • Link to the kernel methods (variance regulariser)
    The problem
    minw1,,wT1Tnt,i(yti,wt,xti)+λ(1Tt=1T||wt||2+1γγVar(w1,,wT))

    is equivalent to
    minw0,u1,,uT1Tnt,i(yti,w0+ut,xti)+λ(1γTt=1T||ut||2+11γ||w0||2)(2)

    by setting wt=w0+ut and minimise over w0 .
    It is of the form (1) with
    vBTtdim=(T+1)d×d=((1γ)12w0,(γT)12u1,,(γT)12uT)=[1γId×d,0d×d,,0d×dt-1,γTId×d,0d×d,,0d×dT-t]

    and the corresponding kernel K((x1,t1),(x2,t2))=(1γ+γTδt1t2)x1,x2
    By writing (2) as the following, it is more apparent that we regularise around some common vector w0
    minw01Tt=1Tminw{1ni=1n(yti,w,xti)+λγ||ww0||2}+λ1γ||w0||2
  • More multitask kernels

Structured sparsity

  • general sparsity regulariser
    ||W||2,1:=j=1dt=1Tw2tj

    – sum of the l2 norm of the row of matrix
    – encourages a matrix has only a few non-zero rows
    – regression vectors are sparse, but the sparsity pattern is contained in a small cardinality
    structured sparsity

Clustered MTL

Further topics

Transferring to new tasks

  • Having found a feature map h , to test it on the environment we
    1) draw a task μE
    2) draw a sample zμn
    3) run the algorithm to obtain a(h)z=f^h,zh
    4) measure the loss of a(h)z on a random pair (x,y)μ
  • The error associated with the algorithm a(h) is
    Rn(h)=EμEEzμnE(x,y)μ[(a(h)z(x),y)]
  • The best value for a representation h given complete knowledge of the environment is then
    minhHRn(h)
  • Compare to the very best we can do:

    R=minhHEμE[minfFE(x,y)μ(f(h(x)),y)]

  • The excess error associated with h is then Rn(h)R

Case of the variance regulariser

  • Training
    minw01Tt=1Tminw{1ni=1n(yti,w,xti+λγ||ww0||2}+λ1γ||w0||2
  • Testing
    minw1ni=1n(yi,w,xi)+λγ||ww0||2
  • Error
    Rn(w0)=EμEEzμnE(x,y)μ(y,w0+wz,x)
  • Best we can do
    R=minw0EμE[minwE(x,y)μ(y,w0+w,x)]
  • Excess error of w0 : Rn(w0)R

Informal reasoning

The feature map B learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

  • Learn a new task by the method
    minv{1ni=1n(yt,v,Bxi)+λ2||v||22}

    • Give more weight to important features. In particular, if some eigenvalues of G=BB are zero, the corresponding eigenvectors are discarded when learning a new task.
    • In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
    • A statistical justification of an approach similar to this based on dictionary learning can be given.
    • Take home message

      • MLT objective function
      • regulariser
      • link to kernel trick

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
几篇CVPR关于multi-task的论文笔记整理,包括 一、 多任务课程学习Curriculum Learning of Multiple Tasks 1 --------------^CVPR2015/CVPR2016v--------------- 5 二、 词典对分类器驱动卷积神经网络进行对象检测Dictionary Pair Classifier Driven Convolutional Neural Networks for Object Detection 5 三、 用于同时检测和分割的多尺度贴片聚合(MPA)* Multi-scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation ∗ 7 四、 通过多任务网络级联实现感知语义分割Instance-aware Semantic Segmentation via Multi-task Network Cascades 10 五、 十字绣网络多任务学习Cross-stitch Networks for Multi-task Learning 15 --------------^CVPR2016/CVPR2017v--------------- 23 六、 多任务相关粒子滤波器用于鲁棒物体跟踪Multi-Task Correlation Particle Filter for Robust Object Tracking 23 七、 多任务网络中的全自适应特征共享与人物属性分类中的应用Fully-Adaptive Feature Sharing in Multi-Task Networks With Applications in Person Attribute Classification 28 八、 超越triplet loss:一个深层次的四重网络,用于人员重新识别Beyond triplet loss: a deep quadruplet network for person re-identification 33 九、 弱监督级联卷积网络Weakly Supervised Cascaded Convolutional Networks 38 十、 从单一图像深度联合雨水检测和去除Deep Joint Rain Detection and Removal from a Single Image 43 十一、 什么可以帮助行人检测?What Can Help Pedestrian Detection? (将额外的特征聚合到基于CNN的行人检测框架) 46 十二、 人员搜索的联合检测和识别特征学习Joint Detection and Identification Feature Learning for Person Search 50 十三、 UberNet:使用多种数据集和有限内存训练用于低,中,高级视觉的通用卷积神经网络UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory 62 一共13篇,希望能够帮助到大家

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值