Causal Forest Theory

因果森林总结:基于树模型的异质因果效应估计

Uplift model with multiple treatments

1. Estimation and Inference of Heterogeneous Treatment Effects using Random Forest

二元干预情形下估计 τ ( x ) = E [ Y 1 − Y 0 ∣ X = x ] \tau(x)=E[Y^1-Y^0|X=x] τ(x)=E[Y1Y0X=x]

1.1 Asymptotic analysis

定理1

  • Under some condition,
    ( τ ^ ( x ) − τ ( x ) ) / Var ⁡ [ τ ^ ( x ) ] ⇒ N ( 0 , 1 ) (\hat{\tau}(x)-\tau(x)) / \sqrt{\operatorname{Var}[\hat{\tau}(x)]} \Rightarrow \mathcal{N}(0,1) (τ^(x)τ(x))/Var[τ^(x)] N(0,1)

  • Var ⁡ [ τ ^ ( x ) ] \operatorname{Var}[\hat{\tau}(x)] Var[τ^(x)]可以用infinitesimal jackknife估计 V ^ I J ( x ) / Var ⁡ [ τ ^ ( x ) ] → 1 \widehat{V}_{I J}(x) / \operatorname{Var}[\hat{\tau}(x)] \rightarrow 1 V IJ(x)/Var[τ^(x)]1
    V ^ I J ( x ) = n − 1 n ( n n − s ) 2 ∑ i = 1 n Cov ⁡ ∗ [ τ ^ b ∗ ( x ) , N i b ∗ ] 2 \widehat{V}_{I J}(x)=\frac{n-1}{n}\left(\frac{n}{n-s}\right)^2 \sum_{i=1}^n \operatorname{Cov}_*\left[\hat{\tau}_b^*(x), N_{i b}^*\right]^2 V IJ(x)=nn1(nsn)2i=1nCov[τ^b(x),Nib]2
    其中,系数项 ( n − 1 ) n / ( n − s ) 2 (n-1)n/(n-s)^2 (n1)n/(ns)2只能对无放回的子抽样做修正

证明过程分为两步:

  • 先证明偏差 E [ μ ^ n ( x ) − μ ( x ) ] E[\hat \mu_n(x)-\mu(x)] E[μ^n(x)μ(x)]的bound

在这里插入图片描述
在这里插入图片描述

  • 再证明 μ ^ n ( x ) − E [ μ ^ n ( x ) ] \hat \mu_n(x)-E[\hat \mu_n(x)] μ^n(x)E[μ^n(x)]近似正态

利用Hajek projection和k-PNN先证明T is ν-incremental
T ∘ = E [ T ] + ∑ i = 1 n ( E [ T ∣ Z i ] − E [ T ] ) \stackrel{\circ}{T}=\mathbb{E}[T]+\sum_{i=1}^n\left(\mathbb{E}\left[T \mid Z_i\right]-\mathbb{E}[T]\right) T=E[T]+i=1n(E[TZi]E[T])

在这里插入图片描述

1.2 Double-Sample Trees

算法

回归树T分裂准则为最小化MSE, μ ^ ( x ) = 1 ∣ { i : X i ∈ L ( x ) } ∣ ∑ { i : X i ∈ L ( x ) } Y i = Y ˉ L \hat{\mu}(x)=\frac{1}{\left|\left\{i: X_i \in L(x)\right\}\right|} \sum_{\left\{i: X_i \in L(x)\right\}} Y_i=\bar Y_L μ^(x)={i:XiL(x)}1{i:XiL(x)}Yi=YˉL

∑ i ∈ J ( μ ^ ( X i ) − Y i ) 2 = ∑ i ∈ J Y i 2 − ∑ i ∈ J μ ^ ( X i ) 2 \sum_{i \in \mathcal{J}}\left(\hat{\mu}\left(X_i\right)-Y_i\right)^2=\sum_{i \in \mathcal{J}} Y_i^2-\sum_{i \in \mathcal{J}} \hat{\mu}\left(X_i\right)^2 iJ(μ^(Xi)Yi)2=iJYi2iJμ^(Xi)2

考虑到 ∑ i ∈ J μ ^ ( X i ) = ∑ i ∈ J Y i \sum_{i \in \mathcal{J}} \hat{\mu}\left(X_i\right)=\sum_{i \in \mathcal{J}} Y_i iJμ^(Xi)=iJYi,上式等价于最大化 μ ^ ( X i ) \hat{\mu}(X_i) μ^(Xi)的方差

2. Generalized Random Forests

2.1 Algorithm

1. Forest-based local estimation

目的:给定 ( O i , X i ) (O_i, X_i) (Oi,Xi), 估计 θ ( ⋅ ) \theta(\cdot) θ(),如估计HTE时, O i = ( Y i , W i ) O_i=(Y_i, W_i) Oi=(Yi,Wi)
方法:求解方程 E [ ψ θ ( x ) , v ( x ) ( O i ) ∣ X i = x ] = 0 \mathbb{E}\left[\psi_{\theta(x), v(x)}\left(O_i\right) \mid X_i=x\right]=0 E[ψθ(x),v(x)(Oi)Xi=x]=0,其中, θ ( x ) , v ( x ) \theta(x), v(x) θ(x),v(x)分别是感兴趣的参数和无关参数

  • 权重估计阶段: α i ( x ) \alpha_i(x) αi(x)衡量 x i x_i xi x x x的相似程度,将同一叶子结点中的"“共现频率”"作为其权重
    α b i ( x ) = 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ , α i ( x ) = 1 B ∑ b = 1 B α b i ( x ) \alpha_{b i}(x)=\frac{\mathbf{1}\left(\left\{X_i \in L_b(x)\right\}\right)}{\left|L_b(x)\right|}, \quad \alpha_i(x)=\frac{1}{B} \sum_{b=1}^B \alpha_{b i}(x) αbi(x)=Lb(x)1({XiLb(x)}),αi(x)=B1b=1Bαbi(x)
    其中 L b ( x ) L_b(x) Lb(x)为第b棵树 x x x所在叶子结点的所有数据
  • 加权求解
    ( θ ^ ( x ) , v ^ ( x ) ) ∈ argmin ⁡ θ , v { ∥ ∑ i = 1 n α i ( x ) ψ θ , v ( O i ) ∥ 2 } (\hat{\theta}(x), \hat{v}(x)) \in \underset{\theta, v}{\operatorname{argmin}}\left\{\left\|\sum_{i=1}^n \alpha_i(x) \psi_{\theta, v}\left(O_i\right)\right\|_2\right\} (θ^(x),v^(x))θ,vargmin{ i=1nαi(x)ψθ,v(Oi) 2}
    例子:求解 μ ( x ) = E [ Y i ∣ X i = x ] = 0 \mu(x)=\mathbb{E}\left[Y_i \mid X_i=x\right]=0 μ(x)=E[YiXi=x]=0,取 ψ u ( x ) ( Y i ) = Y i − μ ( x ) \psi_{u(x)}\left(Y_i\right)=Y_i-\mu(x) ψu(x)(Yi)=Yiμ(x),则有 ∑ i = 1 n 1 B ∑ α b i ( x ) ( Y i − μ ^ ( x ) ) = 0 \sum_{i=1}^n \frac{1}{B} \sum \alpha_{b i}(x)\left(Y_i-\hat{\mu}(x)\right)=0 i=1nB1αbi(x)(Yiμ^(x))=0,方程的解为 μ ^ ( x ) = 1 B ∑ μ ^ b ( x ) \hat{\mu}(x)=\frac{1}{B} \sum \hat{\mu}_b(x) μ^(x)=B1μ^b(x)

2. Splitting to maximize heterogeneity

针对某一节点P和数据J,参数的估计方法为
( θ ^ P , v ^ P ) ( J ) ∈ arg ⁡ min ⁡ θ , v ∥ ∑ { i ∈ J : X i ∈ P } ψ θ , v ( O i ) ∥ 2 \left(\hat{\theta}_P, \hat{v}_P\right)(J) \in \arg \min _{\theta, v}\left\|\sum_{\left\{i \in J: X_i \in P\right\}} \psi_{\theta, v}\left(O_i\right)\right\|_2 (θ^P,v^P)(J)argθ,vmin {iJ:XiP}ψθ,v(Oi) 2
将结点P分裂为两个子节点 C 1 , C 2 C_1, C_2 C1,C2,目标为最小化 e r r ( C 1 , C 2 ) err(C_1, C_2) err(C1,C2)
err ⁡ ( C 1 , C 2 ) = ∑ j = 1 , 2 P ( X ∈ C j ∣ X ∈ P ) E [ ( θ ^ C j ( J ) − θ ( X ) ) 2 ∣ X ∈ C j ] \operatorname{err}\left(C_1, C_2\right)=\sum_{j=1,2} P\left(X \in C_j \mid X \in P\right) E\left[\left(\hat{\theta}_{C_j}(J)-\theta(X)\right)^2 \mid X \in C_j\right] err(C1,C2)=j=1,2P(XCjXP)E[(θ^Cj(J)θ(X))2XCj]
在某些条件下, err ⁡ ( C 1 , C 2 ) = K ( P ) − E [ Δ ( C 1 , C 2 ) ] + o ( r 2 ) \operatorname{err}\left(C_1, C_2\right)=K(P)-\mathbb{E}\left[\Delta\left(C_1, C_2\right)\right]+o\left(r^2\right) err(C1,C2)=K(P)E[Δ(C1,C2)]+o(r2),所以分裂等价于最大化节点间的异质性,即
△ ( C 1 , C 2 ) = n C 1 n C 2 n C P 2 ( θ ^ C 1 ( J ) − θ ^ C 2 ( J ) ) 2 \triangle\left(C_1, C_2\right)=\frac{n_{C_1} n_{C_2}}{n_{C_P}^2}\left(\hat{\theta}_{C_1}(J)-\hat{\theta}_{C_2}(J)\right)^2 (C1,C2)=nCP2nC1nC2(θ^C1(J)θ^C2(J))2

3. The gradient tree algorithm

为减少计算量,采用梯度近似
θ ~ C = θ ^ P − 1 ∣ { i : X i ∈ C } ∣ ∑ { i : X i ∈ C } ξ ⊤ A P − 1 ψ θ ^ P , v ^ P ( O i ) \tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|\left\{i: X_i \in C\right\}\right|} \sum_{\left\{i: X_i \in C\right\}} \xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) θ~C=θ^P{i:XiC}1{i:XiC}ξAP1ψθ^P,v^P(Oi)
其中, ξ \xi ξ取出 θ \theta θ的值,消去无关参数, A P A_P AP近似 ∇ E [ ψ θ ^ P , v ^ P ( O i ) ∣ X i ∈ P ] \nabla \mathbb{E}\left[\psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) \mid X_i \in P\right] E[ψθ^P,v^P(Oi)XiP]
A P = 1 ∣ { i : X i ∈ P } ∣ ∑ { i : X i ∈ P } ∇ ψ θ ^ P , v ^ P ( O i ) A_P=\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}} \nabla \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) AP={i:XiP}1{i:XiP}ψθ^P,v^P(Oi)
注:当 ψ \psi ψ不可导时,可以采用分位数回归

故分裂阶段可以分为以下2步

  • labeling step:计算父节点的 θ ^ P , v ^ P , A P − 1 \hat{\theta}_P, \hat{v}_P, A_P^{-1} θ^P,v^P,AP1 ,以及每个样本的伪值
    ρ i = − ξ T A P − 1 ψ θ ^ P , v ^ P ( O i ) \rho_i=-\xi^T A_P^{-1} \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) ρi=ξTAP1ψθ^P,v^P(Oi)
  • regression step:最大化近似分裂准则
    △ ~ ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ { i : X i ∈ C j } ∣ ( ∑ { i : X i ∈ C j } ρ i ) 2 \tilde{\triangle}\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|\left\{i: X_i \in C_j\right\}\right|}\left(\sum_{\left\{i: X_i \in C_j\right\}} \rho_i\right)^2 ~(C1,C2)=j=12{i:XiCj}1 {i:XiCj}ρi 2

回归过程中,分裂准则的近似误差在一定范围内
在这里插入图片描述

2.2 Asymptotic analysis

定义expected score function
M θ , ν ( x ) : = E [ ψ θ , v ( O ) ∣ X = x ] M_{\theta, \nu}(x):=\mathbb{E}\left[\psi_{\theta, v}(O) \mid X=x\right] Mθ,ν(x):=E[ψθ,v(O)X=x]

在这里插入图片描述在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • consistency
    在这里插入图片描述
  • approximate normality
    在这里插入图片描述

2.3 Experiments

1. CAPE

Y i = W i ⋅ b i + ε i , β ( x ) = E [ b i ∣ X i = x ] Y_i=W_i \cdot b_i+\varepsilon_i, \beta(x)=\mathbb{E}\left[b_i \mid X_i=x\right] Yi=Wibi+εi,β(x)=E[biXi=x]目标是估计 θ ( x ) = ξ ⋅ β ( x ) , ξ ∈ R p \theta(x)=\xi \cdot \beta(x),\xi \in \mathbb{R}^p θ(x)=ξβ(x)ξRp,score function
ψ β ( x ) , c ( x ) ( Y i , W i ) = ( Y i − β ( x ) ⋅ W i − c ( x ) ) ( 1   W i ⊤ ) ⊤ \psi_{\beta(x), c(x)}\left(Y_i, W_i\right)=\left(Y_i-\beta(x) \cdot W_i-c(x)\right)\left(1 \ W_i^{\top}\right)^{\top} ψβ(x),c(x)(Yi,Wi)=(Yiβ(x)Wic(x))(1 Wi)此时 arg ⁡ min ⁡ θ ∥ E [ ψ θ ( x ) , v ( x ) ( O i ) ] ∥ 2 \arg \min _\theta\left\|E\left[\psi_{\theta(x), v(x)}\left(O_i\right)\right]\right\|_2 argminθ E[ψθ(x),v(x)(Oi)] 2 相当于:
θ ( x ) = ξ ⊤ Var ⁡ [ W i ∣ X i = x ] − 1 Cov ⁡ [ W i , Y i ∣ X i = x ⌉ \theta(x)=\xi^{\top} \operatorname{Var}\left[W_i \mid X_i=x\right]^{-1} \operatorname{Cov}\left[W_i, Y_i \mid X_i=x\right\rceil θ(x)=ξVar[WiXi=x]1Cov[Wi,YiXi=x

  • Forest
    θ ^ ( x ) = ξ ⊤ ( ∑ i = 1 n α i ( x ) ( W i − W ˉ α ) ⊗ 2 ) − 1 ∑ i = 1 n α i ( x ) ( W i − W ˉ α ) ( Y i − Y ˉ α ) \hat{\theta}(x)=\xi^{\top}\left(\sum_{i=1}^n \alpha_i(x)\left(W_i-\bar{W}_\alpha\right)^{\otimes 2}\right)^{-1} \sum_{i=1}^n \alpha_i(x)\left(W_i-\bar{W}_\alpha\right)\left(Y_i-\bar{Y}_\alpha\right) θ^(x)=ξ(i=1nαi(x)(WiWˉα)2)1i=1nαi(x)(WiWˉα)(YiYˉα)其中, W ˉ α = ∑ α i ( x ) W i , Y ˉ α = ∑ α i ( x ) Y i , v ⊗ 2 = v v ⊤ \bar{W}_\alpha=\sum \alpha_i(x) W_i,\bar{Y}_\alpha=\sum \alpha_i(x) Y_i, v^{\otimes 2}=v v^{\top} Wˉα=αi(x)WiYˉα=αi(x)Yiv2=vv
    GRF算法实施时,权重可自动求解,但需要计算对应的伪结果,注意 ρ i 和 A P \rho_i和A_P ρiAP只关注 θ ( x ) \theta(x) θ(x)
    ψ i = W i ( Y i − Y ˉ P − ( W i − W ˉ P ) β ^ P ) \psi_i=W_i \left(Y_i-\bar{Y}_P-\left(W_i-\bar{W}_P\right) \hat{\beta}_P\right) ψi=Wi(YiYˉP(WiWˉP)β^P) ∇ ψ i = W i ⊗ 2 \nabla \psi_i = W_i^{\otimes 2} ψi=Wi2
    对比实际代入的表达式,实际对W做centering
    ρ i = ξ ⊤ A P − 1 ( W i − W ˉ P ) ( Y i − Y ˉ P − ( W i − W ˉ P ) β ^ P ) \rho_i =\xi^{\top} A_P^{-1}\left(W_i-\bar{W}_P\right)\left(Y_i-\bar{Y}_P-\left(W_i-\bar{W}_P\right) \hat{\beta}_P\right) ρi=ξAP1(WiWˉP)(YiYˉP(WiWˉP)β^P) A P = 1 ∣ { i : X i ∈ P } ∣ ∑ { i : X i ∈ P } ( W − W ˉ P ) ⊗ 2 A_P =\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}}\left(W-\bar{W}_P\right)^{\otimes 2} AP={i:XiP}1{i:XiP}(WWˉP)2
  • Local Centering
    提前对Y和W做中心化处理,类似残差,使得估计效果更好
    θ ( x ) = ξ ⊤ Var ⁡ [ ( W i − E [ W i ∣ X i ] ) ∣ X i ∈ S ] − 1 × Cov ⁡ [ ( W i − E [ W i ∣ X i ] ) , ( Y i − E [ Y i ∣ X i ] ) ∣ X i ∈ S ] \begin{aligned} \theta(x)= & \xi^{\top} \operatorname{Var}\left[\left(W_i-\mathbb{E}\left[W_i \mid X_i\right]\right) \mid X_i \in \mathcal{S}\right]^{-1} \\ & \times \operatorname{Cov}\left[\left(W_i-\mathbb{E}\left[W_i \mid X_i\right]\right),\left(Y_i-\mathbb{E}\left[Y_i \mid X_i\right]\right) \mid X_i \in \mathcal{S}\right] \end{aligned} θ(x)=ξVar[(WiE[WiXi])XiS]1×Cov[(WiE[WiXi]),(YiE[YiXi])XiS]

2. Quantile Regression Forest

ψ θ ( Y i ) = q 1 ( { Y i > θ } ) − ( 1 − q ) 1 ( { Y i ≤ θ } ) \psi_\theta\left(Y_i\right)=q \mathbf{1}\left(\left\{Y_i>\theta\right\}\right)-(1-q) \mathbf{1}\left(\left\{Y_i \leq \theta\right\}\right) ψθ(Yi)=q1({Yi>θ})(1q)1({Yiθ})

在这里插入图片描述

3. Orthogonal Random Forest for Causal Inference

3.1 Introduction

DML的优势:即使第一阶段的估计有误差,第二阶段的估计仍可以近似正态;劣势:HTE预设参数形式。CF的优势:非参数估计;劣势:很大程度上要求低维度W。ORF在GRF的基础上,参考DML新增对无关参数的正交估计(First stage),减少误差。

At a high level, ORF can be viewed as an orthogonalized version of GRF that is more robust to the nuisance estimation error. The key modification to GRF’s tree learner is our incorporation of orthogonal nuisance estimation in the splitting criterion.

在这里插入图片描述

3.2 Algorithm

建树时每次分裂的过程two-stage
在这里插入图片描述

1. first stage

ν ^ ( x ) = arg ⁡ min ⁡ ∑ a i l ( Z i ; ν ) + λ ∣ ∣ ν ∣ ∣ 1 \hat \nu(x)=\arg \min \sum a_i l (Z_i;\nu)+\lambda ||\nu||_1 ν^(x)=argminail(Zi;ν)+λ∣∣ν1
在这里插入图片描述

2. second stage

  • split

具体执行分裂的算法类似GRF的gradient tree algorithm,但考虑到honesty,集合略有改动,其中 S 1 S^1 S1是用于分裂的数据, h ^ P \hat h_P h^P是first stage估计的无关参数
θ ~ C = θ ^ P − 1 ∣ C ∩ S 1 ∣ ∑ i ∈ C j ∩ S 1 A P − 1 ψ ( Z i ; θ ^ P , h ^ P ( X i , W i ) ) \tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|C \cap S^1\right|} \sum_{i \in C_j \cap S^1} A_P^{-1} \psi\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right) θ~C=θ^PCS11iCjS1AP1ψ(Zi;θ^P,h^P(Xi,Wi))
where A P = 1 ∣ P ∩ S 1 ∣ ∑ i ∈ P ∩ S b 1 ∇ θ ψ ( Z i ; θ ^ P , h ^ P ( X i , W i ) ) A_P=\frac{1}{\left|P \cap S^1\right|} \sum_{i \in P \cap S_b^1} \nabla_\theta \psi\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right) AP=PS11iPSb1θψ(Zi;θ^P,h^P(Xi,Wi))

  • labeling step:计算父节点的 θ ^ P , h ^ P , A P − 1 \hat{\theta}_P, \hat{h}_P, A_P^{-1} θ^P,h^P,AP1 ,以及每个样本的伪值
    ρ t , i = A P − 1 ψ t ( Z i ; θ ^ P , h ^ P ( X i , W i ) ) \rho_{t, i}=A_P^{-1} \psi_t\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right) ρt,i=AP1ψt(Zi;θ^P,h^P(Xi,Wi))

  • regression step:maximize proxy heterogeneity score
    Δ ~ t ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ C j ∩ S 1 ∣ ( ∑ i ∈ C j ∩ S 1 ρ t , i ) \tilde{\Delta}_t\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|C_j \cap S^1\right|}\left(\sum_{i \in C_j \cap S^1} \rho_{t, i}\right) Δ~t(C1,C2)=j=12CjS11 iCjS1ρt,i

  • Predict

a i b a_{ib} aib同样仅限于 S 2 S^2 S2的估计样本
a i b = 1 [ ( X i ∈ L b ( x ) ) ∧ ( Z i ∈ S b 2 ) ] ∣ L b ( x ) ∩ S b 2 ∣ , a i = 1 B ∑ b = 1 B a i b a_{i b}=\frac{\mathbf{1}\left[\left(X_i \in L_b(x)\right) \wedge\left(Z_i \in S_b^2\right)\right]}{\left|L_b(x) \cap S_b^2\right|}, \quad a_i=\frac{1}{B} \sum_{b=1}^B a_{i b} aib=Lb(x)Sb21[(XiLb(x))(ZiSb2)],ai=B1b=1Baib

以下定理保证了 a i b a_{ib} aib在x邻域内非零

在这里插入图片描述

3.3 Experiments

  • DML Partially Linear Regression(PLR, Robinson, 1988)
    Y = D θ 0 + g 0 ( X ) + U , E [ U ∣ X , D ] = 0 D = m 0 ( X ) + V , E [ V ∣ X ] = 0 \begin{array}{cl} Y=D \theta_0+g_0(X)+U, & \mathrm{E}[U \mid X, D]=0 \\ D=m_0(X)+V, & \mathrm{E}[V \mid X]=0 \end{array} Y=Dθ0+g0(X)+U,D=m0(X)+V,E[UX,D]=0E[VX]=0则score function为 ψ ( W ; θ , η ) = ( Y − D α − g ( X ) ) ( D − m ( X ) ) \psi(W ; \theta, \eta)=(Y-D \alpha-g(X))(D-m(X)) ψ(W;θ,η)=(YDαg(X))(Dm(X))

  • ORF
    数据 D = { Z i = ( T i , Y i , W i , X i ) } i = 1 2 n D = \{Z_i = (T_i, Y_i, W_i, X_i)\}_{i=1}^{2n} D={Zi=(Ti,Yi,Wi,Xi)}i=12n,其中T是连续或离散的Treatment,Y是outcome, W ∈ [ − 1 , 1 ] d ν W \in [-1,1]^{d_\nu} W[1,1]dν是potential confounders/controls, X ∈ [ 0 , 1 ] d X \in [0,1]^d X[0,1]d是特征
    Y = ⟨ μ 0 ( X , W ) , T ⟩ + f 0 ( X , W ) + ε , E [ ε ∣ W , X , T ] = 0 T = g 0 ( X , W ) + η , E [ η ∣ X , W , ε ] = 0 \begin{array}{cl} Y=\left\langle\mu_0(X, W), T\right\rangle+f_0(X, W)+\varepsilon, & \mathbb{E}[\varepsilon \mid W, X, T]=0 \\ T=g_0(X, W)+\eta, & \mathbb{E}[\eta \mid X, W, \varepsilon]=0 \end{array} Y=μ0(X,W),T+f0(X,W)+ε,T=g0(X,W)+η,E[εW,X,T]=0E[ηX,W,ε]=0confounders分别通过 f 0 f_0 f0 g 0 g_0 g0影响outcome和treatment
    μ 0 : R d × R d ν → [ − 1 , 1 ] p \mu_0: \mathbb{R}^d \times \mathbb{R}^{d_\nu} \rightarrow[-1,1]^p μ0:Rd×Rdν[1,1]p为treatment effect function,目标是估计CATE
    θ 0 ( x ) = E [ μ 0 ( X , W ) ∣ X = x ] \theta_0(x)=\mathbb{E}\left[\mu_0(X, W) \mid X=x\right] θ0(x)=E[μ0(X,W)X=x]
    基于DML思想,残差化
    Y − E [ Y ∣ X , W ] = ⟨ μ 0 ( X , W ) , T − E [ T ∣ X , W ] ⟩ + ε Y-\mathbb{E}[Y \mid X, W]=\left\langle\mu_0(X, W), T-\mathbb{E}[T \mid X, W]\right\rangle+\varepsilon YE[YX,W]=μ0(X,W),TE[TX,W]+ε定义 q 0 ( X , W ) = E [ Y ∣ X , W ] q_0(X, W)=\mathbb{E}[Y \mid X, W] q0(X,W)=E[YX,W] Y ~ = Y − q ( X , W ) \tilde{Y}=Y-q(X, W) Y~=Yq(X,W) T ~ = T − \tilde{T}=T- T~=T g 0 ( X , W ) = η g_0(X, W)=\eta g0(X,W)=η, 则有
    E [ Y ~ ∣ X , T ~ ] = E [ μ 0 ( X , W ) ∣ X ] ⋅ T ~ = θ ( X ) ⋅ T ~ \mathbb{E}[\tilde{Y} \mid X, \tilde{T}]=\mathbb{E}\left[\mu_0(X, W) \mid X\right] \cdot \tilde{T}=\theta(X) \cdot \tilde{T} E[Y~X,T~]=E[μ0(X,W)X]T~=θ(X)T~则score function ψ ( Z ; θ , h ( X , W ) ) = { Y − q ( X , W ) − θ ( T − g ( X , W ) ⟩ ) } ( T − g ( X , W ) ) \psi(Z ; \theta, h(X, W)) = \{Y-q(X, W)-\theta(T-g(X, W)\rangle)\}(T-g(X, W)) ψ(Z;θ,h(X,W))={Yq(X,W)θ(Tg(X,W)⟩)}(Tg(X,W))
    其中 q , g q, g q,g q 0 , g 0 q_0, g_0 q0,g0的估计

4. Decision trees for uplift modeling with single and multiple treatments

4.1 Single Treatment

  • Split rule:maximize the differences between class distributions
    D gain  ( A ) = D ( P T ( Y ) : P C ( Y ) ∣ A ) − D ( P T ( Y ) : P C ( Y ) ) D_{\text {gain }}(A)=D\left(P^T(Y): P^C(Y) \mid A\right)-D\left(P^T(Y): P^C(Y)\right) Dgain (A)=D(PT(Y):PC(Y)A)D(PT(Y):PC(Y))

  • Normalising:C4.5对gain除以info避免bias,而本文的norm主要惩罚两边子节点中treatment和control组比例不平衡的,这和随机试验的假设相悖
    下式第一项系数考虑比例不平衡,后两项考虑相对样本大小
    (1) D=KL:
    I ( A ) = H ( N T N , N C N ) K L ( P T ( A ) : P C ( A ) ) + N T N H ( P T ( A ) ) + N C N H ( P C ( A ) ) + 1 2 \begin{aligned} I(A)= & H\left(\frac{N^T}{N}, \frac{N^C}{N}\right) K L\left(P^T(A): P^C(A)\right) \\ & +\frac{N^T}{N} H\left(P^T(A)\right)+\frac{N^C}{N} H\left(P^C(A)\right)+\frac{1}{2}\end{aligned} I(A)=H(NNT,NNC)KL(PT(A):PC(A))+NNTH(PT(A))+NNCH(PC(A))+21
    (2) D=欧式/卡方
    J ( A ) = Gini ⁡ ( N T N , N C N ) D ( P T ( A ) : P C ( A ) ) + N T N Gini ⁡ ( P T ( A ) ) + N C N Gini ⁡ ( P C ( A ) ) + 1 2 \begin{aligned} & J(A)=\operatorname{Gini}\left(\frac{N^T}{N}, \frac{N^C}{N}\right) D\left(P^T(A): P^C(A)\right) \\ &+ \frac{N^T}{N} \operatorname{Gini}\left(P^T(A)\right)+\frac{N^C}{N} \operatorname{Gini}\left(P^C(A)\right)+\frac{1}{2}\end{aligned} J(A)=Gini(NNT,NNC)D(PT(A):PC(A))+NNTGini(PT(A))+NNCGini(PC(A))+21

4.2 Multiple treatment

  • Split rule

在这里插入图片描述

  • Normalizing
    I ( A ) = α H ( N T N , N C N ) K L ( P T ( A ) : P C ( A ) ) + ( 1 − α ) ∑ i = 1 k H ( N T i N T i + N C , N C N T i + N C ) K L ( P T i ( A ) : P C ( A ) ) + ∑ i = 1 k N T i N H ( P T i ( A ) ) + N C N H ( P C ( A ) ) + 1 2 \begin{aligned} I(A)=\alpha H & \left(\frac{N^T}{N}, \frac{N^C}{N}\right) K L\left(P^T(A): P^C(A)\right) \\ & +(1-\alpha) \sum_{i=1}^k H\left(\frac{N^{T_i}}{N^{T_i}+N^C}, \frac{N^C}{N^{T_i}+N^C}\right) K L\left(P^{T_i}(A): P^C(A)\right) \\ & +\sum_{i=1}^k \frac{N^{T_i}}{N} H\left(P^{T_i}(A)\right)+\frac{N^C}{N} H\left(P^C(A)\right)+\frac{1}{2} \end{aligned} I(A)=αH(NNT,NNC)KL(PT(A):PC(A))+(1α)i=1kH(NTi+NCNTi,NTi+NCNC)KL(PTi(A):PC(A))+i=1kNNTiH(PTi(A))+NNCH(PC(A))+21
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值