Paper Review
Uplift model with multiple treatments
1. Estimation and Inference of Heterogeneous Treatment Effects using Random Forest
二元干预情形下估计 τ ( x ) = E [ Y 1 − Y 0 ∣ X = x ] \tau(x)=E[Y^1-Y^0|X=x] τ(x)=E[Y1−Y0∣X=x]
1.1 Asymptotic analysis
-
Under some condition,
( τ ^ ( x ) − τ ( x ) ) / Var [ τ ^ ( x ) ] ⇒ N ( 0 , 1 ) (\hat{\tau}(x)-\tau(x)) / \sqrt{\operatorname{Var}[\hat{\tau}(x)]} \Rightarrow \mathcal{N}(0,1) (τ^(x)−τ(x))/Var[τ^(x)]⇒N(0,1) -
Var [ τ ^ ( x ) ] \operatorname{Var}[\hat{\tau}(x)] Var[τ^(x)]可以用infinitesimal jackknife估计 V ^ I J ( x ) / Var [ τ ^ ( x ) ] → 1 \widehat{V}_{I J}(x) / \operatorname{Var}[\hat{\tau}(x)] \rightarrow 1 V IJ(x)/Var[τ^(x)]→1
V ^ I J ( x ) = n − 1 n ( n n − s ) 2 ∑ i = 1 n Cov ∗ [ τ ^ b ∗ ( x ) , N i b ∗ ] 2 \widehat{V}_{I J}(x)=\frac{n-1}{n}\left(\frac{n}{n-s}\right)^2 \sum_{i=1}^n \operatorname{Cov}_*\left[\hat{\tau}_b^*(x), N_{i b}^*\right]^2 V IJ(x)=nn−1(n−sn)2i=1∑nCov∗[τ^b∗(x),Nib∗]2
其中,系数项 ( n − 1 ) n / ( n − s ) 2 (n-1)n/(n-s)^2 (n−1)n/(n−s)2只能对无放回的子抽样做修正
证明过程分为两步:
- 先证明偏差 E [ μ ^ n ( x ) − μ ( x ) ] E[\hat \mu_n(x)-\mu(x)] E[μ^n(x)−μ(x)]的bound
- 再证明 μ ^ n ( x ) − E [ μ ^ n ( x ) ] \hat \mu_n(x)-E[\hat \mu_n(x)] μ^n(x)−E[μ^n(x)]近似正态
利用Hajek projection和k-PNN先证明T is ν-incremental
T
∘
=
E
[
T
]
+
∑
i
=
1
n
(
E
[
T
∣
Z
i
]
−
E
[
T
]
)
\stackrel{\circ}{T}=\mathbb{E}[T]+\sum_{i=1}^n\left(\mathbb{E}\left[T \mid Z_i\right]-\mathbb{E}[T]\right)
T∘=E[T]+i=1∑n(E[T∣Zi]−E[T])
1.2 Double-Sample Trees
回归树T分裂准则为最小化MSE, μ ^ ( x ) = 1 ∣ { i : X i ∈ L ( x ) } ∣ ∑ { i : X i ∈ L ( x ) } Y i = Y ˉ L \hat{\mu}(x)=\frac{1}{\left|\left\{i: X_i \in L(x)\right\}\right|} \sum_{\left\{i: X_i \in L(x)\right\}} Y_i=\bar Y_L μ^(x)=∣{i:Xi∈L(x)}∣1∑{i:Xi∈L(x)}Yi=YˉL
∑ i ∈ J ( μ ^ ( X i ) − Y i ) 2 = ∑ i ∈ J Y i 2 − ∑ i ∈ J μ ^ ( X i ) 2 \sum_{i \in \mathcal{J}}\left(\hat{\mu}\left(X_i\right)-Y_i\right)^2=\sum_{i \in \mathcal{J}} Y_i^2-\sum_{i \in \mathcal{J}} \hat{\mu}\left(X_i\right)^2 i∈J∑(μ^(Xi)−Yi)2=i∈J∑Yi2−i∈J∑μ^(Xi)2
考虑到 ∑ i ∈ J μ ^ ( X i ) = ∑ i ∈ J Y i \sum_{i \in \mathcal{J}} \hat{\mu}\left(X_i\right)=\sum_{i \in \mathcal{J}} Y_i ∑i∈Jμ^(Xi)=∑i∈JYi,上式等价于最大化 μ ^ ( X i ) \hat{\mu}(X_i) μ^(Xi)的方差
2. Generalized Random Forests
2.1 Algorithm
1. Forest-based local estimation
目的:给定
(
O
i
,
X
i
)
(O_i, X_i)
(Oi,Xi), 估计
θ
(
⋅
)
\theta(\cdot)
θ(⋅),如估计HTE时,
O
i
=
(
Y
i
,
W
i
)
O_i=(Y_i, W_i)
Oi=(Yi,Wi)。
方法:求解方程
E
[
ψ
θ
(
x
)
,
v
(
x
)
(
O
i
)
∣
X
i
=
x
]
=
0
\mathbb{E}\left[\psi_{\theta(x), v(x)}\left(O_i\right) \mid X_i=x\right]=0
E[ψθ(x),v(x)(Oi)∣Xi=x]=0,其中,
θ
(
x
)
,
v
(
x
)
\theta(x), v(x)
θ(x),v(x)分别是感兴趣的参数和无关参数
- 权重估计阶段:
α
i
(
x
)
\alpha_i(x)
αi(x)衡量
x
i
x_i
xi和
x
x
x的相似程度,将同一叶子结点中的"“共现频率”"作为其权重
α b i ( x ) = 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ , α i ( x ) = 1 B ∑ b = 1 B α b i ( x ) \alpha_{b i}(x)=\frac{\mathbf{1}\left(\left\{X_i \in L_b(x)\right\}\right)}{\left|L_b(x)\right|}, \quad \alpha_i(x)=\frac{1}{B} \sum_{b=1}^B \alpha_{b i}(x) αbi(x)=∣Lb(x)∣1({Xi∈Lb(x)}),αi(x)=B1b=1∑Bαbi(x)
其中 L b ( x ) L_b(x) Lb(x)为第b棵树 x x x所在叶子结点的所有数据 - 加权求解
( θ ^ ( x ) , v ^ ( x ) ) ∈ argmin θ , v { ∥ ∑ i = 1 n α i ( x ) ψ θ , v ( O i ) ∥ 2 } (\hat{\theta}(x), \hat{v}(x)) \in \underset{\theta, v}{\operatorname{argmin}}\left\{\left\|\sum_{i=1}^n \alpha_i(x) \psi_{\theta, v}\left(O_i\right)\right\|_2\right\} (θ^(x),v^(x))∈θ,vargmin{ i=1∑nαi(x)ψθ,v(Oi) 2}
例子:求解 μ ( x ) = E [ Y i ∣ X i = x ] = 0 \mu(x)=\mathbb{E}\left[Y_i \mid X_i=x\right]=0 μ(x)=E[Yi∣Xi=x]=0,取 ψ u ( x ) ( Y i ) = Y i − μ ( x ) \psi_{u(x)}\left(Y_i\right)=Y_i-\mu(x) ψu(x)(Yi)=Yi−μ(x),则有 ∑ i = 1 n 1 B ∑ α b i ( x ) ( Y i − μ ^ ( x ) ) = 0 \sum_{i=1}^n \frac{1}{B} \sum \alpha_{b i}(x)\left(Y_i-\hat{\mu}(x)\right)=0 ∑i=1nB1∑αbi(x)(Yi−μ^(x))=0,方程的解为 μ ^ ( x ) = 1 B ∑ μ ^ b ( x ) \hat{\mu}(x)=\frac{1}{B} \sum \hat{\mu}_b(x) μ^(x)=B1∑μ^b(x)
2. Splitting to maximize heterogeneity
针对某一节点P和数据J,参数的估计方法为
(
θ
^
P
,
v
^
P
)
(
J
)
∈
arg
min
θ
,
v
∥
∑
{
i
∈
J
:
X
i
∈
P
}
ψ
θ
,
v
(
O
i
)
∥
2
\left(\hat{\theta}_P, \hat{v}_P\right)(J) \in \arg \min _{\theta, v}\left\|\sum_{\left\{i \in J: X_i \in P\right\}} \psi_{\theta, v}\left(O_i\right)\right\|_2
(θ^P,v^P)(J)∈argθ,vmin
{i∈J:Xi∈P}∑ψθ,v(Oi)
2
将结点P分裂为两个子节点
C
1
,
C
2
C_1, C_2
C1,C2,目标为最小化
e
r
r
(
C
1
,
C
2
)
err(C_1, C_2)
err(C1,C2)
err
(
C
1
,
C
2
)
=
∑
j
=
1
,
2
P
(
X
∈
C
j
∣
X
∈
P
)
E
[
(
θ
^
C
j
(
J
)
−
θ
(
X
)
)
2
∣
X
∈
C
j
]
\operatorname{err}\left(C_1, C_2\right)=\sum_{j=1,2} P\left(X \in C_j \mid X \in P\right) E\left[\left(\hat{\theta}_{C_j}(J)-\theta(X)\right)^2 \mid X \in C_j\right]
err(C1,C2)=j=1,2∑P(X∈Cj∣X∈P)E[(θ^Cj(J)−θ(X))2∣X∈Cj]
在某些条件下,
err
(
C
1
,
C
2
)
=
K
(
P
)
−
E
[
Δ
(
C
1
,
C
2
)
]
+
o
(
r
2
)
\operatorname{err}\left(C_1, C_2\right)=K(P)-\mathbb{E}\left[\Delta\left(C_1, C_2\right)\right]+o\left(r^2\right)
err(C1,C2)=K(P)−E[Δ(C1,C2)]+o(r2),所以分裂等价于最大化节点间的异质性,即
△
(
C
1
,
C
2
)
=
n
C
1
n
C
2
n
C
P
2
(
θ
^
C
1
(
J
)
−
θ
^
C
2
(
J
)
)
2
\triangle\left(C_1, C_2\right)=\frac{n_{C_1} n_{C_2}}{n_{C_P}^2}\left(\hat{\theta}_{C_1}(J)-\hat{\theta}_{C_2}(J)\right)^2
△(C1,C2)=nCP2nC1nC2(θ^C1(J)−θ^C2(J))2
3. The gradient tree algorithm
为减少计算量,采用梯度近似
θ
~
C
=
θ
^
P
−
1
∣
{
i
:
X
i
∈
C
}
∣
∑
{
i
:
X
i
∈
C
}
ξ
⊤
A
P
−
1
ψ
θ
^
P
,
v
^
P
(
O
i
)
\tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|\left\{i: X_i \in C\right\}\right|} \sum_{\left\{i: X_i \in C\right\}} \xi^{\top} A_P^{-1} \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right)
θ~C=θ^P−∣{i:Xi∈C}∣1{i:Xi∈C}∑ξ⊤AP−1ψθ^P,v^P(Oi)
其中,
ξ
\xi
ξ取出
θ
\theta
θ的值,消去无关参数,
A
P
A_P
AP近似
∇
E
[
ψ
θ
^
P
,
v
^
P
(
O
i
)
∣
X
i
∈
P
]
\nabla \mathbb{E}\left[\psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) \mid X_i \in P\right]
∇E[ψθ^P,v^P(Oi)∣Xi∈P]
A
P
=
1
∣
{
i
:
X
i
∈
P
}
∣
∑
{
i
:
X
i
∈
P
}
∇
ψ
θ
^
P
,
v
^
P
(
O
i
)
A_P=\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}} \nabla \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right)
AP=∣{i:Xi∈P}∣1{i:Xi∈P}∑∇ψθ^P,v^P(Oi)
注:当
ψ
\psi
ψ不可导时,可以采用分位数回归
故分裂阶段可以分为以下2步
- labeling step:计算父节点的
θ
^
P
,
v
^
P
,
A
P
−
1
\hat{\theta}_P, \hat{v}_P, A_P^{-1}
θ^P,v^P,AP−1 ,以及每个样本的伪值
ρ i = − ξ T A P − 1 ψ θ ^ P , v ^ P ( O i ) \rho_i=-\xi^T A_P^{-1} \psi_{\hat{\theta}_P, \hat{v}_P}\left(O_i\right) ρi=−ξTAP−1ψθ^P,v^P(Oi) - regression step:最大化近似分裂准则
△ ~ ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ { i : X i ∈ C j } ∣ ( ∑ { i : X i ∈ C j } ρ i ) 2 \tilde{\triangle}\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|\left\{i: X_i \in C_j\right\}\right|}\left(\sum_{\left\{i: X_i \in C_j\right\}} \rho_i\right)^2 △~(C1,C2)=j=1∑2∣{i:Xi∈Cj}∣1 {i:Xi∈Cj}∑ρi 2
回归过程中,分裂准则的近似误差在一定范围内
2.2 Asymptotic analysis
定义expected score function
M
θ
,
ν
(
x
)
:
=
E
[
ψ
θ
,
v
(
O
)
∣
X
=
x
]
M_{\theta, \nu}(x):=\mathbb{E}\left[\psi_{\theta, v}(O) \mid X=x\right]
Mθ,ν(x):=E[ψθ,v(O)∣X=x]
- consistency
- approximate normality
2.3 Experiments
1. CAPE
Y
i
=
W
i
⋅
b
i
+
ε
i
,
β
(
x
)
=
E
[
b
i
∣
X
i
=
x
]
Y_i=W_i \cdot b_i+\varepsilon_i, \beta(x)=\mathbb{E}\left[b_i \mid X_i=x\right]
Yi=Wi⋅bi+εi,β(x)=E[bi∣Xi=x]目标是估计
θ
(
x
)
=
ξ
⋅
β
(
x
)
,
ξ
∈
R
p
\theta(x)=\xi \cdot \beta(x),\xi \in \mathbb{R}^p
θ(x)=ξ⋅β(x),ξ∈Rp,score function
ψ
β
(
x
)
,
c
(
x
)
(
Y
i
,
W
i
)
=
(
Y
i
−
β
(
x
)
⋅
W
i
−
c
(
x
)
)
(
1
W
i
⊤
)
⊤
\psi_{\beta(x), c(x)}\left(Y_i, W_i\right)=\left(Y_i-\beta(x) \cdot W_i-c(x)\right)\left(1 \ W_i^{\top}\right)^{\top}
ψβ(x),c(x)(Yi,Wi)=(Yi−β(x)⋅Wi−c(x))(1 Wi⊤)⊤此时
arg
min
θ
∥
E
[
ψ
θ
(
x
)
,
v
(
x
)
(
O
i
)
]
∥
2
\arg \min _\theta\left\|E\left[\psi_{\theta(x), v(x)}\left(O_i\right)\right]\right\|_2
argminθ
E[ψθ(x),v(x)(Oi)]
2 相当于:
θ
(
x
)
=
ξ
⊤
Var
[
W
i
∣
X
i
=
x
]
−
1
Cov
[
W
i
,
Y
i
∣
X
i
=
x
⌉
\theta(x)=\xi^{\top} \operatorname{Var}\left[W_i \mid X_i=x\right]^{-1} \operatorname{Cov}\left[W_i, Y_i \mid X_i=x\right\rceil
θ(x)=ξ⊤Var[Wi∣Xi=x]−1Cov[Wi,Yi∣Xi=x⌉
- Forest
θ ^ ( x ) = ξ ⊤ ( ∑ i = 1 n α i ( x ) ( W i − W ˉ α ) ⊗ 2 ) − 1 ∑ i = 1 n α i ( x ) ( W i − W ˉ α ) ( Y i − Y ˉ α ) \hat{\theta}(x)=\xi^{\top}\left(\sum_{i=1}^n \alpha_i(x)\left(W_i-\bar{W}_\alpha\right)^{\otimes 2}\right)^{-1} \sum_{i=1}^n \alpha_i(x)\left(W_i-\bar{W}_\alpha\right)\left(Y_i-\bar{Y}_\alpha\right) θ^(x)=ξ⊤(i=1∑nαi(x)(Wi−Wˉα)⊗2)−1i=1∑nαi(x)(Wi−Wˉα)(Yi−Yˉα)其中, W ˉ α = ∑ α i ( x ) W i , Y ˉ α = ∑ α i ( x ) Y i , v ⊗ 2 = v v ⊤ \bar{W}_\alpha=\sum \alpha_i(x) W_i,\bar{Y}_\alpha=\sum \alpha_i(x) Y_i, v^{\otimes 2}=v v^{\top} Wˉα=∑αi(x)Wi,Yˉα=∑αi(x)Yi,v⊗2=vv⊤
GRF算法实施时,权重可自动求解,但需要计算对应的伪结果,注意 ρ i 和 A P \rho_i和A_P ρi和AP只关注 θ ( x ) \theta(x) θ(x)
ψ i = W i ( Y i − Y ˉ P − ( W i − W ˉ P ) β ^ P ) \psi_i=W_i \left(Y_i-\bar{Y}_P-\left(W_i-\bar{W}_P\right) \hat{\beta}_P\right) ψi=Wi(Yi−YˉP−(Wi−WˉP)β^P) ∇ ψ i = W i ⊗ 2 \nabla \psi_i = W_i^{\otimes 2} ∇ψi=Wi⊗2
对比实际代入的表达式,实际对W做centering
ρ i = ξ ⊤ A P − 1 ( W i − W ˉ P ) ( Y i − Y ˉ P − ( W i − W ˉ P ) β ^ P ) \rho_i =\xi^{\top} A_P^{-1}\left(W_i-\bar{W}_P\right)\left(Y_i-\bar{Y}_P-\left(W_i-\bar{W}_P\right) \hat{\beta}_P\right) ρi=ξ⊤AP−1(Wi−WˉP)(Yi−YˉP−(Wi−WˉP)β^P) A P = 1 ∣ { i : X i ∈ P } ∣ ∑ { i : X i ∈ P } ( W − W ˉ P ) ⊗ 2 A_P =\frac{1}{\left|\left\{i: X_i \in P\right\}\right|} \sum_{\left\{i: X_i \in P\right\}}\left(W-\bar{W}_P\right)^{\otimes 2} AP=∣{i:Xi∈P}∣1{i:Xi∈P}∑(W−WˉP)⊗2 - Local Centering
提前对Y和W做中心化处理,类似残差,使得估计效果更好
θ ( x ) = ξ ⊤ Var [ ( W i − E [ W i ∣ X i ] ) ∣ X i ∈ S ] − 1 × Cov [ ( W i − E [ W i ∣ X i ] ) , ( Y i − E [ Y i ∣ X i ] ) ∣ X i ∈ S ] \begin{aligned} \theta(x)= & \xi^{\top} \operatorname{Var}\left[\left(W_i-\mathbb{E}\left[W_i \mid X_i\right]\right) \mid X_i \in \mathcal{S}\right]^{-1} \\ & \times \operatorname{Cov}\left[\left(W_i-\mathbb{E}\left[W_i \mid X_i\right]\right),\left(Y_i-\mathbb{E}\left[Y_i \mid X_i\right]\right) \mid X_i \in \mathcal{S}\right] \end{aligned} θ(x)=ξ⊤Var[(Wi−E[Wi∣Xi])∣Xi∈S]−1×Cov[(Wi−E[Wi∣Xi]),(Yi−E[Yi∣Xi])∣Xi∈S]
2. Quantile Regression Forest
ψ θ ( Y i ) = q 1 ( { Y i > θ } ) − ( 1 − q ) 1 ( { Y i ≤ θ } ) \psi_\theta\left(Y_i\right)=q \mathbf{1}\left(\left\{Y_i>\theta\right\}\right)-(1-q) \mathbf{1}\left(\left\{Y_i \leq \theta\right\}\right) ψθ(Yi)=q1({Yi>θ})−(1−q)1({Yi≤θ})
3. Orthogonal Random Forest for Causal Inference
3.1 Introduction
DML的优势:即使第一阶段的估计有误差,第二阶段的估计仍可以近似正态;劣势:HTE预设参数形式。CF的优势:非参数估计;劣势:很大程度上要求低维度W。ORF在GRF的基础上,参考DML新增对无关参数的正交估计(First stage),减少误差。
At a high level, ORF can be viewed as an orthogonalized version of GRF that is more robust to the nuisance estimation error. The key modification to GRF’s tree learner is our incorporation of orthogonal nuisance estimation in the splitting criterion.
3.2 Algorithm
建树时每次分裂的过程two-stage
1. first stage
ν
^
(
x
)
=
arg
min
∑
a
i
l
(
Z
i
;
ν
)
+
λ
∣
∣
ν
∣
∣
1
\hat \nu(x)=\arg \min \sum a_i l (Z_i;\nu)+\lambda ||\nu||_1
ν^(x)=argmin∑ail(Zi;ν)+λ∣∣ν∣∣1
2. second stage
- split
具体执行分裂的算法类似GRF的gradient tree algorithm,但考虑到honesty,集合略有改动,其中
S
1
S^1
S1是用于分裂的数据,
h
^
P
\hat h_P
h^P是first stage估计的无关参数
θ
~
C
=
θ
^
P
−
1
∣
C
∩
S
1
∣
∑
i
∈
C
j
∩
S
1
A
P
−
1
ψ
(
Z
i
;
θ
^
P
,
h
^
P
(
X
i
,
W
i
)
)
\tilde{\theta}_C=\hat{\theta}_P-\frac{1}{\left|C \cap S^1\right|} \sum_{i \in C_j \cap S^1} A_P^{-1} \psi\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right)
θ~C=θ^P−∣C∩S1∣1i∈Cj∩S1∑AP−1ψ(Zi;θ^P,h^P(Xi,Wi))
where
A
P
=
1
∣
P
∩
S
1
∣
∑
i
∈
P
∩
S
b
1
∇
θ
ψ
(
Z
i
;
θ
^
P
,
h
^
P
(
X
i
,
W
i
)
)
A_P=\frac{1}{\left|P \cap S^1\right|} \sum_{i \in P \cap S_b^1} \nabla_\theta \psi\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right)
AP=∣P∩S1∣1∑i∈P∩Sb1∇θψ(Zi;θ^P,h^P(Xi,Wi))
-
labeling step:计算父节点的 θ ^ P , h ^ P , A P − 1 \hat{\theta}_P, \hat{h}_P, A_P^{-1} θ^P,h^P,AP−1 ,以及每个样本的伪值
ρ t , i = A P − 1 ψ t ( Z i ; θ ^ P , h ^ P ( X i , W i ) ) \rho_{t, i}=A_P^{-1} \psi_t\left(Z_i ; \hat{\theta}_P, \hat{h}_P\left(X_i, W_i\right)\right) ρt,i=AP−1ψt(Zi;θ^P,h^P(Xi,Wi)) -
regression step:maximize proxy heterogeneity score
Δ ~ t ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ C j ∩ S 1 ∣ ( ∑ i ∈ C j ∩ S 1 ρ t , i ) \tilde{\Delta}_t\left(C_1, C_2\right)=\sum_{j=1}^2 \frac{1}{\left|C_j \cap S^1\right|}\left(\sum_{i \in C_j \cap S^1} \rho_{t, i}\right) Δ~t(C1,C2)=j=1∑2∣Cj∩S1∣1 i∈Cj∩S1∑ρt,i -
Predict
a
i
b
a_{ib}
aib同样仅限于
S
2
S^2
S2的估计样本
a
i
b
=
1
[
(
X
i
∈
L
b
(
x
)
)
∧
(
Z
i
∈
S
b
2
)
]
∣
L
b
(
x
)
∩
S
b
2
∣
,
a
i
=
1
B
∑
b
=
1
B
a
i
b
a_{i b}=\frac{\mathbf{1}\left[\left(X_i \in L_b(x)\right) \wedge\left(Z_i \in S_b^2\right)\right]}{\left|L_b(x) \cap S_b^2\right|}, \quad a_i=\frac{1}{B} \sum_{b=1}^B a_{i b}
aib=∣Lb(x)∩Sb2∣1[(Xi∈Lb(x))∧(Zi∈Sb2)],ai=B1b=1∑Baib
以下定理保证了 a i b a_{ib} aib在x邻域内非零
3.3 Experiments
-
DML Partially Linear Regression(PLR, Robinson, 1988)
Y = D θ 0 + g 0 ( X ) + U , E [ U ∣ X , D ] = 0 D = m 0 ( X ) + V , E [ V ∣ X ] = 0 \begin{array}{cl} Y=D \theta_0+g_0(X)+U, & \mathrm{E}[U \mid X, D]=0 \\ D=m_0(X)+V, & \mathrm{E}[V \mid X]=0 \end{array} Y=Dθ0+g0(X)+U,D=m0(X)+V,E[U∣X,D]=0E[V∣X]=0则score function为 ψ ( W ; θ , η ) = ( Y − D α − g ( X ) ) ( D − m ( X ) ) \psi(W ; \theta, \eta)=(Y-D \alpha-g(X))(D-m(X)) ψ(W;θ,η)=(Y−Dα−g(X))(D−m(X)) -
ORF
数据 D = { Z i = ( T i , Y i , W i , X i ) } i = 1 2 n D = \{Z_i = (T_i, Y_i, W_i, X_i)\}_{i=1}^{2n} D={Zi=(Ti,Yi,Wi,Xi)}i=12n,其中T是连续或离散的Treatment,Y是outcome, W ∈ [ − 1 , 1 ] d ν W \in [-1,1]^{d_\nu} W∈[−1,1]dν是potential confounders/controls, X ∈ [ 0 , 1 ] d X \in [0,1]^d X∈[0,1]d是特征
Y = ⟨ μ 0 ( X , W ) , T ⟩ + f 0 ( X , W ) + ε , E [ ε ∣ W , X , T ] = 0 T = g 0 ( X , W ) + η , E [ η ∣ X , W , ε ] = 0 \begin{array}{cl} Y=\left\langle\mu_0(X, W), T\right\rangle+f_0(X, W)+\varepsilon, & \mathbb{E}[\varepsilon \mid W, X, T]=0 \\ T=g_0(X, W)+\eta, & \mathbb{E}[\eta \mid X, W, \varepsilon]=0 \end{array} Y=⟨μ0(X,W),T⟩+f0(X,W)+ε,T=g0(X,W)+η,E[ε∣W,X,T]=0E[η∣X,W,ε]=0confounders分别通过 f 0 f_0 f0和 g 0 g_0 g0影响outcome和treatment
μ 0 : R d × R d ν → [ − 1 , 1 ] p \mu_0: \mathbb{R}^d \times \mathbb{R}^{d_\nu} \rightarrow[-1,1]^p μ0:Rd×Rdν→[−1,1]p为treatment effect function,目标是估计CATE
θ 0 ( x ) = E [ μ 0 ( X , W ) ∣ X = x ] \theta_0(x)=\mathbb{E}\left[\mu_0(X, W) \mid X=x\right] θ0(x)=E[μ0(X,W)∣X=x]
基于DML思想,残差化
Y − E [ Y ∣ X , W ] = ⟨ μ 0 ( X , W ) , T − E [ T ∣ X , W ] ⟩ + ε Y-\mathbb{E}[Y \mid X, W]=\left\langle\mu_0(X, W), T-\mathbb{E}[T \mid X, W]\right\rangle+\varepsilon Y−E[Y∣X,W]=⟨μ0(X,W),T−E[T∣X,W]⟩+ε定义 q 0 ( X , W ) = E [ Y ∣ X , W ] q_0(X, W)=\mathbb{E}[Y \mid X, W] q0(X,W)=E[Y∣X,W], Y ~ = Y − q ( X , W ) \tilde{Y}=Y-q(X, W) Y~=Y−q(X,W), T ~ = T − \tilde{T}=T- T~=T− g 0 ( X , W ) = η g_0(X, W)=\eta g0(X,W)=η, 则有
E [ Y ~ ∣ X , T ~ ] = E [ μ 0 ( X , W ) ∣ X ] ⋅ T ~ = θ ( X ) ⋅ T ~ \mathbb{E}[\tilde{Y} \mid X, \tilde{T}]=\mathbb{E}\left[\mu_0(X, W) \mid X\right] \cdot \tilde{T}=\theta(X) \cdot \tilde{T} E[Y~∣X,T~]=E[μ0(X,W)∣X]⋅T~=θ(X)⋅T~则score function ψ ( Z ; θ , h ( X , W ) ) = { Y − q ( X , W ) − θ ( T − g ( X , W ) ⟩ ) } ( T − g ( X , W ) ) \psi(Z ; \theta, h(X, W)) = \{Y-q(X, W)-\theta(T-g(X, W)\rangle)\}(T-g(X, W)) ψ(Z;θ,h(X,W))={Y−q(X,W)−θ(T−g(X,W)⟩)}(T−g(X,W))
其中 q , g q, g q,g是 q 0 , g 0 q_0, g_0 q0,g0的估计
4. Decision trees for uplift modeling with single and multiple treatments
4.1 Single Treatment
-
Split rule:maximize the differences between class distributions
D gain ( A ) = D ( P T ( Y ) : P C ( Y ) ∣ A ) − D ( P T ( Y ) : P C ( Y ) ) D_{\text {gain }}(A)=D\left(P^T(Y): P^C(Y) \mid A\right)-D\left(P^T(Y): P^C(Y)\right) Dgain (A)=D(PT(Y):PC(Y)∣A)−D(PT(Y):PC(Y)) -
Normalising:C4.5对gain除以info避免bias,而本文的norm主要惩罚两边子节点中treatment和control组比例不平衡的,这和随机试验的假设相悖
下式第一项系数考虑比例不平衡,后两项考虑相对样本大小
(1) D=KL:
I ( A ) = H ( N T N , N C N ) K L ( P T ( A ) : P C ( A ) ) + N T N H ( P T ( A ) ) + N C N H ( P C ( A ) ) + 1 2 \begin{aligned} I(A)= & H\left(\frac{N^T}{N}, \frac{N^C}{N}\right) K L\left(P^T(A): P^C(A)\right) \\ & +\frac{N^T}{N} H\left(P^T(A)\right)+\frac{N^C}{N} H\left(P^C(A)\right)+\frac{1}{2}\end{aligned} I(A)=H(NNT,NNC)KL(PT(A):PC(A))+NNTH(PT(A))+NNCH(PC(A))+21
(2) D=欧式/卡方
J ( A ) = Gini ( N T N , N C N ) D ( P T ( A ) : P C ( A ) ) + N T N Gini ( P T ( A ) ) + N C N Gini ( P C ( A ) ) + 1 2 \begin{aligned} & J(A)=\operatorname{Gini}\left(\frac{N^T}{N}, \frac{N^C}{N}\right) D\left(P^T(A): P^C(A)\right) \\ &+ \frac{N^T}{N} \operatorname{Gini}\left(P^T(A)\right)+\frac{N^C}{N} \operatorname{Gini}\left(P^C(A)\right)+\frac{1}{2}\end{aligned} J(A)=Gini(NNT,NNC)D(PT(A):PC(A))+NNTGini(PT(A))+NNCGini(PC(A))+21
4.2 Multiple treatment
- Split rule
- Normalizing
I ( A ) = α H ( N T N , N C N ) K L ( P T ( A ) : P C ( A ) ) + ( 1 − α ) ∑ i = 1 k H ( N T i N T i + N C , N C N T i + N C ) K L ( P T i ( A ) : P C ( A ) ) + ∑ i = 1 k N T i N H ( P T i ( A ) ) + N C N H ( P C ( A ) ) + 1 2 \begin{aligned} I(A)=\alpha H & \left(\frac{N^T}{N}, \frac{N^C}{N}\right) K L\left(P^T(A): P^C(A)\right) \\ & +(1-\alpha) \sum_{i=1}^k H\left(\frac{N^{T_i}}{N^{T_i}+N^C}, \frac{N^C}{N^{T_i}+N^C}\right) K L\left(P^{T_i}(A): P^C(A)\right) \\ & +\sum_{i=1}^k \frac{N^{T_i}}{N} H\left(P^{T_i}(A)\right)+\frac{N^C}{N} H\left(P^C(A)\right)+\frac{1}{2} \end{aligned} I(A)=αH(NNT,NNC)KL(PT(A):PC(A))+(1−α)i=1∑kH(NTi+NCNTi,NTi+NCNC)KL(PTi(A):PC(A))+i=1∑kNNTiH(PTi(A))+NNCH(PC(A))+21