1.从线性回归说起
从观测数据获得因果效应的一个简单方式是使用线性回归,控制confounders的影响:
S
a
l
e
s
i
=
α
+
τ
P
i
r
c
e
i
+
β
1
t
e
m
p
i
+
β
2
C
o
s
t
i
+
β
3
W
e
e
k
d
a
y
i
+
e
i
Sales_i = \alpha+\tau Pirce_i+\beta_1 temp_i+\beta_2 Cost_i+\beta_3 Weekday_i+e_i
Salesi=α+τPircei+β1tempi+β2Costi+β3Weekdayi+ei
τ
\tau
τ是我们唯一需要关注的,因为
τ
\tau
τ是需要求的价格对销量的因果效应,其他变量我们并不关心,但是在线性模型中需要正确对待,否则得到的
τ
\tau
τ是错误估计。比如,如果温度对销量的影响如果是二次的,根据上式估计出的treatment effect就是错的。
2.Frisch-Waugh-Lovell Law
假设有一系列特征 X 1 X_1 X1和 X 2 X_2 X2,
普通线性回归: Y ^ = β 0 + β 1 X 1 + β 2 X 2 \hat Y = \beta_0+ \beta_1 X_1+\beta_2 X_2 Y^=β0+β1X1+β2X2
基于Frisch-Waugh-Lovell Law回归:
- 用 X 2 X_2 X2回归 Y Y Y: y ^ ∗ = γ 1 X 2 \hat y^* = \gamma_1 X_2 y^∗=γ1X2
- 用 X 2 X_2 X2回归 X 1 X_1 X1: X ^ 1 = γ 2 X 2 \hat X_1 = \gamma_2 X_2 X^1=γ2X2
- 分别计算残差: X ~ 1 = X 1 − X ^ 1 \widetilde X_1 = X_1 - \hat X_1 X 1=X1−X^1, y ~ = y − y ^ ∗ \widetilde y = y -\hat y^* y =y−y^∗
- 用 X 1 X_1 X1的残差回归 y y y的残差: y ^ ∗ = γ 3 X ~ 1 \hat y^* = \gamma_3 \widetilde X_1 y^∗=γ3X 1
两种回归方式得到的系数是相同的,
β
1
=
γ
3
\beta_1 = \gamma_3
β1=γ3
Frisch-Waugh-Lovell Law的非线性推广:
(
Y
−
(
Y
∼
X
)
)
∼
(
T
−
(
T
∼
X
)
)
⟺
Y
i
−
E
(
Y
i
∣
X
i
)
=
τ
∗
(
T
i
−
E
(
T
i
∣
X
i
)
)
+
ϵ
( Y - (Y \sim X) ) \sim ( T - (T \sim X) ) \iff Y_i - E(Y_i|X_i) = \tau * (T_i - E(T_i|X_i))+\epsilon
(Y−(Y∼X))∼(T−(T∼X))⟺Yi−E(Yi∣Xi)=τ∗(Ti−E(Ti∣Xi))+ϵ
Frisch-Waugh-Lovell Law解决了什么问题
Frisch-Waugh-Lovell Law可以借助机器学习模型获得
Y
i
−
E
(
Y
i
∣
X
i
)
Y_i - E(Y_i|X_i)
Yi−E(Yi∣Xi)和
T
i
−
E
(
T
i
∣
X
i
)
T_i - E(T_i|X_i)
Ti−E(Ti∣Xi),无需像线性模型一样知道所有特征的正确形式,只需保证预测值
y
^
∗
\hat y^*
y^∗和
X
~
\widetilde X
X
足够准确就可。
3.Partially Linear Model
- Y是实验影响的核心指标
- T是treatment,通常是0/1变量,代表样本进入实验组还是对照组,对随机AB实验𝑇⊥𝑋
- X是Confounder,可以简单理解为未被实验干预过的用户特征,通常是高维向量
Y = θ 0 T + g ( X ) + U , E ( U ∣ T , X ) = 0 T = m ( X ) + V , E ( V ∣ T ) = 0 \begin{aligned} Y &= \theta_0 T+g(X)+U , \qquad \ E(U | T,X)=0 \\ T &= m(X)+V , \qquad \qquad \ E(V | T)=0 \end{aligned} YT=θ0T+g(X)+U, E(U∣T,X)=0=m(X)+V, E(V∣T)=0
DML最终估计的是 θ ^ 0 \widehat \theta_0 θ 0,也就是实验对不同用户核心指标的不同影响。最直接的方法就是用X和T一起对Y建模,直接估计 θ ^ 0 \widehat \theta_0 θ 0。但这样估计出的 θ ^ 0 \widehat \theta_0 θ 0往往是有偏的,偏差部分来自于对样本的过拟合,部分来自于 g ^ ( X ) \widehat g(X) g (X)估计的偏差。假定 θ 0 \theta_0 θ0是参数的真实值,根据最小二乘公式 θ = ( X T X ) − 1 X T Y \theta = {(X^TX)}^{-1}X^TY θ=(XTX)−1XTY,把 X = T X=T X=T, Y = Y − g ^ ( X ) Y= Y-\widehat g(X) Y=Y−g (X)带入最小二乘公式,则估计值 θ ^ 0 \hat \theta_0 θ^0为:
θ ^ 0 = ( T T T ) − 1 T T ( Y − g ^ ( X ) ) = ∑ T i ( Y i − g ^ ( X i ) ) ∑ T i 2 = ∑ T i { [ Y i − g ( X i ) ] + [ g ( X i ) − g ^ ( X i ) ] } ∑ T i 2 = ∑ T i [ ( θ 0 T i + ϵ i ) + [ g ( X i ) − g ^ ( X i ) ] ∑ T i 2 = θ 0 + ∑ T i ϵ i + T i [ g ( X i ) − g ^ ( X i ) ] ∑ T i 2 \begin{aligned} \hat \theta_0 &= {(T^TT)}^{-1}T^T(Y-\widehat g(X)) \\ &= \frac {\sum T_i (Y_i-\widehat g(X_i))} {\sum T_i^2} \\ &= \frac {\sum T_i \{ [Y_i-g(X_i)]+ [g(X_i)-\widehat g(X_i)] \} } {\sum T_i^2} \\ &= \frac {\sum T_i [(\theta_0 T_i+\epsilon_i)+[g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ &= \theta_0+\frac {\sum T_i \epsilon_i+ T_i [g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ \end{aligned} θ^0=(TTT)−1TT(Y−g (X))=∑Ti2∑Ti(Yi−g (Xi))=∑Ti2∑Ti{[Yi−g(Xi)]+[g(Xi)−g (Xi)]}=∑Ti2∑Ti[(θ0Ti+ϵi)+[g(Xi)−g (Xi)]=θ0+∑Ti2∑Tiϵi+Ti[g(Xi)−g (Xi)] 估计值的偏差为:
n ( θ ^ 0 − θ 0 ) = n ∗ ( ∑ T i ϵ i T i 2 ⏟ a + T i [ g ( X i ) − g ^ ( X i ) ] T i 2 ⏟ b ) \begin{aligned} \sqrt{n}(\hat \theta_0-\theta_0) = \sqrt{n} *\left(\underbrace {\sum \frac {T_i \epsilon_i}{T_i^2}}_{a} +\underbrace {\frac {T_i [g(X_i)-\widehat g(X_i)]}{T_i^2}}_{b} \right) \end{aligned} n(θ^0−θ0)=n∗⎝⎜⎜⎛a ∑Ti2Tiϵi+b Ti2Ti[g(Xi)−g (Xi)]⎠⎟⎟⎞
根据线性回归的假设 ∑ T i ϵ i = 0 \sum T_i \epsilon_i=0 ∑Tiϵi=0,所以a项收敛为0。b项中,由于包含 T θ 0 T \theta_0 Tθ0项, g ( X ) ≠ E [ Y ∣ X ] g(X) \neq E[Y|X] g(X)=E[Y∣X] ,因此, g ^ ( X i ) \widehat g(X_i) g (Xi)是有偏的,得到的预测误差 g ( X i ) − g ^ ( X i ) g(X_i)-\widehat g(X_i) g(Xi)−g (Xi)不会收敛为0。
4. Double Machine Learning
4.1 DML模型步骤
DML模型分为以下三个步骤:
步骤一.用任意ML模型拟合Y和T得到残差
U
^
,
V
^
\widehat U,\widehat V
U
,V
U
^
=
Y
−
T
θ
0
−
g
^
(
X
i
)
w
h
e
r
e
g
^
(
X
i
)
=
E
(
Y
∣
X
,
T
)
V
^
=
T
−
m
^
(
X
i
)
w
h
e
r
e
m
^
(
X
i
)
=
E
(
T
∣
X
)
\begin{aligned} \widehat U &= Y - T \theta_0 - \hat g(X_i) \qquad where \ \hat g(X_i) =E(Y|X,T) \\ \widehat V &= T - \hat m(X_i) \qquad where \ \hat m(X_i) =E(T|X) \\ \end{aligned}
U
V
=Y−Tθ0−g^(Xi)where g^(Xi)=E(Y∣X,T)=T−m^(Xi)where m^(Xi)=E(T∣X) 步骤二.用任意模型拟合
U
^
,
V
^
\widehat U,\widehat V
U
,V
的关系,得到
θ
^
0
\hat \theta_0
θ^0
θ
0
\theta_0
θ0的拟合可以是参数模型也可以是非参数模型,参数模型可以直接用线性模型拟合。而非参数模型因为只接受输入和输出所以需要再做如下变换,模型Target变为
U
^
V
^
\frac {\widehat U} {\widehat V}
V
U
, 样本权重为
V
^
2
{\widehat V} ^2
V
2
参
数
模
型
:
U
^
=
θ
0
∗
V
^
+
ϵ
a
r
g
m
i
n
E
[
(
U
^
−
θ
0
∗
V
^
)
2
]
=
0
非
参
模
型
:
E
[
(
U
^
−
θ
0
∗
V
^
)
2
]
=
E
[
V
^
2
∗
(
U
^
V
^
−
θ
0
)
2
]
\begin{aligned} 参数模型&:\widehat U = \theta_0 *\widehat V + \epsilon \qquad argmin\ E[(\widehat U - \theta_0 *\widehat V)^2] =0\\ 非参模型&:E[(\widehat U - \theta_0 *\widehat V)^2] = E[{\widehat V}^2* (\frac {\widehat U}{\widehat V} - \theta_0)^2] \end{aligned}
参数模型非参模型:U
=θ0∗V
+ϵargmin E[(U
−θ0∗V
)2]=0:E[(U
−θ0∗V
)2]=E[V
2∗(V
U
−θ0)2] 步骤三. Cross-fitting
DML保证估计无偏很重要的一步就是Cross-fitting,用来降低overfitting带来的估计偏差。先把总样本分成两份:样本1,样本2。先用样本1估计残差,样本2估计𝜃̂ 1,再用样本2估计残差,样本1估计𝜃̂ 2,取平均得到最终的估计。当然也可以进一步使用K-Fold来增加估计的稳健性。
参数模型
θ
^
0
\hat \theta_0
θ^0的表示式:
根据两阶段最小二乘公式
θ
=
(
Z
T
X
)
−
1
Z
T
Y
\theta = {(Z^TX)}^{-1}Z^TY
θ=(ZTX)−1ZTY,把
Z
=
V
^
Z=\widehat V
Z=V
,
X
=
T
X=T
X=T,
Y
=
Y
−
g
^
(
X
)
Y= Y-\widehat g(X)
Y=Y−g
(X)带入最小二乘公式,则:
θ
^
0
=
(
V
^
T
T
)
−
1
V
^
T
(
Y
−
g
^
(
X
)
)
=
∑
V
^
i
(
Y
i
−
g
^
(
X
i
)
)
∑
V
^
i
T
i
=
∑
V
^
i
(
[
Y
i
−
g
(
X
i
)
]
+
[
g
(
X
i
)
−
g
^
(
X
i
)
]
)
∑
V
^
i
T
i
=
∑
V
^
i
(
[
T
i
θ
0
+
U
i
]
+
[
g
(
X
i
)
−
g
^
(
X
i
)
]
)
∑
V
^
i
T
i
=
θ
0
+
∑
V
^
i
U
i
+
∑
V
^
i
[
g
(
X
i
)
−
g
^
(
X
i
)
]
)
∑
V
^
i
T
i
=
θ
0
+
∑
V
^
i
U
i
+
∑
[
V
i
−
(
m
^
(
X
i
)
−
m
(
X
i
)
)
]
[
g
(
X
i
)
−
g
^
(
X
i
)
]
)
∑
V
^
i
T
i
=
θ
0
+
∑
V
^
i
U
i
∑
V
^
i
T
i
⏟
a
+
∑
[
m
(
X
i
)
−
m
^
(
X
i
)
]
[
g
(
X
i
)
−
g
^
(
X
i
)
]
∑
V
^
i
T
i
⏟
b
+
∑
V
i
[
g
(
X
i
)
−
g
^
(
X
i
)
]
∑
V
^
i
T
i
⏟
c
\begin{aligned} \hat \theta_0 &= {({\widehat V}^T T)}^{-1} {\widehat V}^T(Y-\widehat g(X)) \\ &= \frac {\sum \widehat V_i (Y_i-\widehat g(X_i))}{ \sum \widehat V_i T_i} \\ &= \frac {\sum \widehat V_i([Y_i-g(X_i)]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \frac {\sum \widehat V_i([T_i \theta_0+U_i]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum \widehat V_i[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum [V_i-(\widehat m(X_i)-m(X_i))][g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\underbrace {\frac {\sum \widehat V_i U_i} { \sum \widehat V_i T_i} }_a + \underbrace {\frac {\sum [m(X_i)-\widehat m(X_i)] [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_b + \underbrace {\frac {\sum V_i [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_c \end{aligned}
θ^0=(V
TT)−1V
T(Y−g
(X))=∑V
iTi∑V
i(Yi−g
(Xi))=∑V
iTi∑V
i([Yi−g(Xi)]+[g(Xi)−g
(Xi)])=∑V
iTi∑V
i([Tiθ0+Ui]+[g(Xi)−g
(Xi)])=θ0+∑V
iTi∑V
iUi+∑V
i[g(Xi)−g
(Xi)])=θ0+∑V
iTi∑V
iUi+∑[Vi−(m
(Xi)−m(Xi))][g(Xi)−g
(Xi)])=θ0+a
∑V
iTi∑V
iUi+b
∑V
iTi∑[m(Xi)−m
(Xi)][g(Xi)−g
(Xi)]+c
∑V
iTi∑Vi[g(Xi)−g
(Xi)] a项等于0;b项由于
∑
[
m
(
X
i
)
−
m
^
(
X
i
)
]
\sum [m(X_i)-\widehat m(X_i)]
∑[m(Xi)−m
(Xi)]=0,且
∑
[
m
(
X
i
)
−
m
^
(
X
i
)
]
\sum [m(X_i)-\widehat m(X_i)]
∑[m(Xi)−m
(Xi)]和
∑
[
g
(
X
i
)
−
g
^
(
X
i
)
]
\sum [g(X_i)-\widehat g(X_i)]
∑[g(Xi)−g
(Xi)]独立,所以b项等于0;c项中,
[
g
(
X
i
)
−
g
^
(
X
i
)
]
[g(X_i)-\widehat g(X_i)]
[g(Xi)−g
(Xi)]是方程
Y
=
T
θ
^
0
+
g
^
(
X
)
+
U
^
Y = T \widehat \theta_0+\widehat g(X)+\widehat U
Y=Tθ
0+g
(X)+U
的预测误差,如果
g
^
(
x
)
\widehat g(x)
g
(x)过拟合,预测误差中会包含噪声
U
U
U,如果噪声
U
和
V
U和V
U和V相关联,那么c项不会等于0(
V
V
V和
[
g
(
X
i
)
−
g
^
(
X
i
)
]
[g(X_i)-\widehat g(X_i)]
[g(Xi)−g
(Xi)]相关联,不独立)。为了解决这个问题,需要用到步骤三的样本分割方法。
如果有任何未观测的confounder, V V V和 [ g ( X i ) − g ^ ( X i ) ] [g(X_i)-\widehat g(X_i)] [g(Xi)−g (Xi)]相关联,但我们假设观测到了所有confounders
另一种推导方式( g(X) 换成p(X) ):
根据Frisch-Waugh-Lovell Law,DML也可以表示成如下形式:
Y
=
p
(
X
)
+
W
,
p
(
X
)
=
E
(
Y
∣
X
)
,
W
^
=
Y
−
p
^
(
X
)
T
=
m
(
X
)
+
V
,
m
(
X
)
=
E
(
T
∣
X
)
,
V
^
=
T
−
m
^
(
X
)
W
^
=
V
^
θ
0
+
η
\begin{aligned} Y &= p(X)+W , \quad p(X)=E(Y|X) , \quad \widehat W = Y- \widehat p(X) \\ T &= m(X)+V , \quad m(X)=E(T|X), \quad \widehat V = T- \widehat m(X) \\ \widehat W &= \widehat V \theta_0 +\eta \end{aligned}
YTW
=p(X)+W,p(X)=E(Y∣X),W
=Y−p
(X)=m(X)+V,m(X)=E(T∣X),V
=T−m
(X)=V
θ0+η
θ
^
0
\hat \theta_0
θ^0的表达式为:
θ
^
0
=
∑
V
^
i
W
^
i
∑
V
^
i
2
=
∑
[
V
i
+
(
m
(
X
i
)
−
m
^
(
X
i
)
)
]
[
W
i
+
(
p
(
X
i
)
−
p
^
(
X
i
)
)
]
∑
V
^
i
2
=
∑
V
i
W
i
+
[
m
(
X
i
)
−
m
^
(
X
i
)
]
[
p
(
X
i
)
−
p
^
(
X
i
)
]
+
W
i
[
m
(
X
i
)
−
m
^
(
X
i
)
]
+
V
i
[
p
(
X
i
)
−
p
^
(
X
i
)
]
∑
V
^
i
2
=
∑
V
i
W
i
∑
V
^
i
2
⏟
a
+
∑
[
m
(
X
i
)
−
m
^
(
X
i
)
]
[
p
(
X
i
)
−
p
^
(
X
i
)
]
∑
V
^
i
2
⏟
b
+
∑
W
i
[
m
(
X
i
)
−
m
^
(
X
i
)
]
∑
V
^
i
2
⏟
c
+
∑
V
i
[
p
(
X
i
)
−
p
^
(
X
i
)
]
∑
V
^
i
2
⏟
θ
0
\begin{aligned} \hat \theta_0 &= \frac {\sum \widehat V_i \widehat W_i}{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum [V_i + (m(X_i) - \widehat m(X_i))] [W_i+(p(X_i) -\widehat p(X_i))] }{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum V_i W_i+ [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)] +W_i [m(X_i) - \widehat m(X_i)] + V_i [p(X_i) - \widehat p(X_i)] }{{ \sum {\widehat V_i }^2} } \\ &=\underbrace {\frac {\sum V_i W_i} { \sum {\widehat V_i} ^2} }_a + \underbrace {\frac {\sum [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)]} { \sum {\widehat V_i} ^2} }_b + \underbrace {\frac {\sum W_i [m(X_i) - \widehat m(X_i)]} { \sum {\widehat V_i} ^2} }_c +\underbrace {\frac {\sum V_i [p(X_i) - \widehat p(X_i)] } { \sum {\widehat V_i} ^2} }_{\theta_0} \end{aligned}
θ^0=∑V
i2∑V
iW
i=∑V
i2∑[Vi+(m(Xi)−m
(Xi))][Wi+(p(Xi)−p
(Xi))]=∑V
i2∑ViWi+[m(Xi)−m
(Xi)][p(Xi)−p
(Xi)]+Wi[m(Xi)−m
(Xi)]+Vi[p(Xi)−p
(Xi)]=a
∑V
i2∑ViWi+b
∑V
i2∑[m(Xi)−m
(Xi)][p(Xi)−p
(Xi)]+c
∑V
i2∑Wi[m(Xi)−m
(Xi)]+θ0
∑V
i2∑Vi[p(Xi)−p
(Xi)]
为什么
∑
V
i
[
p
(
X
i
)
−
p
^
(
X
i
)
]
∑
V
^
i
2
=
θ
0
?
?
?
\frac { \sum V_i [p(X_i) - \widehat p(X_i)] } { {\sum {\widehat V_i} ^2} }=\theta_0 \quad ???
∑V
i2∑Vi[p(Xi)−p
(Xi)]=θ0???
4.2 GMM角度理解DML
PLM(Partially Linear Model)的总体矩条件:
φ
p
l
m
(
W
;
θ
0
,
η
)
=
[
Y
−
T
θ
0
−
g
(
X
)
]
T
w
h
e
r
e
η
=
g
(
X
)
,
η
0
=
g
0
(
X
)
(1)
\varphi_{plm}(W;\theta_0,\eta) = [Y - T \theta_0 - g(X)] T \tag{1}\\ where \quad \eta=g(X), \ \eta_0 = g_0(X)
φplm(W;θ0,η)=[Y−Tθ0−g(X)]Twhereη=g(X), η0=g0(X)(1) DML的总体矩条件:
φ
d
m
l
(
W
;
θ
0
,
η
)
=
[
Y
−
T
θ
0
−
g
(
X
)
]
[
T
−
m
(
X
)
]
o
r
:
φ
d
m
l
(
W
;
θ
0
,
η
)
=
[
Y
−
p
(
X
)
−
(
T
−
m
(
X
)
)
θ
0
]
[
T
−
m
(
X
)
]
w
h
e
r
e
η
=
(
p
(
X
)
,
m
(
X
)
)
,
η
0
=
(
p
0
(
X
)
,
m
0
(
X
)
)
=
(
E
(
Y
∣
X
)
,
E
(
T
∣
X
)
)
(2)
\begin{aligned} \varphi_{dml}(W;\theta_0,\eta) &= [Y - T \theta_0 - g(X)] [T-m(X)] \\ or:\varphi_{dml}(W;\theta_0,\eta) &= [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \\ where \quad \eta&=(p(X),m(X)), \ \eta_0 =(p_0(X),m_0(X)) = (E(Y|X),E(T|X)) \\ \tag{2} \end{aligned}
φdml(W;θ0,η)or:φdml(W;θ0,η)whereη=[Y−Tθ0−g(X)][T−m(X)]=[Y−p(X)−(T−m(X))θ0][T−m(X)]=(p(X),m(X)), η0=(p0(X),m0(X))=(E(Y∣X),E(T∣X))(2)
根据Neyman orthogonality:PML矩条件的求导结果不等于0,
而DML矩条件的加托求导(Gateaux derivative)结果是0:
∂
η
E
{
[
Y
−
p
(
X
)
−
(
T
−
m
(
X
)
)
θ
0
]
[
T
−
m
(
X
)
]
}
∣
η
=
η
0
=
0
\partial_\eta E\{ [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \} |_{\eta=\eta_0} = 0
∂ηE{[Y−p(X)−(T−m(X))θ0][T−m(X)]}∣η=η0=0
这表明用于估计parameters of interest的矩条件, 对nuisance parameters 的导数等于0 , 换句话说,即矩条件在true value of nuisance parameters 处对 nuisance parameters 的导数与从true value of nuisance parameters指向nuisance parameters的向量正交(which indicates the terminology of orthogonality)。由此得到的estimator 也叫 doubly robust estimator,指对于parameters of insterest 和 nuisance parameters 都是稳健的。看不懂。。。。
4.3 DML正交化实例说明
步骤一.原始数据
探索冰淇淋销量和价格之间的关系,可以看到weekday是confounder,周末时价格和销量均更高。如果要得到准确的causal effect,需要矫正confounders带来的偏差。此数据集中包含的confounders有温度、周几、成本
步骤二.regress T on X
用X构建T的回归模型,得到残差
V
^
=
T
−
m
^
(
X
)
\widehat V = T - \widehat m(X)
V
=T−m
(X),残差
V
^
\widehat V
V
可以看作将X对T的作用从T中去除后剩下的量,此时残差
V
^
\widehat V
V
正交于
X
X
X。

步骤三.regress Y on X
同样的,用X构建Y的回归模型,得到残差
W
^
=
Y
−
p
^
(
X
)
\widehat W = Y - \widehat p(X)
W
=Y−p
(X),而模型
p
^
(
X
)
\widehat p(X)
p
(X)的作用在于去除Y的方差,即将X引起的Y的方差从Y中去除。和步骤二图相比,步骤三图Y的方差变小。
步骤四. 对比DML结果和PML结果
从上图中可以看到price_res和sales_res呈明显负相关关系。分别构建price_res和sales_res线性模型以及price和sales线性模型,得到前者得到显著负相关关系,后者由于混淆变量的存在得到正相关关系。
4.4 DML模拟
假设
g
(
X
)
=
s
i
n
2
(
X
)
m
(
X
)
=
1
2
π
s
i
n
h
(
γ
)
c
o
s
h
(
γ
)
−
c
o
s
h
(
x
−
v
)
\begin{aligned} g(X) &=sin^2(X) \\ m(X) &= \frac {1}{2\pi}\frac {sinh(\gamma)}{cosh(\gamma)-cosh(x-v)} \end{aligned}
g(X)m(X)=sin2(X)=2π1cosh(γ)−cosh(x−v)sinh(γ) 令
v
=
0
,
γ
=
1
v=0,\gamma=1
v=0,γ=1,
X
X
X和
m
(
X
)
m(X)
m(X)的关系为
零.生成数据
θ
0
\theta_0
θ0真实值=0.5
import numpy as np
from sklearn.datasets import make_spd_matrix
import math
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
randomseednumber = 11022018
np.random.seed(randomseednumber)
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt
N = 500 # No. obs
k=10 # = No. variables in x_i
theta=0.5 # Structural parameter,the true value of the causal parameter will be θ=0.5, and bk=1/k.
b= [1/k for k in range(1,11)] # x weights
# sigma = make_spd_matrix(k,randomseednumber) #shape(10, 10)
sigma = make_spd_matrix(k) #shape(10, 10)
MC_no = 500 #NUmber of simulations
def g(x):
return np.power(np.sin(x),2)
def m(x,nu=0.,gamma=1.):
return 0.5/math.pi*(np.sinh(gamma))/(np.cosh(gamma)-np.cos(x-nu))
# Array of estimated thetas to store results
theta_est = np.zeros(shape=[MC_no,3]) #shape(500, 3)
一. 真实值: θ 0 \theta_0 θ0真实值=0.5 sm.OLS(Y-G,D)得到 θ 0 \theta_0 θ0的真实值
theta_est_true = np.zeros(shape=[MC_no,1]) #shape(500, 1)
for i in range(MC_no):
X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
G = g(np.dot(X,b)) #shape(500,)
M = m(np.dot(X,b)) #shape(500,)
D = M+np.random.standard_normal(size=[500,])
Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])
# true value of the causal parameter is θ=0.5
model = sm.OLS(Y-G,D)
results = model.fit()
theta_est_true[i][0] = results.params[0] # x1的coef,即theta0
# theta_est_true[i][0] = np.sum(np.dot(D,Y-G))/np.sum(np.dot(D.T,D)) #两者等价
sns.distplot(theta_est_true[:,0],kde=True)
plt.axvline(0.5,0,5)
二. PLM方法估计值:
- 设定 θ 0 \theta_0 θ0的初始值 θ 0 ^ ( 0 ) = 0.4 \widehat{\theta_0}^{(0)}=0.4 θ0 (0)=0.4
- 建立 X \mathbf{X} X 与 Y − D θ 0 ^ ( 0 ) Y - D \widehat {\theta_0}^{(0)} Y−Dθ0 (0) 之间的机器学习模型,以估计 g 0 ^ ( 1 ) \widehat{g_0}^{(1)} g0 (1)
- 建立D 与 Y − g 0 ^ ( 1 ) ( X ) Y- \widehat{g_0}^{(1)}(\mathbf{X}) Y−g0 (1)(X)之间的线性回归,以估计 θ 0 ^ ( 1 ) \widehat{\theta_0}^{(1)} θ0 (1)
- 重复上述方法,直到 ∣ θ 0 ^ ( i ) − θ 0 ^ ( i − 1 ) ∣ < ϵ |\widehat{\theta_0}^{(i)}-\widehat{\theta_0}^{(i-1)}|<\epsilon ∣θ0 (i)−θ0 (i−1)∣<ϵ 收斂,得到最终的估计 g 0 ^ \widehat{g_0} g0 与 θ 0 ^ \widehat{\theta_0} θ0
theta_est_plm = np.zeros(shape=[MC_no,1]) #shape(500, 3)
for i in range(MC_no):
theta_0,theta_delta = 0.4,0.1
X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
G = g(np.dot(X,b)) #shape(500,)
M = m(np.dot(X,b)) #shape(500,)
D = M+np.random.standard_normal(size=[500,])
Y = np.dot(theta_0,D)+G+np.random.standard_normal(size=[500,])
while theta_delta>0.01:
theta_pre = theta_0
# partial linear model
naiveDMLg = RandomForestRegressor(max_depth=2)
naiveDMLg.fit(X,Y-np.dot(D,theta_0)) #Compute ghat
Ghat = naiveDMLg.predict(X) #shape(500,)
theta_0 = np.sum(np.dot(D,Y-Ghat))/np.sum(np.dot(D,D))
theta_delta = abs(theta_pre-theta_0)
theta_est_plm[i][0] = theta_0
sns.distplot(theta_est_plm[:,0],kde=True)
plt.axvline(0.5,0,5)
三. Naive DML方法估计值:np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))得到 θ 0 \theta_0 θ0的预估值,有偏。
for i in range(MC_no):
# Generate data: no. obs x no. variables in x_i
# xi has length K=10 and will be generated from a multivariate normal distribution,
X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
G = g(np.dot(X,b)) #shape(500,)
M = m(np.dot(X,b)) #shape(500,)
D = M+np.random.standard_normal(size=[500,])
Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])
# Naive double machine Learning
naiveDMLg = RandomForestRegressor(max_depth=2)
naiveDMLg.fit(X,Y) #Compute ghat
Ghat = naiveDMLg.predict(X) #shape(500,)
naiveDMLm =RandomForestRegressor(max_depth=2)
naiveDMLm.fit(X,D)
Mhat= naiveDMLm.predict(X) #Compute Mhat,Vhat as residual
Vhat = D-Mhat
theta_est[i][1] = np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))
sns.distplot(theta_est[:,1],kde=True)
plt.axvline(0.5,0,5)
四. cross-fitting DML方法估计值:sample spliting 无偏。
for i in range(MC_no):
# Generate data: no. obs x no. variables in x_i
# xi has length K=10 and will be generated from a multivariate normal distribution,
X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
G = g(np.dot(X,b)) #shape(500,)
M = m(np.dot(X,b)) #shape(500,)
D = M+np.random.standard_normal(size=[500,])
Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])
# Cross-fitting DML
I = np.random.choice(N,np.int(N/2),replace=False) # Split the sample
I_C = [x for x in np.arange(N) if x not in I]
# Ghat for both
Ghat_1 = RandomForestRegressor(max_depth=2).fit(X[I],Y[I]).predict(X[I_C])
Ghat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],Y[I_C]).predict(X[I])
# Mhat and vhat for both
Mhat_1 = RandomForestRegressor(max_depth=2).fit(X[I],D[I]).predict(X[I_C])
Mhat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],D[I_C]).predict(X[I])
Vhat_1,Vhat_2 = D[I_C]-Mhat_1,D[I] - Mhat_2
theta_1 = np.mean(np.dot(Vhat_1,(Y[I_C]-Ghat_1)))/np.mean(np.dot(Vhat_1,D[I_C]))
theta_2 = np.mean(np.dot(Vhat_2,(Y[I]-Ghat_2)))/np.mean(np.dot(Vhat_2,D[I]))
theta_est[i][2] = 0.5*(theta_1+theta_2)
sns.distplot(theta_est[:,2],kde=True)
plt.axvline(0.5,0,5)
五. 结果对比 下图是四种方法的估计值分布。

4.5 非参数DML 累了 写不动
参考资料
DML原始论文
DML详细ppt
DML详细讲解论文
DML训练过程
DML 例子
AB实验人群定向HTE模型4 - Double Machine Learning
DML中文推导
因果推断笔记——DML :Double Machine Learning案例学习(十六)