Double Machine Learning

1.从线性回归说起

从观测数据获得因果效应的一个简单方式是使用线性回归,控制confounders的影响:
S a l e s i = α + τ P i r c e i + β 1 t e m p i + β 2 C o s t i + β 3 W e e k d a y i + e i Sales_i = \alpha+\tau Pirce_i+\beta_1 temp_i+\beta_2 Cost_i+\beta_3 Weekday_i+e_i Salesi=α+τPircei+β1tempi+β2Costi+β3Weekdayi+ei τ \tau τ是我们唯一需要关注的,因为 τ \tau τ是需要求的价格对销量的因果效应,其他变量我们并不关心,但是在线性模型中需要正确对待,否则得到的 τ \tau τ是错误估计。比如,如果温度对销量的影响如果是二次的,根据上式估计出的treatment effect就是错的。

2.Frisch-Waugh-Lovell Law

假设有一系列特征 X 1 X_1 X1 X 2 X_2 X2

普通线性回归 Y ^ = β 0 + β 1 X 1 + β 2 X 2 \hat Y = \beta_0+ \beta_1 X_1+\beta_2 X_2 Y^=β0+β1X1+β2X2

基于Frisch-Waugh-Lovell Law回归

  • X 2 X_2 X2回归 Y Y Y y ^ ∗ = γ 1 X 2 \hat y^* = \gamma_1 X_2 y^=γ1X2
  • X 2 X_2 X2回归 X 1 X_1 X1 X ^ 1 = γ 2 X 2 \hat X_1 = \gamma_2 X_2 X^1=γ2X2
  • 分别计算残差: X ~ 1 = X 1 − X ^ 1 \widetilde X_1 = X_1 - \hat X_1 X 1=X1X^1 y ~ = y − y ^ ∗ \widetilde y = y -\hat y^* y =yy^
  • X 1 X_1 X1的残差回归 y y y的残差: y ^ ∗ = γ 3 X ~ 1 \hat y^* = \gamma_3 \widetilde X_1 y^=γ3X 1

两种回归方式得到的系数是相同的, β 1 = γ 3 \beta_1 = \gamma_3 β1=γ3

Frisch-Waugh-Lovell Law的非线性推广
( Y − ( Y ∼ X ) ) ∼ ( T − ( T ∼ X ) )    ⟺    Y i − E ( Y i ∣ X i ) = τ ∗ ( T i − E ( T i ∣ X i ) ) + ϵ ( Y - (Y \sim X) ) \sim ( T - (T \sim X) ) \iff Y_i - E(Y_i|X_i) = \tau * (T_i - E(T_i|X_i))+\epsilon (Y(YX))(T(TX))YiE(YiXi)=τ(TiE(TiXi))+ϵ

Frisch-Waugh-Lovell Law解决了什么问题
Frisch-Waugh-Lovell Law可以借助机器学习模型获得 Y i − E ( Y i ∣ X i ) Y_i - E(Y_i|X_i) YiE(YiXi) T i − E ( T i ∣ X i ) T_i - E(T_i|X_i) TiE(TiXi),无需像线性模型一样知道所有特征的正确形式,只需保证预测值 y ^ ∗ \hat y^* y^ X ~ \widetilde X X 足够准确就可。

3.Partially Linear Model

  • Y是实验影响的核心指标
  • T是treatment,通常是0/1变量,代表样本进入实验组还是对照组,对随机AB实验𝑇⊥𝑋
  • X是Confounder,可以简单理解为未被实验干预过的用户特征,通常是高维向量
    Y = θ 0 T + g ( X ) + U ,   E ( U ∣ T , X ) = 0 T = m ( X ) + V ,   E ( V ∣ T ) = 0 \begin{aligned} Y &= \theta_0 T+g(X)+U , \qquad \ E(U | T,X)=0 \\ T &= m(X)+V , \qquad \qquad \ E(V | T)=0 \end{aligned} YT=θ0T+g(X)+U, E(UT,X)=0=m(X)+V, E(VT)=0
       DML最终估计的是 θ ^ 0 \widehat \theta_0 θ 0,也就是实验对不同用户核心指标的不同影响。最直接的方法就是用X和T一起对Y建模,直接估计 θ ^ 0 \widehat \theta_0 θ 0。但这样估计出的 θ ^ 0 \widehat \theta_0 θ 0往往是有偏的,偏差部分来自于对样本的过拟合,部分来自于 g ^ ( X ) \widehat g(X) g (X)估计的偏差。假定 θ 0 \theta_0 θ0是参数的真实值,根据最小二乘公式 θ = ( X T X ) − 1 X T Y \theta = {(X^TX)}^{-1}X^TY θ=(XTX)1XTY,把 X = T X=T X=T Y = Y − g ^ ( X ) Y= Y-\widehat g(X) Y=Yg (X)带入最小二乘公式,则估计值 θ ^ 0 \hat \theta_0 θ^0为:
    θ ^ 0 = ( T T T ) − 1 T T ( Y − g ^ ( X ) ) = ∑ T i ( Y i − g ^ ( X i ) ) ∑ T i 2 = ∑ T i { [ Y i − g ( X i ) ] + [ g ( X i ) − g ^ ( X i ) ] } ∑ T i 2 = ∑ T i [ ( θ 0 T i + ϵ i ) + [ g ( X i ) − g ^ ( X i ) ] ∑ T i 2 = θ 0 + ∑ T i ϵ i + T i [ g ( X i ) − g ^ ( X i ) ] ∑ T i 2 \begin{aligned} \hat \theta_0 &= {(T^TT)}^{-1}T^T(Y-\widehat g(X)) \\ &= \frac {\sum T_i (Y_i-\widehat g(X_i))} {\sum T_i^2} \\ &= \frac {\sum T_i \{ [Y_i-g(X_i)]+ [g(X_i)-\widehat g(X_i)] \} } {\sum T_i^2} \\ &= \frac {\sum T_i [(\theta_0 T_i+\epsilon_i)+[g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ &= \theta_0+\frac {\sum T_i \epsilon_i+ T_i [g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ \end{aligned} θ^0=(TTT)1TT(Yg (X))=Ti2Ti(Yig (Xi))=Ti2Ti{[Yig(Xi)]+[g(Xi)g (Xi)]}=Ti2Ti[(θ0Ti+ϵi)+[g(Xi)g (Xi)]=θ0+Ti2Tiϵi+Ti[g(Xi)g (Xi)] 估计值的偏差为:
    n ( θ ^ 0 − θ 0 ) = n ∗ ( ∑ T i ϵ i T i 2 ⏟ a + T i [ g ( X i ) − g ^ ( X i ) ] T i 2 ⏟ b ) \begin{aligned} \sqrt{n}(\hat \theta_0-\theta_0) = \sqrt{n} *\left(\underbrace {\sum \frac {T_i \epsilon_i}{T_i^2}}_{a} +\underbrace {\frac {T_i [g(X_i)-\widehat g(X_i)]}{T_i^2}}_{b} \right) \end{aligned} n (θ^0θ0)=n a Ti2Tiϵi+b Ti2Ti[g(Xi)g (Xi)]
       根据线性回归的假设 ∑ T i ϵ i = 0 \sum T_i \epsilon_i=0 Tiϵi=0,所以a项收敛为0。b项中,由于包含 T θ 0 T \theta_0 Tθ0项, g ( X ) ≠ E [ Y ∣ X ] g(X) \neq E[Y|X] g(X)=E[YX] ,因此, g ^ ( X i ) \widehat g(X_i) g (Xi)是有偏的,得到的预测误差 g ( X i ) − g ^ ( X i ) g(X_i)-\widehat g(X_i) g(Xi)g (Xi)不会收敛为0。

4. Double Machine Learning

4.1 DML模型步骤

DML模型分为以下三个步骤:
步骤一.用任意ML模型拟合Y和T得到残差 U ^ , V ^ \widehat U,\widehat V U ,V
U ^ = Y − T θ 0 − g ^ ( X i ) w h e r e   g ^ ( X i ) = E ( Y ∣ X , T ) V ^ = T − m ^ ( X i ) w h e r e   m ^ ( X i ) = E ( T ∣ X ) \begin{aligned} \widehat U &= Y - T \theta_0 - \hat g(X_i) \qquad where \ \hat g(X_i) =E(Y|X,T) \\ \widehat V &= T - \hat m(X_i) \qquad where \ \hat m(X_i) =E(T|X) \\ \end{aligned} U V =YTθ0g^(Xi)where g^(Xi)=E(YX,T)=Tm^(Xi)where m^(Xi)=E(TX) 步骤二.用任意模型拟合 U ^ , V ^ \widehat U,\widehat V U ,V 的关系,得到 θ ^ 0 \hat \theta_0 θ^0
θ 0 \theta_0 θ0的拟合可以是参数模型也可以是非参数模型,参数模型可以直接用线性模型拟合。而非参数模型因为只接受输入和输出所以需要再做如下变换,模型Target变为 U ^ V ^ \frac {\widehat U} {\widehat V} V U , 样本权重为 V ^ 2 {\widehat V} ^2 V 2
参 数 模 型 : U ^ = θ 0 ∗ V ^ + ϵ a r g m i n   E [ ( U ^ − θ 0 ∗ V ^ ) 2 ] = 0 非 参 模 型 : E [ ( U ^ − θ 0 ∗ V ^ ) 2 ] = E [ V ^ 2 ∗ ( U ^ V ^ − θ 0 ) 2 ] \begin{aligned} 参数模型&:\widehat U = \theta_0 *\widehat V + \epsilon \qquad argmin\ E[(\widehat U - \theta_0 *\widehat V)^2] =0\\ 非参模型&:E[(\widehat U - \theta_0 *\widehat V)^2] = E[{\widehat V}^2* (\frac {\widehat U}{\widehat V} - \theta_0)^2] \end{aligned} U =θ0V +ϵargmin E[(U θ0V )2]=0E[(U θ0V )2]=E[V 2(V U θ0)2] 步骤三. Cross-fitting
DML保证估计无偏很重要的一步就是Cross-fitting,用来降低overfitting带来的估计偏差。先把总样本分成两份:样本1,样本2。先用样本1估计残差,样本2估计𝜃̂ 1,再用样本2估计残差,样本1估计𝜃̂ 2,取平均得到最终的估计。当然也可以进一步使用K-Fold来增加估计的稳健性。

参数模型 θ ^ 0 \hat \theta_0 θ^0的表示式:
根据两阶段最小二乘公式 θ = ( Z T X ) − 1 Z T Y \theta = {(Z^TX)}^{-1}Z^TY θ=(ZTX)1ZTY,把 Z = V ^ Z=\widehat V Z=V X = T X=T X=T Y = Y − g ^ ( X ) Y= Y-\widehat g(X) Y=Yg (X)带入最小二乘公式,则:
θ ^ 0 = ( V ^ T T ) − 1 V ^ T ( Y − g ^ ( X ) ) = ∑ V ^ i ( Y i − g ^ ( X i ) ) ∑ V ^ i T i = ∑ V ^ i ( [ Y i − g ( X i ) ] + [ g ( X i ) − g ^ ( X i ) ] ) ∑ V ^ i T i = ∑ V ^ i ( [ T i θ 0 + U i ] + [ g ( X i ) − g ^ ( X i ) ] ) ∑ V ^ i T i = θ 0 + ∑ V ^ i U i + ∑ V ^ i [ g ( X i ) − g ^ ( X i ) ] ) ∑ V ^ i T i = θ 0 + ∑ V ^ i U i + ∑ [ V i − ( m ^ ( X i ) − m ( X i ) ) ] [ g ( X i ) − g ^ ( X i ) ] ) ∑ V ^ i T i = θ 0 + ∑ V ^ i U i ∑ V ^ i T i ⏟ a + ∑ [ m ( X i ) − m ^ ( X i ) ] [ g ( X i ) − g ^ ( X i ) ] ∑ V ^ i T i ⏟ b + ∑ V i [ g ( X i ) − g ^ ( X i ) ] ∑ V ^ i T i ⏟ c \begin{aligned} \hat \theta_0 &= {({\widehat V}^T T)}^{-1} {\widehat V}^T(Y-\widehat g(X)) \\ &= \frac {\sum \widehat V_i (Y_i-\widehat g(X_i))}{ \sum \widehat V_i T_i} \\ &= \frac {\sum \widehat V_i([Y_i-g(X_i)]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \frac {\sum \widehat V_i([T_i \theta_0+U_i]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum \widehat V_i[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum [V_i-(\widehat m(X_i)-m(X_i))][g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\underbrace {\frac {\sum \widehat V_i U_i} { \sum \widehat V_i T_i} }_a + \underbrace {\frac {\sum [m(X_i)-\widehat m(X_i)] [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_b + \underbrace {\frac {\sum V_i [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_c \end{aligned} θ^0=(V TT)1V T(Yg (X))=V iTiV i(Yig (Xi))=V iTiV i([Yig(Xi)]+[g(Xi)g (Xi)])=V iTiV i([Tiθ0+Ui]+[g(Xi)g (Xi)])=θ0+V iTiV iUi+V i[g(Xi)g (Xi)])=θ0+V iTiV iUi+[Vi(m (Xi)m(Xi))][g(Xi)g (Xi)])=θ0+a V iTiV iUi+b V iTi[m(Xi)m (Xi)][g(Xi)g (Xi)]+c V iTiVi[g(Xi)g (Xi)] a项等于0;b项由于 ∑ [ m ( X i ) − m ^ ( X i ) ] \sum [m(X_i)-\widehat m(X_i)] [m(Xi)m (Xi)]=0,且 ∑ [ m ( X i ) − m ^ ( X i ) ] \sum [m(X_i)-\widehat m(X_i)] [m(Xi)m (Xi)] ∑ [ g ( X i ) − g ^ ( X i ) ] \sum [g(X_i)-\widehat g(X_i)] [g(Xi)g (Xi)]独立,所以b项等于0;c项中, [ g ( X i ) − g ^ ( X i ) ] [g(X_i)-\widehat g(X_i)] [g(Xi)g (Xi)]是方程 Y = T θ ^ 0 + g ^ ( X ) + U ^ Y = T \widehat \theta_0+\widehat g(X)+\widehat U Y=Tθ 0+g (X)+U 的预测误差,如果 g ^ ( x ) \widehat g(x) g (x)过拟合,预测误差中会包含噪声 U U U,如果噪声 U 和 V U和V UV相关联,那么c项不会等于0( V V V [ g ( X i ) − g ^ ( X i ) ] [g(X_i)-\widehat g(X_i)] [g(Xi)g (Xi)]相关联,不独立)。为了解决这个问题,需要用到步骤三的样本分割方法。

如果有任何未观测的confounder, V V V [ g ( X i ) − g ^ ( X i ) ] [g(X_i)-\widehat g(X_i)] [g(Xi)g (Xi)]相关联,但我们假设观测到了所有confounders

另一种推导方式( g(X) 换成p(X) ):
根据Frisch-Waugh-Lovell Law,DML也可以表示成如下形式:
Y = p ( X ) + W , p ( X ) = E ( Y ∣ X ) , W ^ = Y − p ^ ( X ) T = m ( X ) + V , m ( X ) = E ( T ∣ X ) , V ^ = T − m ^ ( X ) W ^ = V ^ θ 0 + η \begin{aligned} Y &= p(X)+W , \quad p(X)=E(Y|X) , \quad \widehat W = Y- \widehat p(X) \\ T &= m(X)+V , \quad m(X)=E(T|X), \quad \widehat V = T- \widehat m(X) \\ \widehat W &= \widehat V \theta_0 +\eta \end{aligned} YTW =p(X)+W,p(X)=E(YX),W =Yp (X)=m(X)+V,m(X)=E(TX),V =Tm (X)=V θ0+η θ ^ 0 \hat \theta_0 θ^0的表达式为:
θ ^ 0 = ∑ V ^ i W ^ i ∑ V ^ i 2 = ∑ [ V i + ( m ( X i ) − m ^ ( X i ) ) ] [ W i + ( p ( X i ) − p ^ ( X i ) ) ] ∑ V ^ i 2 = ∑ V i W i + [ m ( X i ) − m ^ ( X i ) ] [ p ( X i ) − p ^ ( X i ) ] + W i [ m ( X i ) − m ^ ( X i ) ] + V i [ p ( X i ) − p ^ ( X i ) ] ∑ V ^ i 2 = ∑ V i W i ∑ V ^ i 2 ⏟ a + ∑ [ m ( X i ) − m ^ ( X i ) ] [ p ( X i ) − p ^ ( X i ) ] ∑ V ^ i 2 ⏟ b + ∑ W i [ m ( X i ) − m ^ ( X i ) ] ∑ V ^ i 2 ⏟ c + ∑ V i [ p ( X i ) − p ^ ( X i ) ] ∑ V ^ i 2 ⏟ θ 0 \begin{aligned} \hat \theta_0 &= \frac {\sum \widehat V_i \widehat W_i}{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum [V_i + (m(X_i) - \widehat m(X_i))] [W_i+(p(X_i) -\widehat p(X_i))] }{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum V_i W_i+ [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)] +W_i [m(X_i) - \widehat m(X_i)] + V_i [p(X_i) - \widehat p(X_i)] }{{ \sum {\widehat V_i }^2} } \\ &=\underbrace {\frac {\sum V_i W_i} { \sum {\widehat V_i} ^2} }_a + \underbrace {\frac {\sum [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)]} { \sum {\widehat V_i} ^2} }_b + \underbrace {\frac {\sum W_i [m(X_i) - \widehat m(X_i)]} { \sum {\widehat V_i} ^2} }_c +\underbrace {\frac {\sum V_i [p(X_i) - \widehat p(X_i)] } { \sum {\widehat V_i} ^2} }_{\theta_0} \end{aligned} θ^0=V i2V iW i=V i2[Vi+(m(Xi)m (Xi))][Wi+(p(Xi)p (Xi))]=V i2ViWi+[m(Xi)m (Xi)][p(Xi)p (Xi)]+Wi[m(Xi)m (Xi)]+Vi[p(Xi)p (Xi)]=a V i2ViWi+b V i2[m(Xi)m (Xi)][p(Xi)p (Xi)]+c V i2Wi[m(Xi)m (Xi)]+θ0 V i2Vi[p(Xi)p (Xi)]
为什么 ∑ V i [ p ( X i ) − p ^ ( X i ) ] ∑ V ^ i 2 = θ 0 ? ? ? \frac { \sum V_i [p(X_i) - \widehat p(X_i)] } { {\sum {\widehat V_i} ^2} }=\theta_0 \quad ??? V i2Vi[p(Xi)p (Xi)]=θ0???

4.2 GMM角度理解DML

PLM(Partially Linear Model)的总体矩条件:
φ p l m ( W ; θ 0 , η ) = [ Y − T θ 0 − g ( X ) ] T w h e r e η = g ( X ) ,   η 0 = g 0 ( X ) (1) \varphi_{plm}(W;\theta_0,\eta) = [Y - T \theta_0 - g(X)] T \tag{1}\\ where \quad \eta=g(X), \ \eta_0 = g_0(X) φplm(W;θ0,η)=[YTθ0g(X)]Twhereη=g(X), η0=g0(X)(1) DML的总体矩条件:
φ d m l ( W ; θ 0 , η ) = [ Y − T θ 0 − g ( X ) ] [ T − m ( X ) ] o r : φ d m l ( W ; θ 0 , η ) = [ Y − p ( X ) − ( T − m ( X ) ) θ 0 ] [ T − m ( X ) ] w h e r e η = ( p ( X ) , m ( X ) ) ,   η 0 = ( p 0 ( X ) , m 0 ( X ) ) = ( E ( Y ∣ X ) , E ( T ∣ X ) ) (2) \begin{aligned} \varphi_{dml}(W;\theta_0,\eta) &= [Y - T \theta_0 - g(X)] [T-m(X)] \\ or:\varphi_{dml}(W;\theta_0,\eta) &= [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \\ where \quad \eta&=(p(X),m(X)), \ \eta_0 =(p_0(X),m_0(X)) = (E(Y|X),E(T|X)) \\ \tag{2} \end{aligned} φdml(W;θ0,η)orφdml(W;θ0,η)whereη=[YTθ0g(X)][Tm(X)]=[Yp(X)(Tm(X))θ0][Tm(X)]=(p(X),m(X)), η0=(p0(X),m0(X))=(E(YX),E(TX))(2)
根据Neyman orthogonality:PML矩条件的求导结果不等于0,
而DML矩条件的加托求导(Gateaux derivative)结果是0:
∂ η E { [ Y − p ( X ) − ( T − m ( X ) ) θ 0 ] [ T − m ( X ) ] } ∣ η = η 0 = 0 \partial_\eta E\{ [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \} |_{\eta=\eta_0} = 0 ηE{[Yp(X)(Tm(X))θ0][Tm(X)]}η=η0=0
这表明用于估计parameters of interest的矩条件, 对nuisance parameters 的导数等于0 , 换句话说,即矩条件在true value of nuisance parameters 处对 nuisance parameters 的导数与从true value of nuisance parameters指向nuisance parameters的向量正交(which indicates the terminology of orthogonality)。由此得到的estimator 也叫 doubly robust estimator,指对于parameters of insterest 和 nuisance parameters 都是稳健的。看不懂。。。。

4.3 DML正交化实例说明

步骤一.原始数据
  探索冰淇淋销量和价格之间的关系,可以看到weekday是confounder,周末时价格和销量均更高。如果要得到准确的causal effect,需要矫正confounders带来的偏差。此数据集中包含的confounders有温度、周几、成本

步骤二.regress T on X
  用X构建T的回归模型,得到残差 V ^ = T − m ^ ( X ) \widehat V = T - \widehat m(X) V =Tm (X),残差 V ^ \widehat V V 可以看作将X对T的作用从T中去除后剩下的量,此时残差 V ^ \widehat V V 正交于 X X X

步骤三.regress Y on X
  同样的,用X构建Y的回归模型,得到残差 W ^ = Y − p ^ ( X ) \widehat W = Y - \widehat p(X) W =Yp (X),而模型 p ^ ( X ) \widehat p(X) p (X)的作用在于去除Y的方差,即将X引起的Y的方差从Y中去除。和步骤二图相比,步骤三图Y的方差变小。

步骤四. 对比DML结果和PML结果
  从上图中可以看到price_res和sales_res呈明显负相关关系。分别构建price_res和sales_res线性模型以及price和sales线性模型,得到前者得到显著负相关关系,后者由于混淆变量的存在得到正相关关系。

4.4 DML模拟

假设
g ( X ) = s i n 2 ( X ) m ( X ) = 1 2 π s i n h ( γ ) c o s h ( γ ) − c o s h ( x − v ) \begin{aligned} g(X) &=sin^2(X) \\ m(X) &= \frac {1}{2\pi}\frac {sinh(\gamma)}{cosh(\gamma)-cosh(x-v)} \end{aligned} g(X)m(X)=sin2(X)=2π1cosh(γ)cosh(xv)sinh(γ) v = 0 , γ = 1 v=0,\gamma=1 v=0,γ=1 X X X m ( X ) m(X) m(X)的关系为

零.生成数据 θ 0 \theta_0 θ0真实值=0.5

import numpy as np
from sklearn.datasets import make_spd_matrix
import math
import statsmodels.api as sm 
from sklearn.ensemble import RandomForestRegressor 
randomseednumber = 11022018
np.random.seed(randomseednumber)
import warnings 
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt

N = 500 # No. obs 
k=10 # = No. variables in x_i 
theta=0.5 # Structural parameter,the true value of the causal parameter will be θ=0.5, and bk=1/k. 
b= [1/k for k in range(1,11)] # x weights 
# sigma = make_spd_matrix(k,randomseednumber) #shape(10, 10)
sigma = make_spd_matrix(k) #shape(10, 10)
MC_no = 500 #NUmber of simulations 

def g(x):
    return np.power(np.sin(x),2)

def m(x,nu=0.,gamma=1.):
    return 0.5/math.pi*(np.sinh(gamma))/(np.cosh(gamma)-np.cos(x-nu))

# Array of estimated thetas to store results 
theta_est = np.zeros(shape=[MC_no,3]) #shape(500, 3)

一. 真实值:  θ 0 \theta_0 θ0真实值=0.5 sm.OLS(Y-G,D)得到 θ 0 \theta_0 θ0的真实值

theta_est_true = np.zeros(shape=[MC_no,1]) #shape(500, 1)

for i in range(MC_no):
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # true value of the causal parameter is θ=0.5
    model = sm.OLS(Y-G,D)
    results = model.fit()
    theta_est_true[i][0] = results.params[0] # x1的coef,即theta0
#     theta_est_true[i][0] = np.sum(np.dot(D,Y-G))/np.sum(np.dot(D.T,D)) #两者等价
    
sns.distplot(theta_est_true[:,0],kde=True)
plt.axvline(0.5,0,5)

二. PLM方法估计值:

  • 设定 θ 0 \theta_0 θ0的初始值 θ 0 ^ ( 0 ) = 0.4 \widehat{\theta_0}^{(0)}=0.4 θ0 (0)=0.4
  • 建立 X \mathbf{X} X Y − D θ 0 ^ ( 0 ) Y - D \widehat {\theta_0}^{(0)} YDθ0 (0) 之间的机器学习模型,以估计 g 0 ^ ( 1 ) \widehat{g_0}^{(1)} g0 (1)
  • 建立D 与 Y − g 0 ^ ( 1 ) ( X ) Y- \widehat{g_0}^{(1)}(\mathbf{X}) Yg0 (1)(X)之间的线性回归,以估计 θ 0 ^ ( 1 ) \widehat{\theta_0}^{(1)} θ0 (1)
  • 重复上述方法,直到 ∣ θ 0 ^ ( i ) − θ 0 ^ ( i − 1 ) ∣ < ϵ |\widehat{\theta_0}^{(i)}-\widehat{\theta_0}^{(i-1)}|<\epsilon θ0 (i)θ0 (i1)<ϵ 收斂,得到最终的估计 g 0 ^ \widehat{g_0} g0 θ 0 ^ \widehat{\theta_0} θ0
theta_est_plm = np.zeros(shape=[MC_no,1]) #shape(500, 3)

for i in range(MC_no):
    theta_0,theta_delta = 0.4,0.1
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta_0,D)+G+np.random.standard_normal(size=[500,])
    while theta_delta>0.01:
        theta_pre = theta_0

        # partial linear model
        naiveDMLg = RandomForestRegressor(max_depth=2)
        naiveDMLg.fit(X,Y-np.dot(D,theta_0)) #Compute ghat  
        Ghat = naiveDMLg.predict(X) #shape(500,)

        theta_0 = np.sum(np.dot(D,Y-Ghat))/np.sum(np.dot(D,D))
        theta_delta = abs(theta_pre-theta_0)

    theta_est_plm[i][0] = theta_0

sns.distplot(theta_est_plm[:,0],kde=True)
plt.axvline(0.5,0,5)

三. Naive DML方法估计值:np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))得到 θ 0 \theta_0 θ0的预估值,有偏。

for i in range(MC_no):
#     Generate data: no. obs x no. variables in x_i    
#     xi has length K=10 and will be generated from a multivariate normal distribution,
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # Naive double machine Learning
    naiveDMLg = RandomForestRegressor(max_depth=2)
    naiveDMLg.fit(X,Y) #Compute ghat  
    Ghat = naiveDMLg.predict(X) #shape(500,)
    naiveDMLm =RandomForestRegressor(max_depth=2)
    naiveDMLm.fit(X,D)
    Mhat= naiveDMLm.predict(X) #Compute Mhat,Vhat as residual
    Vhat = D-Mhat
    theta_est[i][1] = np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))
    
sns.distplot(theta_est[:,1],kde=True)
plt.axvline(0.5,0,5)    

四. cross-fitting DML方法估计值:sample spliting 无偏。

for i in range(MC_no):
#     Generate data: no. obs x no. variables in x_i    
#     xi has length K=10 and will be generated from a multivariate normal distribution,
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # Cross-fitting DML
    I = np.random.choice(N,np.int(N/2),replace=False) # Split the sample 
    I_C = [x for x in np.arange(N) if x not in I]
    # Ghat for both 
    Ghat_1 = RandomForestRegressor(max_depth=2).fit(X[I],Y[I]).predict(X[I_C])
    Ghat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],Y[I_C]).predict(X[I])
    # Mhat and vhat for both     
    Mhat_1 = RandomForestRegressor(max_depth=2).fit(X[I],D[I]).predict(X[I_C])
    Mhat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],D[I_C]).predict(X[I])
    Vhat_1,Vhat_2 = D[I_C]-Mhat_1,D[I] - Mhat_2
    theta_1 = np.mean(np.dot(Vhat_1,(Y[I_C]-Ghat_1)))/np.mean(np.dot(Vhat_1,D[I_C]))
    theta_2 = np.mean(np.dot(Vhat_2,(Y[I]-Ghat_2)))/np.mean(np.dot(Vhat_2,D[I]))
    theta_est[i][2] = 0.5*(theta_1+theta_2)

sns.distplot(theta_est[:,2],kde=True)
plt.axvline(0.5,0,5) 

五. 结果对比  下图是四种方法的估计值分布。

4.5 非参数DML 累了 写不动

参考资料

DML原始论文
DML详细ppt
DML详细讲解论文
DML训练过程
DML 例子
AB实验人群定向HTE模型4 - Double Machine Learning
DML中文推导
因果推断笔记——DML :Double Machine Learning案例学习(十六)

  • 0
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值