Double Machine Learning-CSDN博客

本文链接：https://blog.csdn.net/qq_26430933/article/details/122525210

1.从线性回归说起

从观测数据获得因果效应的一个简单方式是使用线性回归，控制confounders的影响：
$Sales_i = \alpha+\tau Pirce_i+\beta_1 temp_i+\beta_2 Cost_i+\beta_3 Weekday_i+e_i$ $\tau$ 是我们唯一需要关注的，因为 $\tau$ 是需要求的价格对销量的因果效应，其他变量我们并不关心，但是在线性模型中需要正确对待，否则得到的 $\tau$ 是错误估计。比如，如果温度对销量的影响如果是二次的，根据上式估计出的treatment effect就是错的。

2.Frisch-Waugh-Lovell Law

假设有一系列特征 $X_1$ 和 $X_2$ ，

普通线性回归： $\hat Y = \beta_0+ \beta_1 X_1+\beta_2 X_2$

基于Frisch-Waugh-Lovell Law回归：

用 $X_2$ 回归 $Y$ ： $\hat y^* = \gamma_1 X_2$
用 $X_2$ 回归 $X_1$ ： $\hat X_1 = \gamma_2 X_2$
分别计算残差： $\widetilde X_1 = X_1 - \hat X_1$ ， $\widetilde y = y -\hat y^*$
用 $X_1$ 的残差回归 $y$ 的残差： $\hat y^* = \gamma_3 \widetilde X_1$

两种回归方式得到的系数是相同的， $\beta_1 = \gamma_3$

Frisch-Waugh-Lovell Law的非线性推广：
$\sim X) ) \sim ( T - (T \sim X) ) \iff Y_i - E(Y_i|X_i) = \tau * (T_i - E(T_i|X_i))+\epsilon$

Frisch-Waugh-Lovell Law解决了什么问题
Frisch-Waugh-Lovell Law可以借助机器学习模型获得 $Y_i - E(Y_i|X_i)$ 和 $T_i - E(T_i|X_i)$ ，无需像线性模型一样知道所有特征的正确形式，只需保证预测值 $\hat y^*$ 和 $\widetilde X$ 足够准确就可。

3.Partially Linear Model

Y是实验影响的核心指标
T是treatment，通常是0/1变量，代表样本进入实验组还是对照组，对随机AB实验𝑇⊥𝑋
X是Confounder，可以简单理解为未被实验干预过的用户特征，通常是高维向量
$\begin{aligned} Y &= \theta_0 T+g(X)+U , \qquad \ E(U | T,X)=0 \\ T &= m(X)+V , \qquad \qquad \ E(V | T)=0 \end{aligned}$
DML最终估计的是 $\widehat \theta_0$ ，也就是实验对不同用户核心指标的不同影响。最直接的方法就是用X和T一起对Y建模，直接估计 $\widehat \theta_0$ 。但这样估计出的 $\widehat \theta_0$ 往往是有偏的，偏差部分来自于对样本的过拟合，部分来自于 $\widehat g(X)$ 估计的偏差。假定 $\theta_0$ 是参数的真实值，根据最小二乘公式 $\theta = {(X^TX)}^{-1}X^TY$ ，把 $X = T$ ， $Y-\widehat g(X)$ 带入最小二乘公式，则估计值 $\hat \theta_0$ 为：
$\begin{aligned} \hat \theta_0 &= {(T^TT)}^{-1}T^T(Y-\widehat g(X)) \\ &= \frac {\sum T_i (Y_i-\widehat g(X_i))} {\sum T_i^2} \\ &= \frac {\sum T_i \{ [Y_i-g(X_i)]+ [g(X_i)-\widehat g(X_i)] \} } {\sum T_i^2} \\ &= \frac {\sum T_i [(\theta_0 T_i+\epsilon_i)+[g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ &= \theta_0+\frac {\sum T_i \epsilon_i+ T_i [g(X_i)-\widehat g(X_i)] } {\sum T_i^2} \\ \end{aligned}$ 估计值的偏差为：
$\begin{aligned} \sqrt{n}(\hat \theta_0-\theta_0) = \sqrt{n} *\left(\underbrace {\sum \frac {T_i \epsilon_i}{T_i^2}}_{a} +\underbrace {\frac {T_i [g(X_i)-\widehat g(X_i)]}{T_i^2}}_{b} \right) \end{aligned}$
根据线性回归的假设 $\sum T_i \epsilon_i=0$ ，所以a项收敛为0。b项中，由于包含 $\theta_0$ 项， $\neq E[Y|X]$ ，因此， $\widehat g(X_i)$ 是有偏的，得到的预测误差 $g(X_i)-\widehat g(X_i)$ 不会收敛为0。

4. Double Machine Learning

4.1 DML模型步骤

DML模型分为以下三个步骤：
步骤一.用任意ML模型拟合Y和T得到残差 $\widehat U,\widehat V$
$\begin{aligned} \widehat U &= Y - T \theta_0 - \hat g(X_i) \qquad where \ \hat g(X_i) =E(Y|X,T) \\ \widehat V &= T - \hat m(X_i) \qquad where \ \hat m(X_i) =E(T|X) \\ \end{aligned}$ 步骤二.用任意模型拟合 $\widehat U,\widehat V$ 的关系，得到 $\hat \theta_0$
$\theta_0$ 的拟合可以是参数模型也可以是非参数模型，参数模型可以直接用线性模型拟合。而非参数模型因为只接受输入和输出所以需要再做如下变换，模型Target变为 $\frac {\widehat U} {\widehat V}$ , 样本权重为 ${\widehat V} ^2$
$\begin{aligned} 参数模型&：\widehat U = \theta_0 *\widehat V + \epsilon \qquad argmin\ E[(\widehat U - \theta_0 *\widehat V)^2] =0\\ 非参模型&：E[(\widehat U - \theta_0 *\widehat V)^2] = E[{\widehat V}^2* (\frac {\widehat U}{\widehat V} - \theta_0)^2] \end{aligned}$ 步骤三. Cross-fitting
DML保证估计无偏很重要的一步就是Cross-fitting，用来降低overfitting带来的估计偏差。先把总样本分成两份：样本1，样本2。先用样本1估计残差，样本2估计𝜃̂ 1，再用样本2估计残差，样本1估计𝜃̂ 2，取平均得到最终的估计。当然也可以进一步使用K-Fold来增加估计的稳健性。

参数模型 $\hat \theta_0$ 的表示式：
根据两阶段最小二乘公式 $\theta = {(Z^TX)}^{-1}Z^TY$ ，把 $Z=\widehat V$ ， $X = T$ ， $Y-\widehat g(X)$ 带入最小二乘公式，则：
$\begin{aligned} \hat \theta_0 &= {({\widehat V}^T T)}^{-1} {\widehat V}^T(Y-\widehat g(X)) \\ &= \frac {\sum \widehat V_i (Y_i-\widehat g(X_i))}{ \sum \widehat V_i T_i} \\ &= \frac {\sum \widehat V_i([Y_i-g(X_i)]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \frac {\sum \widehat V_i([T_i \theta_0+U_i]+[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum \widehat V_i[g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\frac {\sum \widehat V_i U_i+\sum [V_i-(\widehat m(X_i)-m(X_i))][g(X_i)-\widehat g(X_i)])}{{ \sum \widehat V_i T_i} } \\ &= \theta_0+\underbrace {\frac {\sum \widehat V_i U_i} { \sum \widehat V_i T_i} }_a + \underbrace {\frac {\sum [m(X_i)-\widehat m(X_i)] [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_b + \underbrace {\frac {\sum V_i [g(X_i)-\widehat g(X_i)]} { \sum \widehat V_i T_i} }_c \end{aligned}$ a项等于0；b项由于 $\sum [m(X_i)-\widehat m(X_i)]$ =0，且 $\sum [m(X_i)-\widehat m(X_i)]$ 和 $\sum [g(X_i)-\widehat g(X_i)]$ 独立，所以b项等于0；c项中， $[g(X_i)-\widehat g(X_i)]$ 是方程 $\widehat \theta_0+\widehat g(X)+\widehat U$ 的预测误差，如果 $\widehat g(x)$ 过拟合，预测误差中会包含噪声 $U$ ，如果噪声 $U 和 V$ 相关联，那么c项不会等于0（ $V$ 和 $[g(X_i)-\widehat g(X_i)]$ 相关联，不独立）。为了解决这个问题，需要用到步骤三的样本分割方法。

如果有任何未观测的confounder， $V$ 和 $[g(X_i)-\widehat g(X_i)]$ 相关联，但我们假设观测到了所有confounders

另一种推导方式( g(X) 换成p(X) )：
根据Frisch-Waugh-Lovell Law，DML也可以表示成如下形式：
$\begin{aligned} Y &= p(X)+W , \quad p(X)=E(Y|X) , \quad \widehat W = Y- \widehat p(X) \\ T &= m(X)+V , \quad m(X)=E(T|X), \quad \widehat V = T- \widehat m(X) \\ \widehat W &= \widehat V \theta_0 +\eta \end{aligned}$ $\hat \theta_0$ 的表达式为：
$\begin{aligned} \hat \theta_0 &= \frac {\sum \widehat V_i \widehat W_i}{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum [V_i + (m(X_i) - \widehat m(X_i))] [W_i+(p(X_i) -\widehat p(X_i))] }{{ \sum {\widehat V_i }^2} } \\ &= \frac {\sum V_i W_i+ [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)] +W_i [m(X_i) - \widehat m(X_i)] + V_i [p(X_i) - \widehat p(X_i)] }{{ \sum {\widehat V_i }^2} } \\ &=\underbrace {\frac {\sum V_i W_i} { \sum {\widehat V_i} ^2} }_a + \underbrace {\frac {\sum [m(X_i) - \widehat m(X_i)] [p(X_i) - \widehat p(X_i)]} { \sum {\widehat V_i} ^2} }_b + \underbrace {\frac {\sum W_i [m(X_i) - \widehat m(X_i)]} { \sum {\widehat V_i} ^2} }_c +\underbrace {\frac {\sum V_i [p(X_i) - \widehat p(X_i)] } { \sum {\widehat V_i} ^2} }_{\theta_0} \end{aligned}$
为什么 $\frac { \sum V_i [p(X_i) - \widehat p(X_i)] } { {\sum {\widehat V_i} ^2} }=\theta_0 \quad ???$

4.2 GMM角度理解DML

PLM（Partially Linear Model）的总体矩条件：
$\varphi_{plm}(W;\theta_0,\eta) = [Y - T \theta_0 - g(X)] T \tag{1}\\ where \quad \eta=g(X), \ \eta_0 = g_0(X)$ DML的总体矩条件：
$\begin{aligned} \varphi_{dml}(W;\theta_0,\eta) &= [Y - T \theta_0 - g(X)] [T-m(X)] \\ or：\varphi_{dml}(W;\theta_0,\eta) &= [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \\ where \quad \eta&=(p(X),m(X)), \ \eta_0 =(p_0(X),m_0(X)) = (E(Y|X),E(T|X)) \\ \tag{2} \end{aligned}$
根据Neyman orthogonality：PML矩条件的求导结果不等于0，
而DML矩条件的加托求导(Gateaux derivative)结果是0：
$\partial_\eta E\{ [ Y - p(X)- (T-m(X))\theta_0 ] [T-m(X)] \} |_{\eta=\eta_0} = 0$
这表明用于估计parameters of interest的矩条件, 对nuisance parameters 的导数等于0 , 换句话说，即矩条件在true value of nuisance parameters 处对 nuisance parameters 的导数与从true value of nuisance parameters指向nuisance parameters的向量正交(which indicates the terminology of orthogonality)。由此得到的estimator 也叫 doubly robust estimator，指对于parameters of insterest 和 nuisance parameters 都是稳健的。看不懂。。。。

4.3 DML正交化实例说明

步骤一.原始数据
探索冰淇淋销量和价格之间的关系，可以看到weekday是confounder，周末时价格和销量均更高。如果要得到准确的causal effect，需要矫正confounders带来的偏差。此数据集中包含的confounders有温度、周几、成本

步骤二.regress T on X
用X构建T的回归模型，得到残差 $\widehat V = T - \widehat m(X)$ ，残差 $\widehat V$ 可以看作将X对T的作用从T中去除后剩下的量，此时残差 $\widehat V$ 正交于 $X$ 。

步骤三.regress Y on X
同样的，用X构建Y的回归模型，得到残差 $\widehat W = Y - \widehat p(X)$ ，而模型 $\widehat p(X)$ 的作用在于去除Y的方差，即将X引起的Y的方差从Y中去除。和步骤二图相比，步骤三图Y的方差变小。

步骤四. 对比DML结果和PML结果
从上图中可以看到price_res和sales_res呈明显负相关关系。分别构建price_res和sales_res线性模型以及price和sales线性模型，得到前者得到显著负相关关系，后者由于混淆变量的存在得到正相关关系。

4.4 DML模拟

假设
$\begin{aligned} g(X) &=sin^2(X) \\ m(X) &= \frac {1}{2\pi}\frac {sinh(\gamma)}{cosh(\gamma)-cosh(x-v)} \end{aligned}$ 令 $v=0,\gamma=1$ ， $X$ 和 $m (X)$ 的关系为

零.生成数据 $\theta_0$ 真实值=0.5

import numpy as np
from sklearn.datasets import make_spd_matrix
import math
import statsmodels.api as sm 
from sklearn.ensemble import RandomForestRegressor 
randomseednumber = 11022018
np.random.seed(randomseednumber)
import warnings 
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt

N = 500 # No. obs 
k=10 # = No. variables in x_i 
theta=0.5 # Structural parameter,the true value of the causal parameter will be θ=0.5, and bk=1/k. 
b= [1/k for k in range(1,11)] # x weights 
# sigma = make_spd_matrix(k,randomseednumber) #shape(10, 10)
sigma = make_spd_matrix(k) #shape(10, 10)
MC_no = 500 #NUmber of simulations 

def g(x):
    return np.power(np.sin(x),2)

def m(x,nu=0.,gamma=1.):
    return 0.5/math.pi*(np.sinh(gamma))/(np.cosh(gamma)-np.cos(x-nu))

# Array of estimated thetas to store results 
theta_est = np.zeros(shape=[MC_no,3]) #shape(500, 3)

一. 真实值： $\theta_0$ 真实值=0.5 sm.OLS(Y-G,D)得到 $\theta_0$ 的真实值

theta_est_true = np.zeros(shape=[MC_no,1]) #shape(500, 1)

for i in range(MC_no):
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # true value of the causal parameter is θ=0.5
    model = sm.OLS(Y-G,D)
    results = model.fit()
    theta_est_true[i][0] = results.params[0] # x1的coef,即theta0
#     theta_est_true[i][0] = np.sum(np.dot(D,Y-G))/np.sum(np.dot(D.T,D)) #两者等价
    
sns.distplot(theta_est_true[:,0],kde=True)
plt.axvline(0.5,0,5)

二. PLM方法估计值：

设定 $\theta_0$ 的初始值 $\widehat{\theta_0}^{(0)}=0.4$
建立 $\mathbf{X}$ 与 $\widehat {\theta_0}^{(0)}$ 之间的机器学习模型，以估计 $\widehat{g_0}^{(1)}$
建立D 与 $\widehat{g_0}^{(1)}(\mathbf{X})$ 之间的线性回归，以估计 $\widehat{\theta_0}^{(1)}$
重复上述方法，直到 $|\widehat{\theta_0}^{(i)}-\widehat{\theta_0}^{(i-1)}|<\epsilon$ 收斂，得到最终的估计 $\widehat{g_0}$ 与 $\widehat{\theta_0}$

theta_est_plm = np.zeros(shape=[MC_no,1]) #shape(500, 3)

for i in range(MC_no):
    theta_0,theta_delta = 0.4,0.1
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta_0,D)+G+np.random.standard_normal(size=[500,])
    while theta_delta>0.01:
        theta_pre = theta_0

        # partial linear model
        naiveDMLg = RandomForestRegressor(max_depth=2)
        naiveDMLg.fit(X,Y-np.dot(D,theta_0)) #Compute ghat  
        Ghat = naiveDMLg.predict(X) #shape(500,)

        theta_0 = np.sum(np.dot(D,Y-Ghat))/np.sum(np.dot(D,D))
        theta_delta = abs(theta_pre-theta_0)

    theta_est_plm[i][0] = theta_0

sns.distplot(theta_est_plm[:,0],kde=True)
plt.axvline(0.5,0,5)

三. Naive DML方法估计值：np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))得到 $\theta_0$ 的预估值，有偏。

for i in range(MC_no):
#     Generate data: no. obs x no. variables in x_i    
#     xi has length K=10 and will be generated from a multivariate normal distribution,
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # Naive double machine Learning
    naiveDMLg = RandomForestRegressor(max_depth=2)
    naiveDMLg.fit(X,Y) #Compute ghat  
    Ghat = naiveDMLg.predict(X) #shape(500,)
    naiveDMLm =RandomForestRegressor(max_depth=2)
    naiveDMLm.fit(X,D)
    Mhat= naiveDMLm.predict(X) #Compute Mhat,Vhat as residual
    Vhat = D-Mhat
    theta_est[i][1] = np.mean(np.dot(Vhat,Y-Ghat))/np.mean(np.dot(Vhat,D))
    
sns.distplot(theta_est[:,1],kde=True)
plt.axvline(0.5,0,5)

四. cross-fitting DML方法估计值：sample spliting 无偏。

for i in range(MC_no):
#     Generate data: no. obs x no. variables in x_i    
#     xi has length K=10 and will be generated from a multivariate normal distribution,
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,]) #shape(500, 10)
    G = g(np.dot(X,b)) #shape(500,)
    M = m(np.dot(X,b)) #shape(500,)
    D = M+np.random.standard_normal(size=[500,])
    Y = np.dot(theta,D)+G+np.random.standard_normal(size=[500,])

    # Cross-fitting DML
    I = np.random.choice(N,np.int(N/2),replace=False) # Split the sample 
    I_C = [x for x in np.arange(N) if x not in I]
    # Ghat for both 
    Ghat_1 = RandomForestRegressor(max_depth=2).fit(X[I],Y[I]).predict(X[I_C])
    Ghat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],Y[I_C]).predict(X[I])
    # Mhat and vhat for both     
    Mhat_1 = RandomForestRegressor(max_depth=2).fit(X[I],D[I]).predict(X[I_C])
    Mhat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],D[I_C]).predict(X[I])
    Vhat_1,Vhat_2 = D[I_C]-Mhat_1,D[I] - Mhat_2
    theta_1 = np.mean(np.dot(Vhat_1,(Y[I_C]-Ghat_1)))/np.mean(np.dot(Vhat_1,D[I_C]))
    theta_2 = np.mean(np.dot(Vhat_2,(Y[I]-Ghat_2)))/np.mean(np.dot(Vhat_2,D[I]))
    theta_est[i][2] = 0.5*(theta_1+theta_2)

sns.distplot(theta_est[:,2],kde=True)
plt.axvline(0.5,0,5)