因果推理-学习笔记

        因果推理的三个层次分别是关联(相关性)、介入(治疗/干预/处理)、反事实(想象)。关联即相关性,是最低层次,目前的人工智能(弱智能)处于这个层次,虽然是最低层次,但仍能解决现实世界的很多问题。介入即行动(do),在协变量X、混杂因子W、工具变量Z(影响T但不影响Y的变量)的环境下通过干预T(do)得到结果Y。反事实,通过反事实(反事实是指现实世界对应的虚拟世界)提问(想象)执果求因,人类智能所处的层次,能够改变世界。

        微软因果推理实现的主要框架dowhy和econml,dowhy通过建模、识别、估计、验证四个步骤实现因果推理的过程;econml是个潜在结果(不能观测到的结果)模型的期望估计框架,提供了基于三个假设的多种方法实现减少选择偏差(建模使用的样本与现实世界的偏差)。这些方法的实现主要依赖的机器学习的模型,包括线性回归、随机森林、决策树、深度学习等。dowhy在识别时可以使用econml的方法来进行估计。

        因果推理的研究对象为U,如:酒店客户。其属性/特征为X(X为协变量或混杂因子),结果为Y,干预T。有时还有引入Z(工具变量,也是为了减少选择偏差)。

        协变量指原因T与结果Y以外的所有其他变量。比如在现有数据中,除了原因和结果的变量,其他所有变量都是协变量。而混杂因素是这些协变量中“同时影响原因与结果的变量”。也就是说,协变量中包括混杂因素,也包括非混杂因素。

        下面以变换酒店房间对取消订单的影响为例,通过dowhy框架来理解因果推理的过程。

一、创建模型(通过先验证知识创建一个初始因果图,根据数据集、干预、初始因果图创建因果模型),这个因果图可能不完全,但dowhy会自动补充完整。

1.准备数据集,包括特征工程

#准备数据集
import dowhy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import logging
logging.getLogger("dowhy").setLevel(logging.INFO)

dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv')
dataset.columns

# Total stay in nights
dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights']
# Total number of guests
dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies']
# Creating the different_room_assigned feature
dataset['different_room_assigned']=0
slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type']
dataset.loc[slice_indices,'different_room_assigned']=1
# Deleting older features
dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies'
                        ,'reserved_room_type','assigned_room_type'],axis=1)

dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries
dataset = dataset.drop(['agent','company'],axis=1)
# Replacing missing countries with most freqently occuring countries
dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0])

dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1)
dataset = dataset.drop(['arrival_date_year'],axis=1)

# Replacing 1 by True and 0 by False for the experiment and outcome variables
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
dataset.dropna(inplace=True) # 新增对NA值的处理
dataset.columns

2.确定变量之间的因果关系:

非常简单的看Y ~ X随机抽取中,多少会是相等的,如果100%相等,大概率X-> Y; 如果50%那就不确定有无因果关系。

随机(采样1万次)从1000条样本中看有多少取消订单的数量和变换房间的次数是相等的。

# different_room_assigned  - 518 不确定因果关系
counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset.sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
print(counts_sum/10000)

# 预约变化 booking_changes - 492,不确定
counts_sum=0
for i in range(1,10000):
        counts_i = 0
        rdf = dataset[dataset["booking_changes"]==0].sample(1000)
        counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0]
        counts_sum+= counts_i
print(counts_sum/10000)

3.根据先验证知识建立因果图(贝叶斯网络,有向无环图)

causal_graph = """digraph {
different_room_assigned[label="Different Room Assigned"];
is_canceled[label="Booking Cancelled"];
booking_changes[label="Booking Changes"];
previous_bookings_not_canceled[label="Previous Booking Retentions"];
days_in_waiting_list[label="Days in Waitlist"];
lead_time[label="Lead Time"];
market_segment[label="Market Segment"];
country[label="Country"];
U[label="Unobserved Confounders"];
is_repeated_guest;
total_stay;
guests;
meal;
hotel;
U->different_room_assigned; U->is_canceled;U->required_car_parking_spaces;
market_segment -> lead_time;
lead_time->is_canceled; country -> lead_time;
different_room_assigned -> is_canceled;
country->meal;
lead_time -> days_in_waiting_list;
days_in_waiting_list ->is_canceled;
previous_bookings_not_canceled -> is_canceled;
previous_bookings_not_canceled -> is_repeated_guest;
is_repeated_guest -> is_canceled;
total_stay -> is_canceled;
guests -> is_canceled;
booking_changes -> different_room_assigned; booking_changes -> is_canceled; 
hotel -> is_canceled;
required_car_parking_spaces -> is_canceled;
total_of_special_requests -> is_canceled;
country->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
market_segment->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled};
}"""

4.创建因果模型(实际上是建立了一个假设,通过识别、估计来验证这个假设)

model= dowhy.CausalModel(
        data = dataset,
        graph=causal_graph.replace("\n", " "),
        treatment='different_room_assigned',
        outcome='is_canceled')
model.view_model()

 二、因果识别(涉及平均处理/治疗估计ATE、前门frontdoor、后门backdoor、工具变量iv)

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

 三、因果估计(计算期望,也可以使用econml来实现,econml方法很多,且支持新方法的扩展)

dowhy的方法:

线性回归:backdoor.linear_regression (比较快)

estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.linear_regression",
                                        control_value=0,
                                        treatment_value=1,
                                        confidence_intervals=True,
                                        test_significance=True)
print(estimate)

倾向得分匹配:backdoor.propensity_score_matching(比较慢)

倾向得分分层:backdoor.propensity_score_stratification(比较慢)

倾向得分加权:backdoor.propensity_score_weighting(比较慢)

工具变量:iv.instrumental_variable

回归不连续:iv.regression_discontinuity

econnml的方法:

双机器学习:backdoor.econml.dml.*(比较快)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
                                     control_value = 0,
                                     treatment_value = 1,
                                 confidence_intervals=False,
                                method_params={"init_params":{'model_y':GradientBoostingRegressor(),
                                                              'model_t': GradientBoostingRegressor(),
                                                              "model_final":LassoCV(fit_intercept=False),
                                                              'featurizer':PolynomialFeatures(degree=2, include_bias=True)},
                                               "fit_params":{}})
print(dml_estimate)
estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.econml.dml.LinearDML",
                                        method_params={
                                        'init_params': {'model_y':GradientBoostingRegressor(),
                                                        'model_t': GradientBoostingRegressor(), },
                                        'fit_params': {}
                                     })
print(estimate)

双重稳定学习:backdoor.econml.drlearner.*

正交森林学习:backdoor.econml.ortho_forest.*

工具变量深度学习:iv.econml.deepiv.*

元学习:backdoor.econml.metalearners.*

estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.econml.metalearners.SLearner",
                                        method_params={
                                        'init_params': {'overall_model':GradientBoostingRegressor(),
                                                       },
                                        'fit_params': {}
                                     })
print(estimate)
estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.econml.metalearners.TLearner",
                                        method_params={
                                        'init_params': {'models':GradientBoostingRegressor(),
                                                        },
                                        'fit_params': {}
                                     })
print(estimate)
estimate = model.estimate_effect(identified_estimand,
                                        method_name="backdoor.econml.metalearners.XLearner",
                                        method_params={
                                        'init_params': {  'models': GradientBoostingRegressor(),
                                                          'propensity_model': GradientBoostingClassifier(),
                                                          'cate_models': GradientBoostingRegressor()
                                                       },
                                        'fit_params': {}
                                     })
print(estimate)

这么多的估计方法,究竟该用哪种方法呢?建议读这边书《原因与结果的经济学》可以获得一些指导;econml的策略是选择得分最低的,在实际使用中视乎难以抉择。

使用倾向得分分层估计:

estimate = model.estimate_effect(identified_estimand, 
                                 method_name="backdoor.propensity_score_stratification",target_units="ate")
# ATE = Average Treatment Effect
# ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room)
# ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room)
print(estimate)

推理:变换房间(干预)会使客户取消订单的期望值下降32%。猜测原因:是客户到达酒店后,换了更好的房间。 

四、验证(通过多个反事实样本来验证推理结果的鲁棒性/稳定性)

1.随机样本(期望结果:新的影响与估计影响差异很小)

refute1_results=model.refute_estimate(identified_estimand, estimate,
        method_name="random_common_cause")
print(refute1_results)

 2.安慰疗法(期望结果:新的影响接近0)

refute2_results=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter")
print(refute2_results)

 3. 子样本集(期望结果:新的影响与估计影响差异很小)

refute3_results=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter")
print(refute3_results)

反事实验证不能证明推理的正确性,但能增强推理的信心。

欢迎交流!

  • 6
    点赞
  • 35
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值