lecture 9：分类变量回归

最新推荐文章于 2023-12-29 01:22:29 发布

HUT_Econometrics

最新推荐文章于 2023-12-29 01:22:29 发布

阅读量3.2k

点赞数

分类专栏：计量经济学文章标签：回归分类机器学习人工智能

本文链接：https://blog.csdn.net/weixin_40496687/article/details/116276884

版权

Heckman两阶段模型用于解决样本选择偏差问题，常用于经济学和管理学领域。第一阶段，通过Probit模型估计样本是否被选择的概率，计算逆米尔斯比率。第二阶段，在原回归方程中加入逆米尔斯比率作为控制变量，以修正估计偏误。此方法在处理如女性工作选择、企业R&D投入披露等问题时非常有效。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

先学习这个这个资料

分类变量回归——Probit和Logit（附代码）

以下两个网页上均有heckman的代码，以日本教师的优先考虑。
https://pypi.org/project/py4etrics/官方安装地址
https://github.com/Py4Etrics
https://github.com/vnery5/Econometria
https://lost-stats.github.io/Model_Estimation/GLS/heckman_correction_model.html
https://rlhick.people.wm.edu/stories/econ_407_notes_heckman.html
https://econ.pages.code.wm.edu/407/notes/docs/mle_heckman.html

Heckman 两阶段模型:

Heckman两阶段模型解决的是样本选择偏差（sample selection bias）的问题。样本选择偏差指的是我们在回归方程中估计出的参数是基于那些被选择进样本了的数据点（或者说是能够观测得到的数据点）而得出的。如果说一个数据点（观测值）是不是被选择进样本是一个外生的、纯随机的事件，那么我们据此得出的参数并不会有偏差（bias）——这个估计结果就不会有问题。

可是，一个数据点是不是能被选择进样本、或者说是不是能够被观测到，这个过程在很多时候并不是随机、外生的。比如说，就拿Wooldridge 教材中的一个经典例子来讲：研究者试图估计出受教育程度以及工作经验对于女职工工资的影响。在一个753名女性的大样本中，428名女性是有工作的，所以这项研究只能在这428名有工作（有收入）的样本中展开。那么问题来了：因为我们无法观测到那325个没有工作的样本中受教育程度以及经验对于收入的影响，并且一个人选择工作或不工作并非是随机的——人们会根据潜在的收入水平、自身条件、家庭情况、年龄等等因素综合来决定是否参加工作，于是，我们仅从那428个有工作的人身上找出的统计学结果将是有偏差的，因为样本的选择并非随机及外生的。

Heckman两阶段模型的功能就是试图纠正这种偏差导致的估计偏误。第一阶段的模型，是一个包括全样本（753人）的Probit模型，用来估计一个人参加工作与否的概率。这里的因变量是二元的，表示是否参加工作；自变量是一些会影响个人决定工作与否的外生变量，比如其他收入来源、年龄、有几个未成年子女，等等。这些自变量类似工具变量——他们会影响个人是否参加工作的决策，但不太可能影响参加工作后的收入水平。然后根据这个Probit模型，我们为每一个样本计算出逆米尔斯比率（Inverse Mills Ratio）。这个比率的作用是为每一个样本计算出一个用于修正样本选择偏差的值。

然后第二步，只需要在原来的回归方程——即对于428个有工作的样本，基于她们受教育程度和经验的回归中加入一个额外的自变量——逆米尔斯比率即可，然后估计出回归参数。

最后，观察在第二阶段方程中，逆米尔斯比率这个自变量的显著性。如果该变量不显著，则说明最一开始的回归方程并不具有样本选择偏差，研究者可以根据原来的系数来做出统计推断；但如果尼米尔比率这个参数是显著的，则说明样本选择偏差是存在的，研究者应当根据第二阶段方程里的回归系数来做出统计推断。

Heckman两阶段模型适用于解决由样本选择偏差（sample selection bias）造成的内生性问题。在经济学领域，样本选择偏差的典型例子是研究女性的受教育情况对女性工资的影响。按照这个思路，一般会去问卷收集或在哪个网站下载部分女性的受教育情况，工资，及其他特征数据，例如年龄，毕业院校等级等个人特征，然后做回归。不过这样做有一个问题，就是登记的女性，都是在工作的，但是许多受教育程度较高的女性不工作，选择做家庭主妇，这部分样本就没有算在内，样本失去随机性。这就导致模型只是用到了在工作的女性，这样得出的结论是有偏差的。在管理学领域，一个典型的问题是企业的某个特征，或者董事/CEO的某个特征，对企业R&D投入的影响。也是同样的问题，企业的R&D投入是企业自愿披露的内容，有的企业不披露，这时你做回归时就不能包括这部分样本，也会造成样本选择偏差，结果有偏。

对于这种情况，Heckman提出了一个方法，赫克曼矫正法（Heckman Correction，又称两阶段方法）。赫克曼矫正法分两个步骤进行：第一步骤，研究者根据管理学理论设计出一个计算企业披露R&D投入概率的模型，而该模型的统计估计结果可以用来预测每个个体的概率；第二步骤，研究者将这些被预测个体概率合并为一个额外的解释变量，与其他控制变量等变量一起来矫正自选择问题。这个比率叫逆米尔斯比率，inverse Mills ration， imr，也就是说，在第一步计算出imr，在第二步把imr当作一个控制变量。

以企业R&D投入问题为例，假设全样本是1000家公司，其中800家公司披露了其R&D投入。

第一阶段的模型，是一个包括全样本（1000家）的Probit模型，用来估计一家公司是否会披露其R&D投入的概率。这里的因变量是二元的，表示是否披露R&D投入；自变量是一些会影响是否披露R&D的外生变量，比如其他收入营业收入，杠杆率，公司规模，所属行业等等。然后根据这个Probit模型，为每一个样本计算出imr，imr作用是为每一个样本计算出一个用于修正样本选择偏差的值。

第二阶段，在原来的回归方程，也就是原来只有800家公司的样本的方程假如imr作为控制变量，其他都不变，然后估计出回归参数。这时不管imr显著不显著都不重要，imr显著说明样本选择偏差的确影响了你最初模型的估计，这正表明了使用Heckman两步法纠正样本选择偏差的必要性。imr不显著说明原模型不存在严重的样本选择偏差，这时Heckman第二步得到的结果应该与原模型得到的结果差不多。(关于imr的显著性是否说明样本选择偏差存在目前还有争议，不过imr不是关注的变量)。第二步关注的对象是核心解释变量是否显著。只要核心解释变量显著，就说明结果稳健。

在stata上的实现，还是刚才的例子。假设问题是研究董事会的连锁懂事比例对企业R&D投入的影响，各变量如下：

因变量：企业R&D投入额度（rd）

自变量：董事会连锁懂事比例（interlockratio）

控制变量：公司规模（firmsize），杠杆率（leverage），公司成长性（growth），公司年龄（age），行业R&D投入（industryrd），行业集中度（cr4），行业净资产收益率（industryroa）等。其中前三个控制变量还会影响企业R&D投入的概率。

总样本数1000家，其中800家披露了R&D投入，不考虑其他变量的缺失值。

Heckman两步法
第一步，命名一个新的因变量，企业是否披露R&D投入，ifrd
xi: probit ifrd firmsize leverage growth i.year i.ind  r//Heckman两阶段的第一阶段回归，这里的r可加可不加，看需不需要控制异方差问题。
estimate store First
predict y_hat, xb
gen pdf = normalden(y_hat)  
gen cdf = normal(y_hat)  
gen imr = pdf/cdf//生成imr
 第二步回归，把imr当作控制变量加入原模型，用原来的数据。
reg rd interlockratio leverage growth industryrd cr4 industryroa imr i.year i.ind , r if ifrd==1

需要注意的是，在第一步，确定哪些变量会影响企业披露其R&D数据时，这些变量不一定是原模型的因变量，可以是可以不是，是不是要说明理由。

例二：
数据地址：点击下载

*数据来源： https://gitee.com/arlionn/data
use womenwk.dta, clear   

*描述性统计数据
sum age educ married children wage 

*简单的ols模型，存在选择性偏误
reg wage educ age
est store OLS

*第一种方法  heckman maximum likelihood
heckman wage educ age, select(married children educ age) //默认最大似然估计
est store HeckMLE

*第二种方法  heckman two-step  all-in-one 不可以进行cluster调整
heckman wage educ age, select(married children educ age) twostep
est store Heck2s

*第二种方法  heckman two-step  step-by-step 可以进行cluster调整
probit work married children educ age
est store First
predict y_hat, xb
gen pdf = normalden(y_hat)  //概率密度函数
gen cdf = normal(y_hat)     //累积分布函数
gen imr = pdf/cdf           //计算逆米尔斯比率
reg  wage educ age imr if work == 1  //女性工作子样本
est store Second
vif  //方差膨胀因子

*对比结果
local m "OLS HeckMLE Heck2s First Second"
esttab `m', mtitle(`m') nogap compress pr2 ar2

在这里插入图片描述
python实现：

import numpy as np
import pandas as pd

# true parameters
rho_t = np.array([0.8])
rho_x_w_t = np.array([0.8])
gamma_t = np.array([.5,1.0])
beta_t = np.array([-2.0,0.5])
sigma_e_t = np.array([1.0])
N =5000

# generate toy data consistent with heckman:
# generate potentially correlated x,w data
mean_x_w = np.array([0,0])
cov_x_w = np.array([[1,rho_x_w_t[0]],[rho_x_w_t[0], 1]])
w, x = np.random.multivariate_normal(mean_x_w, cov_x_w, N).T

# add constant to first position and convert to DataFrame
w_ = pd.DataFrame(np.c_[np.ones(N),w],columns=['Constant (Selection)','Slope (Selection)'])
x_ = pd.DataFrame(np.c_[np.ones(N),x], columns=['Constant','Slope'])

# generate errors 
mean_u_eps = np.array([0,0])
cov_u_eps = np.array([[1,rho_t[0]],[rho_t[0],sigma_e_t]])
u, epsilon = np.random.multivariate_normal(mean_u_eps, cov_u_eps, N).T

# generate latent zstar
zstar = w_.dot(gamma_t) + u
# generate observed z (indicator=1 if zstar is positive)
z = zstar > 0  

# generate latent ystar
ystar = x_.dot(beta_t) + epsilon
y=ystar.copy()
# generate observed y [if z=0, set y to NaN]
y[~z] = np.nan

stata_data = pd.DataFrame(np.c_[y,z,x,w], columns=['y','z','x','w'])
stata_data.to_stata('/tmp/heckman_data.dta')
print(stata_data.head())

import heckman as heckman
res = heckman.Heckman(y, x_, w_).fit(method='twostep')
print(res.summary())

在这里插入图片描述来源：网络地址

"""
Heckman correction for sample selection bias (the Heckit procedure).
Original Code Created August 19, 2014 by B.I.
Last modified February 26, 2017 by B.I.
NO warranty is provided for this software.
"""

import numpy as np
import statsmodels.api as sm
import statsmodels.base.model as base
from statsmodels.iolib import summary
from statsmodels.tools.numdiff import approx_fprime
from scipy.stats import norm
import warnings
import pandas as pd


class Heckman(base.LikelihoodModel):
    """
    Class for Heckman correction for sample selection bias model.
    Parameters
    ----------
    endog : 1darray
        Data for the dependent variable. Should be set to np.nan for
        censored observations.
    exog : 2darray
        Data for the regression (response) equation. If a constant
        term is desired, the user should directly add the constant
        column to the data before using it as an argument here.
    exog_select : 2darray
        Data for the selection equation. If a constant
        term is desired, the user should directly add the constant
        column to the data before using it as an argument here.
    **kwargs:
        missing=, which can be 'none', 'drop', or 'raise'
    Notes
    -----
    The selection equation should contain at least one variable that 
    is not in the regression (response) equation, i.e. the selection
    equation should contain at least one instrument. However, this
    module should still work if the user chooses not to do this.
    """

    def __init__(self, endog, exog, exog_select, **kwargs):
        
        # check that Z has same index as X (and consequently Y through super().__init__)
        if pd.__name__ in type(endog).__module__ and pd.__name__ in type(exog).__module__:
            if not all(endog.index==exog_select.index):
                raise ValueError("Z indices need to be the same as X and Y indices")


        # shape checks
        if (len(endog) == len(exog)) and (len(endog) == len(exog_select)):
            pass
        else:
            raise ValueError("Y, X, and Z data shapes do not conform with each other.")

        try:
            if (endog.ndim == 1) and (exog.ndim <= 2) and (exog_select.ndim <= 2):
                pass
            else:
                raise ValueError("Y, X, and Z data shapes do not conform with each other.")
        except AttributeError:
            if (np.asarray(endog).ndim == 1) and (np.asarray(exog).ndim <= 2) and (np.asarray(exog_select).ndim <= 2):
                pass
            else:
                raise ValueError("Y, X, and Z data shapes do not conform with each other.")

        # give missing (treated) values in endog variable finite values so that super().__init__
        # does not strip them out -- they will be put back after the call to super().__init__
        treated = np.asarray(~np.isnan(endog))
        
        try:
            endog_nomissing = endog.copy()
            endog_nomissing[~treated] = -99999
        except (TypeError, AttributeError):
            endog_nomissing = [endog[i] if treated[i] else -99999 for i in range(len(treated))]

        # create 1-D array that will be np.nan for every row of exog_select that has any missing
        # values and a finite value otherwise for the call to super().__init__ so that it can
        # strip out rows where exog_select has missing data if missing option is set

        if np.asarray(exog_select).ndim==2:
            exog_select_1dnan_placeholder = \
                [np.nan if any(np.isnan(row)) else 1 for row in np.asarray(exog_select)]
        else:  # assume ==1
            exog_select_1dnan_placeholder = [np.nan if np.isnan(row) else 1 for row in np.asarray(exog_select)]

        if pd.__name__ in type(endog).__module__:
            exog_select_1dnan_placeholder = pd.Series(exog_select_1dnan_placeholder, index=endog.index)
        else:
            exog_select_1dnan_placeholder = np.array(exog_select_1dnan_placeholder)

        # create array of sequential row positions so that rows of exog_select that have missing
        # data can be identified after call to super().__init__
        obsno = np.array(list(range(len(endog))))

        # call super().__init__
        super(Heckman, self).__init__(
            endog_nomissing, exog=exog, 
            exog_select_1dnan_placeholder=exog_select_1dnan_placeholder, obsno=obsno,
            treated=treated, 
            **kwargs)

        # put np.nan back into endog for treated rows
        self.endog = self.data.endog = \
            np.asarray(
                [self.endog[i] if self.treated[i] else np.nan for i in range(len(self.treated))]
                )

        # strip out rows stripped out by call to super().__init__ in Z variable
        self.exog_select = np.asarray([np.asarray(exog_select)[obs] for obs in self.obsno])

        # store variable names of exog_select
        try:
            self.exog_select_names = exog_select.columns.tolist()
        except AttributeError:
            self.exog_select_names = None

        # delete attributes created by the call to super().__init__ that are no longer needed
        del self.exog_select_1dnan_placeholder
        del self.obsno

        
        # store observation counts
        self.nobs_total = len(endog)
        self.nobs_uncensored = self.nobs = np.sum(self.treated)
        self.nobs_censored = self.nobs_total - self.nobs_uncensored


    def initialize(self):
        self.wendog = self.endog
        self.wexog = self.exog
        

    def whiten(self, data):
        """
        Model whitener for Heckman correction model does nothing.
        """
        return data


    def get_datamats(self):
        Y = np.asarray(self.endog)
        Y = Y[self.treated]

        X = np.asarray(self.exog)
        X = X[self.treated,:]
        if X.ndim

最低0.47元/天解锁文章