IV in R

15 篇文章 0 订阅

http://eclr.humanities.manchester.ac.uk/index.php/IV_in_R

 

In this Section we will demonstrate how to use instrumental variables (IV) estimation (or better Two-Stage-Least Squares, 2SLS) to estimate the parameters in a linear regression model. 

 

Here we will be very short on the problem setup and big on the implementation! 

 

When you estimate a linear regression model, say

y=α0+α1x1+α2x2+α3x3+u

the most crucial of all assumptions you got to make is that the explanatory variables  x1 to  x3 are uncorrelated to the error term u

.

 

 Of course, the error term  u     is unobservable and hence it is impossible to empirically test this assumption (notwithstanding a related test introduced below) and the applied econometrician ought to think very carefully whether there may be any reason that makes it likely that this assumption might be breached. 

 

The seasoned econometrician would immediately rattle down simultaneity bias, measurement error and omitted relevant variables as the three most common causes for this to happen.

 

In some such situations you can actually fix the problem, e.g. by including the 

relevant variable into the model, but in others that is impossible and you need  

to accept that there is a high probability that, say, x3 is correlated with u   We would then call x3   an endogenous variable and all those explanatory 

 variables that do not have that issue are called exogenous.If you still persist 

 with estimating that model by Ordinary Least Squares (OLS) you will have to 

 accept that your estimated coefficients come from a random distribution that 

 on average will not produce the correct (yet unknown) value, in technical 

 lingo, the estimators are biased   

 

 

Implementation in R

The R Package needed is the AER package that we already recommended for use in the context of estimating robust standard errors. Included in that package is a function called ivreg which we will use. We explain how to use it by walking through an example.

If you use IV a lot in your work, you may well want to pack all of the following into one convenient function (just as Alan Fernihough has done here [1]. But if you are applying IV for the first time it is actually very instructive to go through some of the steps in a bit more detail. It is  good to see that really there is not a lot of technical magic ... just a great idea!

 

 

 

Example

We will use the Women's Wages dataset to illustrate the use of the IV regression. The dependent variable which we use here is the log wage lwage and we are interested in whether the years of education, educ, has a positive influence on this log wage (here we mirror the analysis in Wooldridge's Example 15.1).

Let's first import the data

 

 

 

setwd("YOUR/DIRECTORY/PATH")              # This sets the working directory

mydata <- read.csv("mroz.csv", na.strings = ".")  # Opens mroz.csv from #working directory

 

And also let's remove all observations with missing wages from the dataframe

 

 mydata <- subset(mydata, is.na(wage) == FALSE)   

 

# remove observations with missing wages from dataset

 

An extremely simple model would be to estimate the following OLS regression  

 

 which models lwage as a function of a constant and educ.

 

 

reg_ex0 <- lm(lwage~educ,data=mydata)
    print(summary(reg_ex0))

 

which delivers
    Call:
    lm(formula = lwage ~ educ, data = mydata)
    Residuals:
         Min       1Q   Median       3Q      Max 
    -3.10256 -0.31473  0.06434  0.40081  2.10029 
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -0.1852     0.1852  -1.000    0.318    
    educ          0.1086     0.0144   7.545 2.76e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Residual standard error: 0.68 on 426 degrees of freedom
      (325 observations deleted due to missingness)
    Multiple R-squared:  0.1179, Adjusted R-squared:  0.1158 
    F-statistic: 56.93 on 1 and 426 DF,  p-value: 2.761e-13

 

 

This seems to indicate that every additional year of education increases the wage by almost 11% (recall the interpretation of a coefficient in a log-linmodel!). 

 

The issue with this sort of model is that education is most likely to be correlated with individual characteristics that are important for the person's wage, but not modelled (and hence captured by the error term).

 

 

What we need is an instrument that meets the conditions outlined above and here and as in Wooldridge's example we use the father's education as an instrument. The way to do this is as follows:

 

 reg_iv0 <- ivreg(lwage~educ|fatheduc,data=mydata)

 

    print(summary(reg_iv0))

 

The ivreg function works very similar to the lm command (as usual use ?ivreg to get more detailed help). In fact the only difference is the specification of the instrument |fatheduc. The instruments follow the model specification. Behind the vertical lines we find the instrument used to instrument the educ variable[1].

The result is  

 Call:
    ivreg(formula = lwage ~ educ | fatheduc, data = mydata)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.0870 -0.3393  0.0525  0.4042  2.0677 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.44110    0.44610   0.989   0.3233  
    educ         0.05917    0.03514   1.684   0.0929 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
   
Residual standard error: 0.6894 on 426 degrees of freedom
    Multiple R-Squared:
0.09344, Adjusted R-squared: 0.09131 
   
Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294 

 

 

.Clearly, 

. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of educ on fatheduc[2].

In order to illustrate the full functionality of the ivreg procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge's Example 15.5). Let's start by estimating this model by OLS (as we need this result later, but result not shown here)

 

 

  reg_1 <- lm(lwage~educ+age+exper+expersq, data=mydata)         # OLS estimation

 

    reg_1_sm <- summary(reg_1)      

 

    print(reg_1_sm)                            

 

The estimated coefficient for educ is 0.1075 with standard error 0.0142 (the rest of the results are not shown). Then we estimate the TSLS regression with fatheduc and matheduc as instruments.

 

    reg_iv1 <- ivreg(lwage~educ+exper+expersq|fatheduc+motheduc+exper+expersq,data=mydata)


    print(summary(reg_iv1))

 

Before the vertical line we can see the model that is to be estimeted, lwage~educ+exper+expersq.   All the action is after the vertical line. 

 

 

First we see the instrumental variables used to instrument educ, fatheduc+motheduc; this is followed by all the explanatory variables that are considered exogenous, exper+expersq.

 

When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables (in red). A slightly different, more economical way of asking R to do the same thing is as follows

 

    reg_iv1 <- ivreg(lwage~educ+exper+expersq|.-educ+fatheduc+motheduc,data=mydata)
    print(summary(reg_iv1))

After the vertical line you are basically telling R which variable to remove from the instrument set (the endogenous variable, .-educ) and which to add (+fatheduc+motheduc). Make sure you don't forget the decimal point straight after the vertical line when you use this way of specifying the instruments. What you get is the following

 

 Call:
    ivreg(formula = lwage ~ educ + age + exper + expersq | . - educ + 
        fatheduc + motheduc, data = mydata)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.1017 -0.3216  0.0545  0.3685  2.3496 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  0.0667186  0.4591101   0.145   0.8845   
    educ         0.0609945  0.0315798   1.931   0.0541 . 
    age         -0.0003542  0.0049318  -0.072   0.9428   
    exper        0.0441960  0.0134524   3.285   0.0011 **
    expersq     -0.0008947  0.0004075  -2.196   0.0287 * 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.6756 on 423 degrees of freedom
    Multiple R-Squared: 0.1353, Adjusted R-squared: 0.1272 
    Wald test: 6.085 on 4 and 423 DF,  p-value: 9.085e-05

 

 

 

Again we can see the significantly reduced size of the educ effect, when compared to OLS estimation.

 

 

IV related Testing procedures

 

One feature of IV estimations is that in general it is an inferior estimator of β if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.

 

So there are usually three tests one performs in this context. 

Firstly a test to examine that the chosen instruments are indeed sufficiently 

strong correlated to the endogenous variable (Instrument relevance); 

 

whether the potentially endogenous variable is indeed endogenous (Testing 

for exogeneity) and finally that the instruments are indeed exogenous.

 

 

Instrument relevance

 

The entire 2SLS procedure hinges on the instruments chosen being useful 

instruments. Useful here means that they are sufficiently strongly correlated to the endogenous variable.

 

 

We can use the first stage regression (described in the Introduction) to test 

whether that is indeed the case. So here is the first stage regression

 

    # First Stage

 

    first_stage

<- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)

 

What we now need to know is whether the instruments fatheduc and motheduc explain a sufficient amount of variation in educ. We can use a standard F-test to test this. Here is the easiest way to implement this (check how to implement F-tests here):
     instrFtest <- waldtest(first_stage,.~.-fatheduc-motheduc)
     print(instrFtest)
    Wald test
    Model 1: educ ~ age + exper + expersq + fatheduc + motheduc
    Model 2: educ ~ age + exper + expersq
      Res.Df Df      F    Pr(>F)    
    1    747                        
    2    749 -2 116.74 < 2.2e-16 ***
    ---

 

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

 

The value of the F_test is 116.74 with an extremely low p-value. So in this case 

we can clearly reject the null hypothesis that the instruments are irrelevant.

 

 

 

Testing for exogeneity

 

You really only want to use IV/TSLS if you are really dealing with endogenous explanatory variables. 

If the variable you suspected wasn't endogenous, then IV only has disadvantages compared to OLS. Most crucially it will deliver much larger standard errors. For this reason you really want to make sure that you do have an endogeneity problem.

The celebrated test to use in this case is the Hausman test. Here we use a slightly different implementation to the original Hausman test, the so-called Hausman-Wu test.

 

In the end it is pretty straighforward and you only need simple regressions to 

implement it. In a first step you run the first step regression(s) of the TSLS 

procedure. In our case this is

 

    # First Stage

 

first_stage<- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)

 

In a second step you add the residual(s) from this first step into the original model

 

    # Hausman

Hausman_reg<- lm(lwage~educ+age+exper+expersq+first_stage$residuals,data=mydata)

 

 

 

    print(summary(Hausman_reg))

 

 

 

Now we need to compare this result to the one we got from the original 

model reg_2. 

 

If the educ is indeed endogenous, then the first stage regression  

should have isolated the variation of x3   that was correlated with error term 

in the residual of the first stage regression. 

 

In that case the included  first_stage$residuals should be relevant. As there may potentially be more  than one endogenous variable and hence more 

than  one first stage residual  we use an F-test to test the null hypothesis that these residuals are irrelevant  (and hence endogeneity not being a problem).

 

 

    HausWutest <- waldtest(Hausman_reg,.~.-first_stage$residuals)

 

    print(HausWutest)

 

 

    Wald test
    Model 1: lwage ~ educ + age + exper + expersq + first_stage$residuals
    Model 2: lwage ~ educ + age + exper + expersq
      Res.Df Df      F  Pr(>F)  
    1    422                    
    2    423 -1 2.9051 0.08903 .
    ---

 

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

The result is a p-value of 0.089. So at an α=0.05 we just fail to reject the null of x3 being exogenous.

 

 

Sargan test for instrument validity

 

One crucial property of instruments is that they ought to be uncorrelated to 

the regression error terms uu. Instrument exogeneity is set as the null 

hypothesis of this following test with the alternative hypothesis being that the 

instruments are endogenous. This test can only be applied if you have more 

instruments than endogenous variables. It is therefore sometimes also called 

the test for overidentifying restrictions.

 

The test is rather simple to implement. Take the residuals from the 2SLS 

regression reg_iv1$residuals and use them as the dependent variable in a new 

regression in which you regress them on all exogenous explanatory variables 

and all instruments.

 

Sargan_re<- lm(reg_iv1$residuals~age+exper+expersq+fatheduc+motheduc,data=mydata)

 

 

    Sargan_reg_sm <- summary(Sargan_reg)

 

If the instruments are valid (null hypothesis), they should be uncorrelated to these residuals and hence we apply the following χ2test. We use the R2  of this regression and calculate n∗R2

 

 

    Sargan_test <- Sargan_reg_sm$r.squared*nrow(mydata)

 

    print(Sargan_test)

 

    print(1-pchisq(Sargan_test,1))  # prints p-value

 

We find that the p-value of this test is 0.5260 and hence we do not reject the null hypothesis of instrument validity. The p-value was obtained from a χ2distribution with one degree of freedom. That was one here as we had two instruments for one endogenous variable (2-1).

 

 

Footnotes
Jump up ↑ The order of the variables after the vertical line doesn't matter.
Jump up ↑ Which turns out to be 0.1958 if you check it.

 

 


      library(AER)
      library(foreign)
      
      library(sandwich)
     
      library(sandwich)      
               
      
      setwd("D://RStudio//exercise") 
      
      mydata <- read.csv("mroz.csv", na.strings = ".") 
      
      
      mydata <- subset(mydata, is.na(wage) == FALSE) 
      
      reg_ex0 <- lm(lwage~educ,data=mydata)
      
      
      print(summary(reg_ex0))
      
   

 

Call:
lm(formula = lwage ~ educ, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.10256 -0.31473  0.06434  0.40081  2.10029 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1852     0.1852  -1.000    0.318    
educ          0.1086     0.0144   7.545 2.76e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 426 degrees of freedom
Multiple R-squared:  0.1179,    Adjusted R-squared:  0.1158 
F-statistic: 56.93 on 1 and 426 DF,  p-value: 2.761e-13

 

 

 

   reg_iv0 <- ivreg(lwage~educ|fatheduc,data=mydata)
      
      
      print(summary(reg_iv0))
      
     

Call:
ivreg(formula = lwage ~ educ | fatheduc, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0870 -0.3393  0.0525  0.4042  2.0677 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44110    0.44610   0.989   0.3233  
educ         0.05917    0.03514   1.684   0.0929 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6894 on 426 degrees of freedom
Multiple R-Squared: 0.09344,    Adjusted R-squared: 0.09131 
Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294 
 

 

 reg_1 <- lm(lwage~educ+age+exper+expersq, data=mydata)         # OLS estimation
      
      
      
      reg_1_sm <- summary(reg_1)              
      
       print(reg_1_sm)
      
      
      
     

Call:
lm(formula = lwage ~ educ + age + exper + expersq, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08521 -0.30587  0.04946  0.37523  2.37077 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.5333762  0.2778430  -1.920  0.05557 .  
educ         0.1075228  0.0141745   7.586 2.12e-13 ***
age          0.0002836  0.0048553   0.058  0.95344    
exper        0.0415623  0.0131909   3.151  0.00174 ** 
expersq     -0.0008152  0.0003996  -2.040  0.04195 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6672 on 423 degrees of freedom
Multiple R-squared:  0.1568,    Adjusted R-squared:  0.1489 
F-statistic: 19.67 on 4 and 423 DF,  p-value: 7.328e-15
 

 

 reg_iv1 <- ivreg(lwage~educ+exper+expersq|fatheduc+motheduc+exper+expersq,data=mydata)
      
      
      print(summary(reg_iv1))
      
      
      
   

Call:
ivreg(formula = lwage ~ educ + exper + expersq | fatheduc + motheduc + 
    exper + expersq, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0986 -0.3196  0.0551  0.3689  2.3493 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.0481003  0.4003281   0.120  0.90442   
educ         0.0613966  0.0314367   1.953  0.05147 . 
exper        0.0441704  0.0134325   3.288  0.00109 **
expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6747 on 424 degrees of freedom
Multiple R-Squared: 0.1357,    Adjusted R-squared: 0.1296 
Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05 
 

   reg_iv1 <- ivreg(lwage~educ+exper+expersq|.-educ+fatheduc+motheduc,data=mydata)
      
      print(summary(reg_iv1))
      
      
   

Call:
ivreg(formula = lwage ~ educ + exper + expersq | . - educ + fatheduc + 
    motheduc, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0986 -0.3196  0.0551  0.3689  2.3493 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.0481003  0.4003281   0.120  0.90442   
educ         0.0613966  0.0314367   1.953  0.05147 . 
exper        0.0441704  0.0134325   3.288  0.00109 **
expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6747 on 424 degrees of freedom
Multiple R-Squared: 0.1357,    Adjusted R-squared: 0.1296 
Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05 
 

   
      first_stage <- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)
      
      
      instrFtest <- waldtest(first_stage,.~.-fatheduc-motheduc)
      print(instrFtest)
      
      
 

Wald test

Model 1: educ ~ age + exper + expersq + fatheduc + motheduc
Model 2: educ ~ age + exper + expersq
  Res.Df Df      F    Pr(>F)    
1    422                        
2    424 -2 54.943 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

     first_stage<- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)
      
      
      Hausman_reg<- lm(lwage~educ+age+exper+expersq+first_stage$residuals,data=mydata)
      
      print(summary(Hausman_reg))
      
      
   

Call:
lm(formula = lwage ~ educ + age + exper + expersq + first_stage$residuals, 
    data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.04272 -0.30286  0.03749  0.40637  2.33167 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.0667186  0.4524011   0.147 0.882826    
educ                   0.0609945  0.0311183   1.960 0.050643 .  
age                   -0.0003542  0.0048597  -0.073 0.941927    
exper                  0.0441960  0.0132558   3.334 0.000931 ***
expersq               -0.0008947  0.0004015  -2.228 0.026387 *  
first_stage$residuals  0.0586438  0.0349356   1.679 0.093965 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6658 on 422 degrees of freedom
Multiple R-squared:  0.1624,    Adjusted R-squared:  0.1525 
F-statistic: 16.37 on 5 and 422 DF,  p-value: 9.055e-15
 

   HausWutest <- waldtest(Hausman_reg,.~.-first_stage$residuals)
      
      
      
      print(HausWutest)
      
      
     

Wald test

Model 1: lwage ~ educ + age + exper + expersq + first_stage$residuals
Model 2: lwage ~ educ + age + exper + expersq
  Res.Df Df      F  Pr(>F)  
1    422                    
2    423 -1 2.8178 0.09397 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Sargan_reg<- lm(reg_iv1$residuals~age+exper+expersq+fatheduc+motheduc,data=mydata)
      
      
      
      
      
      Sargan_reg_sm <- summary(Sargan_reg)
      
      
      
      Sargan_test <- Sargan_reg_sm$r.squared*nrow(mydata)
      print(Sargan_test)
      
      
      
      print(1-pchisq(Sargan_test,1))  # prints p-value
      
      
      
      
      
      
      
      

[1] 0.4075297

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值