03空间计量基础模型之SLX，SAR，SEM

最新推荐文章于 2024-06-23 16:25:50 发布

WenbinYao&YouweiHu

最新推荐文章于 2024-06-23 16:25:50 发布

阅读量7.8k

点赞数 4

分类专栏： python进行时空数据处理及爬取机器学习/统计文章标签：新星计划空间计量 SLX SEM SAR

本文链接：https://blog.csdn.net/qq_39805362/article/details/118379518

版权

机器学习/统计同时被 2 个专栏收录

35 篇文章 30 订阅

订阅专栏

python进行时空数据处理及爬取

14 篇文章 38 订阅

订阅专栏

这两天刚好有些时间，于是跑了一些空间计量模型作为实战练习，使用的包是pysal，原教程点击该链接，主要是阐述了空间异质性、空间依赖的含义以及SLX，SAR，SEM这三个空间计量基本模型，其他的许多变体其实也就是这三个模型的两两结合或三个结合在一起。在本博客中不再阐述空间异质性和空间依赖了，只讲如何用pysal实现SLX，SAR，SEM这三个基本模型，希望了解全部内容的可以看原教程。此外，pysal这个包最好是要更新到最新版，要不然本博客代码跑起来会有bug。

1.数据情况及OLS回归

The Data: San Diego AirBnB，可以点击自己下载。数据是San Diego的airbnb的每晚房价数据，包括房价，这个也就是我们后续分析的因变量，还有一些连续的自变量，比如房间数量、卫生间数量、卧室数量、床数量等等，此外也有一些分类变量，比如neighborhood，此外，还有一些哑变量，比如pg_Apartment等。

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

import pysal
import spreg
from libpysal import weights
import esda
from scipy import stats
import statsmodels.formula.api as sm
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

db = gpd.read_file('regression_db.geojson')
db.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6110 entries, 0 to 6109
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   accommodates        6110 non-null   int64   
 1   bathrooms           6110 non-null   float64 
 2   bedrooms            6110 non-null   float64 
 3   beds                6110 non-null   float64 
 4   neighborhood        6110 non-null   object  
 5   pool                6110 non-null   int64   
 6   d2balboa            6110 non-null   float64 
 7   coastal             6110 non-null   int64   
 8   price               6110 non-null   float64 
 9   log_price           6110 non-null   float64 
 10  id                  6110 non-null   int64   
 11  pg_Apartment        6110 non-null   int64   
 12  pg_Condominium      6110 non-null   int64   
 13  pg_House            6110 non-null   int64   
 14  pg_Other            6110 non-null   int64   
 15  pg_Townhouse        6110 non-null   int64   
 16  rt_Entire_home/apt  6110 non-null   int64   
 17  rt_Private_room     6110 non-null   int64   
 18  rt_Shared_room      6110 non-null   int64   
 19  geometry            6110 non-null   geometry
dtypes: float64(6), geometry(1), int64(12), object(1)
memory usage: 954.8+ KB

db.head(2)

	accommodates	bathrooms	bedrooms	beds	neighborhood	pool	d2balboa	coastal	price	log_price	id	pg_Apartment	pg_Condominium	pg_House	pg_Other	pg_Townhouse	rt_Entire_home/apt	rt_Private_room	rt_Shared_room	geometry
0	5	2.0	2.0	2.0	North Hills	0	2.972077	0	425.0	6.052089	6	0	0	1	0	0	1	0	0	POINT (-117.12971 32.75399)
1	6	1.0	2.0	4.0	Mission Bay	0	11.501385	1	205.0	5.323010	5570	0	1	0	0	0	1	0	0	POINT (-117.25253 32.78421)

These are the explanatory variables we will use throughout the chapter.

variable_names = [
    'accommodates', 
    'bathrooms', 
    'bedrooms', 
    'beds', 
    'rt_Private_room', 
    'rt_Shared_room',
    'pg_Condominium', 
    'pg_House', 
    'pg_Other', 
    'pg_Townhouse'
]

下面的这个代码是直接做一个普通OLS回归，不考虑空间效应，就是把因变量和自变量输进去进行了，用stats也可以，大同小异。

m1 = spreg.OLS(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    name_y='log_price', 
    name_x=variable_names
)

Number of Variables是包括CONSTANT变量在内的，因此一共是11个。

print(m1.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
R-squared           :      0.6683
Adjusted R-squared  :      0.6678
Sum squared residual:    1320.148                F-statistic           :   1229.0564
Sigma-square        :       0.216                Prob(F-statistic)     :           0
S.E. of regression  :       0.465                Log likelihood        :   -3988.895
Sigma-square ML     :       0.216                Akaike info criterion :    7999.790
S.E of regression ML:      0.4648                Schwarz criterion     :    8073.685

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.3883830       0.0161147     272.3217773       0.0000000
        accommodates       0.0834523       0.0050781      16.4336318       0.0000000
           bathrooms       0.1923790       0.0109668      17.5419773       0.0000000
            bedrooms       0.1525221       0.0111323      13.7009195       0.0000000
                beds      -0.0417231       0.0069383      -6.0134430       0.0000000
     rt_Private_room      -0.5506868       0.0159046     -34.6244758       0.0000000
      rt_Shared_room      -1.2383055       0.0384329     -32.2198992       0.0000000
      pg_Condominium       0.1436347       0.0221499       6.4846529       0.0000000
            pg_House      -0.0104894       0.0145315      -0.7218393       0.4704209
            pg_Other       0.1411546       0.0228016       6.1905633       0.0000000
        pg_Townhouse      -0.0416702       0.0342758      -1.2157316       0.2241342
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           11.964

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2671.611           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               10         322.532           0.0000
Koenker-Bassett test             10         135.581           0.0000
================================ END OF REPORT =====================================

下面看一下有沙滩和无沙滩的房子是不是存在异质性，因为我们之前没有考虑任何的空间效应，因此可能回归残差不是白噪声，下面就来验证下是不是说确实数据中是存在空间效应的，空间效应当然有很多种，这里只是举了有没有沙滩这个空间效应。

is_coastal = db.coastal.astype(bool)
coastal = m1.u[is_coastal] # 取出残差中存在coastal的样本的残差
not_coastal = m1.u[~is_coastal]  # 取出残差中不存在coastal的样本的残差
plt.hist(coastal, density=True, label='Coastal')
plt.hist(
    not_coastal, 
    histtype='step',
    density=True, 
    linewidth=4, 
    label='Not Coastal'
)
plt.vlines(0,0,1, linestyle=":", color='k', linewidth=4)
plt.legend()
plt.show()

在这里插入图片描述

While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic $t$ test:

stats.ttest_ind(coastal, not_coastal)

Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))

下面建立以下这个样本的空间权重矩阵，关于空间权重矩阵的相关介绍和看我的空间计量01,02，这边使用的距离权重矩阵。

knn = weights.KNN.from_dataframe(db, k=1) # 得到k近邻权重矩阵
knn

<libpysal.weights.distance.KNN at 0xaf77427fc8>

knn.weights

{0: [1.0],
 1: [1.0],
 2: [1.0],
 3: [1.0],
 4: [1.0],
 ...}

This means that, when we compute the spatial lag of that $K N N$ weight and the residual, we get the residual of the AirBnB listing closest to each observation.

lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u) # 生成空间滞后变量
ax = sns.regplot(
    m1.u.flatten(),
    lag_residual.flatten(),
    line_kws=dict(color='orangered'),
    ci=None
)
ax.set_xlabel('Model Residuals - $u$')
ax.set_ylabel('Spatial Lag of Model Residuals - $W u$');

在这里插入图片描述

lag_residual

array([[-0.59201055],
       [ 0.65057166],
       [ 0.00970927],
       ...,
       [-0.40444691],
       [ 0.07424779],
       [-0.30458876]])

从上面这个图实际上我们可以很明显的看到空间效应确实存在于残差中，这个图横坐标是每个样本的残差,纵坐标是每个样本的最近领的残差，如果是随机分布的，那么就是没有任何关系，但是我们发现两者间存在一定的聚集关系，并且回归线系数也显著非0.因此用普通OLS回归实际上是存在一些问题的，下面使用Spatial Regimes来做回归，还是使用spreg来做。

2.Spatial Regimes

公式如下所示，其实理解起来也是相当简单，就是每个地方单独给一个截距项。希望用这个截距项来代表空间异质性。

$\log{P_i} = \alpha_r + \sum_k \mathbf{X}_{ki}\beta_{k-r} + \epsilon_i$

where we are not only allowing the constant term to vary by region ( $\alpha_r$ ), but also every other parameter ( $\beta_{k-r}$ ).

To illustrate this approach, we will use the “spatial differentiator” of whether a house is in a coastal neighborhood or not (coastal_neig) to define the regimes. The rationale behind this choice is that renting a house close to the ocean might be a strong enough pull that people might be willing to pay at different rates for each of the house’s characteristics.

m4 = spreg.OLS_Regimes(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    regimes = db['coastal'].tolist(),
    constant_regi='many',
    regime_err_sep=False,
    name_y='log_price', 
    name_x=variable_names
)
print(m4.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES - REGIMES
---------------------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          22
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6088
R-squared           :      0.6853
Adjusted R-squared  :      0.6843
Sum squared residual:    1252.489                F-statistic           :    631.4283
Sigma-square        :       0.206                Prob(F-statistic)     :           0
S.E. of regression  :       0.454                Log likelihood        :   -3828.169
Sigma-square ML     :       0.205                Akaike info criterion :    7700.339
S.E of regression ML:      0.4528                Schwarz criterion     :    7848.128

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
          0_CONSTANT       4.4072424       0.0215156     204.8392695       0.0000000
      0_accommodates       0.0901860       0.0064737      13.9311338       0.0000000
         0_bathrooms       0.1433760       0.0142680      10.0487871       0.0000000
          0_bedrooms       0.1129626       0.0138273       8.1695568       0.0000000
              0_beds      -0.0262719       0.0088380      -2.9726102       0.0029644
   0_rt_Private_room      -0.5293343       0.0189179     -27.9805699       0.0000000
    0_rt_Shared_room      -1.2244586       0.0425969     -28.7452834       0.0000000
    0_pg_Condominium       0.1053065       0.0281309       3.7434523       0.0001832
          0_pg_House      -0.0454471       0.0179571      -2.5308637       0.0114032
          0_pg_Other       0.0607526       0.0276365       2.1982715       0.0279673
      0_pg_Townhouse      -0.0103973       0.0456730      -0.2276456       0.8199294
          1_CONSTANT       4.4799043       0.0250938     178.5260014       0.0000000
      1_accommodates       0.0484639       0.0078806       6.1497397       0.0000000
         1_bathrooms       0.2474779       0.0165661      14.9388057       0.0000000
          1_bedrooms       0.1897404       0.0179229      10.5864676       0.0000000
              1_beds      -0.0506077       0.0107429      -4.7107925       0.0000025
   1_rt_Private_room      -0.5586281       0.0283122     -19.7309699       0.0000000
    1_rt_Shared_room      -1.0528541       0.0841745     -12.5079997       0.0000000
    1_pg_Condominium       0.2044470       0.0339434       6.0231780       0.0000000
          1_pg_House       0.0753534       0.0233783       3.2232188       0.0012743
          1_pg_Other       0.2954848       0.0386455       7.6460385       0.0000000
      1_pg_Townhouse      -0.0735077       0.0493672      -1.4889984       0.1365396
------------------------------------------------------------------------------------
Regimes variable: unknown

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.033

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        3977.425           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               21         443.593           0.0000
Koenker-Bassett test             21         164.276           0.0000

REGIMES DIAGNOSTICS - CHOW TEST
                 VARIABLE        DF        VALUE           PROB
                 CONSTANT         1           4.832           0.0279
             accommodates         1          16.736           0.0000
                bathrooms         1          22.671           0.0000
                 bedrooms         1          11.504           0.0007
                     beds         1           3.060           0.0802
           pg_Condominium         1           5.057           0.0245
                 pg_House         1          16.793           0.0000
                 pg_Other         1          24.410           0.0000
             pg_Townhouse         1           0.881           0.3480
          rt_Private_room         1           0.740           0.3896
           rt_Shared_room         1           3.309           0.0689
              Global test        11         328.869           0.0000
================================ END OF REPORT =====================================

3.Exogenous effects: The SLX Model

公式如下，也就是对于自变量再加上一个空间滞后项，这个模型是最简单的空间计量模型，因为仅仅自变量加了一个空间滞后项，回归方式其实还是OLS，而且解读起来其实也和普通的OLS是一样的，比较简单，具体的这边不阐述了，可以看一下沈体雁老师《空间计量经济学》一书。
$\log(P_i) = \alpha + \sum^{p}_{k=1}X_{ij}\beta_j + \sum^{p}_{k=1}\left(\sum^{N}_{j=1}w_{ij}x_{jk}\right)\gamma_k + \epsilon_i$

where $\sum_{j=1}^N w_{ij}x_{jk}$ represents the spatial lag of the $k$ th explanatory variable.
This can be stated in matrix form using the spatial weights matrix, $\mathbf{W}$ , as:
$\log(P_i) = \alpha + \mathbf{X}\beta + \mathbf{WX}\gamma + \epsilon$

This splits the model to focus on two main effects: $\beta$ and $\gamma$ . The $\beta$ effect describes the change in $y_i$ when $X_{ik}$ changes by one. ¹. The subscript for site $i$ is important here: since we’re dealing with a $\mathbf{W}$ matrix, it’s useful to be clear about where the change occurs. Indeed, this matters for the $\gamma$ effect, which represents an indirect effect of a change in $X_i$ . This can be conceptualized in two ways. First, one could think of $\gamma$ as simply the effect of a unit change in your average surroundings. This is useful and simple. But, this interpretation ignores where this change might occur. In truth, a change in a variable at site $i$ will result in a spillover to its surroundings:
when $x_i$ changes, so too does the spatial lag of any site near $i$ . The precise size of this will depend on the structure of $\mathbf{W}$ , and can be different for every site. For example, think of a very highly-connected “focal” site in a row-standardized weight matrix. This focal site will not be strongly affected if a neighbor changes by a single unit, since each site only contributes a small amount to the lag at the focal site. Alternatively, consider a site with only one neighbor: its lag will change by exactly the amount its sole neighbor changes. Thus, to discover the exact indirect effect of a change $y$ caused by the changeat a specific site $x_i$ you would need to compute the change in the spatial lag,and then use that as your change in $X$ .

In Python, we can calculate the spatial lag of each variable whose name starts by pg_ by first creating a list of all of those names, and then applying PySAL's lag_spatial to each of them:
下面就是创建了自变量的空间滞后项，在后面的回归中，把自变量的空间滞后项也放进去就可以了。

wx = db.filter(
    like='pg'
).apply(
    lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(
    columns=lambda c: 'w_'+c
).drop(
    'w_pg_Apartment', axis=1
)
wx

	w_pg_Condominium	w_pg_House	w_pg_Other	w_pg_Townhouse
0	0.00	0.45	0.20	0.00
1	0.30	0.35	0.00	0.05
2	0.05	0.30	0.00	0.05
3	0.00	0.95	0.00	0.05
4	0.05	0.85	0.00	0.00
...	...	...	...	...
6105	0.30	0.05	0.00	0.00
6106	0.05	0.00	0.60	0.00
6107	0.00	0.55	0.25	0.00
6108	0.00	0.05	0.05	0.00
6109	0.05	0.70	0.05	0.00

6110 rows × 4 columns

Once computed, we can run the model using OLS estimation because, in this context, the spatial lags included do not violate any of the assumptions OLS relies on (they are essentially additional exogenous variables):

slx_exog = db[variable_names].join(wx)
m5 = spreg.OLS(
    db[['log_price']].values, 
    slx_exog.values,
    name_y='l_price', 
    name_x=slx_exog.columns.tolist()
)
print(m5.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :     l_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          15
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6095
R-squared           :      0.6800
Adjusted R-squared  :      0.6792
Sum squared residual:    1273.933                F-statistic           :    924.9423
Sigma-square        :       0.209                Prob(F-statistic)     :           0
S.E. of regression  :       0.457                Log likelihood        :   -3880.030
Sigma-square ML     :       0.208                Akaike info criterion :    7790.061
S.E of regression ML:      0.4566                Schwarz criterion     :    7890.826

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.3205814       0.0234977     183.8727044       0.0000000
        accommodates       0.0809972       0.0050046      16.1843874       0.0000000
           bathrooms       0.1893447       0.0108059      17.5224026       0.0000000
            bedrooms       0.1635998       0.0109764      14.9047058       0.0000000
                beds      -0.0451529       0.0068249      -6.6159365       0.0000000
     rt_Private_room      -0.5293783       0.0157308     -33.6524367       0.0000000
      rt_Shared_room      -1.2892590       0.0381443     -33.7995105       0.0000000
      pg_Condominium       0.1063490       0.0221782       4.7952003       0.0000017
            pg_House       0.0327806       0.0156954       2.0885538       0.0367893
            pg_Other       0.0861857       0.0239774       3.5944620       0.0003276
        pg_Townhouse      -0.0277116       0.0338485      -0.8186965       0.4129916
    w_pg_Condominium       0.5928369       0.0689612       8.5966706       0.0000000
          w_pg_House      -0.0774462       0.0318830      -2.4290766       0.0151661
          w_pg_Other       0.4851047       0.0551461       8.7967121       0.0000000
      w_pg_Townhouse      -0.2724493       0.1223388      -2.2270058       0.0259833
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.277

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2458.006           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               14         393.052           0.0000
Koenker-Bassett test             14         169.585           0.0000
================================ END OF REPORT =====================================

As an illustration, let’s look at some of the direct/indirect effects. The direct effect of the pg_Condominium variable means that condominiums are typically 11% more expensive ( $\beta_{pg\_Condominium}=0.1063$ ) than the benchmark
property type, apartments. More relevant to this section, any given house surrounded by condominiums also receives a price premium. But, since $pg_Condominium$ is a dummy variable,
the spatial lag at site $i$ represents the percentage of properties near $i$ that are condominiums, which is between $0$ and $1$ .²
So, a unit change in this variable means that you would increase the condominium percentage by 100%. Thus, a $. 1$ increase in w_pg_Condominium (a change of ten percentage points) would result in a 5.92% increase in the property house price ( $\beta_{w_pg\_Condominium} = 0.6$ ). Similar interpretations can be derived for all other spatially lagged variables to derive the indirect effect of a change in the spatial lag.

However, to compute the indirect change for a given site $i$ , you may need to examine the predicted values for $y_i$ . In this example, since we are using a row-standardized weights matrix with twenty nearest neighbors, the impact of changing $x_i$ is the same for all of its neighbors and for any site $i$ . Thus, the effect is always $\frac{\gamma}{20}$ , or about $0.0296$ . However, this would not be the same for many other kinds of weights (like Kernel, Queen, Rook, DistanceBand, or Voronoi), so we will demonstrate how to construct the indirect effect for a specific $i$ :

First, predicted values for $y_i$ are stored in the predy attribute of any model:

m5.predy

array([[5.43610121],
       [5.38596868],
       [4.25377454],
       ...,
       [4.29145318],
       [4.89174746],
       [4.85867698]])

Showing this process below, let’s first change a property to be a condominium. Consider the third observation, which is the first apartment in the data:

# 观察第三行数据
db.loc[2]

accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     1
pg_Condominium                                                   0
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

This is an apartment. Let’s make a copy of our data and change this apartment into a condominium:

db_scenario = db.copy()
# make Apartment 0 and condo 1
db_scenario.loc[2, ['pg_Apartment', 'pg_Condominium']] = [0,1]

We’ve successfully made the change:

db_scenario.loc[2]

accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     0
pg_Condominium                                                   1
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

Now, we need to also update the spatial lag variates:

db['pg_Condominium']

0       0
1       1
2       0
3       0
4       0
       ..
6105    0
6106    1
6107    0
6108    0
6109    0
Name: pg_Condominium, Length: 6110, dtype: int64

weights.spatial_lag.lag_spatial(knn, db['pg_Condominium'])  # 空间滞后操作后得到的是每个site i周边是Condominium的占比

array([0.  , 0.3 , 0.05, ..., 0.  , 0.  , 0.05])

wx_scenario = db_scenario.filter(
    like='pg'
).apply(
    lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(
    columns=lambda c: 'w_'+c
).drop(
    'w_pg_Apartment', axis=1
)
wx_scenario

	w_pg_Condominium	w_pg_House	w_pg_Other	w_pg_Townhouse
0	0.00	0.45	0.20	0.00
1	0.30	0.35	0.00	0.05
2	0.05	0.30	0.00	0.05
3	0.00	0.95	0.00	0.05
4	0.05	0.85	0.00	0.00
...	...	...	...	...
6105	0.30	0.05	0.00	0.00
6106	0.05	0.00	0.60	0.00
6107	0.00	0.55	0.25	0.00
6108	0.00	0.05	0.05	0.00
6109	0.05	0.70	0.05	0.00

6110 rows × 4 columns

And build a new exogenous $\mathbf{X}$ matrix, including the a constant 1 as the leading column

slx_exog_scenario = db_scenario[variable_names].join(wx_scenario)
slx_exog_scenario

	accommodates	bathrooms	bedrooms	beds	rt_Private_room	rt_Shared_room	pg_Condominium	pg_House	pg_Other	pg_Townhouse	w_pg_Condominium	w_pg_House	w_pg_Other	w_pg_Townhouse
0	5	2.0	2.0	2.0	0	0	0	1	0	0	0.00	0.45	0.20	0.00
1	6	1.0	2.0	4.0	0	0	1	0	0	0	0.30	0.35	0.00	0.05
2	2	1.0	1.0	1.0	1	0	1	0	0	0	0.05	0.30	0.00	0.05
3	2	1.0	1.0	1.0	1	0	0	1	0	0	0.00	0.95	0.00	0.05
4	2	1.0	1.0	1.0	1	0	0	1	0	0	0.05	0.85	0.00	0.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6105	2	1.0	1.0	1.0	1	0	0	0	0	0	0.30	0.05	0.00	0.00
6106	6	2.0	2.0	2.0	0	0	1	0	0	0	0.05	0.00	0.60	0.00
6107	1	1.0	1.0	1.0	1	0	0	1	0	0	0.00	0.55	0.25	0.00
6108	3	1.0	1.0	1.0	0	0	0	0	0	0	0.00	0.05	0.05	0.00
6109	3	1.0	1.0	2.0	0	0	0	1	0	0	0.05	0.70	0.05	0.00

6110 rows × 14 columns

Now, our new prediction (in the scenario where we have changed site 2 from an apartment into a condominium), is:

y_pred_scenario = m5.betas[0] + slx_exog_scenario @ m5.betas[1:]  # @操作是矩阵乘法的含义

This prediction will be exactly the same for all sites, except site 2 and its neighbors. So, the neighbors to site 2 are:

knn.neighbors[2]

And the effect of changing site 2 into a condominium is associated with the following changes to $y_i$ :

# 将和第2行相关的行都取出来，这些行和原来的预测值会有所不同
(y_pred_scenario - m5.predy).loc[[2] + knn.neighbors[2]]

	0
2	0.106349
772	0.029642
2212	0.029642
139	0.029642
4653	0.029642
2786	0.029642
1218	0.029642
138	0.029642
808	0.029642
1480	0.029642
4241	0.029642
1631	0.029642
3617	0.029642
2612	0.029642
1162	0.029642
135	0.029642
23	0.029642
5528	0.029642
3591	0.029642
407	0.029642
6088	0.029642

We see the first row, representing the direct effect, is equal exactly to the estimate for pg_Condominium. For the other effects, though, we have only changed w_pg_Condominium by $. 05$

scenario_near_2 = slx_exog_scenario.loc[knn.neighbors[2], ['w_pg_Condominium']]
orig_near_2 = slx_exog.loc[knn.neighbors[2], ['w_pg_Condominium']]
scenario_near_2.join(orig_near_2, lsuffix='_scenario', rsuffix= '_original')
# 以前对于2的邻居，他们周边的Condominium的概率是‘w_pg_Condominium_original’，现在都加上了0.05

	w_pg_Condominium_scenario	w_pg_Condominium_original
772	0.10	0.05
2212	0.10	0.05
139	0.10	0.05
4653	0.10	0.05
2786	0.10	0.05
1218	0.10	0.05
138	0.10	0.05
808	0.05	0.00
1480	0.10	0.05
4241	0.10	0.05
1631	0.10	0.05
3617	0.10	0.05
2612	0.10	0.05
1162	0.05	0.00
135	0.05	0.00
23	0.10	0.05
5528	0.05	0.00
3591	0.05	0.00
407	0.05	0.00
6088	0.10	0.05

上面的例子就就很好的解释了一下SLX模型，其实很简单就是先得到自变量的空间滞后项，再放进去跑OLS就可以了，直接效应解读和OLS没有任何区别，但是简介效应解读需要注意一下，对于本例子所选取的KNN权重矩阵，间接效应指的就是某个site周围领域的平均效应，也就是空间依赖性了。

4.SEM

The spatial error model includes a spatial lag in the error term of the equation:

$\log{P_i} = \alpha + \sum_k \beta_k X_{ki} + u_i$

$u_i = \lambda u_{lag-i} + \epsilon_i$

where $u_{lag-i} = \sum_j w_{i,j} u_j$ .
Although it appears similar, this specification violates the assumptions about the
error term in a classical OLS model. Hence, alternative estimation methods are
required. PySAL incorporates functionality to estimate several of the most
advanced techniques developed by the literature on spatial econometrics. For
example, we can use a general method of moments that account for
heterogeneity (Arraiz et al., 2010):

db[variable_names]

	accommodates	bathrooms	bedrooms	beds	rt_Private_room	rt_Shared_room	pg_Condominium	pg_House	pg_Other	pg_Townhouse
0	5	2.0	2.0	2.0	0	0	0	1	0	0
1	6	1.0	2.0	4.0	0	0	1	0	0	0
2	2	1.0	1.0	1.0	1	0	0	0	0	0
3	2	1.0	1.0	1.0	1	0	0	1	0	0
4	2	1.0	1.0	1.0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...
6105	2	1.0	1.0	1.0	1	0	0	0	0	0
6106	6	2.0	2.0	2.0	0	0	1	0	0	0
6107	1	1.0	1.0	1.0	1	0	0	1	0	0
6108	3	1.0	1.0	1.0	0	0	0	0	0	0
6109	3	1.0	1.0	2.0	0	0	0	1	0	0

6110 rows × 10 columns

knn = weights.KNN.from_dataframe(db, k=20) # 得到k近邻权重矩阵
knn.transform='r'
# knn.weights

m6 = spreg.GM_Error_Het(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    w=knn,
    name_y='log_price', 
    name_x=variable_names
)
print(m6.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HET)
---------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
Pseudo R-squared    :      0.6655
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.4262033       0.0217088     203.8898741       0.0000000
        accommodates       0.0695536       0.0063268      10.9934495       0.0000000
           bathrooms       0.1614101       0.0151312      10.6673765       0.0000000
            bedrooms       0.1739251       0.0146697      11.8560847       0.0000000
                beds      -0.0377710       0.0088293      -4.2779023       0.0000189
     rt_Private_room      -0.4947947       0.0163843     -30.1993140       0.0000000
      rt_Shared_room      -1.1613985       0.0515304     -22.5381175       0.0000000
      pg_Condominium       0.1003761       0.0213148       4.7092198       0.0000025
            pg_House       0.0308334       0.0147100       2.0960849       0.0360747
            pg_Other       0.0861768       0.0254942       3.3802547       0.0007242
        pg_Townhouse      -0.0074515       0.0292873      -0.2544285       0.7991646
              lambda       0.6448728       0.0186626      34.5543740       0.0000000
              lambda       0.6448728       0.0186626      34.5543740       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

5.SAR

SAR的公式如下：
$\log{P_i} = \alpha + \rho \log{P_{lag-i}} + \sum_k \beta_k X_{ki} + \epsilon_i$

Although it might not seem very different from the previous equation, this model violates the exogeneity assumption, crucial for OLS to work. Put simply, this occurs when $P_i$ exists on both “sides” of the equals sign.In theory, since $P_i$ is included in computing $P_{lag-i}$ , exogeneity is violated. Similarly to the case ofthe spatial error, several techniques have been proposed to overcome this limitation, and PySAL implements several of them. In the example below, we use a two-stage least squares estimation {cite}Anselin_1988, where the spatial lag of all the explanatory variables is used as instrument for the endogenous
lag:

m7 = spreg.GM_Lag(
    db[['log_price']].values, 
    db[variable_names].values,
    w=knn, 
    name_y='log_price', 
    name_x=variable_names
)
print(m7.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          12
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6098
Pseudo R-squared    :      0.7057
Spatial Pseudo R-squared:  0.6883

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       2.7440254       0.0727290      37.7294400       0.0000000
        accommodates       0.0697596       0.0048157      14.4859187       0.0000000
           bathrooms       0.1626725       0.0104007      15.6405467       0.0000000
            bedrooms       0.1604137       0.0104823      15.3033012       0.0000000
                beds      -0.0365065       0.0065336      -5.5874750       0.0000000
     rt_Private_room      -0.4981415       0.0151396     -32.9031780       0.0000000
      rt_Shared_room      -1.1157392       0.0365563     -30.5210777       0.0000000
      pg_Condominium       0.1072995       0.0209048       5.1327614       0.0000003
            pg_House      -0.0004017       0.0136828      -0.0293598       0.9765777
            pg_Other       0.1207503       0.0214771       5.6222929       0.0000000
        pg_Townhouse      -0.0185543       0.0322730      -0.5749190       0.5653461
         W_log_price       0.3416482       0.0147787      23.1175620       0.0000000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_accommodates, W_bathrooms, W_bedrooms, W_beds,
             W_pg_Condominium, W_pg_House, W_pg_Other, W_pg_Townhouse,
             W_rt_Private_room, W_rt_Shared_room
================================ END OF REPORT =====================================

Similarly to the effects in the SLX regression, changes in the spatial lag regression need to be interpreted with care. Here, w_log_price applies consistently over all observations, and actually changes the effective strength of each of the $\beta$ coefficients. Thus, it is useful to use predictions and scenario-building to predict $y$ when changing $X$ , which allows you to analyze the direct and indirect components.

参考

https://geographicdata.science/book/notebooks/11_regression.html
沈体雁 et al.空间计量经济学
沈体雁 et al. 空间计量分析软件
本博客代码等资料

Since we use the log price for a $y$ variable, our $\beta$ coefficients are still all interpreted as elasticities, meaning that a unit change in the $x_i$ variate results in a $\beta$ percent change in the price y_i ↩︎
Discover this for yourself: what is the average of numpy.array([True, True, True, False, False, True)]? ↩︎

WenbinYao&YouweiHu

关注

4
点赞
踩
36

收藏

觉得还不错? 一键收藏
6
评论
03空间计量基础模型之SLX，SAR，SEM

这两天刚好有些时间，于是跑了一些空间计量模型作为实战练习，使用的包是pysal，原教程点击该链接，主要是阐述了空间异质性、空间依赖的含义以及SLX，SAR，SEM这三个空间计量基本模型，其他的许多变体其实也就是这三个模型的两两结合或三个结合在一起。在本博客中不再阐述空间异质性和空间依赖了，只讲如何用pysal实现SLX，SAR，SEM这三个基本模型，希望了解全部内容的可以看原教程。此外，pysal这个包最好是要更新到最新版，要不然本博客代码跑起来会有bug。1.数据情况及OLS回归The Data: S
复制链接

扫一扫

专栏目录