03空间计量基础模型之SLX,SAR,SEM

这两天刚好有些时间,于是跑了一些空间计量模型作为实战练习,使用的包是pysal,原教程点击该链接,主要是阐述了空间异质性、空间依赖的含义以及SLX,SAR,SEM这三个空间计量基本模型,其他的许多变体其实也就是这三个模型的两两结合或三个结合在一起。在本博客中不再阐述空间异质性和空间依赖了,只讲如何用pysal实现SLX,SAR,SEM这三个基本模型,希望了解全部内容的可以看原教程。此外,pysal这个包最好是要更新到最新版,要不然本博客代码跑起来会有bug。

1.数据情况及OLS回归

The Data: San Diego AirBnB,可以点击自己下载。数据是San Diego的airbnb的每晚房价数据,包括房价,这个也就是我们后续分析的因变量,还有一些连续的自变量,比如房间数量、卫生间数量、卧室数量、床数量等等,此外也有一些分类变量,比如neighborhood,此外,还有一些哑变量,比如pg_Apartment等。

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

import pysal
import spreg
from libpysal import weights
import esda
from scipy import stats
import statsmodels.formula.api as sm
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns


db = gpd.read_file('regression_db.geojson')
db.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6110 entries, 0 to 6109
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   accommodates        6110 non-null   int64   
 1   bathrooms           6110 non-null   float64 
 2   bedrooms            6110 non-null   float64 
 3   beds                6110 non-null   float64 
 4   neighborhood        6110 non-null   object  
 5   pool                6110 non-null   int64   
 6   d2balboa            6110 non-null   float64 
 7   coastal             6110 non-null   int64   
 8   price               6110 non-null   float64 
 9   log_price           6110 non-null   float64 
 10  id                  6110 non-null   int64   
 11  pg_Apartment        6110 non-null   int64   
 12  pg_Condominium      6110 non-null   int64   
 13  pg_House            6110 non-null   int64   
 14  pg_Other            6110 non-null   int64   
 15  pg_Townhouse        6110 non-null   int64   
 16  rt_Entire_home/apt  6110 non-null   int64   
 17  rt_Private_room     6110 non-null   int64   
 18  rt_Shared_room      6110 non-null   int64   
 19  geometry            6110 non-null   geometry
dtypes: float64(6), geometry(1), int64(12), object(1)
memory usage: 954.8+ KB
db.head(2)
accommodatesbathroomsbedroomsbedsneighborhoodpoold2balboacoastalpricelog_priceidpg_Apartmentpg_Condominiumpg_Housepg_Otherpg_Townhousert_Entire_home/aptrt_Private_roomrt_Shared_roomgeometry
052.02.02.0North Hills02.9720770425.06.052089600100100POINT (-117.12971 32.75399)
161.02.04.0Mission Bay011.5013851205.05.323010557001000100POINT (-117.25253 32.78421)

These are the explanatory variables we will use throughout the chapter.

variable_names = [
    'accommodates', 
    'bathrooms', 
    'bedrooms', 
    'beds', 
    'rt_Private_room', 
    'rt_Shared_room',
    'pg_Condominium', 
    'pg_House', 
    'pg_Other', 
    'pg_Townhouse'
]

下面的这个代码是直接做一个普通OLS回归,不考虑空间效应,就是把因变量和自变量输进去进行了,用stats也可以,大同小异。

m1 = spreg.OLS(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    name_y='log_price', 
    name_x=variable_names
)

Number of Variables是包括CONSTANT变量在内的,因此一共是11个。

print(m1.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
R-squared           :      0.6683
Adjusted R-squared  :      0.6678
Sum squared residual:    1320.148                F-statistic           :   1229.0564
Sigma-square        :       0.216                Prob(F-statistic)     :           0
S.E. of regression  :       0.465                Log likelihood        :   -3988.895
Sigma-square ML     :       0.216                Akaike info criterion :    7999.790
S.E of regression ML:      0.4648                Schwarz criterion     :    8073.685

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.3883830       0.0161147     272.3217773       0.0000000
        accommodates       0.0834523       0.0050781      16.4336318       0.0000000
           bathrooms       0.1923790       0.0109668      17.5419773       0.0000000
            bedrooms       0.1525221       0.0111323      13.7009195       0.0000000
                beds      -0.0417231       0.0069383      -6.0134430       0.0000000
     rt_Private_room      -0.5506868       0.0159046     -34.6244758       0.0000000
      rt_Shared_room      -1.2383055       0.0384329     -32.2198992       0.0000000
      pg_Condominium       0.1436347       0.0221499       6.4846529       0.0000000
            pg_House      -0.0104894       0.0145315      -0.7218393       0.4704209
            pg_Other       0.1411546       0.0228016       6.1905633       0.0000000
        pg_Townhouse      -0.0416702       0.0342758      -1.2157316       0.2241342
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           11.964

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2671.611           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               10         322.532           0.0000
Koenker-Bassett test             10         135.581           0.0000
================================ END OF REPORT =====================================

下面看一下有沙滩和无沙滩的房子是不是存在异质性,因为我们之前没有考虑任何的空间效应,因此可能回归残差不是白噪声,下面就来验证下是不是说确实数据中是存在空间效应的,空间效应当然有很多种,这里只是举了有没有沙滩这个空间效应。

is_coastal = db.coastal.astype(bool)
coastal = m1.u[is_coastal] # 取出残差中存在coastal的样本的残差
not_coastal = m1.u[~is_coastal]  # 取出残差中不存在coastal的样本的残差
plt.hist(coastal, density=True, label='Coastal')
plt.hist(
    not_coastal, 
    histtype='step',
    density=True, 
    linewidth=4, 
    label='Not Coastal'
)
plt.vlines(0,0,1, linestyle=":", color='k', linewidth=4)
plt.legend()
plt.show()

在这里插入图片描述

While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic t t t test:

stats.ttest_ind(coastal, not_coastal)
Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))

下面建立以下这个样本的空间权重矩阵,关于空间权重矩阵的相关介绍和看我的空间计量01,02,这边使用的距离权重矩阵。

knn = weights.KNN.from_dataframe(db, k=1) # 得到k近邻权重矩阵
knn
<libpysal.weights.distance.KNN at 0xaf77427fc8>
knn.weights
{0: [1.0],
 1: [1.0],
 2: [1.0],
 3: [1.0],
 4: [1.0],
 ...}

This means that, when we compute the spatial lag of that K N N KNN KNN weight and the residual, we get the residual of the AirBnB listing closest to each observation.

lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u) # 生成空间滞后变量
ax = sns.regplot(
    m1.u.flatten(),
    lag_residual.flatten(),
    line_kws=dict(color='orangered'),
    ci=None
)
ax.set_xlabel('Model Residuals - $u$')
ax.set_ylabel('Spatial Lag of Model Residuals - $W u$');

在这里插入图片描述

lag_residual
array([[-0.59201055],
       [ 0.65057166],
       [ 0.00970927],
       ...,
       [-0.40444691],
       [ 0.07424779],
       [-0.30458876]])

从上面这个图实际上我们可以很明显的看到空间效应确实存在于残差中,这个图横坐标是每个样本的残差,纵坐标是每个样本的最近领的残差,如果是随机分布的,那么就是没有任何关系,但是我们发现两者间存在一定的聚集关系,并且回归线系数也显著非0.因此用普通OLS回归实际上是存在一些问题的,下面使用Spatial Regimes来做回归,还是使用spreg来做。

2.Spatial Regimes

公式如下所示,其实理解起来也是相当简单,就是每个地方单独给一个截距项。希望用这个截距项来代表空间异质性。

log ⁡ P i = α r + ∑ k X k i β k − r + ϵ i \log{P_i} = \alpha_r + \sum_k \mathbf{X}_{ki}\beta_{k-r} + \epsilon_i logPi=αr+kXkiβkr+ϵi

where we are not only allowing the constant term to vary by region ( α r \alpha_r αr), but also every other parameter ( β k − r \beta_{k-r} βkr).

To illustrate this approach, we will use the “spatial differentiator” of whether a house is in a coastal neighborhood or not (coastal_neig) to define the regimes. The rationale behind this choice is that renting a house close to the ocean might be a strong enough pull that people might be willing to pay at different rates for each of the house’s characteristics.

m4 = spreg.OLS_Regimes(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    regimes = db['coastal'].tolist(),
    constant_regi='many',
    regime_err_sep=False,
    name_y='log_price', 
    name_x=variable_names
)
print(m4.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES - REGIMES
---------------------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          22
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6088
R-squared           :      0.6853
Adjusted R-squared  :      0.6843
Sum squared residual:    1252.489                F-statistic           :    631.4283
Sigma-square        :       0.206                Prob(F-statistic)     :           0
S.E. of regression  :       0.454                Log likelihood        :   -3828.169
Sigma-square ML     :       0.205                Akaike info criterion :    7700.339
S.E of regression ML:      0.4528                Schwarz criterion     :    7848.128

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
          0_CONSTANT       4.4072424       0.0215156     204.8392695       0.0000000
      0_accommodates       0.0901860       0.0064737      13.9311338       0.0000000
         0_bathrooms       0.1433760       0.0142680      10.0487871       0.0000000
          0_bedrooms       0.1129626       0.0138273       8.1695568       0.0000000
              0_beds      -0.0262719       0.0088380      -2.9726102       0.0029644
   0_rt_Private_room      -0.5293343       0.0189179     -27.9805699       0.0000000
    0_rt_Shared_room      -1.2244586       0.0425969     -28.7452834       0.0000000
    0_pg_Condominium       0.1053065       0.0281309       3.7434523       0.0001832
          0_pg_House      -0.0454471       0.0179571      -2.5308637       0.0114032
          0_pg_Other       0.0607526       0.0276365       2.1982715       0.0279673
      0_pg_Townhouse      -0.0103973       0.0456730      -0.2276456       0.8199294
          1_CONSTANT       4.4799043       0.0250938     178.5260014       0.0000000
      1_accommodates       0.0484639       0.0078806       6.1497397       0.0000000
         1_bathrooms       0.2474779       0.0165661      14.9388057       0.0000000
          1_bedrooms       0.1897404       0.0179229      10.5864676       0.0000000
              1_beds      -0.0506077       0.0107429      -4.7107925       0.0000025
   1_rt_Private_room      -0.5586281       0.0283122     -19.7309699       0.0000000
    1_rt_Shared_room      -1.0528541       0.0841745     -12.5079997       0.0000000
    1_pg_Condominium       0.2044470       0.0339434       6.0231780       0.0000000
          1_pg_House       0.0753534       0.0233783       3.2232188       0.0012743
          1_pg_Other       0.2954848       0.0386455       7.6460385       0.0000000
      1_pg_Townhouse      -0.0735077       0.0493672      -1.4889984       0.1365396
------------------------------------------------------------------------------------
Regimes variable: unknown

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.033

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        3977.425           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               21         443.593           0.0000
Koenker-Bassett test             21         164.276           0.0000

REGIMES DIAGNOSTICS - CHOW TEST
                 VARIABLE        DF        VALUE           PROB
                 CONSTANT         1           4.832           0.0279
             accommodates         1          16.736           0.0000
                bathrooms         1          22.671           0.0000
                 bedrooms         1          11.504           0.0007
                     beds         1           3.060           0.0802
           pg_Condominium         1           5.057           0.0245
                 pg_House         1          16.793           0.0000
                 pg_Other         1          24.410           0.0000
             pg_Townhouse         1           0.881           0.3480
          rt_Private_room         1           0.740           0.3896
           rt_Shared_room         1           3.309           0.0689
              Global test        11         328.869           0.0000
================================ END OF REPORT =====================================

3.Exogenous effects: The SLX Model

公式如下,也就是对于自变量再加上一个空间滞后项,这个模型是最简单的空间计量模型,因为仅仅自变量加了一个空间滞后项,回归方式其实还是OLS,而且解读起来其实也和普通的OLS是一样的,比较简单,具体的这边不阐述了,可以看一下沈体雁老师《空间计量经济学》一书。
log ⁡ ( P i ) = α + ∑ k = 1 p X i j β j + ∑ k = 1 p ( ∑ j = 1 N w i j x j k ) γ k + ϵ i \log(P_i) = \alpha + \sum^{p}_{k=1}X_{ij}\beta_j + \sum^{p}_{k=1}\left(\sum^{N}_{j=1}w_{ij}x_{jk}\right)\gamma_k + \epsilon_i log(Pi)=α+k=1pXijβj+k=1p(j=1Nwijxjk)γk+ϵi

where ∑ j = 1 N w i j x j k \sum_{j=1}^N w_{ij}x_{jk} j=1Nwijxjk represents the spatial lag of the k k kth explanatory variable.
This can be stated in matrix form using the spatial weights matrix, W \mathbf{W} W, as:
log ⁡ ( P i ) = α + X β + W X γ + ϵ \log(P_i) = \alpha + \mathbf{X}\beta + \mathbf{WX}\gamma + \epsilon log(Pi)=α+Xβ+WXγ+ϵ

This splits the model to focus on two main effects: β \beta β and γ \gamma γ. The β \beta β effect describes the change in y i y_i yi when X i k X_{ik} Xik changes by one. 1. The subscript for site i i i is important here: since we’re dealing with a W \mathbf{W} W matrix, it’s useful to be clear about where the change occurs. Indeed, this matters for the γ \gamma γ effect, which represents an indirect effect of a change in X i X_i Xi. This can be conceptualized in two ways. First, one could think of γ \gamma γ as simply the effect of a unit change in your average surroundings. This is useful and simple. But, this interpretation ignores where this change might occur. In truth, a change in a variable at site i i i will result in a spillover to its surroundings:
when x i x_i xi changes, so too does the spatial lag of any site near i i i. The precise size of this will depend on the structure of W \mathbf{W} W, and can be different for every site. For example, think of a very highly-connected “focal” site in a row-standardized weight matrix. This focal site will not be strongly affected if a neighbor changes by a single unit, since each site only contributes a small amount to the lag at the focal site. Alternatively, consider a site with only one neighbor: its lag will change by exactly the amount its sole neighbor changes. Thus, to discover the exact indirect effect of a change y y y caused by the changeat a specific site x i x_i xi you would need to compute the change in the spatial lag,and then use that as your change in X X X.

In Python, we can calculate the spatial lag of each variable whose name starts by pg_ by first creating a list of all of those names, and then applying PySAL's lag_spatial to each of them:
下面就是创建了自变量的空间滞后项,在后面的回归中,把自变量的空间滞后项也放进去就可以了。

wx = db.filter(
    like='pg'
).apply(
    lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(
    columns=lambda c: 'w_'+c
).drop(
    'w_pg_Apartment', axis=1
)
wx
w_pg_Condominiumw_pg_Housew_pg_Otherw_pg_Townhouse
00.000.450.200.00
10.300.350.000.05
20.050.300.000.05
30.000.950.000.05
40.050.850.000.00
...............
61050.300.050.000.00
61060.050.000.600.00
61070.000.550.250.00
61080.000.050.050.00
61090.050.700.050.00

6110 rows × 4 columns

Once computed, we can run the model using OLS estimation because, in this context, the spatial lags included do not violate any of the assumptions OLS relies on (they are essentially additional exogenous variables):

slx_exog = db[variable_names].join(wx)
m5 = spreg.OLS(
    db[['log_price']].values, 
    slx_exog.values,
    name_y='l_price', 
    name_x=slx_exog.columns.tolist()
)
print(m5.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :     l_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          15
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6095
R-squared           :      0.6800
Adjusted R-squared  :      0.6792
Sum squared residual:    1273.933                F-statistic           :    924.9423
Sigma-square        :       0.209                Prob(F-statistic)     :           0
S.E. of regression  :       0.457                Log likelihood        :   -3880.030
Sigma-square ML     :       0.208                Akaike info criterion :    7790.061
S.E of regression ML:      0.4566                Schwarz criterion     :    7890.826

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.3205814       0.0234977     183.8727044       0.0000000
        accommodates       0.0809972       0.0050046      16.1843874       0.0000000
           bathrooms       0.1893447       0.0108059      17.5224026       0.0000000
            bedrooms       0.1635998       0.0109764      14.9047058       0.0000000
                beds      -0.0451529       0.0068249      -6.6159365       0.0000000
     rt_Private_room      -0.5293783       0.0157308     -33.6524367       0.0000000
      rt_Shared_room      -1.2892590       0.0381443     -33.7995105       0.0000000
      pg_Condominium       0.1063490       0.0221782       4.7952003       0.0000017
            pg_House       0.0327806       0.0156954       2.0885538       0.0367893
            pg_Other       0.0861857       0.0239774       3.5944620       0.0003276
        pg_Townhouse      -0.0277116       0.0338485      -0.8186965       0.4129916
    w_pg_Condominium       0.5928369       0.0689612       8.5966706       0.0000000
          w_pg_House      -0.0774462       0.0318830      -2.4290766       0.0151661
          w_pg_Other       0.4851047       0.0551461       8.7967121       0.0000000
      w_pg_Townhouse      -0.2724493       0.1223388      -2.2270058       0.0259833
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.277

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2458.006           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               14         393.052           0.0000
Koenker-Bassett test             14         169.585           0.0000
================================ END OF REPORT =====================================

As an illustration, let’s look at some of the direct/indirect effects. The direct effect of the pg_Condominium variable means that condominiums are typically 11% more expensive ( β p g _ C o n d o m i n i u m = 0.1063 \beta_{pg\_Condominium}=0.1063 βpg_Condominium=0.1063) than the benchmark
property type, apartments. More relevant to this section, any given house surrounded by condominiums also receives a price premium. But, since p g C o n d o m i n i u m pg_Condominium pgCondominium is a dummy variable,
the spatial lag at site i i i represents the percentage of properties near i i i that are condominiums, which is between 0 0 0 and 1 1 1.2
So, a unit change in this variable means that you would increase the condominium percentage by 100%. Thus, a . 1 .1 .1 increase in w_pg_Condominium (a change of ten percentage points) would result in a 5.92% increase in the property house price ( β w p g _ C o n d o m i n i u m = 0.6 \beta_{w_pg\_Condominium} = 0.6 βwpg_Condominium=0.6). Similar interpretations can be derived for all other spatially lagged variables to derive the indirect effect of a change in the spatial lag.

However, to compute the indirect change for a given site i i i, you may need to examine the predicted values for y i y_i yi. In this example, since we are using a row-standardized weights matrix with twenty nearest neighbors, the impact of changing x i x_i xi is the same for all of its neighbors and for any site i i i. Thus, the effect is always γ 20 \frac{\gamma}{20} 20γ, or about 0.0296 0.0296 0.0296. However, this would not be the same for many other kinds of weights (like Kernel, Queen, Rook, DistanceBand, or Voronoi), so we will demonstrate how to construct the indirect effect for a specific i i i:

First, predicted values for y i y_i yi are stored in the predy attribute of any model:

m5.predy
array([[5.43610121],
       [5.38596868],
       [4.25377454],
       ...,
       [4.29145318],
       [4.89174746],
       [4.85867698]])

Showing this process below, let’s first change a property to be a condominium. Consider the third observation, which is the first apartment in the data:

# 观察第三行数据
db.loc[2]
accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     1
pg_Condominium                                                   0
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

This is an apartment. Let’s make a copy of our data and change this apartment into a condominium:

db_scenario = db.copy()
# make Apartment 0 and condo 1
db_scenario.loc[2, ['pg_Apartment', 'pg_Condominium']] = [0,1]

We’ve successfully made the change:

db_scenario.loc[2]
accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     0
pg_Condominium                                                   1
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

Now, we need to also update the spatial lag variates:

db['pg_Condominium']
0       0
1       1
2       0
3       0
4       0
       ..
6105    0
6106    1
6107    0
6108    0
6109    0
Name: pg_Condominium, Length: 6110, dtype: int64
weights.spatial_lag.lag_spatial(knn, db['pg_Condominium'])  # 空间滞后操作后得到的是每个site i周边是Condominium的占比
array([0.  , 0.3 , 0.05, ..., 0.  , 0.  , 0.05])
wx_scenario = db_scenario.filter(
    like='pg'
).apply(
    lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(
    columns=lambda c: 'w_'+c
).drop(
    'w_pg_Apartment', axis=1
)
wx_scenario
w_pg_Condominiumw_pg_Housew_pg_Otherw_pg_Townhouse
00.000.450.200.00
10.300.350.000.05
20.050.300.000.05
30.000.950.000.05
40.050.850.000.00
...............
61050.300.050.000.00
61060.050.000.600.00
61070.000.550.250.00
61080.000.050.050.00
61090.050.700.050.00

6110 rows × 4 columns

And build a new exogenous X \mathbf{X} X matrix, including the a constant 1 as the leading column

slx_exog_scenario = db_scenario[variable_names].join(wx_scenario)
slx_exog_scenario
accommodatesbathroomsbedroomsbedsrt_Private_roomrt_Shared_roompg_Condominiumpg_Housepg_Otherpg_Townhousew_pg_Condominiumw_pg_Housew_pg_Otherw_pg_Townhouse
052.02.02.00001000.000.450.200.00
161.02.04.00010000.300.350.000.05
221.01.01.01010000.050.300.000.05
321.01.01.01001000.000.950.000.05
421.01.01.01001000.050.850.000.00
.............................................
610521.01.01.01000000.300.050.000.00
610662.02.02.00010000.050.000.600.00
610711.01.01.01001000.000.550.250.00
610831.01.01.00000000.000.050.050.00
610931.01.02.00001000.050.700.050.00

6110 rows × 14 columns

Now, our new prediction (in the scenario where we have changed site 2 from an apartment into a condominium), is:

y_pred_scenario = m5.betas[0] + slx_exog_scenario @ m5.betas[1:]  # @操作是矩阵乘法的含义

This prediction will be exactly the same for all sites, except site 2 and its neighbors. So, the neighbors to site 2 are:

knn.neighbors[2]
[772,
 2212,
 139,
 4653,
 2786,
 1218,
 138,
 808,
 1480,
 4241,
 1631,
 3617,
 2612,
 1162,
 135,
 23,
 5528,
 3591,
 407,
 6088]

And the effect of changing site 2 into a condominium is associated with the following changes to y i y_i yi:

# 将和第2行相关的行都取出来,这些行和原来的预测值会有所不同
(y_pred_scenario - m5.predy).loc[[2] + knn.neighbors[2]]  
0
20.106349
7720.029642
22120.029642
1390.029642
46530.029642
27860.029642
12180.029642
1380.029642
8080.029642
14800.029642
42410.029642
16310.029642
36170.029642
26120.029642
11620.029642
1350.029642
230.029642
55280.029642
35910.029642
4070.029642
60880.029642

We see the first row, representing the direct effect, is equal exactly to the estimate for pg_Condominium. For the other effects, though, we have only changed w_pg_Condominium by . 05 .05 .05

scenario_near_2 = slx_exog_scenario.loc[knn.neighbors[2], ['w_pg_Condominium']]
orig_near_2 = slx_exog.loc[knn.neighbors[2], ['w_pg_Condominium']]
scenario_near_2.join(orig_near_2, lsuffix='_scenario', rsuffix= '_original')
# 以前对于2的邻居,他们周边的Condominium的概率是‘w_pg_Condominium_original’,现在都加上了0.05
w_pg_Condominium_scenariow_pg_Condominium_original
7720.100.05
22120.100.05
1390.100.05
46530.100.05
27860.100.05
12180.100.05
1380.100.05
8080.050.00
14800.100.05
42410.100.05
16310.100.05
36170.100.05
26120.100.05
11620.050.00
1350.050.00
230.100.05
55280.050.00
35910.050.00
4070.050.00
60880.100.05

上面的例子就就很好的解释了一下SLX模型,其实很简单就是先得到自变量的空间滞后项,再放进去跑OLS就可以了,直接效应解读和OLS没有任何区别,但是简介效应解读需要注意一下,对于本例子所选取的KNN权重矩阵,间接效应指的就是某个site周围领域的平均效应,也就是空间依赖性了。

4.SEM

The spatial error model includes a spatial lag in the error term of the equation:

log ⁡ P i = α + ∑ k β k X k i + u i \log{P_i} = \alpha + \sum_k \beta_k X_{ki} + u_i logPi=α+kβkXki+ui

u i = λ u l a g − i + ϵ i u_i = \lambda u_{lag-i} + \epsilon_i ui=λulagi+ϵi

where u l a g − i = ∑ j w i , j u j u_{lag-i} = \sum_j w_{i,j} u_j ulagi=jwi,juj.
Although it appears similar, this specification violates the assumptions about the
error term in a classical OLS model. Hence, alternative estimation methods are
required. PySAL incorporates functionality to estimate several of the most
advanced techniques developed by the literature on spatial econometrics. For
example, we can use a general method of moments that account for
heterogeneity (Arraiz et al., 2010):

db[variable_names]
accommodatesbathroomsbedroomsbedsrt_Private_roomrt_Shared_roompg_Condominiumpg_Housepg_Otherpg_Townhouse
052.02.02.0000100
161.02.04.0001000
221.01.01.0100000
321.01.01.0100100
421.01.01.0100100
.................................
610521.01.01.0100000
610662.02.02.0001000
610711.01.01.0100100
610831.01.01.0000000
610931.01.02.0000100

6110 rows × 10 columns

knn = weights.KNN.from_dataframe(db, k=20) # 得到k近邻权重矩阵
knn.transform='r'
# knn.weights
m6 = spreg.GM_Error_Het(
    y = db[['log_price']].values, 
    x = db[variable_names].values,
    w=knn,
    name_y='log_price', 
    name_x=variable_names
)
print(m6.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HET)
---------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
Pseudo R-squared    :      0.6655
N. of iterations    :           1                Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       4.4262033       0.0217088     203.8898741       0.0000000
        accommodates       0.0695536       0.0063268      10.9934495       0.0000000
           bathrooms       0.1614101       0.0151312      10.6673765       0.0000000
            bedrooms       0.1739251       0.0146697      11.8560847       0.0000000
                beds      -0.0377710       0.0088293      -4.2779023       0.0000189
     rt_Private_room      -0.4947947       0.0163843     -30.1993140       0.0000000
      rt_Shared_room      -1.1613985       0.0515304     -22.5381175       0.0000000
      pg_Condominium       0.1003761       0.0213148       4.7092198       0.0000025
            pg_House       0.0308334       0.0147100       2.0960849       0.0360747
            pg_Other       0.0861768       0.0254942       3.3802547       0.0007242
        pg_Townhouse      -0.0074515       0.0292873      -0.2544285       0.7991646
              lambda       0.6448728       0.0186626      34.5543740       0.0000000
              lambda       0.6448728       0.0186626      34.5543740       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

5.SAR

SAR的公式如下:
log ⁡ P i = α + ρ log ⁡ P l a g − i + ∑ k β k X k i + ϵ i \log{P_i} = \alpha + \rho \log{P_{lag-i}} + \sum_k \beta_k X_{ki} + \epsilon_i logPi=α+ρlogPlagi+kβkXki+ϵi

Although it might not seem very different from the previous equation, this model violates the exogeneity assumption, crucial for OLS to work. Put simply, this occurs when P i P_i Pi exists on both “sides” of the equals sign.In theory, since P i P_i Pi is included in computing P l a g − i P_{lag-i} Plagi, exogeneity is violated. Similarly to the case ofthe spatial error, several techniques have been proposed to overcome this limitation, and PySAL implements several of them. In the example below, we use a two-stage least squares estimation {cite}Anselin_1988, where the spatial lag of all the explanatory variables is used as instrument for the endogenous
lag:

m7 = spreg.GM_Lag(
    db[['log_price']].values, 
    db[variable_names].values,
    w=knn, 
    name_y='log_price', 
    name_x=variable_names
)
print(m7.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          12
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6098
Pseudo R-squared    :      0.7057
Spatial Pseudo R-squared:  0.6883

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       2.7440254       0.0727290      37.7294400       0.0000000
        accommodates       0.0697596       0.0048157      14.4859187       0.0000000
           bathrooms       0.1626725       0.0104007      15.6405467       0.0000000
            bedrooms       0.1604137       0.0104823      15.3033012       0.0000000
                beds      -0.0365065       0.0065336      -5.5874750       0.0000000
     rt_Private_room      -0.4981415       0.0151396     -32.9031780       0.0000000
      rt_Shared_room      -1.1157392       0.0365563     -30.5210777       0.0000000
      pg_Condominium       0.1072995       0.0209048       5.1327614       0.0000003
            pg_House      -0.0004017       0.0136828      -0.0293598       0.9765777
            pg_Other       0.1207503       0.0214771       5.6222929       0.0000000
        pg_Townhouse      -0.0185543       0.0322730      -0.5749190       0.5653461
         W_log_price       0.3416482       0.0147787      23.1175620       0.0000000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_accommodates, W_bathrooms, W_bedrooms, W_beds,
             W_pg_Condominium, W_pg_House, W_pg_Other, W_pg_Townhouse,
             W_rt_Private_room, W_rt_Shared_room
================================ END OF REPORT =====================================

Similarly to the effects in the SLX regression, changes in the spatial lag regression need to be interpreted with care. Here, w_log_price applies consistently over all observations, and actually changes the effective strength of each of the β \beta β coefficients. Thus, it is useful to use predictions and scenario-building to predict y y y when changing X X X, which allows you to analyze the direct and indirect components.

参考

https://geographicdata.science/book/notebooks/11_regression.html
沈体雁 et al.空间计量经济学
沈体雁 et al. 空间计量分析软件
本博客代码等资料


  1. Since we use the log price for a y y y variable, our β \beta β coefficients are still all interpreted as elasticities, meaning that a unit change in the x i x_i xi variate results in a β \beta β percent change in the price y_i ↩︎

  2. Discover this for yourself: what is the average of numpy.array([True, True, True, False, False, True)]? ↩︎

  • 4
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
在Simulink中,可以使用"模型引用"(Model referencing)功能来实现多个模型的仿真。 模型引用可以帮助我们将大型复杂的系统划分为多个模块化的子系统,每个子系统可以独立开发和测试。具体步骤如下: 1. 创建主模型:首先,我们需要创建一个主模型,用于整合多个子模型。可以通过在Simulink中新建一个空模型开始。 2. 添加模型引用:在主模型中,可以通过右键点击空白处,选择"添加"->"模型引用",然后选择需要引用的子模型。可以一次添加多个子模型。 3. 配置参数:在主模型中,可以选择每个子模型的参数和输入信号。对于子模型的参数,可以在主模型中进行修改;对于输入信号,可以在主模型中创建对应的输入端口。 4. 连接信号:根据需要,可以在主模型中连接不同子模型之间的信号。可以使用Simulink中的信号线进行连接。 5. 运行仿真:完成上述步骤后,可以运行主模型的仿真,从而同时运行和联动多个子模型。 需要注意的是,在模型引用功能下,每个子模型在运行时保持独立状态,可以独立调试和修改。同时也可以进行并行仿真,提高仿真效率。如果需要更改模型的结构或参数,可以修改子模型而无需修改主模型。 总结来说,通过使用Simulink的模型引用功能,我们可以方便地实现多个模型的仿真,提高开发和测试的效率。以上为基本步骤,具体操作可以根据实际情况进行调整和改进。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值