pandas 线性回归
This post was originally published here
这篇文章最初发表在这里
rel="stylesheet" type="text/css" href="/wp-content/themes/colormag-child/css/tim-dobbins-style.css">
rel="stylesheet" type="text/css" href="/wp-content/themes/colormag-child/css/tim-dobbins-style.css">
In this post, we’ll walk through building linear regression models to predict housing prices resulting from economic activity. Topics covered will include:
在本文中,我们将逐步构建线性回归模型,以预测经济活动导致的房价。 涵盖的主题将包括:
- What is Regression
- Variable Selection
- Reading in the Data with pandas
- Ordinary Least Squares (OLS) Assumptions
- Simple Linear Regression
- Regression Plots
- Multiple Linear Regression
- Another Look at Partial Regression Plots
- Conclusion
- Navigating Pitfalls
Future posts will cover related topics such as exploratory analysis, regression diagnostics, and advanced regression modeling, but I wanted to jump right in so readers could get their hands dirty with data.
未来的文章将涵盖相关主题,例如探索性分析,回归诊断和高级回归建模,但是我想跳进去,以便读者可以轻松掌握数据。
什么是回归? (What is Regression? )
Linear regression is a model that predicts a relationship of direct proportionality between the dependent variable (plotted on the vertical or Y axis) and the predictor variables (plotted on the X axis) that produces a straight line, like so:
线性回归是一个模型,该模型可预测因变量(绘制在垂直或Y轴上)与预测变量(绘制在X轴上)之间的直接比例关系,该变量会产生一条直线,如下所示:
Linear regression will be discussed in greater detail as we move through the modeling process.
在建模过程中,将更详细地讨论线性回归。
变量选择 (Variable Selection )
For our dependent variable we’ll use housing_price_index
(HPI), which measures price changes of residential housing.
对于我们的因变量,我们将使用housing_price_index
(HPI)来衡量住宅价格的变化。
For our predictor variables, we use our intuition to select drivers of macro- (or “big picture”) economic activity, such as unemployment, interest rates, and gross domestic product (total productivity). For an explanation of our variables, including assumptions about how they impact housing prices, and all the sources of data used in this post, see here.
对于我们的预测变量,我们使用直觉来选择宏观(或“全局”)经济活动的驱动力,例如失业率,利率和国内生产总值(总生产率)。 有关我们变量的解释,包括关于变量如何影响房价的假设以及本文中使用的所有数据来源,请参见此处 。
用熊猫读数据 (Reading in the Data with pandas )
Once we’ve downloaded the data, read it in using pandas’ read_csv
method.
下载完数据后,请使用pandas的read_csv
方法读取数据。
import pandas as pd # read in from csv using pd.read_csv # be sure to use the file path where you saved the data housing_price_index = pd.read_csv('/Users/tdobbins/Downloads/hpi/monthly-hpi.csv') unemployment = pd.read_csv('/Users/tdobbins/Downloads/hpi/unemployment.csv') federal_funds_rate = pd.read_csv('/Users/tdobbins/Downloads/hpi/fed_funds.csv') shiller = pd.read_csv('/Users/tdobbins/Downloads/hpi/shiller.csv') gross_domestic_product = pd.read_csv('/Users/tdobbins/Downloads/hpi/gdp.csv')
import pandas as pd # read in from csv using pd.read_csv # be sure to use the file path where you saved the data housing_price_index = pd.read_csv('/Users/tdobbins/Downloads/hpi/monthly-hpi.csv') unemployment = pd.read_csv('/Users/tdobbins/Downloads/hpi/unemployment.csv') federal_funds_rate = pd.read_csv('/Users/tdobbins/Downloads/hpi/fed_funds.csv') shiller = pd.read_csv('/Users/tdobbins/Downloads/hpi/shiller.csv') gross_domestic_product = pd.read_csv('/Users/tdobbins/Downloads/hpi/gdp.csv')
Once we have the data, invoke pandas’ merge
method to join the data together in a single dataframe for analysis. Some data is reported monthly, others are reported quarterly. No worries. We merge the dataframes on a certain column so each row is in its logical place for measurement purposes. In this example, the best column to merge on is the date column. See below.
有了数据后,调用pandas的merge
方法将数据merge
到单个数据框中进行分析。 一些数据每月报告一次,其他数据每季度报告一次。 别担心。 我们将数据帧合并到某一列上,以便每一行都位于其逻辑位置以进行测量。 在此示例中,要合并的最佳列是日期列。 见下文。
Let’s get a quick look at our variables with pandas’ head
method. The headers in bold text represent the date and the variables we’ll test for our model. Each row represents a different time period.
让我们用pandas的head
方法快速查看我们的变量。 粗体文本标题表示日期和我们将为模型测试的变量。 每行代表一个不同的时间段。
date | 日期 | sp500 | sp500 | consumer_price_index | 消费者价格指数 | long_interest_rate | long_interest_rate | housing_price_index | housing_price_index | total_unemployed | 共有失业 | more_than_15_weeks | 超过15周 | not_in_labor_searched_for_work | not_in_labor_searched_for_work | multi_jobs | 多职位 | leavers | 离开者 | losers | 失败者 | federal_funds_rate | Federal_funds_rate | total_expenditures | 支出总额 | labor_force_pr | labor_force_pr | producer_price_index | 生产者价格指数 | gross_domestic_product | 国内生产总值 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2011-01-01 |