机器学习算法完整版见fenghaootong-github
房价预测
数据集描述
数据共有81个特征
SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
….
导入所需模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math as mat
from scipy import stats
from scipy.stats import norm
from sklearn import preprocessing
import statsmodels.api as sm
from patsy import dmatrices
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import sklearn.linear_model as LinReg
import sklearn.metrics as metrics
导入数据
#loading the data
data_train = pd.read_csv('../DATA/SalePrice_train.csv')
data_test = pd.read_csv('../DATA/SalePrice_test.csv')
数据共有81个特征,为了便于说明只挑选7个特征
OverallQual
GrLivArea
GarageCars
TotalBsmtSF
1stFlrSF
FullBath
YearBuilt
因为这些数据与房子的售卖价格相关性比较大
具体如何选择特征,见数据清理
数据预处理
data_train.shape
(1460, 81)
vars = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath','YearBuilt']
Y = data_train[['SalePrice']] #dim (1460, 1)
ID_train = data_train[['Id']] #dim (1460