何为线性回归
- 有监督学习 => 学习样本为 : D=(xi,yi)Ni=1
- 输出/预测的结果yi为连续值变量
- 需要学习映射ƒ : χ → y
- 假定输入x和输出y之间有线性相关关系
测试/预测阶段
对于给定的x,预测其输出
(可以利用最小二乘法对w和b进行估计)
分类
根据自变量个数可以将线性回归主要分为两种:一元线性回归和多元线性回归。
一元线性回归只有一个自变量,而多元线性回归有多个自变量。拟合多元线性回归的时候,可以利用多项式回归或曲线回归。
实例
使用sklearn自带的房价数据库上使用线性回归,多项式回归
线性回归
from sklearn import datasets
boston = datasets.load_boston() # 加载房价数据
X = boston.data
y = boston.target
print (X.shape)
print (y.shape)
输出:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size = 1/3.,random_state = 8)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# 线性回归
lr = LinearRegression(normalize=True,n_jobs=2)
scores = cross_val_score(lr,X_train,y_train,cv=10,scoring='neg_mean_squared_error') #计算均方误差
print (scores.mean())
lr.fit(X_train,y_train)
lr.score(X_test,y_test)
输出:
多项式回归
from sklearn.preprocessing import PolynomialFeatures
for k in range(1,4):
lr_featurizer = PolynomialFeatures(degree=k) # 用于产生多项式 degree:最高次项的次数
print ('-----%d-----' % k)
X_pf_train = lr_featurizer.fit_transform(X_train)
X_pf_test = lr_featurizer.transform(X_test)
pf_scores = cross_val_score(lr,X_pf_train,y_train,cv=10,scoring='neg_mean_squared_error')
print (pf_scores.mean())
lr.fit(X_pf_train,y_train)
print (lr.score(X_pf_test,y_test))
print (lr.score(X_pf_train,y_train))
输出:
从上面的结果可以看出,当k=1时,为线性回归;
当k=2时,效果比线性回归好一点;
当k=3时,出现过拟合
解决过拟合
Lasson回归
# 正则化解决k=3的过拟合现象
lr_featurizer = PolynomialFeatures(degree=3) # 用于产生多项式 degree:最高次项的次数
X_pf_train = lr_featurizer.fit_transform(X_train)
X_pf_test = lr_featurizer.transform(X_test)
# LASSO回归:(也叫线性回归的L1正则化)
from sklearn.linear_model import Lasso
for a in [i/10000 for i in range(0,6)]:
print ('----%f-----'% a)
lasso = Lasso(alpha=a,normalize=True)
pf_scores = cross_val_score(lasso,X_pf_train,y_train,cv=10,scoring='neg_mean_squared_error')
print (pf_scores.mean())
lasso.fit(X_pf_train,y_train)
print (lasso.score(X_pf_test,y_test))
print (lasso.score(X_pf_train,y_train))
输出:
从上面的结果可以看出,Lasson正则化处理后,模型的评价会提高很多
岭回归
# 正则化解决k=3的过拟合现象
lr_featurizer = PolynomialFeatures(degree=3) # 用于产生多项式 degree:最高次项的次数
X_pf_train = lr_featurizer.fit_transform(X_train)
X_pf_test = lr_featurizer.transform(X_test)
from sklearn.linear_model import Ridge
# 岭回归
for a in [0,0.005]:
print ('----%f-----'% a)
ridge = Ridge(alpha=a,normalize=True)
pf_scores = cross_val_score(ridge,X_pf_train,y_train,cv=10,scoring='neg_mean_squared_error')
print (pf_scores.mean())
ridge.fit(X_pf_train,y_train)
print (ridge.score(X_pf_test,y_test))
print (ridge.score(X_pf_train,y_train))
输出:
从上面的结果可以看出,对比alpha=0 和 alpha = 0.0005的情况,发现Ridge正则化处理后,模型的评价会提高很多。