线性回归预测波士顿房价

摘要

上次写的是纽约大学homework4 test1这次,我会简要实现一下纽约大学homework4 test2.
总的而言,这次的test,是利用线性回归解决预测boston房价的problem。
ok let us begin

简要过程

首先,我们加载入boston_data,并看一下数据集中不同features之间的相关联系数矩阵

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
Y = boston.target
boston = pd.DataFrame(boston.data)
print(boston.corr())

根据打印的结果我们可以看出来,features共有13个,这或许与官方提供的数据(https://www.kaggle.com/c/boston-housing)不太相同,那是因为这里面的数据delete掉了medv这个属性值,有关features代表的相关的含义可以查看官方所提供的属性解释,就是上面的超链接。
其次,我们觉得仅仅用1,2 …表示列名不够清晰,我将列名称以及行名称转换为相应的属性值名称之后,再次进行打印输出。

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
Y = boston.target
boston = pd.DataFrame(boston.data,columns=['crim','zn','indus','chas','nox','rm','age','dis','rad','tax','ptratio','black','lstat'])
print(boston.corr())

然后,我们画出correlation matrix

import matplotlib.pyplot as plt
plt.matshow(boston.corr(),  cmap=plt.cm.jet)
plt.show()

OK,了解完了数据集,接下来就是测试数据集与训练数据集的分割以及模型的建立与训练了
首先,先看一下测试数据集与训练数据集的分割,以及数据集的标准化,采用的是最大值最小值标准化,关于为什么要标准化的问题,可以搜索相关的资料自行查看,我有时间也会总结一下发出来

from sklearn.model_selection import train_test_split
import numpy as np
X_train,X_test,y_train,y_test=train_test_split(boston,Y,random_state=0,test_size=0.20)
min_max_scaler = preprocessing.MinMaxScaler()
X_train=min_max_scaler.fit_transform(X_train)
X_test=min_max_scaler.fit_transform(X_test)
y_train=min_max_scaler.fit_transform(y_train.reshape(-1,1))
y_test=min_max_scaler.fit_transform(y_test.reshape(-1,1))

数据集分割完成之后,接下来就是线性回归模型的建立了,直接看代码

# 6. Then, please predict new values using the test set.
# Please give the coefficient for your model.
lr=LinearRegression()
lr.fit(X_train,y_train)
lr_y_predict=lr.predict(X_test)
print(lr.score(X_test, y_test))
# 7. The sign of a regression coefficient tells you whether there is a positive or negative correlation
# between each independent variable and the dependent variable. What does a positive coefficient and a negative coefficient indicate respectively?
weight = lr.coef_
bias = lr.intercept_
print(weight)
print(bias)
# 8. Finally, to gain an understanding of how your model is performing, please score the model against three metrics: R squared, mean squared error,
# and mean absolute error. Write the lines of code to get your output; and answer the questions:
# a) Google R Squared, Mean Squared Error, and Mean Absolute Error. What do these metrics
# mean? What are the numbers telling you?
score = r2_score(y_test, lr_y_predict)
mse_test=np.sum((lr_y_predict-y_test)**2)/len(y_test)
mae_test=np.sum(np.absolute(lr_y_predict-y_test))/len(y_test)
print(score)
print(mse_test)
print(mae_test)

做到这里时候,如果不出意外的话,正确率应该在6成左右,这么低的准确率怎么kennel好意思交作业呢,还记得我上一次的New York university homework4 task1 文章里说的内容了么?数据集不进行预处理,就相当于不做,不优化,不成魔。
看优化,这里面的优化采用的方法很简单,就是仅仅使用上述过程中与结果有正相关并且正相关权重较大的features。

# b) What do you think could improve the model? Try the possible improved model in coding lines as a bonus.

# improved model one : use only positive coefficient to train the model

dataset = load_boston()
x_data = dataset.data
y_data = dataset.target
name_data = dataset.feature_names
x_data = dataset.data
y_data = dataset.target
i_=[]
for i in range(len(y_data)):
    if y_data[i] == 50:
        # to store the error value that the price of the house which one is < 50
        i_.append(i)
# to delete the error value
x_data = delete(x_data,i_,axis=0)
y_data = delete(y_data,i_,axis=0)
name_data = dataset.feature_names
j_=[]
for i in range(13):
    if name_data[i] == 'RM'or name_data[i] == 'PTRATIO'or name_data[i] == 'LSTAT':
        continue
    # to memory the unimportant features
    j_.append(i)
# delete some unimportant features from the data
x_data = delete(x_data,j_,axis=1)
X_train,X_test,y_train,y_test=train_test_split(x_data,y_data,random_state=0,test_size=0.20)
min_max_scaler = preprocessing.MinMaxScaler()

X_train=min_max_scaler.fit_transform(X_train)
X_test=min_max_scaler.fit_transform(X_test)
y_train=min_max_scaler.fit_transform(y_train.reshape(-1,1))
y_test=min_max_scaler.fit_transform(y_test.reshape(-1,1))
lr=LinearRegression()
lr.fit(X_train,y_train)
lr_y_predict=lr.predict(X_test)
score = r2_score(y_test, lr_y_predict)
print(score)

这样,经过这个简简单单的数据集的预处理过程就可以将正确率提高20个百分点,很开心对不对,反正我挺开心的。
最后,可视化结果

def show_res(y_test, y_predict):

    plt.figure()
    x = np.arange(0, len(y_predict))

    plt.plot(x, y_test, marker='*')
    plt.plot(x, y_predict, marker='o')

    plt.title('the predict price and the real price of the bostons house ')
    plt.xlabel('x')
    plt.ylabel('house price')

    plt.legend(['real price', 'predict price'])
    plt.show()

show_res(y_test,lr_y_predict)

看结果
1

人生苦短,我用python

  • 4
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值