【手把手机器学习入门到放弃】从线性回归开始

本文以墨尔本房价为研究对象,通过线性回归模型进行数据分析。选择house类型的房屋,考虑rooms, Date, Distance, Bedroom2, Bathroom, YearBuilt等变量。处理缺失值,去除不相关变量后,建立模型并训练。最终计算误差指标,展示预测结果,作为初学者入门项目。" 123594940,5107219,力扣17:电话号码字母组合的深度优先搜索解法,"['算法', '深度优先遍历', 'LeetCode', '电话号码', '字符串处理']
摘要由CSDN通过智能技术生成

终于开新坑了~

线性回归是指将数据拟合成 y = a 1 x 1 + a 2 x 2 + a 3 x 3 . . . + a n x n + b + ϵ y=a_1x_1+a_2x_2+a_3x_3...+a_nx_n+b +\epsilon y=a1x1+a2x2+a3x3...+anxn+b+ϵ的形式
通过训练模型获得参数 a_1, a_2, …, a_n, b
从而对新的x值,可以预测y
下面正式开始~本期分析的是墨尔本的房价

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# 线性回归
from sklearn.linear_model import LinearRegression
# 数据分割
from sklearn.model_selection import train_test_split
from datetime import date

1. 数据集描述

Melbourne Housing Market

Some Key Details

  • Suburb: Suburb

  • Address: Address

  • Rooms: Number of rooms

  • Price: Price in dollars

  • Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

  • Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

  • SellerG: Real Estate Agent

  • Date: Date sold

  • Distance: Distance from CBD

  • Regionname: General Region (West, North West, North, North east …etc)

  • Propertycount: Number of properties that exist in the suburb.

  • Bedroom2 : Scraped # of Bedrooms (from different source)

  • Bathroom: Number of Bathrooms

  • Car: Number of carspots

  • Landsize: Land Size

  • BuildingArea: Building Size

  • YearBuilt: Year the house was built

  • CouncilArea: Governing council for the area

  • Lattitude: Self explanitory

  • Longtitude: Self explanitory

import os
print(os.listdir('datasets'))
['BrentOilPrices.csv', '.DS_Store', 'Iris', 'Lending club loan data', 'Adult', 'Melbourne_housing_extra_data.csv']

2. 数据初探

org_data = pd.read_csv('datasets/Melbourne_housing_extra_data.csv')
org_data.head(10)
SuburbAddressRoomsTypePriceMethodSellerGDateDistancePostcode...BathroomCarLandsizeBuildingAreaYearBuiltCouncilAreaLattitudeLongtitudeRegionnamePropertycount
0Abbotsford68 Studley St2hNaNSSJellis3/09/20162.53067.0...1.01.0126.0NaNNaNYarra-37.8014144.9958Northern Metropolitan4019.0
1Abbotsford85 Turner St2h1480000.0SBiggin3/12/20162.53067.0...1.01.0202.0NaNNaNYarra-37.7996144.9984Northern Metropolitan4019.0
2Abbotsford25 Bloomburg St2h1035000.0SBiggin4/02/20162.53067.0...1.00.0156.079.01900.0Yarra-37.8079144.9934Northern Metropolitan4019.0
3Abbotsford18/659 Victoria St3uNaNVBRounds4/02/20162.53067.0...2.01.00.0NaNNaNYarra-37.8114145.0116Northern Metropolitan4019.0
4Abbotsford5 Charles St3h1465000.0SPBiggin4/03/20172.53067.0...2.00.0134.0150.01900.0Yarra-37.8093144.9944Northern Metropolitan4019.0
5Abbotsford40 Federation La3h850000.0PIBiggin4/03/20172.53067.0...2.01.094.0NaNNaNYarra-37.7969144.9969Northern Metropolitan4019.0
6Abbotsford55a Park St4h1600000.0VBNelson4/06/20162.53067.0...1.02.0120.0142.02014.0Yarra-37.8072144.9941Northern Metropolitan4019.0
7Abbotsford16 Maugie St4hNaNSNNelson6/08/20162.53067.0...2.02.0400.0220.02006.0Yarra-37.7965144.9965Northern Metropolitan4019.0
8Abbotsford53 Turner St2hNaNSBiggin6/08/20162.53067.0...1.02.0201.0NaN1900.0Yarra-37.7995144.9974Northern Metropolitan4019.0
9Abbotsford99 Turner St2hNaNSCollins6/08/20162.53067.0...2.01.0202.0NaN1900.0Yarra-37.7996144.9989Northern Metropolitan4019.0

10 rows × 21 columns

# 查看变量类型:
org_data.dtypes
Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

3. 为了使模型简单,我们就选取type是h(house)类型的房子,选取的变量有rooms,Date,Distance,Landsize, Bedroom2, Bathroom, YearBuilt几个变量

dataframe = org_data[org_data["Type"]=='h'].loc[:,["Rooms","Date","Distance","Landsize","Bedroom2","Bathroom","YearBuilt","Price"]]

dataframe.head()
RoomsDateDistanceLandsizeBedroom2BathroomYearBuiltPrice
023/09/20162.5126.02.01.0NaNNaN
123/12/20162.5202.02.01.0NaN1480000.0
224/02/20162.5156.02.01.01900.01035000.0
434/03/20172.5134.03.02.01900.01465000.0
534/03/20172.594.03.02.0NaN850000.0
dataframe.shape
(12992, 8)

4. 去除Price列为null值的数据

dataframe = dataframe.dropna(subset=['Price'])

dataframe.head()
RoomsDateDistanceLandsizeBedroom2BathroomYearBuiltPrice
123/12/20162.5202.02.01.0NaN1480000.0
224/02/20162.5156.02.01.01900.01035000.0
434/03/20172.5134.03.02.01900.01465000.0
534/03/20172.594.03.02.0NaN850000.0
644/06/20162.5120.03.01.02014.01600000.0
dataframe.shape
(9944, 8)
# 统计缺失值
dataframe.isnull().describe()
RoomsDateDistanceLandsizeBedroom2BathroomYearBuiltPrice
count99449944994499449944994499449944
unique11222221
topFalseFalseFalseFalseFalseFalseTrueFalse
freq99449944993977807978797853529944

5. 将Date处理成与最小日期的天数差

dataframe["Date"] = pd.to_datetime(dataframe["Date"],dayfirst=True)

days_since_start = [(x - dataframe["Date"].min()).days for x in dataframe["Date"]]

dataframe["Days"] = days_since_start

dataframe = dataframe.drop(["Date"], axis=1)
dataframe.head()
RoomsDistanceLandsizeBedroom2BathroomYearBuiltPriceDays
122.5202.02.01.0NaN1480000.0310
222.5156.02.01.0119.01035000.07
432.5134.03.02.0119.01465000.0401
532.594.03.02.0NaN850000.0401
642.5120.03.01.05.01600000.0128

6. 将YearBuilt处理成与当前年份之间的年数差

year_from_now = [(2019 - x) for x in dataframe["YearBuilt"]]

dataframe["YearBuilt"]=year_from_now

dataframe.head()
RoomsDistanceLandsizeBedroom2BathroomYearBuiltPriceDays
122.5202.02.01.0NaN1480000.0310
222.5156.02.01.01900.01035000.07
432.5134.03.02.01900.01465000.0401
532.594.03.02.0NaN850000.0401
642.5120.03.01.02014.01600000.0128

7. 查看各变量非null值的分布

sns.kdeplot(dataframe["Price"])
<matplotlib.axes._subplots.AxesSubplot at 0x1a188c06d8>

在这里插入图片描述

sns.kdeplot(dataframe["Distance"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a189709b0>

在这里插入图片描述

sns.kdeplot(dataframe["Landsize"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18a85710>

在这里插入图片描述

# 检查一下异常值
dataframe[dataframe["Landsize"]>70000]
RoomsDistanceLandsizeBedroom2BathroomYearBuiltPriceDays
119839.275100.03.01.0NaN2000000.0213
17293334.676000.03.02.0NaN1085000.0485
sns.kdeplot(dataframe["Days"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18c4b390>

在这里插入图片描述

sns.kdeplot(dataframe["YearBuilt"].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1a18bfbe10>

在这里插入图片描述

yearBuilt缺失值过多,且数据质量过差,我们决定放弃这一列

8. 缺失值处理

Distance = dataframe["Distance"]
Distance.fillna(Distance.mean(),inplace=True)
Distance.isnull().describe()

Bedroom2 = dataframe["Bedroom2"]
Bedroom2.fillna(Bedroom2.mean(), inplace=True)
Bedroom2.isnull().describe()

Bathroom = dataframe["Bathroom"]
Bathroom.fillna(Bathroom.mean(), inplace=True)
Bathroom.isnull().describe()

Landsize = dataframe["Landsize"]
Landsize.fillna(Landsize.mean(), inplace=True)
Landsize.isnull().describe()

dataframe = dataframe.drop(["Distance","Landsize","Bedroom2","Bathroom","YearBuilt"], axis=1)

dataframe = pd.concat([dataframe,Distance,Landsize,Bedroom2,Bathroom],axis=1)

dataframe.head()
RoomsPriceDaysDistanceLandsizeBedroom2Bathroom
121480000.03102.5202.02.01.0
221035000.072.5156.02.01.0
431465000.04012.5134.03.02.0
53850000.04012.594.03.02.0
641600000.01282.5120.03.01.0
dataframe.isnull().describe()
RoomsPriceDaysDistanceLandsizeBedroom2Bathroom
count9944994499449944994499449944
unique1111111
topFalseFalseFalseFalseFalseFalseFalse
freq9944994499449944994499449944

9. 绘制矩阵散点图,查看变量间关系

sns.pairplot(dataframe)
<seaborn.axisgrid.PairGrid at 0x1a18ab1128>

在这里插入图片描述

10. 绘制热度图,查看变量相关性

fig, ax = plt.subplots(figsize=(15,15)) 
sns.heatmap(dataframe.corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a8d06d8>

在这里插入图片描述

我们去除和price关系不大的Days 和 Landsize两列

11. 拆分训练集与测试集

X=dataframe.drop(["Price"], axis=1)
y=dataframe["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

12. 导入线性回归模型进行训练

lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

13. 查看拟合参数结果

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
ranked_suburbs = coeff_df.sort_values("Coefficient", ascending = False)
ranked_suburbs
Coefficient
Bathroom287227.794348
Rooms256960.542449
Days208.567181
Landsize28.455846
Bedroom2-40640.077882
Distance-48979.633196

14. 预测并可视化预测结果

predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)
plt.ylim([200000,1000000])
plt.xlim([200000,1000000])
(200000, 1000000)

在这里插入图片描述

# 查看残差分布
sns.distplot((y_test-predictions),bins=50)
#结果还不错,比较尖
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c31cb00>

在这里插入图片描述

15. 计算 RMSE(均方根误差)、MSE(均方误差)、MAE(平均绝对误差)

from sklearn import metrics
# 1.0 最好,越小越差
print("score:", metrics.explained_variance_score(y_test, predictions))
print("MAE:", metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print("r^2:", metrics.r2_score(y_test, predictions))
score: 0.31016009739486505
MAE: 406517.11675773124
MSE: 347933550457.3135
RMSE: 589858.9241990948
r^2: 0.31015583638325983

这是一个简单线性回归模型,涉及到了变量空值的填充,和一些变量分布的查看。最后效果一般,受制于线性模型的简单性,且本模型未对变量进行变化。仅作为第一个数据分析项目,熟悉数据分析流程。

希望对读者有帮助

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值