kaggle House Prices项目

最新推荐文章于 2021-12-29 21:19:11 发布

masbbx123

最新推荐文章于 2021-12-29 21:19:11 发布

阅读量657

点赞数

分类专栏：机器学习文章标签： kaggle

本文链接：https://blog.csdn.net/masbbx123/article/details/79315107

版权

机器学习专栏收录该内容

16 篇文章 0 订阅

订阅专栏

House Prices项目可是麻烦，
首先一样的，先读取数据：

#coding=utf-8
import pandas as pd
from pandas import Series,DataFrame 
import random
import numpy as np
from datetime import date
import datetime as dt
from numpy import nan as NA
from sklearn.tree import DecisionTreeRegressor  
from sklearn.ensemble import RandomForestRegressor  
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor

import warnings
warnings.filterwarnings("ignore")

#读取数据
traindata = pd.read_csv("train.csv",header=0)
print(traindata.shape)
print(traindata.head(5))

#读取数据
testdata = pd.read_csv("test.csv",header=0)
print(testdata.shape)
print(testdata.head(5))

居然有79个变量！我算是看明白了，这个题目就是要考验你的耐心和毅力的，这个数据清理得好几个小时……

print(traindata.isnull().any())
print(testdata.isnull().any())

还有好多缺失值，得，一个一个看过去吧
使用下列类似代码：

print(traindata.MSSubClass.isnull().any())
print(testdata.MSSubClass.isnull().any())
print(traindata.MSSubClass.describe())
print(traindata.MSSubClass.unique())
print(testdata.MSSubClass.describe())
print(testdata.MSSubClass.unique())

MSSubClass：没有缺失值，都是数据，pass
MSZoning：街区分类？一共5种，有缺失值，缺失值直接使用RL代替

testdata.MSZoning[testdata.MSZoning.isnull()] = 'RL'
testdata.MSZoning[testdata.MSZoning=='RH'] = 0
testdata.MSZoning[testdata.MSZoning=='RL'] = 1
testdata.MSZoning[testdata.MSZoning=='RM'] = 2
testdata.MSZoning[testdata.MSZoning=='FV'] = 3
testdata.MSZoning[testdata.MSZoning=='C (all)'] = 4
print(testdata.MSZoning.describe())
print(testdata.MSZoning.unique())

LotFrontage：附近街道情况，有缺失值，缺失值直接使用70代替

traindata.LotFrontage[traindata.LotFrontage.isnull()] = 70
print(traindata.LotFrontage.describe())
print(traindata.LotFrontage.unique())
testdata.LotFrontage[testdata.LotFrontage.isnull()] = 70
print(testdata.LotFrontage.describe())
print(testdata.LotFrontage.unique())

代码都类似的，后面就不贴代码了，79个变量呢！
LotArea：无缺失值，pass
Street：无缺失值，2种字符，修改为0、1
Alley：缺失值太多了，直接删除这个特征
LotShape：无缺失值，4种字符，修改为0、1、2、3
LandContour：无缺失值，4种字符，修改为0、1、2、3
Utilities：几乎都是一个类型，直接删除这个特征

traindata= traindata.drop('Utilities', 1)
print(traindata.head(5))
testdata= testdata.drop('Utilities', 1)
print(testdata.head(5))

LotConfig：无缺失值，5种字符，修改为0、1、2、3、4
LandSlope：无缺失值，3种字符，修改为0、1、2
Neighborhood：无缺失值，25种字符，修改为0~24
Condition1：无缺失值，9种字符，修改为0~8
Condition2：无缺失值，8种字符，修改为0~7
BldgType：无缺失值，5种字符，修改为0、1、2、3、4
HouseStyle：无缺失值，8种字符，修改为0~7
OverallQual：无缺失值，数字
OverallCond：无缺失值，数字
YearBuilt：无缺失值，数字（年份，直接按照数字处理好了）
YearRemodAdd：无缺失值，数字（年份，直接按照数字处理好了）
RoofStyle：无缺失值，5种字符，修改为0、1、2、3、4
RoofMatl：无缺失值，8种字符，修改为0~7

中场休息，累坏了，直接使用上面处理后的特征，GB提交一个结果：0.19591
SVM：0.36002
最终多次测试，使用了200棵树，最佳结果：0.17346
今天机会用完了，后面变量太多了，下次有空再尝试
这里写图片描述

UseFlag = traindata['SalePrice'].values
#print(UseFlag)
UseFeature = traindata[['LotArea','Street','LotShape','LandContour','LotConfig','LandSlope',\
                        'Neighborhood','Condition1','Condition2','BldgType','HouseStyle',\
                        'OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle',\
                       'RoofMatl']].values

#归一化
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(UseFeature)
scaler.transform(UseFeature)

from sklearn.linear_model import LogisticRegression  

from sklearn import svm  
CLF = svm.SVC(gamma=0.001, C=100.)  
RF=RandomForestClassifier()
RF2=RandomForestClassifier(n_estimators=40)
LR = LogisticRegression()
GB = GradientBoostingRegressor(n_estimators=40)
GB2 = GradientBoostingRegressor(n_estimators=100)
GB3 = GradientBoostingRegressor(n_estimators=200)
GB4 = GradientBoostingRegressor(n_estimators=400)

from sklearn.decomposition import PCA  
pca = PCA(n_components=15)
pca.fit(UseFeature)

RF.fit(UseFeature,UseFlag)#进行模型的训练  
RF2.fit(UseFeature,UseFlag)#进行模型的训练  
CLF.fit(UseFeature,UseFlag)#进行模型的训练  
LR.fit(UseFeature,UseFlag)#进行模型的训练  
GB.fit(UseFeature,UseFlag)#进行模型的训练  
GB2.fit(UseFeature,UseFlag)#进行模型的训练  
GB3.fit(UseFeature,UseFlag)#进行模型的训练  
GB4.fit(UseFeature,UseFlag)#进行模型的训练  

temp = GB.predict(UseFeature)
temp = temp.round()
#print(temp)
#from sklearn.metrics import accuracy_score
#accuracy_score(UseFlag, temp)

TestFeature = testdata[['LotArea','Street','LotShape','LandContour','LotConfig','LandSlope',\
                        'Neighborhood','Condition1','Condition2','BldgType','HouseStyle',\
                        'OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle',\
                       'RoofMatl']].values
scaler.transform(TestFeature)

pca.fit(TestFeature)

#temp = GB.predict(TestFeature)
temp = GB3.predict(TestFeature)
#temp = temp.round()
#temp1 = GB.predict(TestFeature)
print(temp)
print(temp1)
#temp = 0.6*temp+0.4*temp1

testdata['SalePrice']=temp
print(testdata.head(5))

outdata = testdata[['Id','SalePrice']]#提取出需要的列
outdata.to_csv("test_2018_2_22_GB3_PCA15.csv",index=False,header=True)#保存数据集