House Prices项目可是麻烦,
首先一样的,先读取数据:
#coding=utf-8
import pandas as pd
from pandas import Series,DataFrame
import random
import numpy as np
from datetime import date
import datetime as dt
from numpy import nan as NA
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
import warnings
warnings.filterwarnings("ignore")
#读取数据
traindata = pd.read_csv("train.csv",header=0)
print(traindata.shape)
print(traindata.head(5))
#读取数据
testdata = pd.read_csv("test.csv",header=0)
print(testdata.shape)
print(testdata.head(5))
居然有79个变量!我算是看明白了,这个题目就是要考验你的耐心和毅力的,这个数据清理得好几个小时……
print(traindata.isnull().any())
print(testdata.isnull().any())
还有好多缺失值,得,一个一个看过去吧
使用下列类似代码:
print(traindata.MSSubClass.isnull().any())
print(testdata.MSSubClass.isnull().any())
print(traindata.MSSubClass.describe())
print(traindata.MSSubClass.unique())
print(testdata.MSSubClass.describe())
print(testdata.MSSubClass.unique())
MSSubClass:没有缺失值,都是数据,pass
MSZoning:街区分类?一共5种,有缺失值,缺失值直接使用RL代替
testdata.MSZoning[testdata.MSZoning.isnull()] = 'RL'
testdata.MSZoning[testdata.MSZoning=='RH'] = 0
testdata.MSZoning[testdata.MSZoning=='RL'] = 1
testdata.MSZoning[testdata.MSZoning=='RM'] = 2
testdata.MSZoning[testdata.MSZoning=='FV'] = 3
testdata.MSZoning[testdata.MSZoning=='C (all)'] = 4
print(testdata.MSZoning.describe())
print(testdata.MSZoning.unique())
LotFrontage:附近街道情况,有缺失值,缺失值直接使用70代替
traindata.LotFrontage[traindata.LotFrontage.isnull()] = 70
print(traindata.LotFrontage.describe())
print(traindata.LotFrontage.unique())
testdata.LotFrontage[testdata.LotFrontage.isnull()] = 70
print(testdata.LotFrontage.describe())
print(testdata.LotFrontage.unique())
代码都类似的,后面就不贴代码了,79个变量呢!
LotArea:无缺失值,pass
Street:无缺失值,2种字符,修改为0、1
Alley:缺失值太多了,直接删除这个特征
LotShape:无缺失值,4种字符,修改为0、1、2、3
LandContour:无缺失值,4种字符,修改为0、1、2、3
Utilities:几乎都是一个类型,直接删除这个特征
traindata= traindata.drop('Utilities', 1)
print(traindata.head(5))
testdata= testdata.drop('Utilities', 1)
print(testdata.head(5))
LotConfig:无缺失值,5种字符,修改为0、1、2、3、4
LandSlope:无缺失值,3种字符,修改为0、1、2
Neighborhood:无缺失值,25种字符,修改为0~24
Condition1:无缺失值,9种字符,修改为0~8
Condition2:无缺失值,8种字符,修改为0~7
BldgType:无缺失值,5种字符,修改为0、1、2、3、4
HouseStyle:无缺失值,8种字符,修改为0~7
OverallQual:无缺失值,数字
OverallCond:无缺失值,数字
YearBuilt:无缺失值,数字(年份,直接按照数字处理好了)
YearRemodAdd:无缺失值,数字(年份,直接按照数字处理好了)
RoofStyle:无缺失值,5种字符,修改为0、1、2、3、4
RoofMatl:无缺失值,8种字符,修改为0~7
中场休息,累坏了,直接使用上面处理后的特征,GB提交一个结果:0.19591
SVM:0.36002
最终多次测试,使用了200棵树,最佳结果:0.17346
今天机会用完了,后面变量太多了,下次有空再尝试
UseFlag = traindata['SalePrice'].values
#print(UseFlag)
UseFeature = traindata[['LotArea','Street','LotShape','LandContour','LotConfig','LandSlope',\
'Neighborhood','Condition1','Condition2','BldgType','HouseStyle',\
'OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle',\
'RoofMatl']].values
#归一化
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(UseFeature)
scaler.transform(UseFeature)
from sklearn.linear_model import LogisticRegression
from sklearn import svm
CLF = svm.SVC(gamma=0.001, C=100.)
RF=RandomForestClassifier()
RF2=RandomForestClassifier(n_estimators=40)
LR = LogisticRegression()
GB = GradientBoostingRegressor(n_estimators=40)
GB2 = GradientBoostingRegressor(n_estimators=100)
GB3 = GradientBoostingRegressor(n_estimators=200)
GB4 = GradientBoostingRegressor(n_estimators=400)
from sklearn.decomposition import PCA
pca = PCA(n_components=15)
pca.fit(UseFeature)
RF.fit(UseFeature,UseFlag)#进行模型的训练
RF2.fit(UseFeature,UseFlag)#进行模型的训练
CLF.fit(UseFeature,UseFlag)#进行模型的训练
LR.fit(UseFeature,UseFlag)#进行模型的训练
GB.fit(UseFeature,UseFlag)#进行模型的训练
GB2.fit(UseFeature,UseFlag)#进行模型的训练
GB3.fit(UseFeature,UseFlag)#进行模型的训练
GB4.fit(UseFeature,UseFlag)#进行模型的训练
temp = GB.predict(UseFeature)
temp = temp.round()
#print(temp)
#from sklearn.metrics import accuracy_score
#accuracy_score(UseFlag, temp)
TestFeature = testdata[['LotArea','Street','LotShape','LandContour','LotConfig','LandSlope',\
'Neighborhood','Condition1','Condition2','BldgType','HouseStyle',\
'OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle',\
'RoofMatl']].values
scaler.transform(TestFeature)
pca.fit(TestFeature)
#temp = GB.predict(TestFeature)
temp = GB3.predict(TestFeature)
#temp = temp.round()
#temp1 = GB.predict(TestFeature)
print(temp)
print(temp1)
#temp = 0.6*temp+0.4*temp1
testdata['SalePrice']=temp
print(testdata.head(5))
outdata = testdata[['Id','SalePrice']]#提取出需要的列
outdata.to_csv("test_2018_2_22_GB3_PCA15.csv",index=False,header=True)#保存数据集
注意:
上面代码中将离散值直接处理为0、1、2……,这会造成引入一个默认的排序的规则,是不对的,正确方法应该是转为独热编码处理。或者根据每个值的房价均值排序,来确定到底赋什么值。