比赛链接
https://tianchi.aliyun.com/competition/entrance/231702/introduction
记录
特别感谢 ChenglongChen 公开的项目代码 : https://github.com/ChenglongChen/Kaggle_CrowdFlower
- 这是一个简化版,在集成的时候没有使用 bagging,而是使用了二阶stacking。所以其泛化能力相对较差,也就是说线下得分和线上得分会有较大差别。
- 为了提高泛化能力,可以考虑加入 bagging;相当于对不同的训练集训练不同参数的模型进行集成。
- 本来还有一个 keras_dnn 模型,由于机器性能限制,调参时间过长,故暂时不考虑。
- 参数优化采用 贝叶斯优化以及其对应的 python 模块: hyperopt。
项目结构
1、文件
代码部分
– preprocessing.py 数据预处理,生成 x_test.csv ;x_train.csv ;y_train.csv数据集文件
– model_library.py 定义了一些模型以及对应的优化参数
– model_param_opt.py 定义了参数优化的逻辑
– generate_best_single_model.py 得到每一个模型的最优参数,存入model_log_best_params.txt文件
– ensemble.py 对最优模型实现一个简单的集成,并产生提交结果
数据部分
– happiness_test_complete.csv
– happiness_train_complete.csv
– happiness_submit.csv
2、运行顺序
进入项目文件夹
python preprocessing.py
进行数据预处理,得到训练数据
python generate_best_single_model.py
得到每一个候选模型的最优参数,存入model_log_best_params.txt文件
python ensemble.py
实现模型集成,生成最终结果
代码文件
preprocessing.py
进行数据预处理的文件
import pandas as pd
import numpy as np
datatrain = pd.read_csv('happiness_train_complete.csv',encoding="gb2312")
datatest = pd.read_csv('happiness_test_complete.csv',encoding="gb2312")
dataplot = datatrain.copy()
datatrain = datatrain[datatrain["happiness"]!=-8].reset_index(drop=True)
dataplot = dataplot[dataplot["happiness"]!=-8].reset_index(drop=True)
target_col = "happiness"
target = datatrain[target_col]
del datatrain['id']
del datatest['id']
label = datatrain['happiness']
del datatrain['happiness']
dataproc = pd.concat([datatrain,datatest],ignore_index=True)
dataproc['survey_type'] = dataproc['survey_type'].map(lambda x:x-1) #变0-1
count = []
for i in range(1,32):
count.append(dataplot.loc[dataplot['province']==i,'happiness'].mean())
count = [i if (1-pd.isnull(i)) else 3 for i in count]
#plt.scatter(range(1,32),count)
reg1 = [i for i in range(1,32) if count[i-1]<3.2]
reg2 = [i for i in range(1,32) if 3.2<count[i-1]<3.9]
reg3 = [i for i in range(1,32) if count[i-1]>=3.9]
def spl(x):
if x in [2,3,8,13,14,20,23,25,26,30]:
return 0
else:
return 1
def spl1(x):
if x in reg1:
return 0
elif x in reg2:
return 1
elif x in reg3:
return 2
dataproc['province_1'] = dataproc['province'].map(spl) #新增两个变量
dataproc['province_2'] = dataproc['province'].map(spl1)
dataproc['gender'] = dataproc['gender'].map(lambda x:x-1) #变0-1
dataproc['age'] = dataproc['survey_time'].map(lambda x:int(x[:4]))-dataproc['birth']
dataproc.loc[dataproc['nationality']<0,'nationality'] = 1
dataproc = dataproc.join(pd.get_dummies(dataproc["nationality"],prefix="nationality"))
def nation(x):
if x==1:
return 1
else:
return 0
dataproc['nationality1'] = dataproc['nationality'].map(nation)#新特征,是否为汉族
del dataproc['nationality']
def relfreq(x):
if x<2:
return 0
elif x<5:
return 1
else:
return 2
dataproc['religion_freq'] = dataproc['religion_freq'].map(relfreq)
from scipy import stats
dataproc.loc[dataproc['edu']<0,'edu'] = stats.mode(dataproc['edu'])[0][0]
del dataproc['edu_other']
dataproc = dataproc.join(pd.get_dummies(dataproc["edu_status"],prefix="edu_status"))
del dataproc["edu_status"]
def eduyr(x):
if (x>0) and (not pd.isnull(x)):
return x
else:
return 0
dataproc['edu_yr'] = dataproc['edu_yr'].map(eduyr)
dataproc['edu_yr'] = dataproc['edu_yr']-dataproc['birth']
def eduyr1(x):
if x>0:
return x
else:
return 0
dataproc['edu_yr'] = dataproc['edu_yr'].map(eduyr1)
dataproc.loc[dataproc['income']<0,'income'] = stats.mode(dataproc['income'])[0][0]
dataproc['income'] = dataproc['income'].map(lambda x:np.log(x+1))
dataproc.loc[dataproc['political']<0,'political'] = 1
dataproc = dataproc.join(pd.get_dummies(dataproc["political"],prefix="political"))
del dataproc['political']
def joinparty(x):
if pd.isnull(x):
return 0
if x<0:
return 0
else:
return x
dataproc['join_party'] = (dataproc['join_party']-dataproc['birth']).map(joinparty)
del dataproc['property_other']
dataproc.loc[(dataproc['weight_jin']<=80)&(dataproc['height_cm']>=160),'weight_jin']= dataproc['weight_jin']*2 #对体重修正
dataproc.loc[dataproc['weight_jin']<=60,'weight_jin']= dataproc['weight_jin']*2
dataproc['bmi'] = dataproc['weight_jin'].map(lambda x:x/2)/dataproc['height_cm'].map(lambda x:(x/100)**2)
dataproc.loc[dataproc['health']<0,'health'] = stats.mode(dataproc['health'])[0][0]
dataproc.loc[dataproc['health_problem']<0,'health_problem'] = stats.mode(dataproc['health_problem'])[0][0]
dataproc.loc[dataproc['depression']<0,'depression'] = stats.mode(dataproc['depression'])[0][0]
dataproc.loc[dataproc['media_1']<0,'media_1'] = stats.mode(dataproc['media_1'])[0][0]
dataproc.loc[dataproc['media_2']<0,'media_2'] = stats.mode(dataproc['media_2'])[0][0]
dataproc.loc[dataproc['media_3']<0,'media_3'] = stats.mode(dataproc['media_3'])[0][0]
dataproc