Task10快来一起挖掘幸福感大赛
参考资料
机器学习训练营_天池龙珠计划:https://tianchi.aliyun.com/specials/promotion/aicampml?invite_channel=1
大赛地址:https://tianchi.aliyun.com/competition/entrance/231702/introduction
参考笔记:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.9.52a149f4iLWWgq&postId=58167
前言
因为我是第一次接触大赛,毫无头绪,只能去论坛看别人分享的思路,因此本文就是基于他人的笔记进行一个学习加复现,目标是弄懂思路和代码。我先在本地复制了一遍原笔记的代码,看能否复现结果,最终每一步都与笔记中的结果相同,并且得到了最终的csv文件,证明代码可行。在这里十分感谢作者分享思路。然后我开始解析代码,因为懂的太少,所以我的代码注释就写得比较细,在我的注释里,“#”后的注释表示代码功能,“##”后的注释表示在解释代码语法。
打开本地环境
在Anaconda Prompt里输入jupyter notebook,等待一会,会自动弹出一个网页。
这个网页显示的是C:\Users\Administrator目录下的文件。
进入自己创建的Happiness文件,把从大赛链接里下的数据文件复制进去。
创建一个ipynb文件,并命名为HappinessModel_complete,我们的代码就写在这里。
代码流程
0 导入相关库
import os
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# LightGBM(Light Gradient Boosting Machine)是微软开源的一个实现 GBDT 算法的框架,支持高效率的并行训练。
# GBDT (Gradient Boosting Decision Tree)DT决策树,GB是一种学习策略,GBDT的含义就是用Gradient Boosting的策略训练出来的DT模型。
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, roc_curve
# K折交叉验证器
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# 均方误差
from sklearn.metrics import mean_squared_error
1 认识数据
1.1 加载数据集
# 1 认识数据
# 1.1 加载数据集
## pandas.readc_csv():将训练集和测试集分别读取到相应的DateFrame中
### parse_dates:名称列表["survey_time"]表示将"survey_time"列解析为单独的日期列。
### encoding:读/写时用于UTF的编码。'latin-1'是最简单的文本编码(也称为'iso-8859-1')将代码点0–255映射到字节0x0-0xff。
train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
## pandas.DataFrame.head(n):n默认5,返回对象的前5行
train.head()
id | happiness | survey_type | province | city | county | survey_time | gender | birth | nationality | ... | neighbor_familiarity | public_service_1 | public_service_2 | public_service_3 | public_service_4 | public_service_5 | public_service_6 | public_service_7 | public_service_8 | public_service_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 4 | 1 | 12 | 32 | 59 | 2015-08-04 14:18:00 | 1 | 1959 | 1 | ... | 4 | 50 | 60 | 50 | 50 | 30.0 | 30 | 50 | 50 | 50 |
1 | 2 | 4 | 2 | 18 | 52 | 85 | 2015-07-21 15:04:00 | 1 | 1992 | 1 | ... | 3 | 90 | 70 | 70 | 80 | 85.0 | 70 | 90 | 60 | 60 |
2 | 3 | 4 | 2 | 29 | 83 | 126 | 2015-07-21 13:24:00 | 2 | 1967 | 1 | ... | 4 | 90 | 80 | 75 | 79 | 80.0 | 90 | 90 | 90 | 75 |
3 | 4 | 5 | 2 | 10 | 28 | 51 | 2015-07-25 17:33:00 | 2 | 1943 | 1 | ... | 3 | 100 | 90 | 70 | 80 | 80.0 | 90 | 90 | 80 | 80 |
4 | 5 | 4 | 1 | 7 | 18 | 36 | 2015-08-10 09:50:00 | 2 | 1994 | 1 | ... | 2 | 50 | 50 | 50 | 50 | 50.0 | 50 | 50 | 50 | 50 |
5 rows × 140 columns
1.2 理解数据集
# 1.2 理解数据集
# 查看特征的数据分布
train.describe()
id | happiness | survey_type | province | city | county | gender | birth | nationality | religion | ... | neighbor_familiarity | public_service_1 | public_service_2 | public_service_3 | public_service_4 | public_service_5 | public_service_6 | public_service_7 | public_service_8 | public_service_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8000.00000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.00000 | 8000.000000 | 8000.00000 | 8000.000000 | ... | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.000000 | 8000.00000 | 8000.000000 | 8000.000000 |
mean | 4000.50000 | 3.850125 | 1.405500 | 15.155375 | 42.564750 | 70.619000 | 1.53000 | 1964.707625 | 1.37350 | 0.772250 | ... | 3.722250 | 70.809500 | 68.170000 | 62.737625 | 66.320125 | 62.794187 | 67.064000 | 66.09625 | 65.626750 | 67.153750 |
std | 2309.54541 | 0.938228 | 0.491019 | 8.917100 | 27.187404 | 38.747503 | 0.49913 | 16.842865 | 1.52882 | 1.071459 | ... | 1.143358 | 21.184742 | 20.549943 | 24.771319 | 22.049437 | 23.463162 | 21.586817 | 23.08568 | 23.827493 | 22.502203 |
min | 1.00000 | -8.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1921.000000 | -8.00000 | -8.000000 | ... | -8.000000 | -3.000000 | -3.000000 | -3.000000 | -3.000000 | -3.000000 | -3.000000 | -3.00000 | -3.000000 | -3.000000 |
25% | 2000.75000 | 4.000000 | 1.000000 | 7.000000 | 18.000000 | 37.000000 | 1.00000 | 1952.000000 | 1.00000 | 1.000000 | ... | 3.000000 | 60.000000 | 60.000000 | 50.000000 | 60.000000 | 55.000000 | 60.000000 | 60.00000 | 60.000000 | 60.000000 |
50% | 4000.50000 | 4.000000 | 1.000000 | 15.000000 | 42.000000 | 73.000000 | 2.00000 | 1965.000000 | 1.00000 | 1.000000 | ... | 4.000000 | 79.000000 | 70.000000 | 70.000000 | 70.000000 | 70.000000 | 70.000000 | 70.00000 | 70.000000 | 70.000000 |
75% | 6000.25000 | 4.000000 | 2.000000 | 22.000000 | 65.000000 | 104.000000 | 2.00000 | 1977.000000 | 1.00000 | 1.000000 | ... | 5.000000 | 80.000000 | 80.000000 | 80.000000 | 80.000000 | 80.000000 | 80.000000 | 80.00000 | 80.000000 | 80.000000 |
max | 8000.00000 | 5.000000 | 2.000000 | 31.000000 | 89.000000 | 134.000000 | 2.00000 | 1997.000000 | 8.00000 | 1.000000 | ... | 5.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.00000 | 100.000000 | 100.000000 |
8 rows × 136 columns
# 查看train的相关信息:索引、列、特征的数据类型、内存使用情况
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Columns: 140 entries, id to public_service_9
dtypes: datetime64[ns](1), float64(25), int64(111), object(3)
memory usage: 8.5+ MB
# 查看每个特征的缺失情况
train.isnull().sum()
id 0
happiness 0
survey_type 0
province 0
city 0
county 0
survey_time 0
gender 0
birth 0
nationality 0
religion 0
religion_freq 0
edu 0
edu_other 7997
edu_status 1120
edu_yr 1972
income 0
political 0
join_party 7176
floor_area 0
property_0 0
property_1 0
property_2 0
property_3 0
property_4 0
property_5 0
property_6 0
property_7 0
property_8 0
property_other 7934
...
m_political 0
m_work_14 0
status_peer 0
status_3_before 0
view 0
inc_ability 0
inc_exp 0
trust_1 0
trust_2 0
trust_3 0
trust_4 0
trust_5 0
trust_6 0
trust_7 0
trust_8 0
trust_9 0
trust_10 0
trust_11 0
trust_12 0
trust_13 0
neighbor_familiarity 0
public_service_1 0
public_service_2 0
public_service_3 0
public_service_4 0
public_service_5 0
public_service_6 0
public_service_7 0
public_service_8 0
public_service_9 0
Length: 140, dtype: int64
# 删除训练集中happiness标签值无效的数据
## train['happiness']是8000行的Series,展示所有训练集的happiness值,其值为-8表示无效
## pandas.DataFrame.loc():通过标签或布尔数组访问一组行和列。这里访问happiness值不为-8的所有组行和列。
train = train.loc[train['happiness'] != -8]
train.shape
(7988, 140)
# 查看幸福感各个值的分布情况,有很明显的类别不均衡的问题
## matplotlib.pyplot.subplots(nrows,ncols,index):在当前图形上添加子图。即一个Figure对象包含了多个子图,共有nrows行乘ncols列个。
## 这里表示绘制1行2列个12X5.5大小的子图,1行2列表示共两个,分别是ax[0]、ax[1],figure是当前Figure对象。
figure,ax=plt.subplots(1,2,figsize=(12,5.5))
## pandas.Series.value_counts():返回一个包含唯一值计数的Series,按出现频率以降序排列。
## pandas.Series.plot.pie():绘制饼图。
print(train['happiness'].value_counts())
train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('happiness')
ax[0].set_ylabel('')
## pandas.Series.plot.pie():绘制垂直条形图。
train['happiness'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('happiness')
plt.show()
4 4818
5 1410
3 1159
2 497
1 104
Name: happiness, dtype: int64
# 探究性别和幸福感的关系
## Seaborn要求原始数据的输入类型为pandas的Dataframe或Numpy数组。
## 画图函数形式之一:sns.图名(x='X轴列名', y='Y轴列名', hue='分组绘图参数', data=原始数据df对象)。
## seaborn.countplot():绘制条形图,显示每个分类箱中的观测值,返回matplotlib Axes对象。
ax = sns.countplot('gender',hue='happiness',data=train)
ax.set_title('Sex:happiness')
Text(0.5,1,'Sex:happiness')
# 探究年龄和幸福感的关系
# 先用直方图展示年龄分布情况
## pandas.Series.dt.year:返回datetime的year
train['survey_time'] = train['survey_time'].dt.year
test['survey_time'] = test['survey_time'].dt.year
train['Age'] = train['survey_time'] - train['birth']
test['Age'] = test['survey_time'] - test['birth']
## del_list=['survey_time','birth']
figure,ax = plt.subplots(1,1)
## pandas.Series.plot.hist():绘制直方图。
train['Age'].plot.hist(ax=ax,color='blue')
<matplotlib.axes._subplots.AxesSubplot at 0x203d435c208>
# 绘制年龄与幸福感关系的条形图。
# 将年龄分为6类,针对5种幸福感,显示其分布情况。
# 将年龄分类成箱处理,是为了避免噪声和异常值的影响。
# 注:本训练集里没有<=16岁的样本
combine=[train,test]
for dataset in combine:
## 在(dataset['Age']<=16)条件下,所有行里的指定标签('Age')的值设为0
dataset.loc[dataset['Age']<=16,'Age']= 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4
dataset.loc[ dataset['Age'] > 80, 'Age'] = 5
sns.countplot('Age', hue='happiness', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x203d43fedd8>
figure1,ax1 = plt.subplots(1,5,figsize=(20,4))
train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)
train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)
train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)
train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)
train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)
<matplotlib.axes._subplots.AxesSubplot at 0x203d45e0f60>
1.3 特征选择
# 1.3 特征选择
# 只选择与幸福感正负相关程度大于0.05的特征
## pandas.DataFrame.corr():计算列的成对相关,不包括NA/空值。检查两列之间变化趋势的方向以及程度。
## 值范围从-1到+1,0表示两个变量不相关,正值表示正相关,负值表示负相关,值越大相关性越强。
## python abs():求绝对值
train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]
happiness 1.000000
edu 0.103048
edu_yr 0.055564
political 0.080986
join_party 0.069007
property_8 -0.051929
weight_jin 0.085841
health 0.250538
health_problem 0.186620
depression 0.304973
hukou 0.072936
media_1 0.095035
media_2 0.084872
media_3 0.091431
media_4 0.098809
media_5 0.065220
media_6 0.059273
leisure_1 -0.077097
leisure_3 -0.070262
leisure_4 -0.095676
leisure_6 -0.107672
leisure_7 -0.072011
leisure_8 -0.100313
leisure_9 -0.148888
leisure_12 -0.068778
socialize 0.082206
relax 0.113233
learn 0.108294
social_friend -0.091079
socia_outing 0.059567
...
family_income 0.051506
family_m 0.061062
family_status 0.204702
house 0.089261
car -0.085387
invest_1 -0.055013
invest_2 0.054019
s_edu 0.125679
s_political 0.068802
s_hukou 0.071953
status_peer -0.150246
status_3_before -0.076808
view 0.078986
trust_1 0.069830
trust_2 0.054909
trust_5 0.102110
trust_7 0.060102
trust_8 0.065644
trust_10 0.069740
trust_12 0.057885
neighbor_familiarity 0.054074
public_service_1 0.112537
public_service_2 0.126029
public_service_3 0.134028
public_service_4 0.129880
public_service_5 0.136347
public_service_6 0.162514
public_service_7 0.154029
public_service_8 0.128678
public_service_9 0.129723
Name: happiness, Length: 65, dtype: float64
# 选择相关性大于0.05的作为候选特征参与训练,并加入我们认为比较重要的特征,总共66个特征参与训练
features = (train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]).index
## features是一个pandas.Index对象:实现一个有序,可切片集合的不可变ndarray。
print(features)
## pandas.Index.values:返回表示索引中数据的数组,再转化成列表形式。
## features这时是一个list对象
features = features.values.tolist()
features.extend(['Age', 'work_exper'])
features.remove('happiness')
print(features)
len(features)
Index(['happiness', 'edu', 'edu_yr', 'political', 'join_party', 'property_8',
'weight_jin', 'health', 'health_problem', 'depression', 'hukou',
'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6',
'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7',
'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn',
'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before',
'class_10_after', 'class_14', 'family_income', 'family_m',
'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu',
's_political', 's_hukou', 'status_peer', 'status_3_before', 'view',
'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10',
'trust_12', 'neighbor_familiarity', 'public_service_1',
'public_service_2', 'public_service_3', 'public_service_4',
'public_service_5', 'public_service_6', 'public_service_7',
'public_service_8', 'public_service_9'],
dtype='object')
['edu', 'edu_yr', 'political', 'join_party', 'property_8', 'weight_jin', 'health', 'health_problem', 'depression', 'hukou', 'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6', 'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7', 'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn', 'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before', 'class_10_after', 'class_14', 'family_income', 'family_m', 'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu', 's_political', 's_hukou', 'status_peer', 'status_3_before', 'view', 'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10', 'trust_12', 'neighbor_familiarity', 'public_service_1', 'public_service_2', 'public_service_3', 'public_service_4', 'public_service_5', 'public_service_6', 'public_service_7', 'public_service_8', 'public_service_9', 'Age', 'work_exper']
66
2 模型搭建
2.1 基本模型
# 2 模型搭建
# 2.1 基本模型
# 生成数据和标签
target = train['happiness']
train_selected = train[features]
test = test[features]
feature_importance_df = pd.DataFrame()
## numpy.zeros():返回给定形状和类型的新数组。
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
params = {'num_leaves': 9,
'min_data_in_leaf': 40,
'objective': 'regression',
'max_depth': 16,
'learning_rate': 0.01,
'boosting': 'gbdt',
'bagging_freq': 5,
'bagging_fraction': 0.8, # 每次迭代时用的数据比例
'feature_fraction': 0.8201,# 每次迭代中随机选择80%的参数来建树
'bagging_seed': 11,
'reg_alpha': 1.728910519108444,
'reg_lambda': 4.9847051755586085,
'random_state': 42,
'metric': 'rmse',
'verbosity': -1,
'subsample': 0.81,
'min_gain_to_split': 0.01077313523861969,
'min_child_weight': 19.428902804238373,
'num_threads': 4}
## 类sklearn.model_selection.KFold:K折交叉验证器。## 提供训练/测试索引以将数据拆分为训练/测试集。
## 将数据集拆分为k个连续的折叠(默认情况下不进行混洗)。然后将每个折叠用作一次验证,而剩下的k-1个折叠形成训练集。
## n_splits:折数。样本共折几次。
## shuffle:在拆分成批次之前是否对数据进行混洗。
## random_state:当shuffle为True时,random_state会影响索引的顺序,从而控制每个折叠的随机性。
kfolds = KFold(n_splits=5,shuffle=True,random_state=15)
## Kfold类方法split():生成索引以将数据分为训练和测试集。返回训练集索引和测试集索引。
for fold_n,(trn_index,val_index) in enumerate(kfolds.split(train_selected,target)):
print("fold_n {}".format(fold_n))
# 转换为Dataset数据格式
# pandas.DataFrame.iloc():基于索引选择数据
trn_data = lgb.Dataset(train_selected.iloc[trn_index],label=target.iloc[trn_index])
val_data = lgb.Dataset(train_selected.iloc[val_index],label=target.iloc[val_index])
# 设置迭代次数
num_round=10000
# 训练
## train()参数介绍:
## params:学习器参数;
## train_set:训练集
## num_boost_round:迭代次数
## feval:定制评估功能
## verbose_eval:进行n次验证后,报告metrics情况。
## early_stopping_rounds:早停决策,若n轮迭代之后metrics没有进步,则终止训练。注意,设置此项务必设置metrics,否则lgb会按照metrics缺失处理。
## 返回一个Booster类型的对象
clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 100)
# 预测
## num_iteration:预测中使用的迭代次数
## predict()返回numpy数组
## 一批次预测val_index个样本
oof[val_index] = clf.predict(train_selected.iloc[val_index], num_iteration=clf.best_iteration)
# 测试集的预测结果
predictions += clf.predict(test,num_iteration=clf.best_iteration)/ 5
print("本批次训练集样本个数:",len(val_index))
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_n + 1
## pandas.concat():沿特定轴给pandas对象编码。
## 把fold_importance_df从0开始编码,编码后返回给feature_importance_df。
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
print("feature_importance_df.shape:",feature_importance_df.shape)
# 计算均方误差
print("CV score: {:<8.5f}".format(mean_squared_error(target, oof)**0.5))
print("#"*30,"第", fold_n + 1,"批样本学习、预测、评估结束","#"*30)
fold_n 0
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000] training's rmse: 0.623114 valid_1's rmse: 0.682914
Early stopping, best iteration is:
[1205] training's rmse: 0.61358 valid_1's rmse: 0.681788
本批次训练集样本个数: 1598
feature_importance_df.shape: (66, 3)
CV score: 3.54968
############################## 第 1 批样本学习、预测、评估结束 ##############################
fold_n 1
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000] training's rmse: 0.617676 valid_1's rmse: 0.692788
Early stopping, best iteration is:
[918] training's rmse: 0.621868 valid_1's rmse: 0.692082
本批次训练集样本个数: 1598
feature_importance_df.shape: (132, 3)
CV score: 3.10340
############################## 第 2 批样本学习、预测、评估结束 ##############################
fold_n 2
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000] training's rmse: 0.62993 valid_1's rmse: 0.64831
Early stopping, best iteration is:
[1346] training's rmse: 0.613892 valid_1's rmse: 0.647217
本批次训练集样本个数: 1598
feature_importance_df.shape: (198, 3)
CV score: 2.55896
############################## 第 3 批样本学习、预测、评估结束 ##############################
fold_n 3
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000] training's rmse: 0.619417 valid_1's rmse: 0.699119
Early stopping, best iteration is:
[1035] training's rmse: 0.617717 valid_1's rmse: 0.698929
本批次训练集样本个数: 1597
feature_importance_df.shape: (264, 3)
CV score: 1.87172
############################## 第 4 批样本学习、预测、评估结束 ##############################
fold_n 4
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000] training's rmse: 0.620862 valid_1's rmse: 0.68395
Early stopping, best iteration is:
[1181] training's rmse: 0.612383 valid_1's rmse: 0.683083
本批次训练集样本个数: 1597
feature_importance_df.shape: (330, 3)
CV score: 0.68085
############################## 第 5 批样本学习、预测、评估结束 ##############################
feature_importance_df
feature | importance | fold | |
---|---|---|---|
0 | edu | 80 | 1 |
1 | edu_yr | 329 | 1 |
2 | political | 16 | 1 |
3 | join_party | 7 | 1 |
4 | property_8 | 9 | 1 |
5 | weight_jin | 338 | 1 |
6 | health | 346 | 1 |
7 | health_problem | 116 | 1 |
8 | depression | 464 | 1 |
9 | hukou | 19 | 1 |
10 | media_1 | 76 | 1 |
11 | media_2 | 32 | 1 |
12 | media_3 | 130 | 1 |
13 | media_4 | 118 | 1 |
14 | media_5 | 77 | 1 |
15 | media_6 | 33 | 1 |
16 | leisure_1 | 62 | 1 |
17 | leisure_3 | 70 | 1 |
18 | leisure_4 | 28 | 1 |
19 | leisure_6 | 144 | 1 |
20 | leisure_7 | 40 | 1 |
21 | leisure_8 | 197 | 1 |
22 | leisure_9 | 165 | 1 |
23 | leisure_12 | 25 | 1 |
24 | socialize | 48 | 1 |
25 | relax | 191 | 1 |
26 | learn | 49 | 1 |
27 | social_friend | 234 | 1 |
28 | socia_outing | 51 | 1 |
29 | equity | 648 | 1 |
... | ... | ... | ... |
36 | family_status | 238 | 5 |
37 | house | 146 | 5 |
38 | car | 70 | 5 |
39 | invest_1 | 45 | 5 |
40 | invest_2 | 5 | 5 |
41 | s_edu | 248 | 5 |
42 | s_political | 56 | 5 |
43 | s_hukou | 59 | 5 |
44 | status_peer | 141 | 5 |
45 | status_3_before | 186 | 5 |
46 | view | 215 | 5 |
47 | trust_1 | 89 | 5 |
48 | trust_2 | 86 | 5 |
49 | trust_5 | 93 | 5 |
50 | trust_7 | 60 | 5 |
51 | trust_8 | 112 | 5 |
52 | trust_10 | 100 | 5 |
53 | trust_12 | 125 | 5 |
54 | neighbor_familiarity | 161 | 5 |
55 | public_service_1 | 138 | 5 |
56 | public_service_2 | 130 | 5 |
57 | public_service_3 | 89 | 5 |
58 | public_service_4 | 119 | 5 |
59 | public_service_5 | 159 | 5 |
60 | public_service_6 | 220 | 5 |
61 | public_service_7 | 143 | 5 |
62 | public_service_8 | 137 | 5 |
63 | public_service_9 | 182 | 5 |
64 | Age | 256 | 5 |
65 | work_exper | 127 | 5 |
330 rows × 3 columns
cols = (feature_importance_df[["feature", "importance"]]
.groupby("feature")# 按feature分组
.mean()# 返回轴的值的平均值
.sort_values(by="importance", ascending=False)[:1000].index)#沿importance值降序排列,取前1000个特征,生成Index对象
## pandas.DataFrame.isin(值):DataFrame中的每个元素是否包含在值中。返回布尔值的DataFrame
## 只要包含在cols中的feature_importance_df
best_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]
# 绘制特征与重要性关系的垂直条形图
plt.figure(figsize=(14,26))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('LightGBM Features (averaged over folds)')
plt.tight_layout()
# 计算结果
submit = pd.read_csv("happiness_submit.csv")
submision_lgb1 = pd.DataFrame({"id":submit['id'].values})
submision_lgb1["happiness"] = predictions
submision_lgb1.head(5)
id | happiness | |
---|---|---|
0 | 8001 | 3.813459 |
1 | 8002 | 2.934148 |
2 | 8003 | 3.299967 |
3 | 8004 | 4.294977 |
4 | 8005 | 3.306847 |
2.2 模型调参(略)
3 结果文件生成
# 3 结果文件生成
# 获取时间
time_str = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
out_dir = "csvResult/{}/".format(time_str)
os.makedirs(out_dir)
# 保存模型
clf.save_model(out_dir + "model.txt")
# 保存结果
submision_lgb1.to_csv(out_dir + "happiness_submision_lgbm.csv",index=False)
提交结果
查看成绩
总结
为期12天的机器学习训练营结束了,收获颇丰,感谢天池龙珠计划,感谢组织者们,也感谢我的小队地坛TBD,更感谢我自己。我一直以来都更适合小队模式的学习,我想是因为有监督,有deadline。我受益于此但希望自己不要局限于此,自己一个人的时候也要保持自律和对学习的热情。