Task10快来一起挖掘幸福感大赛

Task10快来一起挖掘幸福感大赛

参考资料

机器学习训练营_天池龙珠计划:https://tianchi.aliyun.com/specials/promotion/aicampml?invite_channel=1

大赛地址:https://tianchi.aliyun.com/competition/entrance/231702/introduction

参考笔记:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.9.52a149f4iLWWgq&postId=58167

前言

因为我是第一次接触大赛,毫无头绪,只能去论坛看别人分享的思路,因此本文就是基于他人的笔记进行一个学习加复现,目标是弄懂思路和代码。我先在本地复制了一遍原笔记的代码,看能否复现结果,最终每一步都与笔记中的结果相同,并且得到了最终的csv文件,证明代码可行。在这里十分感谢作者分享思路。然后我开始解析代码,因为懂的太少,所以我的代码注释就写得比较细,在我的注释里,“#”后的注释表示代码功能,“##”后的注释表示在解释代码语法。

打开本地环境

在Anaconda Prompt里输入jupyter notebook,等待一会,会自动弹出一个网页。

这个网页显示的是C:\Users\Administrator目录下的文件。

 



进入自己创建的Happiness文件,把从大赛链接里下的数据文件复制进去。


创建一个ipynb文件,并命名为HappinessModel_complete,我们的代码就写在这里。


代码流程

0 导入相关库

import os
import time 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# LightGBM(Light Gradient Boosting Machine)是微软开源的一个实现 GBDT 算法的框架,支持高效率的并行训练。
# GBDT (Gradient Boosting Decision Tree)DT决策树,GB是一种学习策略,GBDT的含义就是用Gradient Boosting的策略训练出来的DT模型。
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, roc_curve
# K折交叉验证器
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# 均方误差
from sklearn.metrics import mean_squared_error

1 认识数据

1.1 加载数据集

# 1 认识数据
# 1.1 加载数据集
## pandas.readc_csv():将训练集和测试集分别读取到相应的DateFrame中
### parse_dates:名称列表["survey_time"]表示将"survey_time"列解析为单独的日期列。
### encoding:读/写时用于UTF的编码。'latin-1'是最简单的文本编码(也称为'iso-8859-1')将代码点0–255映射到字节0x0-0xff。
train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1') 
test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
## pandas.DataFrame.head(n):n默认5,返回对象的前5行
train.head()
idhappinesssurvey_typeprovincecitycountysurvey_timegenderbirthnationality...neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_9
01411232592015-08-04 14:18:00119591...45060505030.030505050
12421852852015-07-21 15:04:00119921...39070708085.070906060
234229831262015-07-21 13:24:00219671...49080757980.090909075
34521028512015-07-25 17:33:00219431...310090708080.090908080
4541718362015-08-10 09:50:00219941...25050505050.050505050

5 rows × 140 columns

1.2 理解数据集

# 1.2 理解数据集
# 查看特征的数据分布
train.describe()
idhappinesssurvey_typeprovincecitycountygenderbirthnationalityreligion...neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_9
count8000.000008000.0000008000.0000008000.0000008000.0000008000.0000008000.000008000.0000008000.000008000.000000...8000.0000008000.0000008000.0000008000.0000008000.0000008000.0000008000.0000008000.000008000.0000008000.000000
mean4000.500003.8501251.40550015.15537542.56475070.6190001.530001964.7076251.373500.772250...3.72225070.80950068.17000062.73762566.32012562.79418767.06400066.0962565.62675067.153750
std2309.545410.9382280.4910198.91710027.18740438.7475030.4991316.8428651.528821.071459...1.14335821.18474220.54994324.77131922.04943723.46316221.58681723.0856823.82749322.502203
min1.00000-8.0000001.0000001.0000001.0000001.0000001.000001921.000000-8.00000-8.000000...-8.000000-3.000000-3.000000-3.000000-3.000000-3.000000-3.000000-3.00000-3.000000-3.000000
25%2000.750004.0000001.0000007.00000018.00000037.0000001.000001952.0000001.000001.000000...3.00000060.00000060.00000050.00000060.00000055.00000060.00000060.0000060.00000060.000000
50%4000.500004.0000001.00000015.00000042.00000073.0000002.000001965.0000001.000001.000000...4.00000079.00000070.00000070.00000070.00000070.00000070.00000070.0000070.00000070.000000
75%6000.250004.0000002.00000022.00000065.000000104.0000002.000001977.0000001.000001.000000...5.00000080.00000080.00000080.00000080.00000080.00000080.00000080.0000080.00000080.000000
max8000.000005.0000002.00000031.00000089.000000134.0000002.000001997.0000008.000001.000000...5.000000100.000000100.000000100.000000100.000000100.000000100.000000100.00000100.000000100.000000

8 rows × 136 columns

# 查看train的相关信息:索引、列、特征的数据类型、内存使用情况
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Columns: 140 entries, id to public_service_9
dtypes: datetime64[ns](1), float64(25), int64(111), object(3)
memory usage: 8.5+ MB
# 查看每个特征的缺失情况
train.isnull().sum()
id                         0
happiness                  0
survey_type                0
province                   0
city                       0
county                     0
survey_time                0
gender                     0
birth                      0
nationality                0
religion                   0
religion_freq              0
edu                        0
edu_other               7997
edu_status              1120
edu_yr                  1972
income                     0
political                  0
join_party              7176
floor_area                 0
property_0                 0
property_1                 0
property_2                 0
property_3                 0
property_4                 0
property_5                 0
property_6                 0
property_7                 0
property_8                 0
property_other          7934
                        ... 
m_political                0
m_work_14                  0
status_peer                0
status_3_before            0
view                       0
inc_ability                0
inc_exp                    0
trust_1                    0
trust_2                    0
trust_3                    0
trust_4                    0
trust_5                    0
trust_6                    0
trust_7                    0
trust_8                    0
trust_9                    0
trust_10                   0
trust_11                   0
trust_12                   0
trust_13                   0
neighbor_familiarity       0
public_service_1           0
public_service_2           0
public_service_3           0
public_service_4           0
public_service_5           0
public_service_6           0
public_service_7           0
public_service_8           0
public_service_9           0
Length: 140, dtype: int64
# 删除训练集中happiness标签值无效的数据
## train['happiness']是8000行的Series,展示所有训练集的happiness值,其值为-8表示无效
## pandas.DataFrame.loc():通过标签或布尔数组访问一组行和列。这里访问happiness值不为-8的所有组行和列。
train = train.loc[train['happiness'] != -8]
train.shape
(7988, 140)
# 查看幸福感各个值的分布情况,有很明显的类别不均衡的问题
## matplotlib.pyplot.subplots(nrows,ncols,index):在当前图形上添加子图。即一个Figure对象包含了多个子图,共有nrows行乘ncols列个。
## 这里表示绘制1行2列个12X5.5大小的子图,1行2列表示共两个,分别是ax[0]、ax[1],figure是当前Figure对象。
figure,ax=plt.subplots(1,2,figsize=(12,5.5))
## pandas.Series.value_counts():返回一个包含唯一值计数的Series,按出现频率以降序排列。
## pandas.Series.plot.pie():绘制饼图。
print(train['happiness'].value_counts())
train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('happiness')
ax[0].set_ylabel('')
## pandas.Series.plot.pie():绘制垂直条形图。
train['happiness'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('happiness')
plt.show()
4    4818
5    1410
3    1159
2     497
1     104
Name: happiness, dtype: int64


 

# 探究性别和幸福感的关系
## Seaborn要求原始数据的输入类型为pandas的Dataframe或Numpy数组。
## 画图函数形式之一:sns.图名(x='X轴列名', y='Y轴列名', hue='分组绘图参数', data=原始数据df对象)。
## seaborn.countplot():绘制条形图,显示每个分类箱中的观测值,返回matplotlib Axes对象。
ax = sns.countplot('gender',hue='happiness',data=train)
ax.set_title('Sex:happiness')
Text(0.5,1,'Sex:happiness')


 

# 探究年龄和幸福感的关系
# 先用直方图展示年龄分布情况
## pandas.Series.dt.year:返回datetime的year
train['survey_time'] = train['survey_time'].dt.year
test['survey_time'] = test['survey_time'].dt.year
train['Age'] = train['survey_time'] - train['birth']
test['Age'] = test['survey_time'] - test['birth']

## del_list=['survey_time','birth']
figure,ax = plt.subplots(1,1)
## pandas.Series.plot.hist():绘制直方图。
train['Age'].plot.hist(ax=ax,color='blue')
<matplotlib.axes._subplots.AxesSubplot at 0x203d435c208>


 

# 绘制年龄与幸福感关系的条形图。
# 将年龄分为6类,针对5种幸福感,显示其分布情况。
# 将年龄分类成箱处理,是为了避免噪声和异常值的影响。
# 注:本训练集里没有<=16岁的样本
combine=[train,test]

for dataset in combine:
    ## 在(dataset['Age']<=16)条件下,所有行里的指定标签('Age')的值设为0
    dataset.loc[dataset['Age']<=16,'Age']= 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4
    dataset.loc[ dataset['Age'] > 80, 'Age'] = 5
    
sns.countplot('Age', hue='happiness', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x203d43fedd8>

figure1,ax1 = plt.subplots(1,5,figsize=(20,4))
train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)
train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)
train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)
train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)
train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)
<matplotlib.axes._subplots.AxesSubplot at 0x203d45e0f60>

1.3 特征选择

# 1.3 特征选择
# 只选择与幸福感正负相关程度大于0.05的特征
## pandas.DataFrame.corr():计算列的成对相关,不包括NA/空值。检查两列之间变化趋势的方向以及程度。
## 值范围从-1到+1,0表示两个变量不相关,正值表示正相关,负值表示负相关,值越大相关性越强。
## python abs():求绝对值
train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]
happiness               1.000000
edu                     0.103048
edu_yr                  0.055564
political               0.080986
join_party              0.069007
property_8             -0.051929
weight_jin              0.085841
health                  0.250538
health_problem          0.186620
depression              0.304973
hukou                   0.072936
media_1                 0.095035
media_2                 0.084872
media_3                 0.091431
media_4                 0.098809
media_5                 0.065220
media_6                 0.059273
leisure_1              -0.077097
leisure_3              -0.070262
leisure_4              -0.095676
leisure_6              -0.107672
leisure_7              -0.072011
leisure_8              -0.100313
leisure_9              -0.148888
leisure_12             -0.068778
socialize               0.082206
relax                   0.113233
learn                   0.108294
social_friend          -0.091079
socia_outing            0.059567
                          ...   
family_income           0.051506
family_m                0.061062
family_status           0.204702
house                   0.089261
car                    -0.085387
invest_1               -0.055013
invest_2                0.054019
s_edu                   0.125679
s_political             0.068802
s_hukou                 0.071953
status_peer            -0.150246
status_3_before        -0.076808
view                    0.078986
trust_1                 0.069830
trust_2                 0.054909
trust_5                 0.102110
trust_7                 0.060102
trust_8                 0.065644
trust_10                0.069740
trust_12                0.057885
neighbor_familiarity    0.054074
public_service_1        0.112537
public_service_2        0.126029
public_service_3        0.134028
public_service_4        0.129880
public_service_5        0.136347
public_service_6        0.162514
public_service_7        0.154029
public_service_8        0.128678
public_service_9        0.129723
Name: happiness, Length: 65, dtype: float64
# 选择相关性大于0.05的作为候选特征参与训练,并加入我们认为比较重要的特征,总共66个特征参与训练
features = (train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]).index
## features是一个pandas.Index对象:实现一个有序,可切片集合的不可变ndarray。
print(features)
## pandas.Index.values:返回表示索引中数据的数组,再转化成列表形式。
## features这时是一个list对象
features = features.values.tolist()
features.extend(['Age', 'work_exper'])
features.remove('happiness')
print(features)
len(features)
Index(['happiness', 'edu', 'edu_yr', 'political', 'join_party', 'property_8',
       'weight_jin', 'health', 'health_problem', 'depression', 'hukou',
       'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6',
       'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7',
       'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn',
       'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before',
       'class_10_after', 'class_14', 'family_income', 'family_m',
       'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu',
       's_political', 's_hukou', 'status_peer', 'status_3_before', 'view',
       'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10',
       'trust_12', 'neighbor_familiarity', 'public_service_1',
       'public_service_2', 'public_service_3', 'public_service_4',
       'public_service_5', 'public_service_6', 'public_service_7',
       'public_service_8', 'public_service_9'],
      dtype='object')
['edu', 'edu_yr', 'political', 'join_party', 'property_8', 'weight_jin', 'health', 'health_problem', 'depression', 'hukou', 'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6', 'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7', 'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn', 'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before', 'class_10_after', 'class_14', 'family_income', 'family_m', 'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu', 's_political', 's_hukou', 'status_peer', 'status_3_before', 'view', 'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10', 'trust_12', 'neighbor_familiarity', 'public_service_1', 'public_service_2', 'public_service_3', 'public_service_4', 'public_service_5', 'public_service_6', 'public_service_7', 'public_service_8', 'public_service_9', 'Age', 'work_exper']
66

2 模型搭建

2.1 基本模型

# 2 模型搭建
# 2.1 基本模型
# 生成数据和标签
target = train['happiness']
train_selected = train[features]
test = test[features]
feature_importance_df = pd.DataFrame()
## numpy.zeros():返回给定形状和类型的新数组。
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
params = {'num_leaves': 9,
         'min_data_in_leaf': 40,
         'objective': 'regression',
         'max_depth': 16,
         'learning_rate': 0.01,
         'boosting': 'gbdt',
         'bagging_freq': 5,
         'bagging_fraction': 0.8,   # 每次迭代时用的数据比例
         'feature_fraction': 0.8201,# 每次迭代中随机选择80%的参数来建树
         'bagging_seed': 11,
         'reg_alpha': 1.728910519108444,
         'reg_lambda': 4.9847051755586085,
         'random_state': 42,
         'metric': 'rmse',
         'verbosity': -1,
         'subsample': 0.81,
         'min_gain_to_split': 0.01077313523861969,
         'min_child_weight': 19.428902804238373,
         'num_threads': 4}
## 类sklearn.model_selection.KFold:K折交叉验证器。## 提供训练/测试索引以将数据拆分为训练/测试集。
## 将数据集拆分为k个连续的折叠(默认情况下不进行混洗)。然后将每个折叠用作一次验证,而剩下的k-1个折叠形成训练集。
## n_splits:折数。样本共折几次。
## shuffle:在拆分成批次之前是否对数据进行混洗。
## random_state:当shuffle为True时,random_state会影响索引的顺序,从而控制每个折叠的随机性。
kfolds = KFold(n_splits=5,shuffle=True,random_state=15)

## Kfold类方法split():生成索引以将数据分为训练和测试集。返回训练集索引和测试集索引。
for fold_n,(trn_index,val_index) in enumerate(kfolds.split(train_selected,target)):
    
    print("fold_n {}".format(fold_n))
        
    # 转换为Dataset数据格式
    # pandas.DataFrame.iloc():基于索引选择数据
    trn_data = lgb.Dataset(train_selected.iloc[trn_index],label=target.iloc[trn_index])
    val_data = lgb.Dataset(train_selected.iloc[val_index],label=target.iloc[val_index])
    
    # 设置迭代次数
    num_round=10000
    
    # 训练
    ## train()参数介绍:
    ## params:学习器参数;
    ## train_set:训练集
    ## num_boost_round:迭代次数
    ## feval:定制评估功能
    ## verbose_eval:进行n次验证后,报告metrics情况。
    ## early_stopping_rounds:早停决策,若n轮迭代之后metrics没有进步,则终止训练。注意,设置此项务必设置metrics,否则lgb会按照metrics缺失处理。
    ## 返回一个Booster类型的对象
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 100)
       
    # 预测
    ## num_iteration:预测中使用的迭代次数
    ## predict()返回numpy数组
    ## 一批次预测val_index个样本
    oof[val_index] = clf.predict(train_selected.iloc[val_index], num_iteration=clf.best_iteration)
    # 测试集的预测结果
    predictions += clf.predict(test,num_iteration=clf.best_iteration)/ 5
    print("本批次训练集样本个数:",len(val_index))
    
    fold_importance_df = pd.DataFrame()    
    fold_importance_df["feature"] = features    
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_n + 1   
    ## pandas.concat():沿特定轴给pandas对象编码。
    ## 把fold_importance_df从0开始编码,编码后返回给feature_importance_df。
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print("feature_importance_df.shape:",feature_importance_df.shape)
    
    # 计算均方误差
    print("CV score: {:<8.5f}".format(mean_squared_error(target, oof)**0.5))
    print("#"*30,"第", fold_n + 1,"批样本学习、预测、评估结束","#"*30)
fold_n 0
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.623114	valid_1's rmse: 0.682914
Early stopping, best iteration is:
[1205]	training's rmse: 0.61358	valid_1's rmse: 0.681788
本批次训练集样本个数: 1598
feature_importance_df.shape: (66, 3)
CV score: 3.54968 
############################## 第 1 批样本学习、预测、评估结束 ##############################
fold_n 1
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.617676	valid_1's rmse: 0.692788
Early stopping, best iteration is:
[918]	training's rmse: 0.621868	valid_1's rmse: 0.692082
本批次训练集样本个数: 1598
feature_importance_df.shape: (132, 3)
CV score: 3.10340 
############################## 第 2 批样本学习、预测、评估结束 ##############################
fold_n 2
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.62993	valid_1's rmse: 0.64831
Early stopping, best iteration is:
[1346]	training's rmse: 0.613892	valid_1's rmse: 0.647217
本批次训练集样本个数: 1598
feature_importance_df.shape: (198, 3)
CV score: 2.55896 
############################## 第 3 批样本学习、预测、评估结束 ##############################
fold_n 3
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.619417	valid_1's rmse: 0.699119
Early stopping, best iteration is:
[1035]	training's rmse: 0.617717	valid_1's rmse: 0.698929
本批次训练集样本个数: 1597
feature_importance_df.shape: (264, 3)
CV score: 1.87172 
############################## 第 4 批样本学习、预测、评估结束 ##############################
fold_n 4
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.620862	valid_1's rmse: 0.68395
Early stopping, best iteration is:
[1181]	training's rmse: 0.612383	valid_1's rmse: 0.683083
本批次训练集样本个数: 1597
feature_importance_df.shape: (330, 3)
CV score: 0.68085 
############################## 第 5 批样本学习、预测、评估结束 ##############################
feature_importance_df
 featureimportancefold
0edu801
1edu_yr3291
2political161
3join_party71
4property_891
5weight_jin3381
6health3461
7health_problem1161
8depression4641
9hukou191
10media_1761
11media_2321
12media_31301
13media_41181
14media_5771
15media_6331
16leisure_1621
17leisure_3701
18leisure_4281
19leisure_61441
20leisure_7401
21leisure_81971
22leisure_91651
23leisure_12251
24socialize481
25relax1911
26learn491
27social_friend2341
28socia_outing511
29equity6481
............
36family_status2385
37house1465
38car705
39invest_1455
40invest_255
41s_edu2485
42s_political565
43s_hukou595
44status_peer1415
45status_3_before1865
46view2155
47trust_1895
48trust_2865
49trust_5935
50trust_7605
51trust_81125
52trust_101005
53trust_121255
54neighbor_familiarity1615
55public_service_11385
56public_service_21305
57public_service_3895
58public_service_41195
59public_service_51595
60public_service_62205
61public_service_71435
62public_service_81375
63public_service_91825
64Age2565
65work_exper1275

330 rows × 3 columns

cols = (feature_importance_df[["feature", "importance"]]
        .groupby("feature")# 按feature分组
        .mean()# 返回轴的值的平均值
        .sort_values(by="importance", ascending=False)[:1000].index)#沿importance值降序排列,取前1000个特征,生成Index对象

## pandas.DataFrame.isin(值):DataFrame中的每个元素是否包含在值中。返回布尔值的DataFrame
## 只要包含在cols中的feature_importance_df
best_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]

# 绘制特征与重要性关系的垂直条形图
plt.figure(figsize=(14,26))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('LightGBM Features (averaged over folds)')
plt.tight_layout()

# 计算结果
submit = pd.read_csv("happiness_submit.csv")
submision_lgb1 = pd.DataFrame({"id":submit['id'].values})
submision_lgb1["happiness"] = predictions
submision_lgb1.head(5)
idhappiness
080013.813459
180022.934148
280033.299967
380044.294977
480053.306847

2.2 模型调参(略)

3 结果文件生成

# 3 结果文件生成
# 获取时间
time_str = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
out_dir = "csvResult/{}/".format(time_str)
os.makedirs(out_dir)

# 保存模型
clf.save_model(out_dir + "model.txt")
# 保存结果
submision_lgb1.to_csv(out_dir + "happiness_submision_lgbm.csv",index=False)

提交结果

查看成绩

总结

为期12天的机器学习训练营结束了,收获颇丰,感谢天池龙珠计划,感谢组织者们,也感谢我的小队地坛TBD,更感谢我自己。我一直以来都更适合小队模式的学习,我想是因为有监督,有deadline。我受益于此但希望自己不要局限于此,自己一个人的时候也要保持自律和对学习的热情。

 

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值