Task10快来一起挖掘幸福感大赛

最新推荐文章于 2021-11-22 19:55:17 发布

Ember's Blog

最新推荐文章于 2021-11-22 19:55:17 发布

阅读量864

点赞数 2

分类专栏：机器学习文章标签：机器学习数据分析

本文链接：https://blog.csdn.net/qq_34438629/article/details/111634541

版权

机器学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Task10快来一起挖掘幸福感大赛

参考资料

机器学习训练营_天池龙珠计划：https://tianchi.aliyun.com/specials/promotion/aicampml?invite_channel=1

大赛地址：https://tianchi.aliyun.com/competition/entrance/231702/introduction

参考笔记：https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.9.52a149f4iLWWgq&postId=58167

前言

因为我是第一次接触大赛，毫无头绪，只能去论坛看别人分享的思路，因此本文就是基于他人的笔记进行一个学习加复现，目标是弄懂思路和代码。我先在本地复制了一遍原笔记的代码，看能否复现结果，最终每一步都与笔记中的结果相同，并且得到了最终的csv文件，证明代码可行。在这里十分感谢作者分享思路。然后我开始解析代码，因为懂的太少，所以我的代码注释就写得比较细，在我的注释里，“#”后的注释表示代码功能，“##”后的注释表示在解释代码语法。

打开本地环境

在Anaconda Prompt里输入jupyter notebook，等待一会，会自动弹出一个网页。

这个网页显示的是C:\Users\Administrator目录下的文件。

进入自己创建的Happiness文件，把从大赛链接里下的数据文件复制进去。

创建一个ipynb文件，并命名为HappinessModel_complete，我们的代码就写在这里。

代码流程

0 导入相关库

import os
import time 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# LightGBM（Light Gradient Boosting Machine）是微软开源的一个实现 GBDT 算法的框架，支持高效率的并行训练。
# GBDT (Gradient Boosting Decision Tree)DT决策树，GB是一种学习策略，GBDT的含义就是用Gradient Boosting的策略训练出来的DT模型。
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, roc_curve
# K折交叉验证器
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# 均方误差
from sklearn.metrics import mean_squared_error

1 认识数据

1.1 加载数据集

# 1 认识数据
# 1.1 加载数据集
## pandas.readc_csv()：将训练集和测试集分别读取到相应的DateFrame中
### parse_dates：名称列表["survey_time"]表示将"survey_time"列解析为单独的日期列。
### encoding：读/写时用于UTF的编码。'latin-1'是最简单的文本编码（也称为'iso-8859-1'）将代码点0–255映射到字节0x0-0xff。
train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1') 
test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
## pandas.DataFrame.head(n)：n默认5，返回对象的前5行
train.head()

id	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
0	1	4	1	12	32	59	2015-08-04 14:18:00	1	1959	1	...	4	50	60	50	50	30.0	30	50	50	50
1	2	4	2	18	52	85	2015-07-21 15:04:00	1	1992	1	...	3	90	70	70	80	85.0	70	90	60	60
2	3	4	2	29	83	126	2015-07-21 13:24:00	2	1967	1	...	4	90	80	75	79	80.0	90	90	90	75
3	4	5	2	10	28	51	2015-07-25 17:33:00	2	1943	1	...	3	100	90	70	80	80.0	90	90	80	80
4	5	4	1	7	18	36	2015-08-10 09:50:00	2	1994	1	...	2	50	50	50	50	50.0	50	50	50	50

5 rows × 140 columns

1.2 理解数据集

# 1.2 理解数据集
# 查看特征的数据分布
train.describe()

id	happiness	survey_type	province	city	county	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
count	8000.00000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.00000	8000.000000	...	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.000000
mean	4000.50000	3.850125	1.405500	15.155375	42.564750	70.619000	1.53000	1964.707625	1.37350	0.772250	...	3.722250	70.809500	68.170000	62.737625	66.320125	62.794187	67.064000	66.09625	65.626750	67.153750
std	2309.54541	0.938228	0.491019	8.917100	27.187404	38.747503	0.49913	16.842865	1.52882	1.071459	...	1.143358	21.184742	20.549943	24.771319	22.049437	23.463162	21.586817	23.08568	23.827493	22.502203
min	1.00000	-8.000000	1.000000	1.000000	1.000000	1.000000	1.00000	1921.000000	-8.00000	-8.000000	...	-8.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.00000	-3.000000	-3.000000
25%	2000.75000	4.000000	1.000000	7.000000	18.000000	37.000000	1.00000	1952.000000	1.00000	1.000000	...	3.000000	60.000000	60.000000	50.000000	60.000000	55.000000	60.000000	60.00000	60.000000	60.000000
50%	4000.50000	4.000000	1.000000	15.000000	42.000000	73.000000	2.00000	1965.000000	1.00000	1.000000	...	4.000000	79.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.00000	70.000000	70.000000
75%	6000.25000	4.000000	2.000000	22.000000	65.000000	104.000000	2.00000	1977.000000	1.00000	1.000000	...	5.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.00000	80.000000	80.000000
max	8000.00000	5.000000	2.000000	31.000000	89.000000	134.000000	2.00000	1997.000000	8.00000	1.000000	...	5.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.00000	100.000000	100.000000

8 rows × 136 columns

# 查看train的相关信息：索引、列、特征的数据类型、内存使用情况
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Columns: 140 entries, id to public_service_9
dtypes: datetime64[ns](1), float64(25), int64(111), object(3)
memory usage: 8.5+ MB

# 查看每个特征的缺失情况
train.isnull().sum()

id                         0
happiness                  0
survey_type                0
province                   0
city                       0
county                     0
survey_time                0
gender                     0
birth                      0
nationality                0
religion                   0
religion_freq              0
edu                        0
edu_other               7997
edu_status              1120
edu_yr                  1972
income                     0
political                  0
join_party              7176
floor_area                 0
property_0                 0
property_1                 0
property_2                 0
property_3                 0
property_4                 0
property_5                 0
property_6                 0
property_7                 0
property_8                 0
property_other          7934
                        ... 
m_political                0
m_work_14                  0
status_peer                0
status_3_before            0
view                       0
inc_ability                0
inc_exp                    0
trust_1                    0
trust_2                    0
trust_3                    0
trust_4                    0
trust_5                    0
trust_6                    0
trust_7                    0
trust_8                    0
trust_9                    0
trust_10                   0
trust_11                   0
trust_12                   0
trust_13                   0
neighbor_familiarity       0
public_service_1           0
public_service_2           0
public_service_3           0
public_service_4           0
public_service_5           0
public_service_6           0
public_service_7           0
public_service_8           0
public_service_9           0
Length: 140, dtype: int64

# 删除训练集中happiness标签值无效的数据
## train['happiness']是8000行的Series，展示所有训练集的happiness值，其值为-8表示无效
## pandas.DataFrame.loc()：通过标签或布尔数组访问一组行和列。这里访问happiness值不为-8的所有组行和列。
train = train.loc[train['happiness'] != -8]

train.shape

(7988, 140)

# 查看幸福感各个值的分布情况，有很明显的类别不均衡的问题
## matplotlib.pyplot.subplots(nrows，ncols，index)：在当前图形上添加子图。即一个Figure对象包含了多个子图，共有nrows行乘ncols列个。
## 这里表示绘制1行2列个12X5.5大小的子图，1行2列表示共两个，分别是ax[0]、ax[1]，figure是当前Figure对象。
figure,ax=plt.subplots(1,2,figsize=(12,5.5))
## pandas.Series.value_counts()：返回一个包含唯一值计数的Series，按出现频率以降序排列。
## pandas.Series.plot.pie()：绘制饼图。
print(train['happiness'].value_counts())
train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('happiness')
ax[0].set_ylabel('')
## pandas.Series.plot.pie()：绘制垂直条形图。
train['happiness'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('happiness')
plt.show()

4    4818
5    1410
3    1159
2     497
1     104
Name: happiness, dtype: int64

# 探究性别和幸福感的关系
## Seaborn要求原始数据的输入类型为pandas的Dataframe或Numpy数组。
## 画图函数形式之一：sns.图名(x='X轴列名', y='Y轴列名', hue='分组绘图参数', data=原始数据df对象)。
## seaborn.countplot()：绘制条形图，显示每个分类箱中的观测值，返回matplotlib Axes对象。
ax = sns.countplot('gender',hue='happiness',data=train)
ax.set_title('Sex:happiness')

Text(0.5,1,'Sex:happiness')

# 探究年龄和幸福感的关系
# 先用直方图展示年龄分布情况
## pandas.Series.dt.year：返回datetime的year
train['survey_time'] = train['survey_time'].dt.year
test['survey_time'] = test['survey_time'].dt.year
train['Age'] = train['survey_time'] - train['birth']
test['Age'] = test['survey_time'] - test['birth']

## del_list=['survey_time','birth']
figure,ax = plt.subplots(1,1)
## pandas.Series.plot.hist()：绘制直方图。
train['Age'].plot.hist(ax=ax,color='blue')

<matplotlib.axes._subplots.AxesSubplot at 0x203d435c208>

# 绘制年龄与幸福感关系的条形图。
# 将年龄分为6类，针对5种幸福感，显示其分布情况。
# 将年龄分类成箱处理，是为了避免噪声和异常值的影响。
# 注：本训练集里没有<=16岁的样本
combine=[train,test]

for dataset in combine:
    ## 在（dataset['Age']<=16）条件下，所有行里的指定标签（'Age'）的值设为0
    dataset.loc[dataset['Age']<=16,'Age']= 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4
    dataset.loc[ dataset['Age'] > 80, 'Age'] = 5
    
sns.countplot('Age', hue='happiness', data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x203d43fedd8>

figure1,ax1 = plt.subplots(1,5,figsize=(20,4))
train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)
train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)
train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)
train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)
train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)

<matplotlib.axes._subplots.AxesSubplot at 0x203d45e0f60>

1.3 特征选择

# 1.3 特征选择
# 只选择与幸福感正负相关程度大于0.05的特征
## pandas.DataFrame.corr()：计算列的成对相关，不包括NA/空值。检查两列之间变化趋势的方向以及程度。
## 值范围从-1到+1，0表示两个变量不相关，正值表示正相关，负值表示负相关，值越大相关性越强。
## python abs()：求绝对值
train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]

happiness               1.000000
edu                     0.103048
edu_yr                  0.055564
political               0.080986
join_party              0.069007
property_8             -0.051929
weight_jin              0.085841
health                  0.250538
health_problem          0.186620
depression              0.304973
hukou                   0.072936
media_1                 0.095035
media_2                 0.084872
media_3                 0.091431
media_4                 0.098809
media_5                 0.065220
media_6                 0.059273
leisure_1              -0.077097
leisure_3              -0.070262
leisure_4              -0.095676
leisure_6              -0.107672
leisure_7              -0.072011
leisure_8              -0.100313
leisure_9              -0.148888
leisure_12             -0.068778
socialize               0.082206
relax                   0.113233
learn                   0.108294
social_friend          -0.091079
socia_outing            0.059567
                          ...   
family_income           0.051506
family_m                0.061062
family_status           0.204702
house                   0.089261
car                    -0.085387
invest_1               -0.055013
invest_2                0.054019
s_edu                   0.125679
s_political             0.068802
s_hukou                 0.071953
status_peer            -0.150246
status_3_before        -0.076808
view                    0.078986
trust_1                 0.069830
trust_2                 0.054909
trust_5                 0.102110
trust_7                 0.060102
trust_8                 0.065644
trust_10                0.069740
trust_12                0.057885
neighbor_familiarity    0.054074
public_service_1        0.112537
public_service_2        0.126029
public_service_3        0.134028
public_service_4        0.129880
public_service_5        0.136347
public_service_6        0.162514
public_service_7        0.154029
public_service_8        0.128678
public_service_9        0.129723
Name: happiness, Length: 65, dtype: float64

# 选择相关性大于0.05的作为候选特征参与训练，并加入我们认为比较重要的特征，总共66个特征参与训练
features = (train.corr()['happiness'][abs(train.corr()['happiness'])>0.05]).index
## features是一个pandas.Index对象：实现一个有序，可切片集合的不可变ndarray。
print(features)
## pandas.Index.values：返回表示索引中数据的数组，再转化成列表形式。
## features这时是一个list对象
features = features.values.tolist()
features.extend(['Age', 'work_exper'])
features.remove('happiness')
print(features)
len(features)

Index(['happiness', 'edu', 'edu_yr', 'political', 'join_party', 'property_8',
       'weight_jin', 'health', 'health_problem', 'depression', 'hukou',
       'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6',
       'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7',
       'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn',
       'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before',
       'class_10_after', 'class_14', 'family_income', 'family_m',
       'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu',
       's_political', 's_hukou', 'status_peer', 'status_3_before', 'view',
       'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10',
       'trust_12', 'neighbor_familiarity', 'public_service_1',
       'public_service_2', 'public_service_3', 'public_service_4',
       'public_service_5', 'public_service_6', 'public_service_7',
       'public_service_8', 'public_service_9'],
      dtype='object')
['edu', 'edu_yr', 'political', 'join_party', 'property_8', 'weight_jin', 'health', 'health_problem', 'depression', 'hukou', 'media_1', 'media_2', 'media_3', 'media_4', 'media_5', 'media_6', 'leisure_1', 'leisure_3', 'leisure_4', 'leisure_6', 'leisure_7', 'leisure_8', 'leisure_9', 'leisure_12', 'socialize', 'relax', 'learn', 'social_friend', 'socia_outing', 'equity', 'class', 'class_10_before', 'class_10_after', 'class_14', 'family_income', 'family_m', 'family_status', 'house', 'car', 'invest_1', 'invest_2', 's_edu', 's_political', 's_hukou', 'status_peer', 'status_3_before', 'view', 'trust_1', 'trust_2', 'trust_5', 'trust_7', 'trust_8', 'trust_10', 'trust_12', 'neighbor_familiarity', 'public_service_1', 'public_service_2', 'public_service_3', 'public_service_4', 'public_service_5', 'public_service_6', 'public_service_7', 'public_service_8', 'public_service_9', 'Age', 'work_exper']

2 模型搭建

2.1 基本模型

# 2 模型搭建
# 2.1 基本模型
# 生成数据和标签
target = train['happiness']
train_selected = train[features]
test = test[features]
feature_importance_df = pd.DataFrame()
## numpy.zeros()：返回给定形状和类型的新数组。
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
params = {'num_leaves': 9,
         'min_data_in_leaf': 40,
         'objective': 'regression',
         'max_depth': 16,
         'learning_rate': 0.01,
         'boosting': 'gbdt',
         'bagging_freq': 5,
         'bagging_fraction': 0.8,   # 每次迭代时用的数据比例
         'feature_fraction': 0.8201,# 每次迭代中随机选择80％的参数来建树
         'bagging_seed': 11,
         'reg_alpha': 1.728910519108444,
         'reg_lambda': 4.9847051755586085,
         'random_state': 42,
         'metric': 'rmse',
         'verbosity': -1,
         'subsample': 0.81,
         'min_gain_to_split': 0.01077313523861969,
         'min_child_weight': 19.428902804238373,
         'num_threads': 4}

## 类sklearn.model_selection.KFold：K折交叉验证器。## 提供训练/测试索引以将数据拆分为训练/测试集。
## 将数据集拆分为k个连续的折叠（默认情况下不进行混洗）。然后将每个折叠用作一次验证，而剩下的k-1个折叠形成训练集。
## n_splits：折数。样本共折几次。
## shuffle：在拆分成批次之前是否对数据进行混洗。
## random_state：当shuffle为True时，random_state会影响索引的顺序，从而控制每个折叠的随机性。
kfolds = KFold(n_splits=5,shuffle=True,random_state=15)

## Kfold类方法split()：生成索引以将数据分为训练和测试集。返回训练集索引和测试集索引。
for fold_n,(trn_index,val_index) in enumerate(kfolds.split(train_selected,target)):
    
    print("fold_n {}".format(fold_n))
        
    # 转换为Dataset数据格式
    # pandas.DataFrame.iloc()：基于索引选择数据
    trn_data = lgb.Dataset(train_selected.iloc[trn_index],label=target.iloc[trn_index])
    val_data = lgb.Dataset(train_selected.iloc[val_index],label=target.iloc[val_index])
    
    # 设置迭代次数
    num_round=10000
    
    # 训练
    ## train()参数介绍：
    ## params：学习器参数；
    ## train_set：训练集
    ## num_boost_round：迭代次数
    ## feval：定制评估功能
    ## verbose_eval：进行n次验证后，报告metrics情况。
    ## early_stopping_rounds：早停决策，若n轮迭代之后metrics没有进步，则终止训练。注意，设置此项务必设置metrics，否则lgb会按照metrics缺失处理。
    ## 返回一个Booster类型的对象
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 100)
       
    # 预测
    ## num_iteration：预测中使用的迭代次数
    ## predict()返回numpy数组
    ## 一批次预测val_index个样本
    oof[val_index] = clf.predict(train_selected.iloc[val_index], num_iteration=clf.best_iteration)
    # 测试集的预测结果
    predictions += clf.predict(test,num_iteration=clf.best_iteration)/ 5
    print("本批次训练集样本个数：",len(val_index))
    
    fold_importance_df = pd.DataFrame()    
    fold_importance_df["feature"] = features    
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_n + 1   
    ## pandas.concat()：沿特定轴给pandas对象编码。
    ## 把fold_importance_df从0开始编码，编码后返回给feature_importance_df。
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print("feature_importance_df.shape：",feature_importance_df.shape)
    
    # 计算均方误差
    print("CV score: {:<8.5f}".format(mean_squared_error(target, oof)**0.5))
    print("#"*30,"第", fold_n + 1,"批样本学习、预测、评估结束","#"*30)

fold_n 0
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.623114	valid_1's rmse: 0.682914
Early stopping, best iteration is:
[1205]	training's rmse: 0.61358	valid_1's rmse: 0.681788
本批次训练集样本个数： 1598
feature_importance_df.shape： (66, 3)
CV score: 3.54968 
############################## 第 1 批样本学习、预测、评估结束 ##############################
fold_n 1
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.617676	valid_1's rmse: 0.692788
Early stopping, best iteration is:
[918]	training's rmse: 0.621868	valid_1's rmse: 0.692082
本批次训练集样本个数： 1598
feature_importance_df.shape： (132, 3)
CV score: 3.10340 
############################## 第 2 批样本学习、预测、评估结束 ##############################
fold_n 2
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.62993	valid_1's rmse: 0.64831
Early stopping, best iteration is:
[1346]	training's rmse: 0.613892	valid_1's rmse: 0.647217
本批次训练集样本个数： 1598
feature_importance_df.shape： (198, 3)
CV score: 2.55896 
############################## 第 3 批样本学习、预测、评估结束 ##############################
fold_n 3
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.619417	valid_1's rmse: 0.699119
Early stopping, best iteration is:
[1035]	training's rmse: 0.617717	valid_1's rmse: 0.698929
本批次训练集样本个数： 1597
feature_importance_df.shape： (264, 3)
CV score: 1.87172 
############################## 第 4 批样本学习、预测、评估结束 ##############################
fold_n 4
[LightGBM] [Warning] bagging_fraction is set=0.8, subsample=0.81 will be ignored. Current value: bagging_fraction=0.8
Training until validation scores don't improve for 100 rounds
[1000]	training's rmse: 0.620862	valid_1's rmse: 0.68395
Early stopping, best iteration is:
[1181]	training's rmse: 0.612383	valid_1's rmse: 0.683083
本批次训练集样本个数： 1597
feature_importance_df.shape： (330, 3)
CV score: 0.68085 
############################## 第 5 批样本学习、预测、评估结束 ##############################

feature_importance_df

	feature	importance	fold
0	edu	80	1
1	edu_yr	329	1
2	political	16	1
3	join_party	7	1
4	property_8	9	1
5	weight_jin	338	1
6	health	346	1
7	health_problem	116	1
8	depression	464	1
9	hukou	19	1
10	media_1	76	1
11	media_2	32	1
12	media_3	130	1
13	media_4	118	1
14	media_5	77	1
15	media_6	33	1
16	leisure_1	62	1
17	leisure_3	70	1
18	leisure_4	28	1
19	leisure_6	144	1
20	leisure_7	40	1
21	leisure_8	197	1
22	leisure_9	165	1
23	leisure_12	25	1
24	socialize	48	1
25	relax	191	1
26	learn	49	1
27	social_friend	234	1
28	socia_outing	51	1
29	equity	648	1
...	...	...	...
36	family_status	238	5
37	house	146	5
38	car	70	5
39	invest_1	45	5
40	invest_2	5	5
41	s_edu	248	5
42	s_political	56	5
43	s_hukou	59	5
44	status_peer	141	5
45	status_3_before	186	5
46	view	215	5
47	trust_1	89	5
48	trust_2	86	5
49	trust_5	93	5
50	trust_7	60	5
51	trust_8	112	5
52	trust_10	100	5
53	trust_12	125	5
54	neighbor_familiarity	161	5
55	public_service_1	138	5
56	public_service_2	130	5
57	public_service_3	89	5
58	public_service_4	119	5
59	public_service_5	159	5
60	public_service_6	220	5
61	public_service_7	143	5
62	public_service_8	137	5
63	public_service_9	182	5
64	Age	256	5
65	work_exper	127	5

330 rows × 3 columns

cols = (feature_importance_df[["feature", "importance"]]
        .groupby("feature")# 按feature分组
        .mean()# 返回轴的值的平均值
        .sort_values(by="importance", ascending=False)[:1000].index)#沿importance值降序排列，取前1000个特征，生成Index对象

## pandas.DataFrame.isin(值)：DataFrame中的每个元素是否包含在值中。返回布尔值的DataFrame
## 只要包含在cols中的feature_importance_df
best_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]

# 绘制特征与重要性关系的垂直条形图
plt.figure(figsize=(14,26))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('LightGBM Features (averaged over folds)')
plt.tight_layout()

# 计算结果
submit = pd.read_csv("happiness_submit.csv")
submision_lgb1 = pd.DataFrame({"id":submit['id'].values})
submision_lgb1["happiness"] = predictions
submision_lgb1.head(5)

id	happiness
0	8001	3.813459
1	8002	2.934148
2	8003	3.299967
3	8004	4.294977
4	8005	3.306847

2.2 模型调参（略）

3 结果文件生成

# 3 结果文件生成
# 获取时间
time_str = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
out_dir = "csvResult/{}/".format(time_str)
os.makedirs(out_dir)

# 保存模型
clf.save_model(out_dir + "model.txt")
# 保存结果
submision_lgb1.to_csv(out_dir + "happiness_submision_lgbm.csv",index=False)

提交结果

查看成绩

总结

为期12天的机器学习训练营结束了，收获颇丰，感谢天池龙珠计划，感谢组织者们，也感谢我的小队地坛TBD，更感谢我自己。我一直以来都更适合小队模式的学习，我想是因为有监督，有deadline。我受益于此但希望自己不要局限于此，自己一个人的时候也要保持自律和对学习的热情。

Ember's Blog

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
Task10快来一起挖掘幸福感大赛

Task10 快来一起挖掘幸福感_天池大赛一、学习内容概括大赛地址：https://tianchi.aliyun.com/competition/entrance/231702/information赛题目标：预测幸福感赛题说明：赛题使用公开数据的问卷调查结果，选取其中多组变量，包括个体变量（性别、年龄、地域、职业、健康、婚姻与政治面貌等等）、家庭变量（父母、配偶、子女、家庭资本等等）、社会态度（公平、信用、公共服务等等），来预测其对幸福感的评价。数据说明：index文件中包含每个变
复制链接

扫一扫

专栏目录