天池-一起挖掘幸福感！

最新推荐文章于 2021-05-18 21:37:25 发布

weixin_45690427

最新推荐文章于 2021-05-18 21:37:25 发布

阅读量687

点赞数 1

文章标签： python 机器学习

原文链接：https://tianchi.aliyun.com/notebook-ai/detail?postId=102191

版权

幸福感是一个古老而深刻的话题，是人类世代追求的方向。与幸福感相关的因素成千上万、因人而异，大如国计民生，小如路边烤红薯，都会对幸福感产生影响。这些错综复杂的因素中，我们能找到其中的共性，一窥幸福感的要义吗？
天池新人实战赛是针对数据新人开设的实战练习专场，以经典赛题作为学习场景，提供详尽入门教程，手把手教你学习数据挖掘。天池希望新人赛能成为高校备受热捧的数据实战课程，帮助更多学生掌握数据技能。
赛题背景
在社会科学领域，幸福感的研究占有重要的位置。这个涉及了哲学、心理学、社会学、经济学等多方学科的话题复杂而有趣；同时与大家生活息息相关，每个人对幸福感都有自己的衡量标准。如果能发现影响幸福感的共性，生活中是不是将多一些乐趣；如果能找到影响幸福感的政策因素，便能优化资源配置来提升国民的幸福感。目前社会科学研究注重变量的可解释性和未来政策的落地，主要采用了线性回归和逻辑回归的方法，在收入、健康、职业、社交关系、休闲方式等经济人口因素；以及政府公共服务、宏观经济环境、税负等宏观因素上有了一系列的推测和发现。
赛题尝试了幸福感预测这一经典课题，希望在现有社会科学研究外有其他维度的算法尝试，结合多学科各自优势，挖掘潜在的影响因素，发现更多可解释、可理解的相关关系。
赛题说明
赛题使用公开数据的问卷调查结果，选取其中多组变量，包括个体变量（性别、年龄、地域、职业、健康、婚姻与政治面貌等等）、家庭变量（父母、配偶、子女、家庭资本等等）、社会态度（公平、信用、公共服务等等），来预测其对幸福感的评价。
幸福感预测的准确性不是赛题的唯一目的，更希望选手对变量间的关系、变量群的意义有所探索与收获。
数据说明
考虑到变量个数较多，部分变量间关系复杂，数据分为完整版和精简版两类。可从精简版入手熟悉赛题后，使用完整版挖掘更多信息。complete文件为变量完整版数据，abbr文件为变量精简版数据。

index文件中包含每个变量对应的问卷题目，以及变量取值的含义。

survey文件是数据源的原版问卷，作为补充以方便理解问题背景。

数据来源：赛题使用的数据来自中国人民大学中国调查与数据中心主持之《中国综合社会调查（CGSS）》项目。赛题感谢此机构及其人员提供数据协助。中国综合社会调查为多阶分层抽样的截面面访调查。

外部数据：赛题以数据挖掘和分析为出发点，不限制外部数据的使用，比如宏观经济指标、政府再分配政策等公开数据，欢迎选手交流分享。
评测指标
在这里插入图片描述
代码如下：
首先导入相应的包和库：

import os
import time 
import pandas as pd
import numpy as np
import lightgbm as lgb
import seaborn as sns
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import mean_squared_error

# 绘图案例 an example of matplotlib
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import jn
from IPython.display import display, clear_output
import time

设置pandas显示

#显示所有列
pd.set_option('display.max_columns',None)
#显示所有行
pd.set_option('display.max_rows',None)

读取影响因素解释数据：

happiness_index = pd.read_excel('happiness_index.xlsx')
happiness_index.head(50)

在这里插入图片描述

| 在这里插入图片描述

读取训练数据和测试集数据

train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1') 
test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
train.head()

训练集数据前五行
在这里插入图片描述

train.shape

在这里插入图片描述

查看训练集中缺失数据

train.isnull().sum().sort_values(ascending=False)

在这里插入图片描述

查看测试集中缺失数据
在这里插入图片描述
训练集数据描述

删除训练集中无效的标签对应的数据

# 删除训练集中无效的标签对应的数据
train = train.loc[train['happiness'] != -8]

查看各个类别的分布情况，有很明显的类别不均衡的问题

# 查看各个类别的分布情况，有很明显的类别不均衡的问题
f,ax=plt.subplots(1,2,figsize=(18,8))
train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('happiness')
ax[0].set_ylabel('')
train['happiness'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('happiness')
plt.show()

在这里插入图片描述
探究性别和幸福感的分布

# 探究性别和幸福感的分布
sns.countplot('gender',hue='happiness',data=train)
ax[1].set_title('Sex:happiness')

在这里插入图片描述
探究年龄和幸福感的关系

# 探究年龄和幸福感的关系
train['survey_time'] = train['survey_time'].dt.year
test['survey_time'] = test['survey_time'].dt.year
train['Age'] = train['survey_time']-train['birth']
test['Age'] = test['survey_time']-test['birth']
del_list=['survey_time','birth']
figure,ax = plt.subplots(1,1)
train['Age'].plot.hist(ax=ax,color='blue')

在这里插入图片描述
将年龄分箱，避免噪声和异常值的影响

# 一般会将年龄分箱，避免噪声和异常值的影响
combine=[train,test]

for dataset in combine:
    dataset.loc[dataset['Age']<=16,'Age']=0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4
    dataset.loc[ dataset['Age'] > 80, 'Age'] = 5
sns.countplot('Age', hue='happiness', data=train)

在这里插入图片描述
各个年龄段幸福感分布

figure1,ax1 = plt.subplots(1,5,figsize=(18,4))
train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)
train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)
train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)
train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)
train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)

在这里插入图片描述
相关性计算

缺失值情况

pd.DataFrame(data.isnull().sum()).tail(50)

在这里插入图片描述

首先处理时间特征

#处理时间特征
data['survey_time'] = pd.to_datetime(data['survey_time'],format='%Y-%m-%d %H:%M:%S')
data["weekday"]=data["survey_time"].dt.weekday
data["year"]=data["survey_time"].dt.year
data["quarter"]=data["survey_time"].dt.quarter
data["hour"]=data["survey_time"].dt.hour
data["month"]=data["survey_time"].dt.month

对每天接受问卷的时间进行分段

def hour_cut(x):
    if 0<=x<6:
        return 0
    elif  6<=x<8:
        return 1
    elif  8<=x<12:
        return 2
    elif  12<=x<14:
        return 3
    elif  14<=x<18:
        return 4
    elif  18<=x<21:
        return 5
    elif  21<=x<24:
        return 6

    
data["hour_cut"]=data["hour"].map(hour_cut)

计算做问卷的时候的年龄

data["survey_age"]=data["year"]-data["birth"]

是否入党

data["join_party"]=data["join_party"].map(lambda x:0 if pd.isnull(x)  else 1)

出生时的年代

def birth_split(x):
    if 1920<=x<=1930:
        return 0
    elif  1930<x<=1940:
        return 1
    elif  1940<x<=1950:
        return 2
    elif  1950<x<=1960:
        return 3
    elif  1960<x<=1970:
        return 4
    elif  1970<x<=1980:
        return 5
    elif  1980<x<=1990:
        return 6
    elif  1990<x<=2000:
        return 7
    
data["birth_s"]=data["birth"].map(birth_split)

#  依据特征income创造个人收入类别特征
incomes = []
for income in data['income']:
    if 0 <= income < 200000:
        incomes.append(1)  
    elif 200000 <= income < 350000:
        incomes.append(2)  
    elif 350000 <= income < 600000:
        incomes.append(3)  
    elif 600000 <= income < 800000:
        incomes.append(4)  
    elif 800000 <= income < 2000000:
        incomes.append(5)  
    elif 2000000 <= income < 5000000:
        incomes.append(6) 
    elif 5000000 <= income:
        incomes.append(7)  
        
data['income'] = pd.DataFrame(incomes)

文本数据处理
在所有的特征中，有3个特征分别是 edu_other、property_other、invest_other 是字符串数据，需要将其转换成序号编码（Ordinal Encoding）。

首先查看 edu_other 的填写情况。

data_origin[data_origin['edu_other'] != -1]['edu_other'].to_frame()

在这里插入图片描述
可以看到 edu_other 的填写情况全都是夜校，将字符串转换成序号编码。

data_origin['edu_other'] = data_origin['edu_other'].astype('category').values.codes + 1

查看 property_other 即房子产权归属谁，首先检查调查问卷的填写情况。

data_origin[data_origin['property_other'] != -1]['property_other'].to_frame()

在这里插入图片描述

根据填写情况来看，其中有很多填写信息都是一个意思，例如家庭共同所有和全家所有是同一个意思，但是在python处理中只能一个个的手动处理

#data_origin.loc[[8009, 9212, 9759, 10517], 'property_other'] = '多人拥有'
#data_origin.loc[[8014, 8056, 10264], 'property_other'] = '未过户'
#data_origin.loc[[8471, 8825, 9597, 9810, 9842, 9967, 10069, 10166, 10203, 10469], 'property_other'] = '全家拥有'
#data_origin.loc[[8553, 8596, 9605, 10421, 10814], 'property_other'] = '无产权'

data_origin.loc[[76, 132, 455, 495, 1415, 2511, 2792, 2956, 3647, 4147, 4193, 4589, 5023, 5382, 5492, 6102, 6272, 6339, 
                6507, 7184, 7239], 'property_other'] = '无产权'
data_origin.loc[[92, 1888, 2703, 3381, 5654], 'property_other'] = '未过户'
data_origin.loc[[99, 619, 2728, 3062, 3222, 3251, 3696, 5283, 6191, 7295, 7376, 7746, 7821, 7917], 'property_other'] = '全家拥有'
data_origin.loc[[1597, 4993, 5398, 5899, 7240, 7776], 'property_other'] = '多人拥有'
data_origin.loc[[6469, 6891], 'property_other'] = '小产权'

将字符串编码为整数型的序号（ordinal）类型。

data_origin['property_other'] = data_origin['property_other'].astype('category').values.codes + 1

查看 invest_other 即从事的投资活动的填写情况。

pd.DataFrame(data_origin[data_origin['invest_other'] != -1]['invest_other'].unique())

在这里插入图片描述

同样地，将其转换成整数类型的序号（ordinal）编码。

data_origin['invest_other'] = data_origin['invest_other'].astype('category').values.codes + 1

data.drop(['survey_time','survey_type','province','city','county',
          'marital_1st','s_birth','marital_now'],axis=1,inplace=True)

针对缺失数据，使用随机森林模型对缺失数据进行填充
使用随机森林对数据进行填充

#  随机森林回归填充的思路就是从缺失值数目最少的开始填充
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

datas_pre['inc_ability'] = pd.to_numeric(datas_pre['inc_ability'])

y_full = data_train['happiness']
data_pre_reg = datas_pre.copy()
sort_index = np.argsort(data_pre_reg.isnull().sum(axis=0)).values

data_pre_reg.columns = [x for x in range(len(data_pre_reg.columns))]

%%time
for i in sort_index:
    df = data_pre_reg
    #  构建新标签
    fillc = df.iloc[:,i]
    #  构建新特征矩阵
    df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1)
    #  对于新的特征矩阵中，用0进行填充
    imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)
    df_0 = pd.DataFrame(imp_0.fit_transform(df))
    #  挑选出不缺失的标签
    Ytrain = fillc[fillc.notnull()]
    #  需要Ytest的index啊
    Ytest = fillc[fillc.isnull()]
    Xtrain = df_0.iloc[Ytrain.index,:]
    Xtest = df_0.iloc[Ytest.index,:]
    #  建立随机森林回归模型
    rfc = RandomForestRegressor(n_estimators=100,n_jobs=-1)
    rfc = rfc.fit(Xtrain,Ytrain)
    Ypredict = rfc.predict(Xtest)
    
    data_pre_reg.loc[data_pre_reg.iloc[:,i].isnull(),i] = Ypredict

查看缺失数据

data_pre_reg.isnull().sum()

在这里插入图片描述
模型建立与调参数

#  切分数据集
from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest = train_test_split(X_train,y_full,test_size=0.3,random_state=1227)

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

model_names = [
    #'linear_reg',
    #'RandomForestRegressor',
    'GradientBoostingRegressors',
    'svr',
    'KNeighborsRegressors',
    'AdaBoostRegressors',
    #'XGBoost',
    #'lightGBM'
    ]

models = [
   # LinearRegression(),
    #RandomForestRegressor(random_state=666),
    GradientBoostingRegressor(random_state=666),
    SVR(),
    KNeighborsRegressor(),
    AdaBoostRegressor(random_state=666),
    
    ]

parm_grids = [
    
#     {'RandomForestRegressor__max_depth':[3,6,9],'RandomForestRegressor__n_estimators':[10,50,70],
#     'RandomForestRegressor__min_samples_split':[3,5,7,9],'RandomForestRegressor__min_samples_leaf':range(2,5)
#     },
    {'GradientBoostingRegressors__n_estimators':[10,50,100],
     'GradientBoostingRegressors__max_depth':range(2,9),
    'GradientBoostingRegressors__min_samples_split':[3,5,7,9],
     'GradientBoostingRegressors__min_samples_leaf':range(2,5)},
    {'svr__degree':[2,3,4]},
    {'KNeighborsRegressors__n_neighbors':[2,5,7,9,10]},
    {'AdaBoostRegressors__n_estimators':[2,9,10,13,15,20,50,100]}
    
    ]

def Grid(pipeline,train_x,train_y,test_x,test_y,param_grid):
    response = {}
    gridsearch = GridSearchCV(pipeline,param_grid=param_grid,cv=3)
    search = gridsearch.fit(train_x,train_y)
    print('最优参数：',search.best_params_)
    print('最优参数(R^2)：%0.4lf' % search.best_score_)
    predict_y = gridsearch.predict(test_x)
    mse = mean_squared_error(ytest,predict_y).mean()
    response['mse'] = mse
    return response

%%time
for model,model_name,parm_grid in zip(models,model_names,parm_grids):
    #print(model_name,model)
    pipeline = Pipeline([
    #('sta',StandardScaler()),
    #('pca',PCA()),
    (model_name,model),
    ]
    )
    
    result = Grid(pipeline,Xtrain,ytrain,Xtest,ytest,parm_grid)
    print(result)

XGBoost与lightGBM的建立与参数

#  试试两大神器
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

#  试试默认参数下的xgboost
xgboo = XGBRegressor().fit(Xtrain,ytrain)
predict = xgboo.predict(Xtest)
print('XGBOOST的mse为:',mean_squared_error(ytest,predict))
print('XGBOOST的r2为:',r2_score(ytest,predict))

#  试试默认参数下的lightlgbm
lgbm = LGBMRegressor().fit(Xtrain,ytrain)
predict = lgbm.predict(Xtest)
print('LGBM的mse为:',mean_squared_error(ytest,predict))
print('LGBM的r2为:',r2_score(ytest,predict))

#   最优参数
xgb1 = XGBRegressor(max_depth=6,
                     learning_rate=0.01,
                     n_estimators=3000,
                     silent=False,
                     objective='reg:squarederror',
                     booster='gbtree',
                     n_jobs=-1,
                     gamma=5.4,
                     min_child_weight=6,
                     subsample=0.8,
                     colsample_bytree=1,
                     reg_lambda=1.39,
                     seed=7)

    #  然后训练模型、测试集预测、获得r2得分
xgb1_best2 = xgb1.fit(Xtrain,ytrain)

predicts = xgb1_best2.predict(Xtest)

print('最优模型的mse:',mean_squared_error(ytest,predicts))
print('最优模型的r2:',r2_score(ytest,predicts))

model = xgb1.fit(Xtrain,ytrain)

模型保存

model.save_model(r'The Best Model')

使用模型进行预测

pre = model.predict(X_test)

weixin_45690427

关注

1
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
天池-一起挖掘幸福感！

幸福感是一个古老而深刻的话题，是人类世代追求的方向。与幸福感相关的因素成千上万、因人而异，大如国计民生，小如路边烤红薯，都会对幸福感产生影响。这些错综复杂的因素中，我们能找到其中的共性，一窥幸福感的要义吗？天池新人实战赛是针对数据新人开设的实战练习专场，以经典赛题作为学习场景，提供详尽入门教程，手把手教你学习数据挖掘。天池希望新人赛能成为高校备受热捧的数据实战课程，帮助更多学生掌握数据技能。赛题背景在社会科学领域，幸福感的研究占有重要的位置。这个涉及了哲学、心理学、社会学、经济学等多方学科的话题复杂而
复制链接

扫一扫