和鲸新人赛——员工满意度预测经验分享


大家好,我是一名数据分析的爱好者,和鲸这个平台适合机器学习的入门(kaggle磕英文太吃力了)。这里介绍的是属于datajoy新人赛的员工满意度预测分析,在第五期提交中取得了第三名,在这里介绍一下学习经验,希望大家一起讨论成长。

环境:Anaconda的Jupyter Notebook
语言:Python 3
主要库:numpy/pandas/matplotlib/seaborn/sklearn

导入包及数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import OneHotEncoder,LabelEncoder,OrdinalEncoder, MinMaxScaler, MaxAbsScaler
from sklearn.decomposition import PCA
#1. 导入并查看数据概况
train_data = pd.read_csv('/Users/rea/Downloads/和鲸社区/Employee_Satisfaction/训练集.csv')
test_data = pd.read_csv('/Users/rea/Downloads/和鲸社区/Employee_Satisfaction/测试集.csv')
print(train_data.info())
#2. 处理数据
train_data.index = train_data.id
test_data.index = test_data.id

x_train = train_data.drop(['satisfaction_level', 'id'], axis=1)
y_train = train_data.satisfaction_level

x_test = test_data.drop(['id'], axis=1)

定义数据处理函数

def encode(data, pca_comp_num = 3):
    result = pd.DataFrame.copy(data, deep=True)
    # 先把非数值离散数据转化成数值型离散数据
    division_le = LabelEncoder()
    package_le = LabelEncoder()
    salary_oe = LabelEncoder()

    result.division = division_le.fit_transform(result['division'])
    result.salary = salary_oe.fit_transform(result['salary'])
    result.package = package_le.fit_transform(result['package'])

    for col in ['last_evaluation', 'average_monthly_hours']:
        maxAbsEnc = MaxAbsScaler()
        result[col] = maxAbsEnc.fit_transform(result[col].values.reshape(-1,1))

    # 转独热编码并降维 
    for col in ['number_project', 'time_spend_company', 'package', 'division']:
        pca = PCA(n_components=pca_comp_num)
        new_col = pca.fit_transform(pd.get_dummies(data=result[col]).values)
        for i in range(pca_comp_num):
            result[col + '_' + str(i)] = new_col[:,i]
        result.drop(columns = [col], axis=1, inplace=True)
    # salary 转换成独热编码就是3维,不需要进行PCA降维
    # Work_accident, promotion_last_5years转换成独热编码是2维
    for col in ['Work_accident', 'promotion_last_5years', 'salary']:
        one_hot_encode = pd.get_dummies(data=result[col])
        one_hot_encode.columns = [col + '_' + str(i) for i in range(len(one_hot_encode.columns))]
        result = result.join(one_hot_encode)
        result.drop(col, axis=1, inplace = True)
    return result

数据处理

# 注意训练集和测试集的处理方式必须一致
x_test_cleaned = encode(x_test, 4)
x_train_cleaned = encode(x_train, 4)

x_test_id = x_test_cleaned.index
x_train = x_train_cleaned
x_test = x_test_cleaned

这里准备采用随机森林的回归算法

这里自己想了一个特别蠢及耗时的算法😅穷举出参数给定范围内的所有情况,比较哪种方案测试集的MSE最小

# 随机森林回归
from sklearn.ensemble import RandomForestRegressor

x_tr, x_va, y_tr, y_va = train_test_split(x_train, y_train, test_size=0.2, random_state=10)
for n in range(100,800,50):
    for m in range(1,25,1):
        estimator = RandomForestRegressor(n_estimators=n,max_features= m)
        estimator.fit(x_tr, y_tr)
        print(n, m, 'MSE:',mean_squared_error(y_va ,estimator.predict(x_va)))
  • 部分结果展示
100 1 MSE: 0.03533629132129167
100 2 MSE: 0.03448767905420833
100 3 MSE: 0.03392650226879167
100 4 MSE: 0.03319066634033334
100 5 MSE: 0.03274114732829166
100 6 MSE: 0.03244751971425
100 7 MSE: 0.03206664105275
100 8 MSE: 0.03168142710029167
100 9 MSE: 0.03162197618991666
100 10 MSE: 0.031356734264625
100 11 MSE: 0.03104515512449999
100 12 MSE: 0.031038263678
  • 这里使用目测法😅选出最佳的参数,并训练模型。大概有三组参数效果都不错,选出更顺眼的一组。
#600,23 150,24 350,24 都不错
estimator = RandomForestRegressor(n_estimators=600, max_features=23, n_jobs=-1)
estimator.fit(x_tr, y_tr)

print(n,m,'MSE:',mean_squared_error(y_va ,estimator.predict(x_va)))

拟合赛题提供的test数据集,并保存结果

y_predict_2 = estimator.predict(x_test)
y_predict_2 = pd.DataFrame(y_predict_2, columns=['satisfaction_level'])

result_2 = pd.DataFrame()
result_2['id'] = x_test_id
result_2['satisfaction_level'] = y_predict_2['satisfaction_level']
result_2.to_csv('/Users/rea/Downloads/submit.csv', index=False)

在这里插入图片描述

这个提交次数可以看出是数据调参师了。。。

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值