基于机器学习预测岗位薪资

一颗洋芋

已于 2024-10-22 11:23:06 修改

阅读量2.6k

点赞数 13

分类专栏：机器学习文章标签：机器学习人工智能预编码 python

于 2024-04-13 20:55:59 首次发布

本文链接：https://blog.csdn.net/m0_58700887/article/details/137724561

版权

机器学习专栏收录该内容

25 篇文章

订阅专栏

本文介绍了如何通过抓取招聘信息，对岗位薪资进行预测，包括数据预处理（统一薪资单位、独热编码）、使用线性回归、决策树和随机森林模型进行预测，并通过GridSearchCV进行超参数调优。最后展示了模型的预测结果和可视化分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文根据某招聘网站抓取的岗位信息，来预测该岗位平均薪资。

数据预处理

数据示例如下：

因为本文重点介绍如何实现预测，因此对于数据的预处理部分讲解一下处理逻辑：

1、统一薪资的单位，要么统一为年薪（万/千），要么统一为月薪（万/千）；

2、将薪资的上下限分割成两列数据，然后求得其平均值；

3、对其他文字性数据进行独热编码（one-hot），参考独热编码（One-Hot Encoding）-CSDN博客；

4、由于是预测每个岗位的平均薪资，因此针对采集下来的岗位需要分开处理一下，我这里以预测“前端开发”岗位为例。

预处理后数据示例如下：

可以看到，第一列为平均薪资（我这里是年薪-万为单位），然后有工作地点、公司规模、工作经验、学历四个特征的独热编码。独热编码的逻辑就是：假如公司规模的类型有三种，分别是50人以下，50-100人，100人以上，那么将这三种类型分为三列，然后1代表有，0代表无。其他特征的编码按照这个逻辑以此类推。

线性回归预测

使用sklearn导包，数据总共1W+，训练80%，测试20%，代码示例如下：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV
import numpy as np

data = pd.read_csv('../input_data/test.csv', encoding='latin1')
output_path = '../output_data/LinearRegression/test/'

# 提取特征和标签
features = data.drop('avg_salary', axis=1)
labels = data['avg_salary']

# 拆分训练集和测试集 -- 训练80%，测试20%
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, random_state=42)

# 创建线性回归模型
model = LinearRegression()

# 定义超参数搜索空间
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [-1, 1, 2],
    'positive': [True, False]
}

# 使用网格搜索进行超参数调优
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(train_features, train_labels)

# 获取最佳模型和参数
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# 对测试集进行预测
predictions = best_model.predict(test_features)

# 保存预测结果和标签
results = pd.DataFrame({'Label': test_labels, 'Pred': predictions})
results.to_csv(output_path + 'result.csv', index=False)

# 评估模型性能
mse = mean_squared_error(test_labels, predictions)
mae = mean_absolute_error(test_labels, predictions)
r2 = r2_score(test_labels, predictions)
rmse = np.sqrt(mse)

# 四舍五入保留4位小数
mse = round(mse, 4)
mae = round(mae, 4)
r2 = round(r2, 4)
rmse = round(rmse, 4)

# 创建包含评估结果的数据帧
result_df = pd.DataFrame({'Metric': ['MSE', 'MAE', 'R2 Score','RMSE'],
                          'Value': [mse, mae, r2, rmse]})
# 保存为CSV文件
result_df.to_csv(output_path + 'evaluate.csv', index=False)

print("MSE: {:.4f}".format(mse))
print("MAE: {:.4f}".format(mae))
print("R2 Score: {:.4f}".format(r2))
print("RMSE: {:.4f}".format(rmse))

决策树预测

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

data = pd.read_csv('../input_data/test.csv', encoding='latin1')
output_path = '../output_data/DecisionTreeRegressor/test/'

# 提取特征和标签
features = data.drop('avg_salary', axis=1)
labels = data['avg_salary']

# 拆分训练集和测试集 -- 训练80%，测试20%
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, random_state=42)

# 创建决策树回归模型
model = DecisionTreeRegressor()

# 定义超参数搜索空间
param_grid = {
    'criterion': ['mse', 'friedman_mse', 'mae'],  # 分割质量的评估准则
    'max_depth': [None, 5, 10],  # 决策树的最大深度
    'min_samples_split': [2, 5, 10],  # 内部节点再划分所需的最小样本数
    'min_samples_leaf': [1, 2, 4]  # 叶节点上所需的最小样本数
}


# 使用网格搜索进行超参数调优
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(train_features, train_labels)

# 获取最佳模型和参数
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# 对测试集进行预测
predictions = best_model.predict(test_features)

# 保存预测结果和标签
results = pd.DataFrame({'Label': test_labels, 'Pred': predictions})
results.to_csv(output_path + 'result.csv', index=False)

# 评估模型性能
mse = mean_squared_error(test_labels, predictions)
mae = mean_absolute_error(test_labels, predictions)
r2 = r2_score(test_labels, predictions)
rmse = np.sqrt(mse)

# 四舍五入保留4位小数
mse = round(mse, 4)
mae = round(mae, 4)
r2 = round(r2, 4)
rmse = round(rmse, 4)

# 创建包含评估结果的数据帧
result_df = pd.DataFrame({'Metric': ['MSE', 'MAE', 'R2 Score','RMSE'],
                          'Value': [mse, mae, r2, rmse]})
# 保存为CSV文件
result_df.to_csv(output_path + 'evaluate.csv', index=False)

print("MSE: {:.4f}".format(mse))
print("MAE: {:.4f}".format(mae))
print("R2 Score: {:.4f}".format(r2))
print("RMSE: {:.4f}".format(rmse))

随机森林预测

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

data = pd.read_csv('../input_data/test.csv', encoding='latin1')
output_path = '../output_data/RandomForestRegressor/test/'

# 提取特征和标签
features = data.drop('avg_salary', axis=1)
labels = data['avg_salary']

# 拆分训练集和测试集 -- 训练80%，测试20%
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, random_state=42)

# 创建随机森林回归模型
model = RandomForestRegressor()

# 定义超参数搜索空间
param_grid = {
    'n_estimators': [100, 200, 300],  # 决策树的数量
    'max_depth': [None, 5, 10],  # 决策树的最大深度
    'min_samples_split': [2, 5, 10],  # 内部节点再划分所需的最小样本数
    'min_samples_leaf': [1, 2, 4],  # 叶节点上所需的最小样本数
    'max_features': ['auto', 'sqrt']  # 寻找最佳分割时要考虑的特征数量
}

# 使用网格搜索进行超参数调优
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(train_features, train_labels)

# 获取最佳模型和参数
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# 对测试集进行预测
predictions = best_model.predict(test_features)

# 保存预测结果和标签
results = pd.DataFrame({'Label': test_labels, 'Pred': predictions})
results.to_csv(output_path + 'result.csv', index=False)


# 评估模型性能
mse = mean_squared_error(test_labels, predictions)
mae = mean_absolute_error(test_labels, predictions)
r2 = r2_score(test_labels, predictions)
rmse = np.sqrt(mse)

# 四舍五入保留4位小数
mse = round(mse, 4)
mae = round(mae, 4)
r2 = round(r2, 4)
rmse = round(rmse, 4)

# 创建包含评估结果的数据帧
result_df = pd.DataFrame({'Metric': ['MSE', 'MAE', 'R2 Score','RMSE'],
                          'Value': [mse, mae, r2, rmse]})
# 保存为CSV文件
result_df.to_csv(output_path + 'evaluate.csv', index=False)

print("MSE: {:.4f}".format(mse))
print("MAE: {:.4f}".format(mae))
print("R2 Score: {:.4f}".format(r2))
print("RMSE: {:.4f}".format(rmse))

评估结果

由于我使用的是基本模型，没有怎么去设置参数以及特征优化等操作，导致结果并没有那么理想，大家可以根据需要去完善模型。

数据拟合可视化

示例100条的数据拟合情况（从下标第2000行开始），代码示例如下：

import matplotlib.pyplot as plt
import pandas as pd

# LinearRegression   DecisionTreeRegressor  RandomForestRegressor 
file_path = '../output_data/LinearRegression   /'
data = pd.read_csv(file_path + 'result.csv')

# 获取预测结果和标签
predictions = data['Pred']
test_labels = data['Label']

# 截取连续的30个数据点
start_index = 2000  # 起始索引
end_index = start_index + 100  # 结束索引
predictions = predictions[start_index:end_index]
test_labels = test_labels[start_index:end_index]

# 绘制数据拟合图
plt.figure(figsize=(12, 5))
plt.plot(test_labels, '--', color='blue', label='Real Value')
plt.plot(predictions, label='LinearRegression', color='red')
plt.xlabel('Index', fontsize=13)
plt.ylabel('Annual Salary (W)', fontsize=13)

# 移除上边框和右边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# 设置横纵坐标值字体大小
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.legend(fontsize=13)
plt.savefig(file_path + 'plot.png')
plt.show()