LR模型实现离职分类

最新推荐文章于 2021-09-07 23:07:51 发布

s_daqing

最新推荐文章于 2021-09-07 23:07:51 发布

阅读量115

点赞数

分类专栏： tensorflow

本文链接：https://blog.csdn.net/s_daqing/article/details/118422893

版权

Logistic Regression 二分类 Scikit-Learn 数据预处理标签编码

关键词由CSDN通过智能技术生成

tensorflow 专栏收录该内容

30 篇文章 0 订阅

订阅专栏

数据集到kagle官网下载：https://www.kaggle.com/

LR优点：

实现简单，广泛的应用于工业问题上；
分类时计算量非常小，速度很快，使用资源低；
方便观测样本概率分数；

LR缺点：

当特征空间很大时，LR的性能不是很好；
容易欠拟合，准确度不太高；
不能很好地处理大量多类特征或变量；
通常只处理二分类问题，多分类需要使用softmax（LR在多分类的推广），且必须线性可分；
对于非线性特征，需要进行转换；

二分类

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from torch.nn import BCELoss, BCEWithLogitsLoss, CrossEntropyLoss


train=pd.read_csv('train.csv',index_col=0)
test=pd.read_csv('test.csv',index_col=0)

# print(train.head())
# print(train.columns)
# 查看没列的数据类型
print(train.info())

# print(train['Attrition'].value_counts())
# 处理Attrition字段
train['Attrition']=train['Attrition'].map(lambda x:1 if x=='Yes' else 0)

# 查看数据是否有空值
# print(train.isna().sum())

# 去掉没用的列 员工号码，标准工时（=80）
train = train.drop(['EmployeeNumber', 'StandardHours'], axis=1)
test = test.drop(['EmployeeNumber', 'StandardHours'], axis=1)

# 对于分类(非数据类型)特征进行特征值编码
attr=['Age','BusinessTravel','Department','Education','EducationField','Gender','JobRole','MaritalStatus','Over18','OverTime']
lbe_list=[]
# 在这个数据集中，测试集出现的标签，在训练集都出现过，
# 一般还可以将训练集和测试集统一起来，一起进行 fit_transform
for feature in attr:
    # 标签编码：如果有10个类别，会编码成0-9
    lbe=LabelEncoder()
    # fit_transform：先 fit(指定对应关系) 再 transform（应用这种对应关系）
    train[feature]=lbe.fit_transform(train[feature])
    # 测试集不需要fit，因为上面已经生成了规则，直接应用就可以，如果重新生成就会和训练集有可能对应规则不同
    test[feature]=lbe.transform(test[feature])
    lbe_list.append(lbe)
#print(train)
train.to_csv('train_label_encoder.csv')

# 拆分训练集和验证集
X_train, X_valid, y_train, y_valid = train_test_split(train.drop('Attrition',axis=1), train['Attrition'], test_size=0.2, random_state=2021)

# 分类模型，二分类
model = LogisticRegression(max_iter=100,
                           verbose=True,
                           random_state=2021,
                           tol=1e-4
                          )
# 训练
model.fit(X_train, y_train)
# 使用验证集提前了解模型的效果
# valid_predict = model.predict(X_valid)
# loss_fn = BCELoss()
# loss = loss_fn(valid_predict, y_valid)

# 预测 predict：用来预测样本，也就是分类，predict_proba：输出分类概率。返回每种类别的概率，按照分类类别顺序给出
# predict = model.predict(test)
predict = model.predict_proba(test)[:, 1]
test['Attrition']=predict

print(test['Attrition'])
test[['Attrition']].to_csv('submit_lr.csv')
print('submit_lr.csv saved')
# 转化为二分类输出
#test['Attrition']=test['Attrition'].map(lambda x:1 if x>=0.5 else 0)
#test[['Attrition']].to_csv('submit_lr.csv')