基于Xgboost的不均衡数据分类

最新推荐文章于 2024-02-12 20:45:11 发布

YUI0908

最新推荐文章于 2024-02-12 20:45:11 发布

阅读量1.7w

点赞数 9

分类专栏：个人文章标签： XGBOOST、数据分析

本文链接：https://blog.csdn.net/qq_26185193/article/details/79439609

版权

本文详述了一种基于Xgboost的不均衡数据分类方法，包括数据探索、预处理、特征选择和机器学习过程。在数据探索阶段，重点关注了年龄分布与目标变量的关系以及数值和分类特征的观察。在预处理环节，处理了缺失值和异常值，进行了哑编码。在机器学习部分，应用了欠采样和过采样技术，通过调整Xgboost的max_depth、scale_pos_weight参数寻找最优模型，并可视化重要特征。

摘要由CSDN通过智能技术生成

1.项目分析与设计

该项目通过美国人口普查数据训练一个模型来预测美国人口收入水平。数据集上包含199523个训练数据和99762个测试数据，各包含了41个属性。经分析，该数据包含了人口统计信息、年龄、贷款信息、国籍、种族等信息。属性数据中有包含空值和有偏分布等问题，处理思路如下：
1.读取数据，观察特征及其分布
2.分析缺失情况，处理缺失值
3.异常值处理
4.对分类变量进行哑编码
5.用随机森林进行重要特征筛选
6.重采样对不均衡数据进行处理
7.构建XGBOOST模型，并进行建模分析预测

2. 数据探索

    In [1]: 
  

import numpy as np
import pandas as pd

    In [2]: 
  

train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
print ('train_df:%s,%s'%train_df.shape)
print ('test_df:%s,%s'%test_df.shape)

train_df:199523,41
test_df:99762,41

    In [3]: 
  

##检查因变量

train_df.income_level.unique()
test_df.income_level.unique()

      Out[3]: 
    

array(['-50000', '50000+.'], dtype=object)

    In [4]: 
  

#为了便于分析，将变量编码为0,1
train_df.loc[train_df['income_level']==-50000,'income_level']=0
train_df.loc[train_df['income_level']== 50000,'income_level']=1
test_df.loc[test_df['income_level']=='-50000','income_level']=0
test_df.loc[test_df['income_level']=='50000+.','income_level']=1

    In [5]: 
  

##检查样本不均衡程度
a=train_df['income_level'].sum()*100.0/train_df['income_level'].count()
b=test_df['income_level'].sum()*100.0/test_df['income_level'].count()
print ('train_df  (1,0):(%s,%s)'%(a,100-a))
print ('test_df  (1,0):(%s,%s)'%(b,100-b))

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)

观察数据

    In [6]: 
  

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 41 columns):
age                                 199523 non-null int64
class_of_worker                     199523 non-null object
industry_code                       199523 non-null int64
occupation_code                     199523 non-null int64
education                           199523 non-null object
wage_per_hour                       199523 non-null int64
enrolled_in_edu_inst_lastwk         199523 non-null object
marital_status                      199523 non-null object
major_industry_code                 199523 non-null object
major_occupation_code               199523 non-null object
race                                199523 non-null object
hispanic_origin                     198649 non-null object
sex                                 199523 non-null object
member_of_labor_union               199523 non-null object
reason_for_unemployment             199523 non-null object
full_parttime_employment_stat       199523 non-null object
capital_gains                       199523 non-null int64
capital_losses                      199523 non-null int64
dividend_from_Stocks                199523 non-null int64
tax_filer_status                    199523 non-null object
region_of_previous_residence        199523 non-null object
state_of_previous_residence         198815 non-null object
d_household_family_stat             199523 non-null object
d_household_summary                 199523 non-null object
migration_msa                       99827 non-null object
migration_reg                       99827 non-null object
migration_within_reg                99827 non-null object
live_1_year_ago                     199523 non-null object
migration_sunbelt                   99827 non-null object
num_person_Worked_employer          199523 non-null int64
family_members_under_18             199523 non-null object
country_father                      192810 non-null object
country_mother                      193404 non-null object
country_self                        196130 non-null object
citizenship                         199523 non-null object
business_or_self_employed           199523 non-null int64
fill_questionnaire_veteran_admin    199523 non-null object
veterans_benefits                   199523 non-null int64
weeks_worked_in_year                199523 non-null int64
year                                199523 non-null int64
income_level                        199523 non-null int64
dtypes: int64(13), object(28)
memory usage: 62.4+ MB

**数值数据观察

    In [7]: 
  

import matplotlib.pyplot as plt
def num_tr(filed,n):
    fig=plt.figure(figsize=(10,5))
    train_df[filed].hist(bins=n) 
    plt.title('%s'%filed) 
    plt.show()

1.age

1.1 分布

    In [8]: 
  

num_tr('age',100)

如图可以观察到，年龄在0-90之间，并且随着年龄的增大，人数减少

我猜测20岁以下和步入工作不久的人，比较不可能>50K，但是也不一定
现在将其分组，0-22,22-35,35-60,60-90,对应编码为：0,1,2,3(22岁为本科毕业平均年龄，35为工作初期（前10年），60岁为退休年龄）

    In [9]: 
  

'''
#创建年龄分组字段
labels=[0,1,2,3,4,5,6,7,8,9]
train_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
test_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
'''

      Out[9]: 
    

"\n#\xe5\x88\x9b\xe5\xbb\xba\xe5\xb9\xb4\xe9\xbe\x84\xe5\x88\x86\xe7\xbb\x84\xe5\xad\x97\xe6\xae\xb5\nlabels=[0,1,2,3,4,5,6,7,8,9]\ntrain_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\ntest_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\n"

1.2 年龄与目标变量的关系

收入水平为1的主要集中在30-50岁之间,并且可以看出，收入水平为1的人群年龄分布是接近正态的，均值为50.

    In [10]: 
  

'''
fig=plt.figure(figsize=(12,6))
train_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')

plt.title('income_level wrt age') 
plt.show()
    
'''

      Out[10]: 
    

"\nfig=plt.figure(figsize=(12,6))\ntrain_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')\n\nplt.title('income_level wrt age') \nplt.show()\n    \n"

    In [11]: 
  

#查看收入水平的人群的年龄分布
fig=plt.figure(figsize=(12,6))
train_df.age[train_df.income_level==0].plot(kind='kde')
train_df.age[train_df.income_level==1].plot(kind='kde')
plt.legend(('0','1'))
plt.show()

2.capital_losses&capital_gains

右偏数据。后续有待进一步分析。

    In [12]: 
  

fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.capital_gains.plot(kind='box')
plt.subplot2grid((1,2),(0,1))
train_df.capital_losses.plot(kind='box')
plt.show()

3.weeks_worked_in_year

水平为0的主要集中在0,50，而水平位1的则主要为50. 这里可以看出，水平为1的该变量几乎没有取值为1的。

    In [13]: 
  

#查看收入水平的人群的周工作时长分布
fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.weeks_worked_in_year[train_df.income_level==0].hist(bins=20)
plt.subplot2grid((1,2),(0,1))
train_df.weeks_worked_in_year[train_df.income_level==1].hist(bins=20,color=