美国人口普查年收入情况分析

最新推荐文章于 2023-05-15 23:49:51 发布

雪❄️

最新推荐文章于 2023-05-15 23:49:51 发布

阅读量6.1k

点赞数 17

分类专栏：机器学习文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_37379106/article/details/103569653

版权

本文分析了美国人口普查数据，使用多种分类模型（随机森林、决策树、逻辑回归、Adaboost、KNN等）进行年收入预测。通过对模型性能的对比，发现GBDT和Adaboost在精确率、召回率和F1值上表现出色，其次是逻辑回归，而KNN表现相对较弱。所有模型的训练和评估均涉及数据预处理、特征工程和模型参数调优等步骤。

摘要由CSDN通过智能技术生成

美国人口普查年收入情况分析

把sklearn几乎所有分类方法都用一遍

0.背景介绍

数据来源于1994年美国人口普查数据库。（下载地址：https://archive.ics.uci.edu/ml/datasets/Adult ）

预测任务是确定一个人的年收入是否超过5万。

数据集包含14个属性，分别是：年龄、工作类别、final weight、教育、教育数量、婚姻状况、职业、关系、种族、性别、资本收益、资本损失、每周小时数、国籍。其中，年龄、final weight、教育数量、资本收益、资本损失和每周小时数是数值标签，其余是标称标签。

序号	字段名	含义	类型
0	age	年龄	double
1	workclass	工作类型	string
2	fnlwgt	序号	string
3	education	教育程度	string
4	education_num	受教育时间	double
5	marital_status	婚姻状态	string
6	occupation	职业	string
7	relationship	关系	string
8	race	种族	string
9	sex	性别	string
10	capital_gain	资本收益	string
11	capital_loss	资本损失	string
12	hours_per_week	每周工作小时数	double
13	native_country	原籍	string
14	(label)income	收入	string

1.数据预处理

1.1数据导入

import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score,roc_curve, auc

train = pd.read_csv('data.csv',header=None)

test = pd.read_csv('test.csv',header=None)

首先为各列特征变量设置列标签，预览数据集信息。

#设置特征变量的标签
cols = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
              'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
              'hours_per_week', 'native_country', 'wage_class']
train.columns = cols
test.columns = cols
train.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	wage_class
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Di vorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

1.2 数据格式处理

观察到数据的输出列存在空格，为了统一输出格式，需要对带有空格的标签替换处理。

#统一输出列格式，删去空格
train['wage_class'] = train['wage_class'].map(lambda x:x.replace(' ',''))
test['wage_class'] = test['wage_class'].replace({
   ' <=50K.':'<=50K',' >50K.':'>50K'})

1.3 缺失值处理

另外，部份数据缺失，在源数据中被填为“？”，需要将“？”标为NAN值。

#将“？”填补为NAN
train = train.replace(' ?',np.nan)
test = test.replace(' ?',np.nan)

2.数据探索性分析

2.1 数据整体情况描述

对于特征，我们可以分析单特征，也可以分析不同特征之间的关系。

本数据集中的特征分为两种：标称型和数值型：

Numerical: 数值型
Categorical: 种类或者字符串

将训练集和测试集合并，并用sign标记。

#将训练集和测试集合并到新的Dataframe: adult，并运用sign对训练集和测试集标记
tr = train.copy()
tr['sign'] = 'train'
te = test.copy()
te['sign'] = 'test'
adult = tr.append(te).reset_index().drop(columns={
   'index'})
adult.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	wage_class	sign
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	train
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	train
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	train
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	train
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	train

统计各个数值型变量的数据分布情况如下：

#展示变量的数据描述（不包括带有缺失值的变量）
adult.describe()

	age	fnlwgt	education_num	capital_gain	capital_loss	hours_per_week
count	48842.000000	4.884200e+04	48842.000000	48842.000000	48842.000000	48842.000000
mean	38.643585	1.896641e+05	10.078089	1079.067626	87.502314	40.422382
std	13.710510	1.056040e+05	2.570973	7452.019058	403.004552	12.391444
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.175505e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.781445e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.376420e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.490400e+06	16.000000	99999.000000	4356.000000	99.000000

统计各个变量的标签数量和最多的一类标签。

#展示变量的数据描述（包括带有缺失值的变量）
adult.describe(include=['O'])

	workclass	education	marital_status	occupation	relationship	race	sex	native_country	wage_class	sign
count	46043	48842	48842	46033	48842	48842	48842	47985	48842	48842
unique	8	16	7	14	6	5	2	41	2	2
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	<=50K	train
freq	33906	15784	22379	6172	19716	41762	32650	43832	37155	32561

2.2 缺失情况分析

#导入画图工具包
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno  #missingno是一个可视化缺失值的库

分别绘制训练集和测试集的缺失值图，一共有三列有缺失：work_class、occupation和native_country。

#绘制训练集和测试集的缺失值柱状图
plt.subplot(1,2,1)
msno.bar(train)
plt.title('train')
plt.subplot(1,2,2)
plt.title('test')
msno.bar(test)
plt.show()

在这里插入图片描述

为了进一步对缺失值之间的相似度进行分析，绘制缺失值热力图。利用热力图可以观察多个特征两两的相似度，相似度由皮尔逊相关系数度量。

occupation和workclass为1表明这两个变量在测试集和训练集都是同步缺失的。

#绘制训练集和测试集的缺失值热力图
msno.heatmap(train,figsize=(3, 2)),msno.heatmap(test,figsize=(3, 2))

(<matplotlib.axes._subplots.AxesSubplot at 0x218eda6fba8>,
 <matplotlib.axes._subplots.AxesSubplot at 0x218edadd470>)

在这里插入图片描述

分别对训练集和测试集绘制缺失值矩阵图。矩阵图中白线越多，代表缺失值越多。

结果表明workclass和occupation相比于native_country有更多的缺失值

#绘制缺失值矩阵图
msno.matrix(train,figsize=(6,3))
msno.matrix(test,figsize=(6,3))

<matplotlib.axes._subplots.AxesSubplot at 0x218ed77d080>

在这里插入图片描述

将训练集和测试集合并，进一步统计缺失数据在总体数据中所占的比例，并绘制柱状图：

测试集中，工作类型workclass和职业occupation分别有5.91%和5.93%的缺失，原籍native_country有1.68%缺失。

训练集中，工作类型workclass和职业occupation分别有5.63%和5.66%的缺失，原籍native_country有1.79%缺失。

#统计缺失值所占的比例
temp = adult.groupby('sign').apply(lambda x :x.isna().sum()/len(x))
temp = temp.loc[:,(temp!=0).any()]
temp

	workclass	occupation	native_country
sign
test	0.059149	0.059333	0.016829
train	0.056386	0.056601	0.017905

#绘制柱状图
temp.plot(kind='bar',figsize=

最低0.47元/天解锁文章

雪❄️

关注

17
点赞
踩
101

收藏

觉得还不错? 一键收藏
11
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

美国人口普查年收入情况分析

美国人口普查年收入情况分析

文章目录

0.背景介绍

1.数据预处理

1.1数据导入

1.2 数据格式处理

1.3 缺失值处理

2.数据探索性分析

2.1 数据整体情况描述

2.2 缺失情况分析