3.Python数据分析项目——工资分类预测

1.总结

流程具体操作
基本查看查看缺失值(可以用直接查看方式isnull、图像查看方式查看缺失值missingno)、查看数值类型特征与非数值类型特征、一次性绘制所有特征的分布图像
预处理缺失值处理(填充)拆分数据(获取有需要的值) 、统一数据格式、特征工程(特征编码、0/1字符转换) 、特征衍生、降维(特征相关性、PCA降维)
数据分析groupby分组求最值数据、seaborn可视化
预测拆分数据集、建立模型(RandomForestRegressor、LogisticRegression、GradientBoostingRegressor)、训练模型、预测、评估模型(ROC曲线、MSE、MAE、RMSE、R2)

数量查看:条形图
占比查看:饼图
数据分区分布查看:概率密度函数图
查看相关关系:条形图、热力图
分布分析:分类直方图(countplot)、分布图-带有趋势线的直方图(distplot)

2.项目背景及数据来源介绍

项目目标
使用美国人口普查收入数据集,根据人口普查数据预测个人收入是否超过每年50,000美元
数据来源
数据集地址: https://archive.ics.uci.edu/ml/datasets/adult

3.数据的理解

>50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-empinc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HSgrad, Prof-school, Assoc-acdm, Assoc-voc, 9th,
7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Otherservice, Sales, Exec-managerial, Profspecialty, Handlers-cleaners, Machine-opinspct, Adm-clerical, Farming-fishing,
Transport-moving, Priv-house-serv, Protectiveserv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-infamily, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-IndianEskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia,
England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam,
Mexico, Portugal, Ireland, France, DominicanRepublic, Laos, Ecuador, Taiwan, Haiti,
Columbia, Hungary, Guatemala, Nicaragua,
Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, HolandNetherlands.

在这里插入图片描述
在这里插入图片描述

4.数据基本查看

4.1 导入数据

# 1.创建字段名
headers = ['age', 'workclass', 'fnlwgt', 
           'education', 'education-num', 
           'marital-status', 'occupation', 
           'relationship', 'race', 'sex', 
           'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 
           'predclass']
# 2.加载训练集和测试集
training_raw = pd.read_csv('dataset/adult.data',
                            names=headers,
                            sep=',\s', # 分隔符
                            na_values=['?'], # 缺失值是什么
                            engine='python'
                            )
test_raw = pd.read_csv('dataset/adult.test',
                            names=headers,
                            sep=',\s', # 分隔符
                            na_values=['?'], # 缺失值是什么
                            engine='python',
                            skiprows=1 # 跳过1行
                            )
# 3.合并数据集并设置新的索引
dataset_raw = training_raw.append(test_raw) # 追加方式合并dataframe
dataset_raw.reset_index(inplace=True) # 重置索引
dataset_raw.drop('index', inplace=True, axis=1) # 删除原先的索引

在这里插入图片描述

4.2 图像方式查看缺失值个数

import missingno
# 1.以矩阵方式查看缺失值
missingno.matrix(dataset_raw, figsize=(30,5))

# 2.以条形图方式查看缺失值
missingno.bar(dataset_raw, sort="ascending", figsize=(30,5))

# 3.删除缺失值(第6步后)
dataset_bin = dataset_bin.dropna(axis=0)
dataset_con = dataset_con.dropna(axis=0)

在这里插入图片描述
在这里插入图片描述

4.3 显示所有数字型特征和非数字型(标量型)特征

import math
# 使用一张画布绘制所有特征的图像
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')# 绘制风格
    fig = plt.figure(figsize=(width, height)) # 画布大小
    # 子图调整
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataset.shape[1]) / cols)
    # enumerate枚举,遍历数据特征
    for i, column in enumerate(dataset.columns):
        ax = fig.add_subplot(rows, cols, i+1) # 添加子图
        ax.set_title(column) # 设置标题

        if dataset.dtypes[column] == np.object: # 判断列的数据类型
            g = sns.countplot(y=column, data=dataset) # 非数字类型用统计
            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(dataset[column]) # 数字类型用直方图
            plt.xticks(rotation=25)

plot_distribution(dataset_raw, cols=3, width=20, height=20, hspace=0.45, wspace=0.5)

在这里插入图片描述

5.数据预处理与特征工程

# 1.创建新的DataFrame
dataset_bin = pd.DataFrame() # 包含所有离散后的值
dataset_con = pd.DataFrame() # 包含所有未离散的值

# 2.predclass标签属性, 预测目标:转换为0/1,年收入超过50k记为1.
# 转换
dataset_raw.loc[dataset_raw['predclass']=='>50K', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='>50K.', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='<=50K', 'predclass'] = 0
dataset_raw.loc[dataset_raw['predclass']=='<=50K.', 'predclass'] = 0

# 存储到两个DataFrame中
dataset_bin['predclass'] = dataset_raw['predclass']
dataset_con['predclass'] = dataset_raw['predclass']

# 可视化:predclass属性
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,1))
sns.countplot(y='predclass', data=dataset_raw)

在这里插入图片描述

# 3.age标签属性: 预测目标:分为是否分箱进行查看
# 存储数据
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10) # 分箱存储离散化数据
dataset_con['age'] = dataset_raw['age'] # 未离散化

# 绘制离散化数据
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age', data=dataset_bin)

# 绘制未离散化数据(带有趋势线的直方图):超过50k收入的人的年龄
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age']) # 高收入年龄趋势图

sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age']) # 低收入年龄趋势图

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

# 4.特征workclass
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_raw)

# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 减少类别数目
dataset_raw.loc[dataset_raw['workclass'] == 'Without-pay','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Never-worked','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Federal-gov','workclass'] = 'Fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'State-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Local-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-not-inc','workclass'] = 'Self-emp'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-inc','workclass'] = 'Self-emp'

# 存储记录
dataset_bin['workclass'] = dataset_raw['workclass']
dataset_con['workclass'] = dataset_raw['workclass']

# 合并工作类别后绘制图
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_bin)

在这里插入图片描述
在这里插入图片描述

# 5.特征occupation
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5)) 
sns.countplot(y="occupation", data=dataset_raw)

# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['occupation'] == 'Adm-clerical', 'occupation'] = 'Admin'
dataset_raw.loc[dataset_raw['occupation'] == 'Armed-Forces', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Craft-repair', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Exec-managerial', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Farming-fishing', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Handlers-cleaners', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Machine-op-inspct', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Other-service', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Priv-house-serv', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Prof-specialty', 'occupation'] = 'Professional'
dataset_raw.loc[dataset_raw['occupation'] == 'Protective-serv', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Sales', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Tech-support', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Transport-moving', 'occupation'] = 'Manual Labour'

dataset_bin['occupation'] = dataset_raw['occupation']
dataset_con['occupation'] = dataset_raw['occupation']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="occupation", data=dataset_bin)

在这里插入图片描述

在这里插入图片描述

# 6.特征native country
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,10)) 
sns.countplot(y="native-country", data=dataset_raw)
# 发现此时除了Private以外的类别都很少,可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['native-country'] == 'Cambodia'                    , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Canada'                      , 'native-country'] = 'British-Commonwealth'    
dataset_raw.loc[dataset_raw['native-country'] == 'China'                       , 'native-country'] = 'China'       
dataset_raw.loc[dataset_raw['native-country'] == 'Columbia'                    , 'native-country'] = 'South-America'    
dataset_raw.loc[dataset_raw['native-country'] == 'Cuba'                        , 'native-country'] = 'South-America'        
dataset_raw.loc[dataset_raw['native-country'] == 'Dominican-Republic'          , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Ecuador'                     , 'native-country'] = 'South-America'     
dataset_raw.loc[dataset_raw['native-country'] == 'El-Salvador'                 , 'native-country'] = 'South-America' 
dataset_raw.loc[dataset_raw['native-country'] == 'England'                     , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'France'                      , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Germany'                     , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Greece'                      , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Guatemala'                   , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Haiti'                       , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Holand-Netherlands'          , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Honduras'                    , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Hong'                        , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Hungary'                     , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'India'                       , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Iran'                        , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Ireland'                     , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Italy'                       , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Jamaica'                     , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Japan'                       , 'native-country'] = 'APAC'
dataset_raw.loc[dataset_raw['native-country'] == 'Laos'                        , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Mexico'                      , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Nicaragua'                   , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Outlying-US(Guam-USVI-etc)'  , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Peru'                        , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Philippines'                 , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Poland'                      , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Portugal'                    , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Puerto-Rico'                 , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Scotland'                    , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'South'                       , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Taiwan'                      , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Thailand'                    , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Trinadad&Tobago'             , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'United-States'               , 'native-country'] = 'United-States'
dataset_raw.loc[dataset_raw['native-country'] == 'Vietnam'                     , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Yugoslavia'                  , 'native-country'] = 'Euro_Group_2'

dataset_bin['native-country'] = dataset_raw['native-country']
dataset_con['native-country'] = dataset_raw['native-country']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="native-country", data=dataset_bin)

在这里插入图片描述
在这里插入图片描述

# 7.education特征
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5)) 
sns.countplot(y="education", data=dataset_raw)

dataset_raw.loc[dataset_raw['education'] == '10th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '11th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '12th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '1st-4th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '5th-6th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '7th-8th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '9th'           , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-acdm'    , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-voc'     , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Bachelors'     , 'education'] = 'Bachelors'
dataset_raw.loc[dataset_raw['education'] == 'Doctorate'     , 'education'] = 'Doctorate'
dataset_raw.loc[dataset_raw['education'] == 'HS-Grad'       , 'education'] = 'HS-Graduate'
dataset_raw.loc[dataset_raw['education'] == 'Masters'       , 'education'] = 'Masters'
dataset_raw.loc[dataset_raw['education'] == 'Preschool'     , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Prof-school'   , 'education'] = 'Professor'
dataset_raw.loc[dataset_raw['education'] == 'Some-college'  , 'education'] = 'HS-Graduate'

dataset_bin['education'] = dataset_raw['education']
dataset_con['education'] = dataset_raw['education']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="education", data=dataset_bin)

在这里插入图片描述

在这里插入图片描述

# 8.特征Marital Status
plt.figure(figsize=(20,3)) 
sns.countplot(y="marital-status", data=dataset_raw)

dataset_raw.loc[dataset_raw['marital-status'] == 'Never-married'        , 'marital-status'] = 'Never-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-AF-spouse'    , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-civ-spouse'   , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-spouse-absent', 'marital-status'] = 'Not-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Separated'            , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Divorced'             , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Widowed'              , 'marital-status'] = 'Widowed'

dataset_bin['marital-status'] = dataset_raw['marital-status']
dataset_con['marital-status'] = dataset_raw['marital-status']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
sns.countplot(y="marital-status", data=dataset_bin)

在这里插入图片描述
在这里插入图片描述

# 9.特征Final Weight 体重分箱
dataset_bin['fnlwgt'] = pd.cut(dataset_raw['fnlwgt'], 10)
dataset_con['fnlwgt'] = dataset_raw['fnlwgt']  # 未离散化
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="fnlwgt", data=dataset_bin)

在这里插入图片描述

# 10.特征Education Number
dataset_bin['education-num'] = pd.cut(dataset_raw['education-num'], 10) # 分箱离散化
dataset_con['education-num'] = dataset_raw['education-num'] # 未离散化

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5)) 
sns.countplot(y="education-num", data=dataset_bin)

在这里插入图片描述

# 11.特征Hours per Week
# 周工作时间(小时)分箱
dataset_bin['hours-per-week'] = pd.cut(dataset_raw['hours-per-week'], 10)
dataset_con['hours-per-week'] = dataset_raw['hours-per-week']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
plt.subplot(1, 2, 1)

sns.countplot(y="hours-per-week", data=dataset_bin);
plt.subplot(1, 2, 2)

sns.distplot(dataset_con['hours-per-week'])

在这里插入图片描述

# 12.Capital Gain
dataset_bin['capital-gain'] = pd.cut(dataset_raw['capital-gain'], 5)

dataset_con['capital-gain'] = dataset_raw['capital-gain']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
plt.subplot(1, 2, 1)
sns.countplot(y="capital-gain", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-gain'])

在这里插入图片描述

# 13.特征Capital Loss
dataset_bin['capital-loss'] = pd.cut(dataset_raw['capital-loss'], 5)
dataset_con['capital-loss'] = dataset_raw['capital-loss']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
plt.subplot(1, 2, 1)
sns.countplot(y="capital-loss", data=dataset_bin)
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-loss'])

在这里插入图片描述

# 14.特征Race, Sex, Relationship
# 无需处理
dataset_con['sex'] = dataset_bin['sex'] = dataset_raw['sex']
dataset_con['race'] = dataset_bin['race'] = dataset_raw['race']
dataset_con['relationship'] = dataset_bin['relationship'] = dataset_raw['relationship']

6.特征衍生

特征衍生的意思是根据已有的特征创建新的特征

# 1.连续型特征衍生(age与hours per-week共同衍生的特征)
dataset_con['age-hours'] = dataset_con['age'] * dataset_con['hours-per-week']
dataset_bin['age-hours'] = pd.cut(dataset_con['age-hours'],10)
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age-hours', data=dataset_bin) # 绘制横向数量统计图
plt.subplot(1,2,2)
# 连续型衍生特征趋势图
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age-hours'])
sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age-hours'])

# 2.离散型特征衍生(sex与marital-status共同衍生的特征)
dataset_bin['sex-marital'] = dataset_con['sex-marital'] = dataset_bin['sex'] + dataset_bin['marital-status']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y='sex-marital', data=dataset_bin)

在这里插入图片描述
在这里插入图片描述

7.特征编码

机器学习算法接收的是数值型变量 把字符型编码为数值型的过程叫做编码 我们用到的:

Label encoding(标签编码)
例如:令 红=1,黄=2,蓝=3. 那么这样其实实现了标签编码,即给不同类别以标签。然而这意味着机器可能会学习到“红<黄<蓝”

One-Hot encoding(独热编码)
每个样本只对应于一个类别(即只在对应的特征处值为1,其余地方值为0)

例如:有三种颜色状态,所以就有3个比特。即红色:1 0 0 ,黄色: 0 1 0,蓝色:0 0 1 。如此一来每两个向量之间的距离都是根号2,在向量空间距离都相等,所以这样不会出现偏序性,基本不会影响基于向量空间度量算法的效果

# 1.对所有 离散型 特征进行one-hot编码
one_hot_cols = dataset_bin.columns.tolist() # 获取所有列索引,并转化为字符串
one_hot_cols.remove('predclass') # 不对标签列进行编码
# one-hot编码
dataset_bin_env = pd.get_dummies(dataset_bin, columns=one_hot_cols)
dataset_bin_env.head()

# 2.所有连续型特征进行Label_encoding编码
encoder = LabelEncoder()
dataset_con = dataset_con.astype(str) # 获取所有列索引转换为字符串类型
dataset_con_env = dataset_con.apply(encoder.fit_transform)
dataset_con_env.head()

在这里插入图片描述
在这里插入图片描述

8.特征相关性与降维

特征降维的作用:

  • 数据在低维下更容易处理、更容易使用
  • 相关特征,特别是重要特征更能在数据中明确的显示出来
  • 去除数据噪声
  • 降低算法开销
# 1.查看特征相关性
# 绘制两个数据集的热力图
plt.style.use('seaborn-whitegrid') # 设置绘图风格
fig = plt.figure(figsize=(20,10))

# 绘制第一个热力图
plt.subplot(1,2,1) # 设置子图,1行2列的第一个子图
# 根据dataset_bin_enc(离散型)的特征相关性,创建布尔型数组
mask = np.zeros_like(dataset_bin_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True(如何将绘制相关系数热力图只保留左下角部分)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_bin_env.corr(),
            vmin=-1, vmax=1,
            square=True,
            cmap=sns.color_palette("RdBu_r",100),
            mask=mask,
            linewidths=.5)

# 绘制第二个热力图
plt.subplot(1,2,2) # 设置子图,1行2列的第二个子图
# 根据dataset_con_enc(离散型)的特征相关性,创建布尔型数组
mask = np.zeros_like(dataset_con_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True(如何将绘制相关系数热力图只保留左下角部分)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_con_env.corr(),
            vmin=-1, vmax=1,
            square=True,
            cmap=sns.color_palette("RdBu_r",100),
            mask=mask,
            linewidths=.5)
       
# 2.PCA降维
# 假设降低到10维(保留10个特征)
X = dataset_bin_env.drop('predclass',axis=1) # 提取特征(不包含标签)
pca = PCA(n_components=10)
X_reduction = pca.fit_transform(X)

在这里插入图片描述

9.建模与评估

# 1.选择数据集
# 第一个可选数据集,dataset_bin_enc(离散编码)
# 第二个可选数据集,dataset_con_enc(连续编码)

selected_dataset = dataset_bin_enc
selected_dataset.head()
# 2.拆分数据集
# 由于原数据集已经帮我们分配好了训练与测试样本,这里直接复原成原来的即可
train = selected_dataset.loc[:32560, :]
test = selected_dataset.loc[32561:,:]

# 算法开始之前重命名特征和标签
X_train = train.drop('predclass',axis=1)
y_train = train['predclass'].astype('int64')
X_test = test.drop('predclass', axis=1)
y_test = test['predclass'].astype('int64')

# 3.建立模型(选择逻辑回归LogisticRegression)
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
decision_scores = log_reg.decision_function(X_test) # 返回一个分数评估
print("decision_scores:",decision_scores)

# 4.评估模型(绘制ROC曲线)
fpr, tpr, thresholds = roc_curve(y_test, decision_scores)
plt.title('Receiver Operation Characteristic')
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--') # 绘制对角线方法,不是根据坐标绘制的,背下来即可
plt.xlabel('False Postive Rate')
plt.ylabel('True Positive Rate')
plt.show()

在这里插入图片描述

  • 8
    点赞
  • 46
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

想成为数据分析师的开发工程师

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值