3.Python数据分析项目——工资分类预测

最新推荐文章于 2023-06-17 22:54:12 发布

想成为数据分析师的开发工程师

最新推荐文章于 2023-06-17 22:54:12 发布

阅读量2.4k

点赞数 8

分类专栏：数据分析-统计分析文章标签： python 数据分析机器学习人工智能算法

本文链接：https://blog.csdn.net/m0_63953077/article/details/129205862

版权

数据分析-统计分析专栏收录该内容

16 篇文章 11 订阅

订阅专栏

1.总结

流程	具体操作
基本查看	查看缺失值（可以用直接查看方式isnull、图像查看方式查看缺失值missingno）、查看数值类型特征与非数值类型特征、一次性绘制所有特征的分布图像
预处理	缺失值处理（填充）拆分数据（获取有需要的值）、统一数据格式、特征工程(特征编码、0/1字符转换) 、特征衍生、降维（特征相关性、PCA降维）
数据分析	groupby分组求最值数据、seaborn可视化
预测	拆分数据集、建立模型（RandomForestRegressor、LogisticRegression、GradientBoostingRegressor）、训练模型、预测、评估模型（ROC曲线、MSE、MAE、RMSE、R2）

数量查看：条形图
占比查看：饼图
数据分区分布查看：概率密度函数图
查看相关关系：条形图、热力图
分布分析：分类直方图（countplot）、分布图-带有趋势线的直方图（distplot）

2.项目背景及数据来源介绍

项目目标
使用美国人口普查收入数据集，根据人口普查数据预测个人收入是否超过每年50,000美元
数据来源
数据集地址: https://archive.ics.uci.edu/ml/datasets/adult

3.数据的理解

>50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-empinc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HSgrad, Prof-school, Assoc-acdm, Assoc-voc, 9th,
7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Otherservice, Sales, Exec-managerial, Profspecialty, Handlers-cleaners, Machine-opinspct, Adm-clerical, Farming-fishing,
Transport-moving, Priv-house-serv, Protectiveserv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-infamily, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-IndianEskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia,
England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam,
Mexico, Portugal, Ireland, France, DominicanRepublic, Laos, Ecuador, Taiwan, Haiti,
Columbia, Hungary, Guatemala, Nicaragua,
Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, HolandNetherlands.

在这里插入图片描述

4.数据基本查看

4.1 导入数据

# 1.创建字段名
headers = ['age', 'workclass', 'fnlwgt', 
           'education', 'education-num', 
           'marital-status', 'occupation', 
           'relationship', 'race', 'sex', 
           'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 
           'predclass']
# 2.加载训练集和测试集
training_raw = pd.read_csv('dataset/adult.data',
                            names=headers,
                            sep=',\s', # 分隔符
                            na_values=['?'], # 缺失值是什么
                            engine='python'
                            )
test_raw = pd.read_csv('dataset/adult.test',
                            names=headers,
                            sep=',\s', # 分隔符
                            na_values=['?'], # 缺失值是什么
                            engine='python',
                            skiprows=1 # 跳过1行
                            )
# 3.合并数据集并设置新的索引
dataset_raw = training_raw.append(test_raw) # 追加方式合并dataframe
dataset_raw.reset_index(inplace=True) # 重置索引
dataset_raw.drop('index', inplace=True, axis=1) # 删除原先的索引

在这里插入图片描述

4.2 图像方式查看缺失值个数

import missingno
# 1.以矩阵方式查看缺失值
missingno.matrix(dataset_raw, figsize=(30,5))

# 2.以条形图方式查看缺失值
missingno.bar(dataset_raw, sort="ascending", figsize=(30,5))

# 3.删除缺失值（第6步后）
dataset_bin = dataset_bin.dropna(axis=0)
dataset_con = dataset_con.dropna(axis=0)

在这里插入图片描述

4.3 显示所有数字型特征和非数字型（标量型）特征

import math
# 使用一张画布绘制所有特征的图像
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')# 绘制风格
    fig = plt.figure(figsize=(width, height)) # 画布大小
    # 子图调整
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataset.shape[1]) / cols)
    # enumerate枚举，遍历数据特征
    for i, column in enumerate(dataset.columns):
        ax = fig.add_subplot(rows, cols, i+1) # 添加子图
        ax.set_title(column) # 设置标题

        if dataset.dtypes[column] == np.object: # 判断列的数据类型
            g = sns.countplot(y=column, data=dataset) # 非数字类型用统计
            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(dataset[column]) # 数字类型用直方图
            plt.xticks(rotation=25)

plot_distribution(dataset_raw, cols=3, width=20, height=20, hspace=0.45, wspace=0.5)

在这里插入图片描述

5.数据预处理与特征工程

# 1.创建新的DataFrame
dataset_bin = pd.DataFrame() # 包含所有离散后的值
dataset_con = pd.DataFrame() # 包含所有未离散的值

# 2.predclass标签属性， 预测目标：转换为0/1，年收入超过50k记为1.
# 转换
dataset_raw.loc[dataset_raw['predclass']=='>50K', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='>50K.', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass']=='<=50K', 'predclass'] = 0
dataset_raw.loc[dataset_raw['predclass']=='<=50K.', 'predclass'] = 0

# 存储到两个DataFrame中
dataset_bin['predclass'] = dataset_raw['predclass']
dataset_con['predclass'] = dataset_raw['predclass']

# 可视化：predclass属性
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,1))
sns.countplot(y='predclass', data=dataset_raw)

在这里插入图片描述

# 3.age标签属性： 预测目标：分为是否分箱进行查看
# 存储数据
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10) # 分箱存储离散化数据
dataset_con['age'] = dataset_raw['age'] # 未离散化

# 绘制离散化数据
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age', data=dataset_bin)

# 绘制未离散化数据（带有趋势线的直方图）:超过50k收入的人的年龄
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age']) # 高收入年龄趋势图

sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age']) # 低收入年龄趋势图

在这里插入图片描述

# 4.特征workclass
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_raw)

# 发现此时除了Private以外的类别都很少，可以考虑进行数据合并
# 减少类别数目
dataset_raw.loc[dataset_raw['workclass'] == 'Without-pay','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Never-worked','workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Federal-gov','workclass'] = 'Fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'State-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Local-gov','workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-not-inc','workclass'] = 'Self-emp'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-inc','workclass'] = 'Self-emp'

# 存储记录
dataset_bin['workclass'] = dataset_raw['workclass']
dataset_con['workclass'] = dataset_raw['workclass']

# 合并工作类别后绘制图
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y = 'workclass', data=dataset_bin)

在这里插入图片描述

# 5.特征occupation
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5)) 
sns.countplot(y="occupation", data=dataset_raw)

# 发现此时除了Private以外的类别都很少，可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['occupation'] == 'Adm-clerical', 'occupation'] = 'Admin'
dataset_raw.loc[dataset_raw['occupation'] == 'Armed-Forces', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Craft-repair', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Exec-managerial', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Farming-fishing', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Handlers-cleaners', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Machine-op-inspct', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Other-service', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Priv-house-serv', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Prof-specialty', 'occupation'] = 'Professional'
dataset_raw.loc[dataset_raw['occupation'] == 'Protective-serv', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Sales', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Tech-support', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Transport-moving', 'occupation'] = 'Manual Labour'

dataset_bin['occupation'] = dataset_raw['occupation']
dataset_con['occupation'] = dataset_raw['occupation']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="occupation", data=dataset_bin)

在这里插入图片描述

# 6.特征native country
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,10)) 
sns.countplot(y="native-country", data=dataset_raw)
# 发现此时除了Private以外的类别都很少，可以考虑进行数据合并
# 属性融合
dataset_raw.loc[dataset_raw['native-country'] == 'Cambodia'                    , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Canada'                      , 'native-country'] = 'British-Commonwealth'    
dataset_raw.loc[dataset_raw['native-country'] == 'China'                       , 'native-country'] = 'China'       
dataset_raw.loc[dataset_raw['native-country'] == 'Columbia'                    , 'native-country'] = 'South-America'    
dataset_raw.loc[dataset_raw['native-country'] == 'Cuba'                        , 'native-country'] = 'South-America'        
dataset_raw.loc[dataset_raw['native-country'] == 'Dominican-Republic'          , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Ecuador'                     , 'native-country'] = 'South-America'     
dataset_raw.loc[dataset_raw['native-country'] == 'El-Salvador'                 , 'native-country'] = 'South-America' 
dataset_raw.loc[dataset_raw['native-country'] == 'England'                     , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'France'                      , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Germany'                     , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Greece'                      , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Guatemala'                   , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Haiti'                       , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Holand-Netherlands'          , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Honduras'                    , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Hong'                        , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Hungary'                     , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'India'                       , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Iran'                        , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Ireland'                     , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Italy'                       , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Jamaica'                     , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Japan'                       , 'native-country'] = 'APAC'
dataset_raw.loc[dataset_raw['native-country'] == 'Laos'                        , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Mexico'                      , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Nicaragua'                   , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Outlying-US(Guam-USVI-etc)'  , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Peru'                        , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Philippines'                 , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Poland'                      , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Portugal'                    , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Puerto-Rico'                 , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Scotland'                    , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'South'                       , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Taiwan'                      , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Thailand'                    , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Trinadad&Tobago'             , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'United-States'               , 'native-country'] = 'United-States'
dataset_raw.loc[dataset_raw['native-country'] == 'Vietnam'                     , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Yugoslavia'                  , 'native-country'] = 'Euro_Group_2'

dataset_bin['native-country'] = dataset_raw['native-country']
dataset_con['native-country'] = dataset_raw['native-country']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="native-country", data=dataset_bin)

在这里插入图片描述

# 7.education特征
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5)) 
sns.countplot(y="education", data=dataset_raw)

dataset_raw.loc[dataset_raw['education'] == '10th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '11th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '12th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '1st-4th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '5th-6th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '7th-8th'       , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '9th'           , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-acdm'    , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-voc'     , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Bachelors'     , 'education'] = 'Bachelors'
dataset_raw.loc[dataset_raw['education'] == 'Doctorate'     , 'education'] = 'Doctorate'
dataset_raw.loc[dataset_raw['education'] == 'HS-Grad'       , 'education'] = 'HS-Graduate'
dataset_raw.loc[dataset_raw['education'] == 'Masters'       , 'education'] = 'Masters'
dataset_raw.loc[dataset_raw['education'] == 'Preschool'     , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Prof-school'   , 'education'] = 'Professor'
dataset_raw.loc[dataset_raw['education'] == 'Some-college'  , 'education'] = 'HS-Graduate'

dataset_bin['education'] = dataset_raw['education']
dataset_con['education'] = dataset_raw['education']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="education", data=dataset_bin)

在这里插入图片描述

# 8.特征Marital Status
plt.figure(figsize=(20,3)) 
sns.countplot(y="marital-status", data=dataset_raw)

dataset_raw.loc[dataset_raw['marital-status'] == 'Never-married'        , 'marital-status'] = 'Never-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-AF-spouse'    , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-civ-spouse'   , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-spouse-absent', 'marital-status'] = 'Not-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Separated'            , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Divorced'             , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Widowed'              , 'marital-status'] = 'Widowed'

dataset_bin['marital-status'] = dataset_raw['marital-status']
dataset_con['marital-status'] = dataset_raw['marital-status']
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
sns.countplot(y="marital-status", data=dataset_bin)

在这里插入图片描述

# 9.特征Final Weight 体重分箱
dataset_bin['fnlwgt'] = pd.cut(dataset_raw['fnlwgt'], 10)
dataset_con['fnlwgt'] = dataset_raw['fnlwgt']  # 未离散化
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
sns.countplot(y="fnlwgt", data=dataset_bin)

在这里插入图片描述

# 10.特征Education Number
dataset_bin['education-num'] = pd.cut(dataset_raw['education-num'], 10) # 分箱离散化
dataset_con['education-num'] = dataset_raw['education-num'] # 未离散化

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5)) 
sns.countplot(y="education-num", data=dataset_bin)

在这里插入图片描述

# 11.特征Hours per Week
# 周工作时间（小时）分箱
dataset_bin['hours-per-week'] = pd.cut(dataset_raw['hours-per-week'], 10)
dataset_con['hours-per-week'] = dataset_raw['hours-per-week']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4)) 
plt.subplot(1, 2, 1)

sns.countplot(y="hours-per-week", data=dataset_bin);
plt.subplot(1, 2, 2)

sns.distplot(dataset_con['hours-per-week'])

在这里插入图片描述

# 12.Capital Gain
dataset_bin['capital-gain'] = pd.cut(dataset_raw['capital-gain'], 5)

dataset_con['capital-gain'] = dataset_raw['capital-gain']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
plt.subplot(1, 2, 1)
sns.countplot(y="capital-gain", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-gain'])

在这里插入图片描述

# 13.特征Capital Loss
dataset_bin['capital-loss'] = pd.cut(dataset_raw['capital-loss'], 5)
dataset_con['capital-loss'] = dataset_raw['capital-loss']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3)) 
plt.subplot(1, 2, 1)
sns.countplot(y="capital-loss", data=dataset_bin)
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-loss'])

在这里插入图片描述

# 14.特征Race, Sex, Relationship
# 无需处理
dataset_con['sex'] = dataset_bin['sex'] = dataset_raw['sex']
dataset_con['race'] = dataset_bin['race'] = dataset_raw['race']
dataset_con['relationship'] = dataset_bin['relationship'] = dataset_raw['relationship']

6.特征衍生

特征衍生的意思是根据已有的特征创建新的特征

# 1.连续型特征衍生（age与hours per-week共同衍生的特征）
dataset_con['age-hours'] = dataset_con['age'] * dataset_con['hours-per-week']
dataset_bin['age-hours'] = pd.cut(dataset_con['age-hours'],10)
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot(y='age-hours', data=dataset_bin) # 绘制横向数量统计图
plt.subplot(1,2,2)
# 连续型衍生特征趋势图
sns.distplot(dataset_con.loc[dataset_con['predclass']==1]['age-hours'])
sns.distplot(dataset_con.loc[dataset_con['predclass']==0]['age-hours'])

# 2.离散型特征衍生（sex与marital-status共同衍生的特征）
dataset_bin['sex-marital'] = dataset_con['sex-marital'] = dataset_bin['sex'] + dataset_bin['marital-status']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y='sex-marital', data=dataset_bin)

在这里插入图片描述

7.特征编码

机器学习算法接收的是数值型变量把字符型编码为数值型的过程叫做编码我们用到的：

Label encoding（标签编码）
例如：令红=1，黄=2，蓝=3. 那么这样其实实现了标签编码，即给不同类别以标签。然而这意味着机器可能会学习到“红<黄<蓝”

One-Hot encoding（独热编码）
每个样本只对应于一个类别（即只在对应的特征处值为1，其余地方值为0）

例如：有三种颜色状态，所以就有3个比特。即红色：1 0 0 ，黄色: 0 1 0，蓝色：0 0 1 。如此一来每两个向量之间的距离都是根号2，在向量空间距离都相等，所以这样不会出现偏序性，基本不会影响基于向量空间度量算法的效果

# 1.对所有 离散型 特征进行one-hot编码
one_hot_cols = dataset_bin.columns.tolist() # 获取所有列索引，并转化为字符串
one_hot_cols.remove('predclass') # 不对标签列进行编码
# one-hot编码
dataset_bin_env = pd.get_dummies(dataset_bin, columns=one_hot_cols)
dataset_bin_env.head()

# 2.所有连续型特征进行Label_encoding编码
encoder = LabelEncoder()
dataset_con = dataset_con.astype(str) # 获取所有列索引转换为字符串类型
dataset_con_env = dataset_con.apply(encoder.fit_transform)
dataset_con_env.head()

在这里插入图片描述

8.特征相关性与降维

特征降维的作用：

数据在低维下更容易处理、更容易使用
相关特征，特别是重要特征更能在数据中明确的显示出来
去除数据噪声
降低算法开销

# 1.查看特征相关性
# 绘制两个数据集的热力图
plt.style.use('seaborn-whitegrid') # 设置绘图风格
fig = plt.figure(figsize=(20,10))

# 绘制第一个热力图
plt.subplot(1,2,1) # 设置子图，1行2列的第一个子图
# 根据dataset_bin_enc(离散型)的特征相关性，创建布尔型数组
mask = np.zeros_like(dataset_bin_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True（如何将绘制相关系数热力图只保留左下角部分）
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_bin_env.corr(),
            vmin=-1, vmax=1,
            square=True,
            cmap=sns.color_palette("RdBu_r",100),
            mask=mask,
            linewidths=.5)

# 绘制第二个热力图
plt.subplot(1,2,2) # 设置子图，1行2列的第二个子图
# 根据dataset_con_enc(离散型)的特征相关性，创建布尔型数组
mask = np.zeros_like(dataset_con_env.corr(), dtype=np.bool)
# 将mask中的上三角矩阵的索引位置的值设置为True（如何将绘制相关系数热力图只保留左下角部分）
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_con_env.corr(),
            vmin=-1, vmax=1,
            square=True,
            cmap=sns.color_palette("RdBu_r",100),
            mask=mask,
            linewidths=.5)
       
# 2.PCA降维
# 假设降低到10维（保留10个特征）
X = dataset_bin_env.drop('predclass',axis=1) # 提取特征（不包含标签）
pca = PCA(n_components=10)
X_reduction = pca.fit_transform(X)

在这里插入图片描述

9.建模与评估

# 1.选择数据集
# 第一个可选数据集，dataset_bin_enc(离散编码)
# 第二个可选数据集，dataset_con_enc(连续编码)

selected_dataset = dataset_bin_enc
selected_dataset.head()
# 2.拆分数据集
# 由于原数据集已经帮我们分配好了训练与测试样本，这里直接复原成原来的即可
train = selected_dataset.loc[:32560, :]
test = selected_dataset.loc[32561:,:]

# 算法开始之前重命名特征和标签
X_train = train.drop('predclass',axis=1)
y_train = train['predclass'].astype('int64')
X_test = test.drop('predclass', axis=1)
y_test = test['predclass'].astype('int64')

# 3.建立模型(选择逻辑回归LogisticRegression)
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
decision_scores = log_reg.decision_function(X_test) # 返回一个分数评估
print("decision_scores:",decision_scores)

# 4.评估模型（绘制ROC曲线）
fpr, tpr, thresholds = roc_curve(y_test, decision_scores)
plt.title('Receiver Operation Characteristic')
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--') # 绘制对角线方法，不是根据坐标绘制的，背下来即可
plt.xlabel('False Postive Rate')
plt.ylabel('True Positive Rate')
plt.show()