基础知识_Scikit-learn(一)

目录

一、简单的概念回顾

二、Sklearn的设计概述

三、机器学习流程

四、简单的sklearn API 套路

五、Preparing data

六、数据处理(上)

  6.1、Standardization(标准化)

    6.1.1、离差标准化 MinMaxScaler

    6.1.2、标准差标准化 StandardScaler

  6.2、Normalization(正则化,基于行操作)

  6.3、Binarization(离散化)

    6.3.1、二值离散化 Binarizer【通常基于均值二值化】

    6.3.2、分桶离散化 pd.cut 与 分位数分桶 pd.qcut 【 pandas的qcut() 与 cut()】

  6.4、Encoding categorical features 【特征工程】

    6.4.1、scikit 中的 OneHotEncoder【一般不用】

    6.4.2、 sklearn 中的 LabelEncode【需要一列一列传入】

    6.4.3、pandas中的 get_dummies【允许整个 df 传入】

    6.4.4、Imputation of missing values 填充缺省值


官方中文参考文档:

     https://sklearn.apachecn.org/docs/0.21.3/

     https://sklearn.apachecn.org/docs/0.21.3/50.html

一、简单的概念回顾

      监督学习与无监督学习

  • 最大的区别就是有没有标签
  • 工业应用中主要是用监督学习

      分类任务和回归任务

  • 能用线性模型,决不用非线性模型(容易over fitting,且计算量太大)

      模型的评估【分类模型的五个指标】

  • accuracy:很少用,样本不均衡时,易出问题
  • recall与precision:二者之间的trade off
  • F1-score:综合均衡考量recall与precision
  • ROC曲线
  • AUC值:ROC曲线下方面积

      特征处理(特征工程)

  • 决定机器学习建模效果的核心
  • 业务经验相关
  • 熟悉相关工具

二、Sklearn的设计概述

      官方文档:
            scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation

  • Classification【分类】
  • Regression【回归】
  • Clustering【聚类】
  • Dimensionality reduction【降维】
  • Model selection【模型选择与验证】
  • Preprocessing【预处理】

三、机器学习流程

      机器学习流程

  •      获取数据

    爬虫
    数据库
    数据文件(csv、excel、txt)

  •      数据处理

    文本处理
    量纲一致
    降维

  •      建立模型

    分类
    回归
    聚类

  •      评估模型

    超参数择优
    哪个模型更好

四、简单的sklearn API 套路

fit:训练模型
transform:将数据转换为模型处理后的结果(label会放在test集后面)
predict:返回模型预测结果
predict_proba:预测概率值
score:模型准确率(很少用默认的accuracy,会设置为f1)
get_params:获取参数

五、Preparing data

      数据集划分:

  • Training data(70%)
  • Validation data
  • Testing data(30%)

      实际工作中,大部分的情况下不会完全随机划分,会用已经发生(时间在前的、过去的)数据作为训练集,来预测未来(时间在后的)数据。否则使用未来数据预测过去的数据,

      会引入一些未来发生的先验信息,是不合理的,容易造成过拟合。另外也会有其他情况,例如按地域划分

六、数据处理(上)

      数据集:ML DATASETS

import numpy as np
import pandas as pd
df = pd.read_csv('forestfires.csv',header='infer',names=None)

out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       517 non-null    int64  
 1   Y       517 non-null    int64  
 2   month   517 non-null    object 
 3   day     517 non-null    object 
 4   FFMC    517 non-null    float64
 5   DMC     517 non-null    float64
 6   DC      517 non-null    float64
 7   ISI     517 non-null    float64
 8   temp    517 non-null    float64
 9   RH      517 non-null    int64  
 10  wind    517 non-null    float64
 11  rain    517 non-null    float64
 12  area    517 non-null    float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB

print(df.head())

out:
   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0

  6.1、Standardization(标准化)

df1 = df.loc[:,'FFMC':'rain']
print(df1.head())

out:
   FFMC   DMC     DC  ISI  temp  RH  wind  rain
0  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0
1  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0
2  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0
3  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2
4  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0

from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(df1.astype(float),df['area'],test_size=0.3,random_state=42)  # 70% train, 30% test
"""
 X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
"""
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

out:
(361, 8) (156, 8) (361,) (156,)

    6.1.1、离差标准化 MinMaxScaler

           MinMaxScaler

Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that 
the maximum absolute value of each feature is scaled to unit size.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
"""
    copy=True, 
    feature_range=(0, 1)  # 默认 0-1 区间
"""
mms.fit(X_train)
X_train_mms = mms.transform(X_train)
X_test_mms = mms.transform(X_test)
print(X_train_mms)

out:
[[0.95096774 0.36078567 0.8797936  ... 0.15294118 0.10588235 0.        ]
 [0.93548387 0.32115782 0.86372698 ... 0.37647059 0.47058824 0.        ]
 [0.92258065 0.37835975 0.62096869 ... 0.32941176 0.57647059 0.        ]
 ...
 [0.94709677 0.52205376 0.76263633 ... 0.48235294 0.25882353 0.        ]
 [0.93032258 0.28807719 0.43239123 ... 0.42352941 0.10588235 0.        ]
 [0.98193548 0.36940041 0.74961886 ... 0.29411765 0.36470588 0.        ]]

print(X_train_mms.shape,X_test_mms.shape)

out:
(361, 8) (156, 8)

    6.1.2、标准差标准化 StandardScaler

           StandardScaler

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API 
to compute the mean and standard deviation on a training set so as to be able to later reapply the same 
transformation on the testing set.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
"""
    copy=True, 
    with_mean=True,  # 是否中心化
    with_std=True    # 是否缩放
"""

ss.fit(X_train.astype(float))
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)
print(X_train_ss)

out:
[[ 0.31227419 -0.09559515  0.84776472 ... -1.01258179 -1.21784014
  -0.07097384]
 [ 0.10927722 -0.26994859  0.79301684 ...  0.13635096  0.52453381
  -0.07097384]
 [-0.05988692 -0.0182732  -0.0341957  ... -0.10552962  1.03038431
  -0.07097384]
 ...
 [ 0.26152495  0.61394751  0.44854476 ...  0.68058227 -0.4871672
  -0.07097384]
 [ 0.04161157 -0.4154958  -0.676784   ...  0.37823154 -1.21784014
  -0.07097384]
 [ 0.71826813 -0.05769223  0.40418698 ... -0.28694005  0.01868331
  -0.07097384]]

print(X_train_ss.shape,X_test_ss.shape)

out:
(361, 8) (156, 8)

print(X_train_ss.mean(axis=0))

out:
[-7.50400050e-17 -2.82937724e-17  0.00000000e+00  3.93652485e-17
 -3.21072808e-16  1.25476730e-16  1.37778370e-16 -1.47619682e-17]

print(X_train_ss.std(axis=0))

out:
[1. 1. 1. 1. 1. 1. 1. 1.]

  6.2、Normalization(正则化,基于行操作

Normalizer类也拥有fit、transform等转换器API拥有的常见方法,但实际上fit和transform对其是没有实际意义的,因为归一化
操作是对每个样本单独进行变换,不存在针对所有样本上的统计学习过程。这里的设计,仅仅是为了供sklearn中的pipeline等API调
用时,传入该对象时,各API的方法能够保持一致性,方便使用pipeline。
from sklearn.preprocessing import Normalizer

norm = Normalizer(norm="l2")
norm.fit(X_train)
X_train_norm = norm.transform(X_train)  # l2 正则本质上将向量转化为单位向量
X_test_norm = norm.transform(X_test)
print(X_train_norm)

out:
[[0.1196928  0.13705085 0.98202506 ... 0.03627055 0.00233168 0.        ]
 [0.12038336 0.12447534 0.9826028  ... 0.06203967 0.00646797 0.        ]
 [0.16160439 0.19869099 0.96281818 ... 0.07703979 0.01039141 0.        ]
 ...
 [0.13451037 0.22286951 0.96128907 ... 0.08178698 0.00452749 0.        ]
 [0.22668263 0.21145395 0.94018369 ... 0.12732174 0.00449371 0.        ]
 [0.14263086 0.16294221 0.97359102 ... 0.0601818  0.00601818 0.        ]]

print(sum(np.square(X_train_norm[1])))

out:
0.9999999999999997

  6.3、Binarization(离散化)

df.head()

   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0

    6.3.1、二值离散化 Binarizer【通常基于均值二值化】

from sklearn.preprocessing import Binarizer

bi = Binarizer(threshold=548)

DC_bi = bi.fit_transform(df[['DC']])   # 这里 Binarizer API 需要2D结构的数据,因此这里多加了一个 [ ]
print(DC_bi)

out:
[[0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
...
]

df['DC_bi'] = DC_bi
print(df)

out:
     X  Y month  day  FFMC    DMC  ...  temp  RH  wind  rain   area  DC_bi
0    7  5   mar  fri  86.2   26.2  ...   8.2  51   6.7   0.0   0.00    0.0
1    7  4   oct  tue  90.6   35.4  ...  18.0  33   0.9   0.0   0.00    1.0
2    7  4   oct  sat  90.6   43.7  ...  14.6  33   1.3   0.0   0.00    1.0
3    8  6   mar  fri  91.7   33.3  ...   8.3  97   4.0   0.2   0.00    0.0
4    8  6   mar  sun  89.3   51.3  ...  11.4  99   1.8   0.0   0.00    0.0
..  .. ..   ...  ...   ...    ...  ...   ...  ..   ...   ...    ...    ...
512  4  3   aug  sun  81.6   56.7  ...  27.8  32   2.7   0.0   6.44    1.0
513  2  4   aug  sun  81.6   56.7  ...  21.9  71   5.8   0.0  54.29    1.0
514  7  4   aug  sun  81.6   56.7  ...  21.2  70   6.7   0.0  11.16    1.0
515  1  4   aug  sat  94.4  146.0  ...  25.6  42   4.0   0.0   0.00    1.0
516  6  3   nov  tue  79.5    3.0  ...  11.8  31   4.5   0.0   0.00    0.0

[517 rows x 14 columns]

    6.3.2、分桶离散化 pd.cut 与 分位数分桶 pd.qcut 【 pandas的qcut() 与 cut()

  •     使用 cut 分桶离散化
1、使用 cut 均匀分桶
--------------------------------------------------------------------------------------------
quartiles = pd.cut(df['DC'],5)
print(quartiles)

out:
0       (7.047, 178.44]
1      (519.52, 690.06]
2      (519.52, 690.06]
3       (7.047, 178.44]
4       (7.047, 178.44]
             ...       
512    (519.52, 690.06]
513    (519.52, 690.06]
514    (519.52, 690.06]
515    (519.52, 690.06]
516     (7.047, 178.44]
Name: DC, Length: 517, dtype: category
Categories (5, interval[float64]): [(7.047, 178.44] < (178.44, 348.98] < (348.98, 519.52] <
                                    (519.52, 690.06] < (690.06, 860.6]]

# Category对象可以直接丢进 groupy 中
print(df["DC_bi"].groupby(quartiles).count())
print(df.groupby(quartiles).count())

out:
DC
(7.047, 178.44]      88
(178.44, 348.98]     16
(348.98, 519.52]     47
(519.52, 690.06]    176
(690.06, 860.6]     190
Name: DC_bi, dtype: int64

                    X    Y  month  day  FFMC  ...   RH  wind  rain  area  DC_bi
DC                                            ...                              
(7.047, 178.44]    88   88     88   88    88  ...   88    88    88    88     88
(178.44, 348.98]   16   16     16   16    16  ...   16    16    16    16     16
(348.98, 519.52]   47   47     47   47    47  ...   47    47    47    47     47
(519.52, 690.06]  176  176    176  176   176  ...  176   176   176   176    176
(690.06, 860.6]   190  190    190  190   190  ...  190   190   190   190    190

# groupby 与 apply 共同使用
def func(group):
    return {'min':group.min(),'max':group.max(),'mean':group.mean()}

reault = df["DC_bi"].groupby(quartiles).apply(func)
print(reault.unstack())

out:
                  min  max      mean
DC                                  
(7.047, 178.44]   0.0  0.0  0.000000
(178.44, 348.98]  0.0  0.0  0.000000
(348.98, 519.52]  0.0  0.0  0.000000
(519.52, 690.06]  0.0  1.0  0.971591
(690.06, 860.6]   1.0  1.0  1.000000
2、使用 cut 给定已有的桶进行分桶
--------------------------------------------------------------------------------------------
quartiles = pd.cut(df['DC'],[-1000,0,100,400,10000])
print(quartiles)

out:
0          (0, 100]
1      (400, 10000]
2      (400, 10000]
3          (0, 100]
4        (100, 400]
           ...     
512    (400, 10000]
513    (400, 10000]
514    (400, 10000]
515    (400, 10000]
516      (100, 400]
Name: DC, Length: 517, dtype: category
Categories (4, interval[int64]): [(-1000, 0] < (0, 100] < (100, 400] < (400, 10000]]
  •     使用 qcut 根据分位数进行分桶
1、使用分位数qcut进行分桶
quartiles = pd.qcut(df['DC'],[0,0.2,0.5,0.8,1])  # 按照分位数 0.2  0.5  0.8  分为 4个桶
print(quartiles)

out:
0      (7.899, 323.3]
1      (664.2, 728.6]
2      (664.2, 728.6]
3      (7.899, 323.3]
4      (7.899, 323.3]
            ...      
512    (664.2, 728.6]
513    (664.2, 728.6]
514    (664.2, 728.6]
515    (323.3, 664.2]
516    (7.899, 323.3]
Name: DC, Length: 517, dtype: category
Categories (4, interval[float64]): [(7.899, 323.3] < (323.3, 664.2] < (664.2, 728.6] < (728.6, 860.6]]

  6.4、Encoding categorical features 【特征工程

    6.4.1、scikit 中的 OneHotEncoder【一般不用】

class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>, 
                                          sparse=True, handle_unknown='error')

tip: 返回时稀释矩阵,需要转化为稠密矩阵,结果与get_dummies相同的onehot矩阵【不常用】
from sklearn.preprocessing import OneHotEncoder
ont = OneHotEncoder()
month_ont = ont.fit_transform(df[['month']])  # 这里传入二维结构

print(month_ont.toarray())

out:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]]

    6.4.2、 sklearn 中的 LabelEncode【需要一列一列传入】

from sklearn.preprocessing import 、LabelEncoder

ont = LabelEncoder()
month_ont = ont.fit_transform(df['month'])  # 传入一维结构
 
print(month_ont)

out:
[ 7 10 10  7  7  1  1  1 11 11 11 11  1 11 11 11  7 10  7  0 11 11  6  1
  1  1 11 11 11 11 11 11 11 11 11 11 10 10 10  7  5  1  1 11 11 11 11  5
  7  7 11  1  1  1  1 11 11 10  3  3  7  7  1  1  1  1 11 11 11  7  7 11
  7  1 11  3  3  7  1  1  1  1  1  1  1 11 11 11 11  7  1  7  1  1  1 11
  3  7  1  1  1  1  1 11  4  7  7  1 11 11  7  7 11 11  7  7  7  7  7  1
  1  1 11 11 11 10  7 11 10 10  3  7  7 11  7  1 11 11  5 11 11  1  1  5
  1  1  7 11  1 11  6  5  5 11 11  1 11  1  1 11  7  1  7 11 11  7  1  1
  7  1 11  1  1 11  1  1  0  1 11  1 11 10  3 10  1 11  7 11  7  7  7  1
  1 11  1  1  0 11 11 11 11  7  3 10  7 11  1 11 11 11 10  1 11  7  7  7
 11 11 11  7  1 11  7  5 11 11 10  1 11  1 11 11 11 11 11  1 11 11 11  0
  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  3  3  3  5  5  5
  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  8 11 11 11 11 11 11 11
 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 11 11 11 11 11 11 11 11 11 11 11  5  1  1 11 11  1  1  7  4  5  1  1  1
  1  1 11  7  1  1  3 11 11  7  3  3 11  1  1  6  6 11  1  1 11  1 11  3
 11  5  3  3  5  1  1  1  5  7  1  1  1  1  5 11  1  1  1  1  1  1 11  1
  1  1  1  5  1  1  1 11 11  1  0  5 11  1  1  7 11  1  1  1  1  1  1  5
  1  1  1  1  1  1 11  3  3  3  7  7  7  0  0  8  6  6  6  6  5  5  5  5
  5  5  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  9]

    6.4.3、pandas中的 get_dummies【允许整个 df 传入】

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, 
sparse=False, drop_first=False)
"""
    data   数据
    columns  选择df的那些列用于onehot
"""

ont_hots = pd.get_dummies(data=df,columns=['month','day'])  # 原来指定的列将会消失,转化为ont-hot结构,不过pandas会智能选择较小的int用于保存数据,所以可以做一个astype转化
print(ont_hots.info())

out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 31 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   X          517 non-null    int64  
 1   Y          517 non-null    int64  
 2   FFMC       517 non-null    float64
 3   DMC        517 non-null    float64
 4   DC         517 non-null    float64
 5   ISI        517 non-null    float64
 6   temp       517 non-null    float64
 7   RH         517 non-null    int64  
 8   wind       517 non-null    float64
 9   rain       517 non-null    float64
 10  area       517 non-null    float64
 11  DC_bi      517 non-null    float64
 12  month_apr  517 non-null    uint8  
 13  month_aug  517 non-null    uint8  
 14  month_dec  517 non-null    uint8  
 15  month_feb  517 non-null    uint8  
 16  month_jan  517 non-null    uint8  
 17  month_jul  517 non-null    uint8  
 18  month_jun  517 non-null    uint8  
 19  month_mar  517 non-null    uint8  
 20  month_may  517 non-null    uint8  
 21  month_nov  517 non-null    uint8  
 22  month_oct  517 non-null    uint8  
 23  month_sep  517 non-null    uint8  
 24  day_fri    517 non-null    uint8  
 25  day_mon    517 non-null    uint8  
 26  day_sat    517 non-null    uint8  
 27  day_sun    517 non-null    uint8  
 28  day_thu    517 non-null    uint8  
 29  day_tue    517 non-null    uint8  
 30  day_wed    517 non-null    uint8  
dtypes: float64(9), int64(3), uint8(19)
memory usage: 58.2 KB

    6.4.4、Imputation of missing values 填充缺省值

1、使用 sklearn 中的 SimpleImputer
------------------------------------------------------------------------------------------------
class sklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0,
 copy=True, add_indicator=False)
"""
    If “mean”, then replace missing values using the mean along the axis.
    If “median”, then replace missing values using the median along the axis.
    If “most_frequent”, then replace missing using the most frequent value along the axis.
    If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
"""
df   # 原始数据

out:
     X  Y month  day  FFMC    DMC  ...  temp  RH  wind  rain   area  DC_bi
0    7  5   mar  fri  86.2   26.2  ...   8.2  51   6.7   0.0   0.00    0.0
1    7  4   oct  tue  90.6   35.4  ...  18.0  33   0.9   0.0   0.00    1.0
2    7  4   oct  sat  90.6   43.7  ...  14.6  33   1.3   0.0   0.00    1.0
3    8  6   mar  fri  91.7   33.3  ...   8.3  97   4.0   0.2   0.00    0.0
4    8  6   mar  sun  89.3   51.3  ...  11.4  99   1.8   0.0   0.00    0.0
..  .. ..   ...  ...   ...    ...  ...   ...  ..   ...   ...    ...    ...
512  4  3   aug  sun  81.6   56.7  ...  27.8  32   2.7   0.0   6.44    1.0
513  2  4   aug  sun  81.6   56.7  ...  21.9  71   5.8   0.0  54.29    1.0
514  7  4   aug  sun  81.6   56.7  ...  21.2  70   6.7   0.0  11.16    1.0
515  1  4   aug  sat  94.4  146.0  ...  25.6  42   4.0   0.0   0.00    1.0
516  6  3   nov  tue  79.5    3.0  ...  11.8  31   4.5   0.0   0.00    0.0

[517 rows x 14 columns]

df['DC_na'] = np.nan
df.loc[df['DC']>600,'DC_na'] = df["DC"]   # 涉及到bool索引,必须用 .loc方式,具体查考 pandas 的笔记
print(df)

out:
     X  Y month  day  FFMC    DMC  ...  RH  wind  rain   area  DC_bi  DC_na
0    7  5   mar  fri  86.2   26.2  ...  51   6.7   0.0   0.00    0.0    NaN
1    7  4   oct  tue  90.6   35.4  ...  33   0.9   0.0   0.00    1.0  669.1
2    7  4   oct  sat  90.6   43.7  ...  33   1.3   0.0   0.00    1.0  686.9
3    8  6   mar  fri  91.7   33.3  ...  97   4.0   0.2   0.00    0.0    NaN
4    8  6   mar  sun  89.3   51.3  ...  99   1.8   0.0   0.00    0.0    NaN
..  .. ..   ...  ...   ...    ...  ...  ..   ...   ...    ...    ...    ...
512  4  3   aug  sun  81.6   56.7  ...  32   2.7   0.0   6.44    1.0  665.6
513  2  4   aug  sun  81.6   56.7  ...  71   5.8   0.0  54.29    1.0  665.6
514  7  4   aug  sun  81.6   56.7  ...  70   6.7   0.0  11.16    1.0  665.6
515  1  4   aug  sat  94.4  146.0  ...  42   4.0   0.0   0.00    1.0  614.7
516  6  3   nov  tue  79.5    3.0  ...  31   4.5   0.0   0.00    0.0    NaN

[517 rows x 15 columns]

1、使用 SimpleImputer API 进行预处理
from sklearn.impute import SimpleImputer

SI = SimpleImputer(strategy='mean')
"""
    missing_values=np.nan,   # 一般默认,指定那些值是需要操作的值
    strategy="mean",       "mean"、"median"、"most_frequent"
    fill_value=None,       # 只有上面 strategy="constant" 生效
"""
DC_na_si = SI.fit_transform(df[["DC_na"]])   # 传入2D结构,这里多使用一个 []
print(DC_na_si)

out:

[[703.07754491]
 [669.1       ]
 [686.9       ]
 [703.07754491]
 [703.07754491]
 [703.07754491]
 [703.07754491]
 [608.2       ]
 [692.6       ]
 [698.6       ]
 [698.6       ]
 [713.        ]
 [665.3       ]
 ... ]
2、也可以使用 df.fillna()  与  df.replace() API,详情见 Pandas 笔记

 【一些实践中的 tips】

  1. 尽量不要把包含个别特征缺失值的样本删除,实践中最好使用一些业务经验来做一些合理的推测值的填充,利用好样本
  2. 如果没有合适的推测手段来填充,可以填充一些像-999,-1这样的没有意义的值

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值