机器学习——数据篇

最新推荐文章于 2023-01-06 14:52:30 发布

ff0kk

最新推荐文章于 2023-01-06 14:52:30 发布

阅读量969

点赞数

分类专栏：机器学习文章标签：机器学习数据分析

本文链接：https://blog.csdn.net/hkk_007/article/details/107962619

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

机器学习——数据篇

一、数据集
二、探索性数据分析
三、数据预处理

一、数据集

数据集本质为一个M x N的矩阵，M代表特征（列），N代表样本（行）
列可以拆分为X和Y。X表示特征、独立变量、输入变量；Y表示类别标签、因变量、输出变量

注意

一个可以用于监督学习的数据集（可以执行回归或分类）将同时包含X和Y，而一个可以用于无监督学习的数据集将只有X。
此外，如果Y包含定量值，那么数据集（由X和Y组成）可以用于回归任务，而如果Y包含定性值，那么数据集（由X和Y组成）可以用于分类任务。

二、探索性数据分析

2.1 操作步骤简述

2.1.1 描述性统计

平均数、中位数、标准差

2.1.2 数据可视化

热力图（辨别特征内部相关性）、箱形图（可视化群体差异）、散点图（可视化特征之间的相关性）、主成分分析（可视化数据集中呈现的聚类分布）等。

2.1.3 数据整形

对数据进行透视、分组、过滤等

2.2 代码部分

2.2.1 箱型图

发现数据中的异常值

data.plot(kind='box',subplots=True,layout=(1,6),figsize=(16,8))
plt.show()

在这里插入图片描述

2.2.2 2. 相关热图

Parameters

data : rectangular dataset
2D dataset that can be coerced into an ndarray. If a Pandas DataFrame is provided, the index/column information will be used to label the columns and rows.
矩阵数据集
vmin, vmax : floats, optional
Values to anchor the colormap, otherwise they are inferred from the data and other keyword arguments.
开始与结束范围（右边小竖线）
cmap : matplotlib colormap name or object, or list of colors, optional
The mapping from data values to color space. If not provided, the default will depend on whether center is set.
显示的绘图板
center : float, optional
The value at which to center the colormap when plotting divergant data. Using this parameter will change the default cmap if none is specified.
robust : bool, optional
If True and vmin or vmax are absent, the colormap range is computed with robust quantiles instead of the extreme values.
annot : bool or rectangular dataset, optional
If True, write the data value in each cell. If an array-like with the same shape as data, then use this to annotate the heatmap instead of the data. Note that DataFrames will match on position, not index.
fmt : string, optional
String formatting code to use when adding annotations.
annot_kws : dict of key, value mappings, optional
Keyword arguments for ax.text when annot is True.
linewidths : float, optional
Width of the lines that will divide each cell.
linecolor : color, optional
Color of the lines that will divide each cell.
cbar : boolean, optional
Whether to draw a colorbar.
cbar_kws : dict of key, value mappings, optional
Keyword arguments for fig.colorbar.
cbar_ax : matplotlib Axes, optional
Axes in which to draw the colorbar, otherwise take space from the main Axes.
square : boolean, optional
If True, set the Axes aspect to “equal” so each cell will be square-shaped.
xticklabels, yticklabels : “auto”, bool, list-like, or int, optional
If True, plot the column names of the dataframe. If False, don’t plot the column names. If list-like, plot these alternate labels as the xticklabels. If an integer, use the column names but plot only every n label. If “auto”, try to densely plot non-overlapping labels.
mask : boolean array or DataFrame, optional
If passed, data will not be shown in cells where mask is True. Cells with missing values are automatically masked.
ax : matplotlib Axes, optional
Axes in which to draw the plot, otherwise use the currently-active Axes.
kwargs : other keyword arguments
All other keyword arguments are passed to matplotlib.axes.Axes.pcolormesh().

Returns

axmatplotlib Axes
Axes object with the heatmap.

def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    plt.title('Correlation of Features', y=1.05, size=15)
    plt.show()

correlation_heatmap(data)

在这里插入图片描述

三、数据预处理

3.1 常规操作简述

数据预处理(又称数据清理、数据整理或数据处理)是指对数据进行各种检查和审查的过程，以 纠正缺失值、拼写错误、使数值正常化/标准化以使其具有可比性、转换数据(如对数转换) 等问题。

3.2 sklearn模块

preprocessing : 数据预处理
Impute : 填补缺失值
feature_selection : 特征选择
decomposition : 降维算法

3.3 数据预处理

3.3.1 数据数据无量纲化

概念：将不同规格的数据转换到同一规格，或不同分布的数据转换到某个特定分布

作用：加快求解速度、提升模型精度（避免某一个取值范围特别大的特征对距离计算造成影响）

线性的无量纲化操作

中心化处理（Zero-centered或者Mean-subtraction）
让所有数据减去一个固定的值，即让数据样本平移到某个位置
缩放处理（Scale）
通过除以一个固定值，将数据固定在某个范围之中，取对数也是一种缩放处理

3.3.2 代码实例

3.3.2.1 数据归一化

sklearn.preprocessing.MinMaxScaler
当数据(x)按照最小值中心化后，再按极差（最大值 - 最小值）缩放，数据移动了最小值个单位，并且会被收敛到[0,1]之间，而这个过程，就叫做数据归一化(Normalization，又称Min-Max Scaling)。

Parameters

feature_range : 控制目标压缩范围，默认为[0,1]

from sklearn.preprocessing import MinMaxScaler
print('data_t sample:\n',data_t.sample(10))
#实现归一化
scaler = MinMaxScaler()                  #实例化
scaler = scaler.fit(data_t)              #fit，在这里本质是生成min(x)和max(x)
result = scaler.transform(data_t)        #通过接口导出结果
print('result sample:\n',result[:10])
print('-'*50)

result_ = scaler.fit_transform(data_t)   #结果一步达成
print('一步达成:\n',result[:10])
print('-'*50)

print('归一化后的结果逆转:\n',scaler.inverse_transform(result))                    #将归一化后的结果逆转

data_t sample:
      Survived  Pclass        Age     Fare
231         0       3  29.000000   7.7750
888         0       3  60.000000  23.4500
848         0       2  28.000000  33.0000
727         1       3  16.143837   7.7375
696         0       3  44.000000   8.0500
312         0       2  26.000000  26.0000
750         1       2   4.000000  23.0000
508         0       3  28.000000  22.5250
491         0       3  21.000000   7.2500
280         0       3  65.000000   7.7500
result sample:
 [[ 0.          1.          0.27117366  0.01415106]
 [ 1.          0.          0.4722292   0.13913574]
 [ 1.          1.          0.32143755  0.01546857]
 [ 1.          0.          0.43453129  0.1036443 ]
 [ 0.          1.          0.43453129  0.01571255]
 [ 0.          1.          0.3559721   0.0165095 ]
 [ 0.          0.          0.67328474  0.10122886]
 [ 0.          1.          0.74868057  0.04113566]
 [ 1.          1.          0.33400352  0.02173075]
 [ 1.          0.5         0.17064589  0.05869429]]
--------------------------------------------------
一步达成:
 [[ 0.          1.          0.27117366  0.01415106]
 [ 1.          0.          0.4722292   0.13913574]
 [ 1.          1.          0.32143755  0.01546857]
 [ 1.          0.          0.43453129  0.1036443 ]
 [ 0.          1.          0.43453129  0.01571255]
 [ 0.          1.          0.3559721   0.0165095 ]
 [ 0.          0.          0.67328474  0.10122886]
 [ 0.          1.          0.74868057  0.04113566]
 [ 1.          1.          0.33400352  0.02173075]
 [ 1.          0.5         0.17064589  0.05869429]]
--------------------------------------------------
归一化后的结果逆转:
 [[  0.       3.      22.       7.25  ]
 [  1.       1.      38.      71.2833]
 [  1.       3.      26.       7.925 ]
 ..., 
 [  0.       3.      60.      23.45  ]
 [  1.       1.      26.      30.    ]
 [  0.       3.      32.       7.75  ]]

3.3.2.2 数据标准化

sklearn.preprocessing.StandardScaler
当数据(x)按均值(μ)中心化后，再按标准差(σ)缩放，数据就会服从为均值为0，方差为1的正态分布（即标准正态分布），而这个过程，就叫做数据标准化(Standardization，又称Z-score normalization)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()               #实例化
result=scaler.fit_transform(data_t)              #使用fit_transform()一步达成结果
print('一步达成:\n',result[:10])

一步达成:
 [[-0.78927234  0.82737724 -0.64082784 -0.50244517]
 [ 1.2669898  -1.56610693  0.3682384   0.78684529]
 [ 1.2669898   0.82737724 -0.38856128 -0.48885426]
 [ 1.2669898  -1.56610693  0.17903848  0.42073024]
 [-0.78927234  0.82737724  0.17903848 -0.48633742]
 [-0.78927234  0.82737724 -0.21523778 -0.47811643]
 [-0.78927234 -1.56610693  1.37730463  0.39581356]
 [-0.78927234  0.82737724  1.75570447 -0.22408312]
 [ 1.2669898   0.82737724 -0.32549464 -0.42425614]
 [ 1.2669898  -0.36936484 -1.14536096 -0.0429555 ]]