一个简单的标准化可视化（numpy+sklearn+matplotlib）

bxttttt

已于 2024-04-22 19:02:02 修改

阅读量409

点赞数 11

分类专栏：可视化文章标签： numpy sklearn matplotlib python 机器学习数据挖掘数据分析

于 2024-04-22 12:57:10 首次发布

本文链接：https://blog.csdn.net/weixin_61728385/article/details/138070556

版权

可视化专栏收录该内容

2 篇文章 0 订阅

订阅专栏

数据集的标准化是许多机器学习估计器的常见要求：如果各个特征或多或少看起来不像标准正态分布数据（例如均值和单位方差为 0 的高斯分布），那么它们可能会表现得很糟糕。

比如支持向量机，是否标准化，可能对结果有很大的影响

但是，随机森林，是否标准化，就影响不是很大

（文章最后会放出我的实验，同样的数据集，同样的处理方案，标准化前后，使用svm作为分类器的acc对比）

标准化是通过去除均值并缩放至单位方差来标准化特征。

所以，标准化特征可视化，应该是先选定特征，分析所有的样本在这一特征上值的分布。

我做流程图中的一个小图，需要将标准化可视化，一开始想直接用别人的图，结果老师说不行，在这里浅浅记录一下我是怎么可视化的。

调用库：

import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing

数据信息：

shape of x_train:(1186, 329)
也就是，x_train包含1186个样本，每个样本包含329个特征

标准化本身很简单：

scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)

我这里，只可视化，前4个特征标准化前后的图像。

# 取最值是为了确定范围，否则不方便可视化
max_num = -100
min_num = 100
# 取出标准化之前，同一个特征，所有样本的值
unstandardized_features = []
# 转置是为了把同个特征，所有样本的值，拿出来
x_train = x_train.T
# 这个4的意思是：做标准化特征的可视化时，只用前4个特征，可以改成自己需要的参数
for i in range(4):
    min_num = min(min_num, min(x_train[i]))
    max_num = max(max_num, max(x_train[i]))
    unstandardized_features.append(x_train[i])
# 记得转置回去
x_train = x_train.T
# 标准化
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
# 标准化结束
# 取出标准化之后，同一个特征，所有样本的值
x_train = x_train.T
standardized_features = []
for i in range(len(unstandardized_features)):
    min_num = min(min_num, min(x_train[i]))
    max_num = max(max_num, max(x_train[i]))
    standardized_features.append(x_train[i])
x_train = x_train.T
# 经过以上步骤，已经把标准化前后，所有样本的值，都拿出来了
# 接下去的步骤，是在max_num和min_num之间，划分很多个小块，求出标准化前后的特征值 落在每一个小块中的数量
unstandardized_features = np.array(unstandardized_features)
standardized_features = np.array(standardized_features)
# steps 可以自己取别的值
steps = 50 
# 划分很多个小块
gap = (max_num - min_num) / steps
standardized_count = []
unstandardized_count = []
for i in range(len(unstandardized_features)):
    standardized_count.append(np.zeros(steps + 1))
    unstandardized_count.append(np.zeros(steps + 1))
# 求出标准化前后的特征值，落在每一个小块中的数量
for i in range(len(unstandardized_features[0])):
    for j in range(len(unstandardized_features)):
        position = (standardized_features[j][i] - min_num) / gap
        position = int(position)
        standardized_count[j][position] += 1
        position = (unstandardized_features[j][i] - min_num) / gap
        position = int(position)
        unstandardized_count[j][position] += 1
# 可视化部分
x = np.arange(len(standardized_count[0]))
fig, ax = plt.subplots(1, 1)
for i in range(len(unstandardized_features)):
    ax.plot(x, standardized_count[i])
path = "对比实验合集/标准化可视化/"
path_name = path + "standardized_subject" + str(subj)
# 注意：plt.savefig不可以放到plt.show后边
plt.savefig(path_name)
plt.show()
fig, ax = plt.subplots(1, 1)
for i in range(len(unstandardized_features)):
    ax.plot(x, unstandardized_count[i])
path_name = path + "unstandardized_subject" + str(subj)
plt.savefig(path_name)
plt.show()

放上我的可视化结果

标准化前：

标准化后：

标准化前，使用支持向量机：0.5824915824915825

标准化后，使用支持向量机：0.8249158249158249

差距还是很明显的

bxttttt

关注

11
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
一个简单的标准化可视化（numpy+sklearn+matplotlib）

数据集的标准化是许多机器学习估计器的常见要求：如果各个特征或多或少看起来不像标准正态分布数据（例如均值和单位方差为 0 的高斯分布），那么它们可能会表现得很糟糕。我做流程图中的一个小图，需要将标准化可视化，一开始想直接用别人的图，结果老师说不行，在这里浅浅记录一下我是怎么可视化的。
复制链接

扫一扫

专栏目录