kaggle入门笔记（Day2:Scaling and normalization）

最新推荐文章于 2024-06-29 11:47:42 发布

qq_18884827

最新推荐文章于 2024-06-29 11:47:42 发布

阅读量1.2k

点赞数 1

分类专栏： kaggle

本文链接：https://blog.csdn.net/qq_18884827/article/details/79827561

版权

kaggle 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

介绍本部分内容之前，先说一下Scaling与normalization的区别

一、Scale包括两部分：Standardization(标准化)和Centering(归一化)

1、Standardization：

newX = (X- 均值) / 标准差(standard deviation)， newX 的均值=0，方差= 1,可用于发现离群点，Python中计算函数为preprocessing.scale和preprocessing.StandardScale，区别在于preprocessing.StandardScale可以保存测试集上的均值、标准差,从而在训练集做一样的处理，

2、Centering：

newX = (X- min) / (max-min)，newX范围（0,1）preprocessing.minmax_scale或preprocessing.MinMaxScaler，区别在于preprocessing.MinMaxScaler可以在测试集应用

newX= X / max，newX范围（-1,1） preprocessing.maxabs_scale或 preprocessing.MaxAbsScale，区别同上。

二、Normalization(正则化)：正则化的过程是将每个样本缩放到单位范数（每个样本的范数为1），可通过preprocessing.normalize()或preprocessing.Normalizer()进行转换。preprocessing.normalize()中的参数norm可以选择不同的范式标准（l1,l2,max）

p-范数的计算公式：||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p，l1,l2范式分别指p=1或p=2的结果

Standardization(标准化)和Centering(归一化)处理后的结果没有太大差别，建议采用标准化，采用梯度下降时一般用归一化，在文本分类和聚类是采用Normalization(正则化)。

1、Get our environment set up

# modules we'll use
import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling：进行数据缩放的
from mlxtend.preprocessing import minmax_scaling

# plotting modules 
import seaborn as sns
import matplotlib.pyplot as plt

# read in all our data
kickstarters_2017 = pd.read_csv("../input/kickstarter-projects/ks-projects-201801.csv")

# set seed for reproducibility
np.random.seed(0)

1.1、scipy 中的stats里有多种概率分布的函数，超过80个连续随机变量和10个离散随机变量已经使用这些类实现，例如scipy.stats.norm表示的是正态函数。并且可以通过info（stats）获得这些函数的完整列表。

这些函数的对象方法

对象方法	描述
rvs(loc=0, scale=1, size=1, random_state=None)	Random variates.
pdf(x, loc=0, scale=1)	输入x，返回概率密度函数
logpdf(x, loc=0, scale=1)	Log of the probability density function.
cdf(x, loc=0, scale=1)	输入x，返回概率，既密度函数的面积
logcdf(x, loc=0, scale=1)	Log of the cumulative distribution function.
sf(x, loc=0, scale=1)	Survival function (also defined as 1 - cdf, but sf is sometimes more accurate).
logsf(x, loc=0, scale=1)	Log of the survival function.
ppf(q, loc=0, scale=1)	输入密度函数面积，返回x (inverse of cdf — percentiles).
isf(q, loc=0, scale=1)	Inverse survival function (inverse of sf).
moment(n, loc=0, scale=1)	Non-central moment of order n
stats(loc=0, scale=1, moments=’mv’)	Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’).
entropy(loc=0, scale=1)	(Differential) entropy of the RV.
fit(data, loc=0, scale=1)	Parameter estimates for generic data.
expect(func, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)	Expected value of a function (of one argument) with respect to the distribution.
median(loc=0, scale=1)	Median of the distribution.
mean(loc=0, scale=1)	Mean of the distribution.
var(loc=0, scale=1)	Variance of the distribution.
std(loc=0, scale=1)	Standard deviation of the distribution.
interval(alpha, loc=0, scale=1)	Endpoints of the range that contains alpha percent of the distribution

其中 loc表示均值，scale表示方差

1.2、import seaborn as sns

Seaborn是在matplotlib基础上进行了更高级的API封装，从而使得作图更容易，Seaborn能作出更有吸引力的图，matplolib能作出具有更多特色的图。应该把Seaborn作为对 matplotlib的补充

那么Pandas与Seaborn之间有什么区别呢？

其实两者都是使用了matplotlib来作图，但是有非常不同的设计差异

在只需要简单地作图时直接用Pandas，但要想做出更加吸引人，更丰富的图就可以使用Seaborn
Pandas的作图函数并没有太多的参数来调整图形，所以你必须要深入了解matplotlib
Seaborn的作图函数中提供了大量的参数来调整图形，所以并不需要太深入了解matplotlib

有篇博客讲了一点区别：https://www.cnblogs.com/kylinlin/p/5236601.html

2、scaling与normalizing的区别

The difference is that, in scaling, you're changing the *range* of your data while in normalization you're changing the *shape of the distribution* of your data

2.1、scaling

这个方法是把你的数据转换为特定的范围，例如0-100或者0-1。当你使用基于离你的数据点有多远的方法时会用到数据缩放，例如SVN或者KNN。在这些算法中，任何特征变化一个1，都有相同的重要性

例如你可能会以美元或者日元为价格来看一种产品。一美元等于一百日元，如果你在SVM或者KNN中不缩放你的数据，那么一日元和一美元起了相同的作用。对于货币可以使用这种转换，但是如果是是身高和体重呢，一公斤等于多少厘米呢？

通过缩放，可以帮助你以相同的权重比较不同的变量。

# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

感觉mlxten与sklearn很像，里面的方法也很像，但是mlxten相对来说用的少一些，希望有人来补充一下

subplots是画子图的函数（参数表示一行两列），可以同时画多个图

2.2Normalization

Scaling知识改变数据的范围，Normalization是一个激进的转换，Normalization的关键是改变你的观察结果，以便为了他们被描述为正则化分布。（Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.）

一般包括t检验，方差分析，线性回归等。

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")