机器学习 数据缩放
It’s possible that you will come across datasets with lots of numerical noise built-in, such as variance or differently-scaled data, so a good preprocessing is a must before even thinking about machine learning. A good preprocessing solution for this type of problem is often referred to as standardization.
您可能会遇到带有大量内置数字噪声的数据集,例如方差或不同比例的数据,因此,在考虑机器学习之前,必须进行良好的预处理。 针对此类问题的良好预处理解决方案通常称为标准化 。
Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn
this is often a necessary step because many models assume that the data you are training on is normally distributed, and if it isn't, your risk biasing your model.
标准化是一种预处理方法,用于转换连续数据以使其看起来呈正态分布。 在scikit-learn
这通常是必要的步骤,因为许多模型都假设您正在训练的数据是正态分布的,如果不是,则可能会使模型存在风险。
You can standardize your data in different ways, and in this article, we’re going to talk about the popular data scaling method — data scaling. Or standard scaling to be more precise.
您可以通过不同的方式标准化数据,在本文中,我们将讨论流行的数据缩放方法- 数据缩放。 或使用标准比例缩放来更精确。
It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, and there are a few different scenarios in which you want to use it:
还需要注意的是, 标准化是一种应用于连续数值数据的预处理方法,在几种不同的情况下,您都可以使用它:
- When working with any kind of model that uses a linear distance metric or operates on a linear space — KNN, linear regression, K-means 当使用任何使用线性距离度量或在线性空间上运行的模型时-KNN,线性回归,K均值
- When a feature or features in your dataset have high variance — this could bias a model that assumes the data is normally distributed, if a feature in has a variance that’s an order