一.数据预处理
1.均值移除(标准化,针对每一列,均值为0,标准差为1) (年龄月薪例子)
均值移除后,分布变得一样,样本的拟合度更高
1.原理:
均值为0:
标准差为1
2.均值移除的API:
3.代码示例
import numpy as np
import sklearn.preprocessing as sp
# 样本数据
raw_samples = np.array([
[17.,90.,4000.],
[20.,80.,5000.],
[23.,75.,5500.]])
# 进行均值移除
result = sp.scale(raw_samples)
print(result)
# axis = 0表示一列,均值移除也是针对一列
print(result.mean(axis=0),result.std(axis=0))
输出
2.范围缩放:范围缩放至0-1
1.原理
2.范围缩放api
3.代码
import numpy as np
import sklearn.preprocessing as sp
# 样本数据
raw_samples = np.array([
[17.,90.,4000.],
[20.,80.,5000.],
[23.,75.,5500.]])
# 指定缩放的范围
mms = sp.MinMaxScaler(feature_range=(0,1))
# 范围缩放的api
result = mms.fit_transform(raw_samples)
print(result)
结果
归一化(突出每个样本的占比,判断样本之间的相似程度)
1.概念
2.api
3.代码
import sklearn.preprocessing as sp
import numpy as np
ary = np.array([[10,21,5],
[2,4,1],
[11,18,18]])
# 归一化
result = sp.normalize(ary,norm='l1')
print(result)
二值化(非0即1)
1.概念
2.api
3.代码
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([[14.,2.,34.,],
[12.,432.,1.],
[12.,4.,23.]])
# 阈值
bin = sp.Binarizer(threshold=20)
result = bin.transform(raw_samples)
print(result)
结果:
独热编码
1.概念
2.场景
3.api
4.代码
import numpy as np
import sklearn.preprocessing as sp
samples = np.array([['你哦好','你qw好','你哦好'],
['你a好','你c好','你哦好'],
])
# 独热编码
# 得到独热编码器
ohe = sp.OneHotEncoder(sparse=False,dtype=int)
result = ohe.fit_transform(samples)
print(result)
标签编码
1.概念
2.api
3.代码
import numpy as np
import sklearn.preprocessing as sp
# 获取标签编码器
raw_sample = np.array(['q','w','e','r','t'])
# 训练之前,需要标签编码
lbe = sp.LabelEncoder()
# 训练转换
result = lbe.fit_transform(raw_sample)
print(result)
# 假设训练之后得到的结果
test = [4,1,2,3,2]
inv = lbe.inverse_transform(test)
print(inv)