1 基本概念
1.1 按比例缩放 MinMaxScaler
1.2 标准差缩放 StanderScaler
1.3 稀疏矩阵的缩放 MaxAbsScaler–有0的矩阵不适合用
实例
import numpy as np
from sklearn.preprocessing import MinMaxScaler #等比例缩放
#1导入数据
x_train = np.array([
[1,2,5],
[-1,5,2],
[-2,4,8],
[3,-5,-7]
])
#方法
nms=MinMaxScaler()
#3 输出
y=nms.fit_transform(x_train)
print(y)
·································································
[[0.6 0.7 0.8]
[0.2 1. 0.6]
[0. 0.9 1. ]
[1. 0. 0. ]]
2 缺失值填补
参考scikit-learn.org官网
事例:用KNN算法填补缺失值
import numpy as np
from sklearn.impute import KNNImputer
nan=np.nan
x=np.array([[1,3,nan],[2,nan,4],[nan,6,5],[8,8,7]])
imputer=KNNImputer(n_neighbors=2,weights='uniform')
imputer.fit_transform(x)
输出
[[1. 3. 4.5]
[2. 4.5 4. ]
[5. 6. 5. ]
[8. 8. 7. ]]
numpy和pandas比较:
import numpy as np
A=np.random.randint(-1,10,20).reshape([5,4])
A=np.array(A,float)
A[A<0]=np.nan
print(A)
输出:
[[ 3. 0. 8. nan]
[ 8. 2. 8. 2.]
[ 5. 9. 2. 4.]
[ 6. 3. 1. 6.]
[ 9. nan 5. 3.]]
import numpy as np
import pandas as pd
A=np.random.randint(-1,10,40).reshape([5,8])
A=np.array(A,float)
A[A<0]=np.nan
#将数组转化成pandas形式数组
P=pd.DataFrame(A)
file_path=r"C:\Users\yanruyu\Desktop\simulation\.vscode\缺失值数组.csv"
P.to_csv(file_path) #index=0 第一列无索引值
pandas
缺失值填充:
1 用平均值填充确实值:
示例:
import pandas as pd
file_path=r" "
fr=pd.read_csv(file_path)
fr2=fr.fillna(fr.mean())
print(fr2)
单特征变量填充时会用到SimpleImputer
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
#1导入数据
file_path=r""
fr=pd.read_csv(file_path)
#2告诉计算机,用什么方法
si=SimpleImputer(missing_values=np.nan,strategy='mean')
#3用这个方法训练数据
ft2=si.fit_transform(fr)#无论之前是什么格式数组,转换后的数组都是numpy格式
print(fr2)
区别;
pandas只能处理纯数值的数组;simpleimputer可以处理众数,含有字符串的数据(knn也可以)。
稀疏矩阵中的单变量插补
import numpy as np
from sklearn.impute import SimpleImputer
si=SimpleImputer(missing_values=np.nan,strategy='mean')
si.fit([[1,2],[np.nan,3],[7,6]]) #此处fit()内容为手动敲的,fit和transform可以合起来写
SimpleImputer()
x=np.array([[np.nan,2],[6,np.nan],[7,6]])
print(si.transform(x))
······································
输出:
[[4. 2. ]
[6. 3.66666667]
[7. 6. ]]
Simpleimputer也支持稀疏矩阵
import scipy.sparse as sp
x=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp = SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(x)
x_text=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(x_text).toarray())
···········································`····
输出:
[[3. 2.]
[6. 3.]
[7. 6.]]
simpleImputer 也支持含有字符串的数据,和pandas分类数据
import pandas as pd
import numpy as np
df=pd.DataFrame([['a','x'],
[np.nan,'y'],
['a',np.nan],
['b','y']],dtype='category') #dtype="category"表示分类的数组
imp=SimpleImputer(missing_values=np.nan,strategy='most_frequent') #众数
df_tansform=imp.fit_transform(df)
print(df_tansform)
··································
输出:
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]
多特征变量缺失值插补
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import enable_iterative_imputer
file_name=r""
df=pf.to_csv(file_name)
imp=enable_iterative_imputer(max_iter=10,random_state=0)
df.tranformed=imp.fit_transform(df)
print(df_tansform)
比单特征变量更智能。
用knn方法填补缺失值
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
s=[[1,2,nan],[3,4,3],[nan,6,5],[8,8,7]]
imp=KNNImputer(n_neighbors=3,weights='uniform')
ss=imp.fit_transform(s)
print(ss)
·······················
输出:
[[1. 2. 5.]
[3. 4. 3.]
[4. 6. 5.]
[8. 8. 7.]]
标记缺失值
nan通常用来标记缺失值(相当于占位符),nan是一个浮点数。
缺失值也可以用其他的整数来占位,比如这里的-1.
MissingIndicator标记缺失值
from sklearn.impute import MissingIndicator
import numpy as np
x=np.array([[-1,-1,1,3],[4,-1,0,-1],[8,-1,1,0]])
indicator=MissingIndicator(missing_values=-1)
mask_missing_values=indicator.fit_transform(x)
mask_missing_values
····································
输出:
array([[ True, True, False],
[False, True, True],
[False, True, False]])
是缺失值就是true,否则是false