数据分析1_缺失值插补

1 基本概念

1.1 按比例缩放 MinMaxScaler
1.2 标准差缩放 StanderScaler
1.3 稀疏矩阵的缩放 MaxAbsScaler–有0的矩阵不适合用

实例

import numpy as np
from sklearn.preprocessing import MinMaxScaler #等比例缩放
#1导入数据
x_train = np.array([
    [1,2,5],
    [-1,5,2],
    [-2,4,8],
    [3,-5,-7]
    ])
#方法
nms=MinMaxScaler()
#3 输出
y=nms.fit_transform(x_train)
print(y)
·································································
[[0.6 0.7 0.8]
 [0.2 1.  0.6]
 [0.  0.9 1. ]
 [1.  0.  0. ]]

2 缺失值填补

参考scikit-learn.org官网
事例:用KNN算法填补缺失值

import numpy as np 
from sklearn.impute import KNNImputer
nan=np.nan
x=np.array([[1,3,nan],[2,nan,4],[nan,6,5],[8,8,7]])
imputer=KNNImputer(n_neighbors=2,weights='uniform')
imputer.fit_transform(x)

输出
[[1. 3. 4.5]
[2. 4.5 4. ]
[5. 6. 5. ]
[8. 8. 7. ]]

numpy和pandas比较:

import numpy as np
A=np.random.randint(-1,10,20).reshape([5,4])
A=np.array(A,float)
A[A<0]=np.nan
print(A)
输出:
[[ 3.  0.  8. nan]
 [ 8.  2.  8.  2.]
 [ 5.  9.  2.  4.]
 [ 6.  3.  1.  6.]
 [ 9. nan  5.  3.]]
import numpy as np
import pandas as pd
A=np.random.randint(-1,10,40).reshape([5,8])
A=np.array(A,float)
A[A<0]=np.nan
#将数组转化成pandas形式数组
P=pd.DataFrame(A)
file_path=r"C:\Users\yanruyu\Desktop\simulation\.vscode\缺失值数组.csv"
P.to_csv(file_path)	#index=0 第一列无索引值

pandas

缺失值填充:

1 用平均值填充确实值:

示例:

import pandas as pd 
file_path=r" "
fr=pd.read_csv(file_path)
fr2=fr.fillna(fr.mean())
print(fr2)

单特征变量填充时会用到SimpleImputer

import pandas as pd  
import numpy as np
from sklearn.impute import SimpleImputer
#1导入数据
file_path=r""
fr=pd.read_csv(file_path)
#2告诉计算机,用什么方法
si=SimpleImputer(missing_values=np.nan,strategy='mean')
#3用这个方法训练数据
ft2=si.fit_transform(fr)#无论之前是什么格式数组,转换后的数组都是numpy格式
print(fr2)
区别;

pandas只能处理纯数值的数组;simpleimputer可以处理众数,含有字符串的数据(knn也可以)。

稀疏矩阵中的单变量插补

import numpy as np 
from sklearn.impute import SimpleImputer
si=SimpleImputer(missing_values=np.nan,strategy='mean')
si.fit([[1,2],[np.nan,3],[7,6]]) #此处fit()内容为手动敲的,fit和transform可以合起来写
SimpleImputer()
x=np.array([[np.nan,2],[6,np.nan],[7,6]])
print(si.transform(x))
······································
输出:
[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]

Simpleimputer也支持稀疏矩阵

import scipy.sparse as sp 
x=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp = SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(x)
x_text=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(x_text).toarray())
···········································`····
输出:
[[3. 2.]
 [6. 3.]
 [7. 6.]]

simpleImputer 也支持含有字符串的数据,和pandas分类数据

import pandas as pd 
import numpy as np 
df=pd.DataFrame([['a','x'],
[np.nan,'y'],
['a',np.nan],
['b','y']],dtype='category')	#dtype="category"表示分类的数组
imp=SimpleImputer(missing_values=np.nan,strategy='most_frequent')	#众数
df_tansform=imp.fit_transform(df)
print(df_tansform)
··································
输出:
[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

多特征变量缺失值插补

import pandas as pd 
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import enable_iterative_imputer
file_name=r""
df=pf.to_csv(file_name)
imp=enable_iterative_imputer(max_iter=10,random_state=0)
df.tranformed=imp.fit_transform(df)
print(df_tansform) 

比单特征变量更智能。

用knn方法填补缺失值

import pandas as pd 
import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
s=[[1,2,nan],[3,4,3],[nan,6,5],[8,8,7]]
imp=KNNImputer(n_neighbors=3,weights='uniform')
ss=imp.fit_transform(s)
print(ss)
·······················
输出:
[[1. 2. 5.]
 [3. 4. 3.]
 [4. 6. 5.]
 [8. 8. 7.]]

标记缺失值

nan通常用来标记缺失值(相当于占位符),nan是一个浮点数。
缺失值也可以用其他的整数来占位,比如这里的-1.
MissingIndicator标记缺失值

from sklearn.impute import MissingIndicator
import numpy as np
x=np.array([[-1,-1,1,3],[4,-1,0,-1],[8,-1,1,0]])
indicator=MissingIndicator(missing_values=-1)
mask_missing_values=indicator.fit_transform(x)
mask_missing_values
····································
输出:
array([[ True,  True, False],
       [False,  True,  True],
       [False,  True, False]])
       是缺失值就是true,否则是false
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AppleYRY

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值