miceforest学习（一）

allooee

已于 2024-05-28 17:25:58 修改

阅读量325

点赞数

文章标签：学习

于 2024-05-23 18:45:15 首次发布

原文链接：https://pypi.org/project/miceforest/

版权

原文档链接 miceforest 5.7.0

micforest有三个主要的类：

ImputationKernel

该类包含执行mice算法所需的原始数据。在此过程中，将训练模型，并存储填补(预测)值。这些值可用于填充原始数据的缺失值。原始数据可以复制，也可以直接引用。模型可以保存，并用于填补新的数据集。

ImputedData

impute_new_data(new_data)的结果。它包含new_data中的原始数据以及填补的值。

MeanMatchScheme
确定如何进行均值匹配。在micforest中有3种内置的均值匹配方案，在下面讨论。

样例（ImputationKernel）：

如果你只想创建一个单独的imputed数据集，你可以使用ImputationKernel和一些默认设置:

首先导入数据集，以iris数据集为例

import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
iris['species'] = iris['species'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)

使用miceforest

# Create kernel. 
kds = mf.ImputationKernel(
  iris_amp,
  save_all_iterations=True,
  random_state=1991
)

# Run the MICE algorithm for 2 iterations
kds.mice(2)

# Return the completed dataset.
iris_complete = kds.complete_data()

通常，我们不想只对单个数据集进行估算。在统计学中，多重填补是通过创建多个不同的填补数据集来检查缺失值引起的不确定性/其他影响的过程。ImputationKernel可以包含任意数量的不同数据集，所有这些数据集都经历了互斥的填补过程:

# Create kernel. 
kernel = mf.ImputationKernel(
  iris_amp,
  datasets=4,
  save_all_iterations=True,
  random_state=1
)

# Run the MICE algorithm for 2 iterations on each of the datasets
kernel.mice(2)

# Printing the kernel will show you some high level information.
print(kernel)
## 
##               Class: ImputationKernel
##            Datasets: 4
##          Iterations: 2
##        Data Samples: 150
##        Data Columns: 5
##   Imputed Variables: 5
## save_all_iterations: True

运行完mice后，我们可以直接从内核中获取完整的数据集:

completed_dataset = kernel.complete_data(dataset=2)

print(completed_dataset.isnull().sum(0))
## sepal length (cm)    0
## sepal width (cm)     0
## petal length (cm)    0
## petal width (cm)     0
## species              0
## dtype: int64

自定义lightgbm参数：

全局应用于每个模型的参数可以简单地作为kwargs传递给mice:

# Run the MICE algorithm for 1 more iteration on the kernel with new parameters
kernel.mice(iterations=1,n_estimators=50)

在mice中，也可以将特定于变量的参数传递给variable_parameters。

例如，假设你发现到[species]列的插补时间更长，因为它是多分类的。你可以专门针对那一列减少n_estimators:

# Run the MICE algorithm for 2 more iterations on the kernel 
kernel.mice(
  iterations=1,
  variable_parameters={'species': {'n_estimators': 25}},
  n_estimators=50
)

Mean Match Schemes

miceforest_MeanMatchScheme包含如何执行均值匹配的信息，例如:

Mean matching functions
Mean matching candidates
How to get predictions from a lightgbm model
The datatypes predictions are stored as

micforest提供了三个预构建的均值匹配方案:

from miceforest import (
  mean_match_default,
  mean_match_fast_cat,
  mean_match_shap
)

# To get information for each, use help()
# help(mean_match_default)

Mean_match_default ——中等速度，中等填补质量
分类型（Categorical）/ 数值型（Numeric）：在候选类概率上执行K最近邻搜索，其中K = mmc。随机选择1，并选择相关联的候选值作为填补值。
Mean_match_fast_cat——速度最快，填充质量最低

分类型（Categorical）：返回类别基于随机抽取加权的每个样本的类别概率。
数值型（Numeric）：对候选类概率执行K最近邻搜索，其中K = mmc。随机选择1，并选择相关联的候选值作为填补值。

Mean_match_shap——速度最慢，大型数据集的填补质量最高
分类型（Categorical）/ 数值型（Numeric）：对候选预测shap值进行K近邻搜索，其中K = mmc。随机选择1，并选择相关联的候选值作为填补值。

作为特殊情况，如果mean_match_candidate设置为0，则所有方案都会观察到以下行为:

分类型（Categorical）：选择概率最大的类。
数值型（Numeric）：使用预测值

进阶功能

多重插补是一个复杂的过程。但是，miceforest允许用户置换所有主要组件并进行定制。

定制插补过程

可以通过变量大量定制我们的imputation过程。

通过将命名列表传递给variable_schema，可以为每个填补变量指定预测变量。

还可以通过传递一个有效值的字典来指定mean_match_candidate和data_subset，使用变量名作为键。

如果需要，甚至可以为某些目标替换整个默认的均值匹配函数。下面是一个非常复杂的设置，你可能永远不想使用它。它只是展示了什么是可能的：

# 使用默认的 mean match schema 作为基础
from miceforest import mean_match_default
mean_match_custom = mean_match_default.copy()

# Define a定义一个新的 mean matching 函数 
# 随机打乱预测值
def custom_mmf(bachelor_preds):
    np.random.shuffle(bachelor_preds)
    return bachelor_preds

# 指定用于执行mean matching的自定义的函数，所有变量为泊松分布建模
mean_match_custom.set_mean_match_function(
  {"poisson": custom_mmf}
)

# 按照变量设置mean match candidates(int或者dict。使用int则对所有变量使用这个值。略过均值匹配则设为0.）
mean_match_custom.set_mean_match_candidates(
  {
      'sepal width (cm)': 3,
      'petal width (cm)': 0
  }
)

# 指定变量用来预测其他变量
variable_schema = {
    'sepal width (cm)': ['species','petal width (cm)'],
    'petal width (cm)': ['species','sepal length (cm)']
}

# 对sepal width (cm)使用预测数据集的50行数据子集.
variable_subset = {
  'sepal width (cm)': 50
}

# Specify that petal width (cm) should be modeled by the
# poisson objective. Our custom mean matching function
# above will be used for this variable.
# 指定petal width (cm) 由泊松分布建模。前面自定义的均值匹配函数也将用于此变量。
variable_parameters = {
  'petal width (cm)': {"objective": "poisson"}
}

# 设置插补模型
cust_kernel = mf.ImputationKernel(
    iris_amp,
    datasets=3,
    mean_match_scheme=mean_match_custom,
    variable_schema=variable_schema,
    data_subset=variable_subset
)
cust_kernel.mice(iterations=1, variable_parameters=variable_parameters)