miceforest学习(一)

原文档链接 miceforest 5.7.0

micforest有三个主要的类:

ImputationKernel

该类包含执行mice算法所需的原始数据。在此过程中,将训练模型,并存储填补(预测)值。这些值可用于填充原始数据的缺失值。原始数据可以复制,也可以直接引用。模型可以保存,并用于填补新的数据集。

ImputedData

impute_new_data(new_data)的结果。它包含new_data中的原始数据以及填补的值。

MeanMatchScheme
确定如何进行均值匹配。在micforest中有3种内置的均值匹配方案,在下面讨论。

样例(ImputationKernel):

如果你只想创建一个单独的imputed数据集,你可以使用ImputationKernel和一些默认设置:

首先导入数据集,以iris数据集为例

import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
iris['species'] = iris['species'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)

使用miceforest

# Create kernel. 
kds = mf.ImputationKernel(
  iris_amp,
  save_all_iterations=True,
  random_state=1991
)

# Run the MICE algorithm for 2 iterations
kds.mice(2)

# Return the completed dataset.
iris_complete = kds.complete_data()

通常,我们不想只对单个数据集进行估算。在统计学中,多重填补是通过创建多个不同的填补数据集来检查缺失值引起的不确定性/其他影响的过程。ImputationKernel可以包含任意数量的不同数据集,所有这些数据集都经历了互斥的填补过程:

# Create kernel. 
kernel = mf.ImputationKernel(
  iris_amp,
  datasets=4,
  save_all_iterations=True,
  random_state=1
)

# Run the MICE algorithm for 2 iterations on each of the datasets
kernel.mice(2)

# Printing the kernel will show you some high level information.
print(kernel)
## 
##               Class: ImputationKernel
##            Datasets: 4
##          Iterations: 2
##        Data Samples: 150
##        Data Columns: 5
##   Imputed Variables: 5
## save_all_iterations: True

运行完mice后,我们可以直接从内核中获取完整的数据集:

completed_dataset = kernel.complete_data(dataset=2)

print(completed_dataset.isnull().sum(0))
## sepal length (cm)    0
## sepal width (cm)     0
## petal length (cm)    0
## petal width (cm)     0
## species              0
## dtype: int64

 自定义lightgbm参数:

全局应用于每个模型的参数可以简单地作为kwargs传递给mice:

# Run the MICE algorithm for 1 more iteration on the kernel with new parameters
kernel.mice(iterations=1,n_estimators=50)

在mice中,也可以将特定于变量的参数传递给variable_parameters。

例如,假设你发现到[species]列的插补时间更长,因为它是多分类的。你可以专门针对那一列减少n_estimators:

# Run the MICE algorithm for 2 more iterations on the kernel 
kernel.mice(
  iterations=1,
  variable_parameters={'species': {'n_estimators': 25}},
  n_estimators=50
)

Mean Match Schemes

miceforest_MeanMatchScheme包含如何执行均值匹配的信息,例如:

  1. Mean matching functions
  2. Mean matching candidates
  3. How to get predictions from a lightgbm model
  4. The datatypes predictions are stored as

micforest提供了三个预构建的均值匹配方案:

from miceforest import (
  mean_match_default,
  mean_match_fast_cat,
  mean_match_shap
)

# To get information for each, use help()
# help(mean_match_default)

Mean_match_default ——中等速度,中等填补质量
分类型(Categorical)/ 数值型(Numeric):在候选类概率上执行K最近邻搜索,其中K = mmc。随机选择1,并选择相关联的候选值作为填补值。
Mean_match_fast_cat——速度最快,填充质量最低

分类型(Categorical):返回类别基于随机抽取加权的每个样本的类别概率。
数值型(Numeric):对候选类概率执行K最近邻搜索,其中K = mmc。随机选择1,并选择相关联的候选值作为填补值。

Mean_match_shap——速度最慢,大型数据集的填补质量最高
分类型(Categorical)/ 数值型(Numeric):对候选预测shap值进行K近邻搜索,其中K = mmc。随机选择1,并选择相关联的候选值作为填补值。

作为特殊情况,如果mean_match_candidate设置为0,则所有方案都会观察到以下行为:

分类型(Categorical):选择概率最大的类。
数值型(Numeric):使用预测值

进阶功能

多重插补是一个复杂的过程。但是,miceforest允许用户置换所有主要组件并进行定制。

定制插补过程

可以通过变量大量定制我们的imputation过程。

通过将命名列表传递给variable_schema,可以为每个填补变量指定预测变量。

还可以通过传递一个有效值的字典来指定mean_match_candidate和data_subset,使用变量名作为键。

如果需要,甚至可以为某些目标替换整个默认的均值匹配函数。下面是一个非常复杂的设置,你可能永远不想使用它。它只是展示了什么是可能的:

# 使用默认的 mean match schema 作为基础
from miceforest import mean_match_default
mean_match_custom = mean_match_default.copy()

# Define a定义一个新的 mean matching 函数 
# 随机打乱预测值
def custom_mmf(bachelor_preds):
    np.random.shuffle(bachelor_preds)
    return bachelor_preds

# 指定用于执行mean matching的自定义的函数,所有变量为泊松分布建模
mean_match_custom.set_mean_match_function(
  {"poisson": custom_mmf}
)

# 按照变量设置mean match candidates(int或者dict。使用int则对所有变量使用这个值。略过均值匹配则设为0.)
mean_match_custom.set_mean_match_candidates(
  {
      'sepal width (cm)': 3,
      'petal width (cm)': 0
  }
)

# 指定变量用来预测其他变量
variable_schema = {
    'sepal width (cm)': ['species','petal width (cm)'],
    'petal width (cm)': ['species','sepal length (cm)']
}

# 对sepal width (cm)使用预测数据集的50行数据子集.
variable_subset = {
  'sepal width (cm)': 50
}

# Specify that petal width (cm) should be modeled by the
# poisson objective. Our custom mean matching function
# above will be used for this variable.
# 指定petal width (cm) 由泊松分布建模。前面自定义的均值匹配函数也将用于此变量。
variable_parameters = {
  'petal width (cm)': {"objective": "poisson"}
}

# 设置插补模型
cust_kernel = mf.ImputationKernel(
    iris_amp,
    datasets=3,
    mean_match_scheme=mean_match_custom,
    variable_schema=variable_schema,
    data_subset=variable_subset
)
cust_kernel.mice(iterations=1, variable_parameters=variable_parameters)

最近直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了个家庭作业,要求写个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值