miceforest插补

1.数据集介绍:

数据集(客户违约数据集)所有行列都有缺失值,一共有六个变量,1001条记录
在这里插入图片描述

2.制造缺失值数据

#读入数据集
data=pd.read_excel("E:\机器学习数据\miceforest.xlsx")
#使用miceforest来对完整数据随机截取缺失值
A_data_missing=mf.ampute_data(data.iloc[:,1:6],perc=0.25,random_state=1)
#输出缺失百分比
print(A_data_missing.isnull().sum()/len(data))

输出缺失数据所占百分比:

月收入       0.255
年龄        0.259
性别        0.256
历史授信额度    0.262
历史违约次数    0.260
信用评分      0.261

对数据前6列除“是否违约记录”外进行缺失值处理,并使其百分比占25%

3.缺失值插补

3.1单一插补

# 使用单一值插补
kds = mf.ImputationKernel(
  data=A_data_missing,
  datasets=1,#dataset=1,就是缺失值插补
  save_models=1,#取值大于等于0,当等于1时表明只保留获得的最后结果的模型
  save_all_iterations=True,#保留所有的中间结果
  random_state=10
)

3.2多重插补

# 使用多重插补
kds = mf.ImputationKernel(
  data=A_data_missing,
  datasets=4,
  save_models=1,#取值大于等于0,当等于1时表明只保留获得的最后结果的模型
  save_all_iterations=True,#保留所有的中间结果
  random_state=10
)
kds.mice(iterations=3,
         #n_jobs=2,n_estimators=50
         )#每个数据集迭代三次,n_jobs并行化数量,当其等于-1时表明使用的是最大化并行,n_estimators控制树的生长
completed_dataset = kds.complete_data(dataset=1, inplace=False)#dataset=1,代表要的是di2个数据集,取值范围因为datasets=4,所以只能为0-3
#使用inplace=False返回已完成数据的副本。由于原始数据已经存储在 中kernel.working_data,您可以设置 inplace=True完成数据而不返回副本
print(completed_dataset.isnull().sum(0))

3.3多分类数据

#如果某一列是多分类数据,那么进行插补可能需要更多的时间,那么可以通过如下方式单独减少n_estimators,在这种情况下,指定的任何参数variable_parameters都将优于第一种方式kwargs。
#kds.mice(iterations=1,variable_parameters={'历史违约次数': {'n_estimators': 25}},n_estimators=50)

3.4已知数据分布(eg:泊松分布)

#如果已知数据分布的话,可以单独对该列数据进行插补(以泊松分布为例)
#例如历史违约次数是泊松分布
# Create kernel.
cust_kernel = mf.ImputationKernel(
  data=A_data_missing,
  datasets=1,
  random_state=1
)
cust_kernel.mice(iterations=1, variable_parameters={'历史违约次数': {'objective': 'poisson'}})

3.5使用GBDT进行插补

#想要使用GBDT来进行插补
kds_gbdt = mf.ImputationKernel(
  data=A_data_missing,
  datasets=1,
  save_all_iterations=True,
  random_state=1991
)
# 我们需要添加一个小的最小 hessian,否则 lightgbm 会报错:
kds_gbdt.mice(iterations=1, boosting='gbdt', min_sum_hessian_in_leaf=0.01)

3.6使用同一模型进行插补

#新创建一个数据集是不同的缺失率
A_data_missing1=mf.ampute_data(data.iloc[:,0:6],perc=0.4,random_state=1)
#可以直接调用上面训练好的模型进行缺失值填补,其实际用处就是可以用一部分的数据子集,对模型进行训练然后对整体数据进行填补,提高模型效率
kds.impute_new_data(A_data_missing1)
completed_dataset1=kds.complete_data(0)

3.8PMM均值预测(mean_match_candidates)

使用参数mean_match_candidates,mean_match_candidates=5,代表用与预测值最接近的5个样本值的均值(或者投票)作为缺失值填补
它不使用预测出来的值作为缺失值,而是通过预测出来的值寻找相邻预测值,再映射回原数据,使用原数据统计量进行填补

cust_kernel = mf.ImputationKernel(
    data=A_data_missing,
    datasets=3,
    mean_match_candidates=5,
)
var_mmc = {
    '年龄': 5,
    '性别': 3
}
cust_kernel = mf.ImputationKernel(
    data=A_data_missing,
    datasets=3,
    mean_match_candidates=var_mmc
)

3.7自定义插补

可以通过变量自定义插补程序,通过把命名列表(named list)传递给参数variable_schema,可以为每个要插补的变量指定预测变量(就是哪个变量要用哪些变量来预测),还可以选择哪些变量应该使用均值匹配(mean matching)来插补,即通过把dict传递给参数mean_match_candidates来指定哪些变量使用均值匹配来插补缺失值。
eg:年龄用性别和月收入来预测,且年龄和性别选择均值预测

var_sch = {
    '年龄': ['性别','月收入'],
    '信用评分': ['月收入','历史违约次数']
}
var_mmc = {
    '年龄': 5,
    '性别': 2
}
kds = mf.ImputationKernel(
  data=A_data_missing,
  datasets=4,
  save_models=1,#取值大于等于0,当等于1时表明只保留获得的最后结果的模型
  save_all_iterations=True,#保留所有的中间结果
  random_state=10,
  variable_schema=var_sch,
  mean_match_candidates=var_mmc
)

4.多重插补的筛选和使用

当多重插补得到多个插补数据集之后可以对每个数据集取其插补后平均值(或者其他统计量)与原始数据集最相近的数据然后重新合并为一个数据集。

kds = mf.ImputationKernel(
  data=A_data_missing,
  datasets=4,
  save_models=1,#取值大于等于0,当等于1时表明只保留获得的最后结果的模型
  save_all_iterations=True,#保留所有的中间结果
  random_state=10
)

kds.mice(iterations=3,
         #n_jobs=2,n_estimators=50
         )#每个数据集迭代三次,n_jobs并行化数量,当其等于-1时表明使用的是最大化并行,n_estimators控制树的生长

#分析并使用多重填补结果

dataresult=[]
result=[]
for i in range(kds.dataset_count()):
  dataresult.append(kds.complete_data(i))
  dd=((dataresult[i].mean()-A_data_missing.mean()))/A_data_missing.mean()*100
  result.append(dd)
print(result)

#可以取每个数据集离原来均值最小的
name=A_data_missing.columns
new_complete=pd.DataFrame(columns=name)
lst=[]#储存要哪个数据集
for i in range(len(name)):
  re = []
  for j in range(kds.dataset_count()):
    re.append(result[j][i])
  a=re.index(min(re))   #返回最小值所在数据集,返回在数据集中是第几个数据集有最符合插补值
  lst.append(a)
for i in range(len(name)):
  new_complete[name[i]]=dataresult[lst[i]][name[i]]
print(new_complete)
  • 4
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧圆弧

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值