使用cleanlab时,你是否遇到了主进程被无限循环调用的问题

使用cleanlab时,你是否遇到了主进程被无限循环调用的问题

背景: 最近在做数据清洗,数据质量评估,脏数据识别的相关工作。于是找到了TrustAI和cleanlab,做了充分的调研,不论从代码的开发还是最终的效果,cleanlab都具有更大的优势。因此信誓旦旦的给老板说用cleanlab很简单,一天就可以出个初步结果。当然承诺的速度远比不上打脸的速度。

cleanlab脏数据识别针对的任务

cleanlab主要用于分类任务中人工标注错误样本的识别,主要是通过统计得到噪声标签(人工标注的标签)与真实标签(达到某个类别平均阈值的样本)的联合概率矩阵来进行脏数据的是识别。

官方用法解读:

"""
cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab 
package works with any model by using model outputs (predicted probabilities) as input
 – it doesn’t depend on which model created those outputs.

If you’re using a scikit-learn-compatible model (option 1), you don’t need to train a 
model – you can pass the model, data, and labels into CleanLearning.find_label_issues 
and cleanlab will handle model training for you. If you want to use any non-sklearn-
compatible model (option 2), you can input the trained model’s out-of-sample predicted 
probabilities into find_label_issues. Examples for both options are below.
"""

from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues

# Option 1 - works with sklearn-compatible models - just input the data and labels ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)

# Option 2 - works with ANY ML model - just input the model's predicted probabilities
ordered_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,  # predicted probabilities from any model (ideally out-of-sample predictions)
    return_indices_ranked_by='self_confidence',
)

遇到问题

看了官方用法后我也是心想: so easy!,于是欣欣然的写下了如下代码:

import numpy as np
from cleanlab.filter import find_label_issues
# labels:人工标签
labels=np.array([0,1,0,0])
# pred_probs:分类模型预测到的概率
pred_probs=np.array([[0.8,0.2],
            [0.4,0.6],
            [0.5,0.5],
            [0.0,1]])
print(labels)
print(pred_probs)
ordered_label_issues = find_label_issues(
        labels=labels,
        pred_probs=pred_probs,  # predicted probabilities from any model (ideally out-of-sample predictions)
        return_indices_ranked_by='self_confidence',
    )
print("被识别为脏数据的索引为:",ordered_label_issues)

然而却出现了如下 的结果:

[0 1 0 0]
[[0.8 0.2]
 [0.4 0.6]
 [0.5 0.5]
 [0.  1. ]]
[0 1 0 0]
[0 1 0 0]
[[0.8 0.2]
 [0.4 0.6]
 [0.5 0.5]
 [0.  1. ]]
[[0.8 0.2]
 [0.4 0.6]
 [0.5 0.5]
 [0.  1. ]]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
"C:\Users\wtsr\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
"C:\Users\wtsr\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

至于是哪里出错了,由于报错太长了实在是找不到哪里错误,然后就去把源码读了一下,读的也是云里雾里的,但是读完源码发现了该问题因为进程调度的问题,不知什么原因导致主进程被无限循环调用。了解问题后,做出如下更改:

import numpy as np
import warnings
from cleanlab.filter import find_label_issues
warnings.filterwarnings('ignore')
# labels:人工标签
labels=np.array([0,1,0,0])
# pred_probs:分类模型预测到的概率
pred_probs=np.array([[0.8,0.2],
            [0.4,0.6],
            [0.5,0.5],
            [0.0,1]])

def predict_issue(labels,pred_probs):
    ordered_label_issues = find_label_issues(
        labels=labels,
        pred_probs=pred_probs,  # predicted probabilities from any model (ideally out-of-sample predictions)
        return_indices_ranked_by='self_confidence',
    )
    return ordered_label_issues
if __name__ == '__main__':
    print(predict_issue(labels,pred_probs))

# output: [3]

至此,问题被完美解决,但是不得不吐槽一下,官方只给使用方法真的是坑人啊。连写注意事项都不给点透,要是没有那么多时间看源码,谁知道怎么改这个BUG。

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

爱疯头666

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值