使用cleanlab时,你是否遇到了主进程被无限循环调用的问题
背景: 最近在做数据清洗,数据质量评估,脏数据识别的相关工作。于是找到了TrustAI和cleanlab,做了充分的调研,不论从代码的开发还是最终的效果,cleanlab都具有更大的优势。因此信誓旦旦的给老板说用cleanlab很简单,一天就可以出个初步结果。当然承诺的速度远比不上打脸的速度。
cleanlab脏数据识别针对的任务
cleanlab主要用于分类任务中人工标注错误样本的识别,主要是通过统计得到噪声标签(人工标注的标签)与真实标签(达到某个类别平均阈值的样本)的联合概率矩阵来进行脏数据的是识别。
官方用法解读:
"""
cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab
package works with any model by using model outputs (predicted probabilities) as input
– it doesn’t depend on which model created those outputs.
If you’re using a scikit-learn-compatible model (option 1), you don’t need to train a
model – you can pass the model, data, and labels into CleanLearning.find_label_issues
and cleanlab will handle model training for you. If you want to use any non-sklearn-
compatible model (option 2), you can input the trained model’s out-of-sample predicted
probabilities into find_label_issues. Examples for both options are below.
"""
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
# Option 1 - works with sklearn-compatible models - just input the data and labels ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)
# Option 2 - works with ANY ML model - just input the model's predicted probabilities
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs, # predicted probabilities from any model (ideally out-of-sample predictions)
return_indices_ranked_by='self_confidence',
)
遇到问题
看了官方用法后我也是心想: so easy!,于是欣欣然的写下了如下代码:
import numpy as np
from cleanlab.filter import find_label_issues
# labels:人工标签
labels=np.array([0,1,0,0])
# pred_probs:分类模型预测到的概率
pred_probs=np.array([[0.8,0.2],
[0.4,0.6],
[0.5,0.5],
[0.0,1]])
print(labels)
print(pred_probs)
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs, # predicted probabilities from any model (ideally out-of-sample predictions)
return_indices_ranked_by='self_confidence',
)
print("被识别为脏数据的索引为:",ordered_label_issues)
然而却出现了如下 的结果:
[0 1 0 0]
[[0.8 0.2]
[0.4 0.6]
[0.5 0.5]
[0. 1. ]]
[0 1 0 0]
[0 1 0 0]
[[0.8 0.2]
[0.4 0.6]
[0.5 0.5]
[0. 1. ]]
[[0.8 0.2]
[0.4 0.6]
[0.5 0.5]
[0. 1. ]]
Traceback (most recent call last):
File "<string>", line 1, in <module>
"C:\Users\wtsr\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
"C:\Users\wtsr\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
至于是哪里出错了,由于报错太长了实在是找不到哪里错误,然后就去把源码读了一下,读的也是云里雾里的,但是读完源码发现了该问题因为进程调度的问题,不知什么原因导致主进程被无限循环调用。了解问题后,做出如下更改:
import numpy as np
import warnings
from cleanlab.filter import find_label_issues
warnings.filterwarnings('ignore')
# labels:人工标签
labels=np.array([0,1,0,0])
# pred_probs:分类模型预测到的概率
pred_probs=np.array([[0.8,0.2],
[0.4,0.6],
[0.5,0.5],
[0.0,1]])
def predict_issue(labels,pred_probs):
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs, # predicted probabilities from any model (ideally out-of-sample predictions)
return_indices_ranked_by='self_confidence',
)
return ordered_label_issues
if __name__ == '__main__':
print(predict_issue(labels,pred_probs))
# output: [3]
至此,问题被完美解决,但是不得不吐槽一下,官方只给使用方法真的是坑人啊。连写注意事项都不给点透,要是没有那么多时间看源码,谁知道怎么改这个BUG。