missforest_missforest最佳丢失数据插补算法

最新推荐文章于 2024-05-22 12:28:53 发布

张_伟_杰

最新推荐文章于 2024-05-22 12:28:53 发布

阅读量3.4k

点赞数 6

文章标签： python 算法 java 人工智能

原文链接：https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3

版权

missforest

Missing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work well.

丢失的数据通常困扰着现实世界的数据集，因此，估算或填写丢失的值具有巨大的价值。不幸的是，标准的“惰性”插补方法(例如仅使用列中位数或平均值)效果不佳。

On the other hand, KNN is a machine-learning based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise. Additionally, depending on circumstances, it can be computationally expensive, requiring the entire dataset to be stored and computing distances between every pair of points.

另一方面，KNN是一种基于机器学习的插补算法，它已经取得了成功，但需要调整参数k，而且容易受到KNN的许多弱点的影响，例如对异常值和噪声敏感。另外，根据情况，计算可能会很昂贵，需要存储整个数据集并计算每对点之间的距离。

MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm. Stekhoven and Buhlmann, creators of the algorithm, conducted a study in 2011 in which imputation methods were compared on datasets with randomly introduced missing values. MissForest outperformed all other algorithms in all metrics, including KNN-Impute, in some cases by over 50%.

MissForest是基于随机森林算法的另一种基于机器学习的数据插补算法。该算法的创建者Stekhoven和Buhlmann于2011年进行了一项研究，该研究在具有随机引入的缺失值的数据集上比较了插补方法。在所有指标上，MissForest的性能均优于其他所有算法，包括KNN-Impute，在某些情况下超过50％。

First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age based on

最低0.47元/天解锁文章

张_伟_杰

关注

6
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
missforest_missforest最佳丢失数据插补算法

missforestMissing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like sim...
复制链接

扫一扫