python 插补数据_python 2020中缺少数据插补技术的快速指南

最新推荐文章于 2024-04-16 21:32:09 发布

张_伟_杰

最新推荐文章于 2024-04-16 21:32:09 发布

阅读量1.7k

点赞数 2

文章标签： python 人工智能 java 大数据机器学习

原文链接：https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-techniques-in-python-2020-5410f3df1c1e

版权

python 插补数据

Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.

大多数机器学习算法期望完整且干净的无噪声数据集，但不幸的是，现实世界的数据集比较杂乱，缺少多个单元格，在这种情况下，处理丢失的数据变得相当复杂。

Therefore in today’s article, we are going to discuss some of the most effective and indeed easy-to-use data imputation techniques which can be used to deal with missing data.

因此，在今天的文章中，我们将讨论一些最有效且确实易于使用的数据插补技术，这些技术可用于处理丢失的数据。

So without any further delay, let’s get started.

因此，没有任何进一步的延迟，让我们开始吧。

Image for post — **Copyright: © Ampcool22 | Dreamstime.com**

什么是数据归因？ (What is Data Imputation?)

Data Imputation is a method in which the missing values in any variable or data frame(in Machine learning) is filled with some numeric values for performing the task. Using this method the sample size remains the same, only the blanks which were missing are now filled with some values. This method is easy to use but the variance of the dataset is reduced.

数据插补是一种方法，其中(在机器学习中)任何变量或数据框中的缺失值都填充有一些数字值，以执行任务。使用此方法，样本大小保持不变，现在仅将缺少的空白 填充一些值 。 这种方法易于使用，但数据集的方差减小了。

为什么要进行数据插补？ (Why Data Imputation?)

There can be various reasons for imputing data, many real-world datasets(not talking about CIFAR or MNIST) containing missing values which can be in any form such as blanks, NaN, 0s, any integers or any categorical symbol. Instead of just dropping the Rows or Columns containing the missing values which come at the price of losing data which may be valuable, a better strategy is to impute the missing values.

插补数据可能有多种原因，许多现实世界的数据集(不涉及CIFAR或MNIST)包含缺失值，这些缺失值可以采用任何形式，例如空格，NaN，0，任何整数或任何分类符号 。 更好的策略是估算缺失值 ，而不是仅仅删除包含缺失值的行或列，而这些缺失值会以丢失可能有价值的数据为代价。

Having a good theoretical knowledge is amazing but implementing them in code in a real-time machine learning project is a completely different thing. You might get different and unexpected results based on different problems and datasets. So as a Bonus,I am also adding the links to the various courses which has helped me a lot in my journey to learn Data science and ML, experiment and compare different data imputations strategies which led me to write this article on comparisons between different data imputations methods.

拥有良好的理论知识是惊人的，但是在实时机器学习项目中以代码实现它们是完全不同的。 根据不同的问题和数据集，您可能会得到不同且出乎意料的结果。 因此，作为奖励，我还添加了到各种课程的链接，这些链接对我学习数据科学和ML，实验和比较不同的数据归因策略有很大帮助，这使我撰写了有关不同数据之间的比较的本文。归因方法。

I am personally a fan of DataCamp, I started from it and I am still learning through DataCamp and keep doing new courses. They seriously have some exciting courses. Do check them out.

我个人是 DataCamp 的粉丝 ，我从此开始，但仍在学习 DataCamp 并继续 学习 新课程。 他们认真地开设了一些令人兴奋的课程。 请检查一下。

1.处理缺少输入的数据 (

最低0.47元/天解锁文章

张_伟_杰

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
python 插补数据_python 2020中缺少数据插补技术的快速指南

python 插补数据Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing...
复制链接

扫一扫