处理丢失的不正确数据

While getting my feet wet in data science, most of the data I was exposed to was very clean. Maybe too clean… This is perfect while learning how to apply statistical analysis to data in programming! However, maybe as you’ve seen, most real-world data doesn’t play as nice. Sooner or later in your data science career you’re going to encounter some missing or incorrect values and it’s important to know the different strategies at your disposal for treating these values. Depending on the data type, feature value range, and amount of data, there are obvious advantages to choosing some methods over others.

在涉足数据科学领域时,我接触到的大多数数据都非常干净。 也许太干净了……这是学习如何在编程中将统计分析应用于数据的完美选择! 但是,也许如您所见,大多数真实世界的数据表现都不尽如人意。 在数据科学事业的早晚,您将遇到一些缺失或不正确的值,因此很重要的一点是,您必须了解处理这些值的不同策略。 根据数据类型,特征值范围和数据量,选择某些方法要比其他方法有明显的优势。

While the majority of the treatment you will be doing will be focused on missing values (NaNs and NULLs), you will always have to check datasets for incorrectly entered data. I won’t talk about this too much in depth, but before moving on to the missing values, you must deal with these first. Examples could be having an age value of 210 years. This is probably a mistake, and you want to figure out a standard method for dealing with. One method could be establishing a maximum age and filtering all the values to be inside that range. Another method could be to simply impute the mean. Point is, you can’t just look for NaNs or NULLs, you have to properly filter your data first to make sure all incorrect values are being treated. Additionally, you have to decide what data you simply can’t live without. For example, if you are performing regression analysis on housing data to try to build a model that will predict market value and the sale amount is missing in the dataset, you probably won’t want to use that row of data given that sale price is your indicator variable. Additionally you could decide to remove the row if the address of the house is missing, given that this could be an extremely predictive discrete variable.

尽管您将要进行的大多数处理都将重点放在缺失值(NaN和NULL)上,但您始终将必须检查数据集是否有错误输入的数据。 我不会深入讨论这个问题,但是在继续介绍缺失的值之前,您必须先处理这些问题。 例如年龄值为210岁。 这可能是一个错误,您想找出一种标准的处理方法。 一种方法是确定最长使用期限,并将所有值过滤到该范围内。 另一种方法可能是简单地估算均值。 关键是,您不能只查找NaN或NULL,而是必须首先正确过滤数据以确保所有不正确的值都得到处理。 此外,您必须决定根本没有什么数据。 例如,如果您要对住房数据进行回归分析以尝试建立一个可预测市场价值的模型,而数据集中缺少销售金额,则鉴于销售价格为,您可能不希望使用该行数据您的指标变量。 另外,如果房屋的地址丢失,您可以决定删除该行,因为这可能是一个极具预测性的离散变量。

Once you have your data purged of incorrect values, it’s time to focus on the missing values. The following methods are most commonly and successfully used for this purpose…

清除不正确值的数据后,就该关注丢失的值了。 以下是最常见且成功用于此目的的方法…

没做什么 (Do Nothing)

I know this sounds like a cop out, but sometimes the best solution is to actually do nothing, and let your algorithm handle the missing values. This is not supported in every model (Scikit-learn Linear Regression) and you may just have to treat and clean the data the long way. However, some algorithms can actually learn and impute these values (XGBoost, KNN) better than some of the other methods we will talk about, while others simply have the option of ignoring these missing values. If you choose the “do nothing” approach and leave it up to your model, you’ll want to check the relevant documentation to ensure you know what exactly is happening to these values behind the scenes.

我知道这听起来像个警察,但有时最好的解决方案是什么都不做,让您的算法处理缺失的值。 并非所有模型都支持此功能(Scikit-learn线性回归),您可能只需要长期处理和清理数据。 但是,某些算法实际上可以比我们将要讨论的其他一些方法更好地学习和估算这些值(XGBoost,KNN),而其他算法则可以选择忽略这些缺失的值。 如果选择“不执行任何操作”方法并将其留给模型,则需要检查相关文档以确保您知道这些值在幕后究竟发生了什么。

删除行 (Delete Rows)

This is the simplest and most true-to-the-dataset method of dealing with missing values. This treatment is typically used when the value is missing for a particularly indicative variable, or if multiple values are missing in the row. Deleting rows is only recommended when dealing with particularly large datasets. One reason for this is that there is obvious information loss when removing entire rows of data, and therefore you risk your final dataset not being informative enough. Additionally, you want to make sure that you are not adding any bias by removing these rows. For example if you are trying to predict plant species based on a number of features, and a particular species only has a very small portion of points in the datasets, you would want to find a way to impute these values rather than deleting the rows.

这是处理缺失值的最简单且最符合数据集的方法。 当缺少特定指示变量的值或行中缺少多个值时,通常使用此处理。 仅在处理特别大的数据集时才建议删除行。 这样做的原因之一是,删除整行数据时会明显损失信息,因此您可能会冒最后的数据集信息不足的风险。 此外,您想确保通过删除这些行不会增加任何偏差。 例如,如果您尝试基于许多特征来预测植物物种,而特定物种在数据集中只有很少一部分点,那么您可能想找到一种推论这些值而不是删除行的方法。

估算平均值/中位数/众数 (Impute Mean/Median/Mode)

A simple and common way to deal with missing values that we know not to be zero, is to replace them with a central value. By replacing the missing values with either the mean, median or mode, we are reducing the impact on our distribution, and we are still able to use those rows of data (as opposed to deletion).

处理我们知道不为零的缺失值的一种简单而通用的方法是将它们替换为中心值。 通过用均值,中位数或众数代替缺失值,我们减少了对分布的影响,并且我们仍然能够使用这些数据行(与删除相对)。

When trying to decide which central value to use, it’s important to look back at the data, specifically the distribution of the feature you are attempting to impute. If the data follows a somewhat normal distribution, you can use the mean. If the data has some sort of skew to it, you might choose to use the median value. Of course discrete (categorical) variables don’t have a mean or median, but we can still use the mode (most common class) as a substitute value.

尝试确定要使用哪个中心值时,重要的是要回顾一下数据,尤其是要尝试插补的功能的分布。 如果数据遵循某种正态分布,则可以使用均值。 如果数据有某种偏斜,则可以选择使用中间值。 当然,离散(分类)变量没有均值或中位数,但是我们仍然可以使用模式(最常见的类)作为替代值。

分配唯一的离散变量 (Assign A Unique Discrete Variable)

If we have a column of discrete data, the possible values contained are a finite set of classes, and can be updated. So now, as opposed to trying to guess the value of a missing class or removing the row of data altogether, we are going to attempt to extract more information from this missing value by creating a category for the missing values. For example, if we’re looking at the Titanic data set and attempting to predict whether or not a passenger survived, we might decide to use the feature Passenger Class. This is a categorical variable that has 3 possible classes (1st, 2nd and 3rd class), but also has missing data. What we could do is assign a fourth class of ‘U’ or Unknown. Now when we feed this data into whatever model we choose to perform our logistic regression, we have all of our rows of data, and a newly created class in our P. Class feature that our model can use to extract information.

如果我们有一列离散数据,则包含的可能值是一组有限的类,并且可以更新。 因此,现在,与尝试猜测缺失类的值或完全删除数据行相反,我们将尝试通过为缺失值创建类别来尝试从此缺失值中提取更多信息。 例如,如果我们正在查看“泰坦尼克号”数据集并试图预测乘客是否幸存下来,我们可能会决定使用“乘客舱位”功能。 这是一个分类变量,具有3个可能的类(第一类,第二类和第三类),但也缺少数据。 我们能做的是给第四类分配“ U”或“未知”。 现在,当我们将这些数据输入到选择执行逻辑回归的任何模型中时,我们将拥有所有数据行,并且在P. Class功能中拥有一个新创建的类,我们的模型可用于提取信息。

预测缺失值 (Predict The Missing Values)

Using the features we have that do not have missing data, we can try using an algorithm to predict our missing values. This method may take a little more work than other methods, but if the feature in question is particularly useful (high predictive value), it might be worth it to try to fill those nulls with predictions. This works best when you have features with low variance (we might try linear regression to impute missing age values), and it’s goal is to provide a more accurate value than simply using the mean/median so that we can ultimately extract more information from this row. Experiment with different algorithms and find out which one works best.

使用我们没有丢失数据的功能,我们可以尝试使用一种算法来预测我们的丢失值。 与其他方法相比,此方法可能需要更多的工作,但是如果所讨论的功能特别有用(较高的预测值),则尝试用预测值填充这些零值可能是值得的。 当您具有低方差的特征时,这种方法效果最好(我们可能会尝试使用线性回归来估算缺失的年龄值),并且其目标是提供比仅使用平均值/中位数更准确的值,以便最终从中提取更多信息行。 试用不同的算法,找出哪种算法效果最好。

结论… (In conclusion…)

There is no perfect way to deal with missing values. There is always going to be some amount of guessing or information loss during treatment and in the end, it’s going to be up to you as the engineer to decide what method to use. As always, try a few different methods, look at your data, accuracy and as your loss, and then make an educated decision about a final treatment.

没有完美的方法来处理缺失的值。 在治疗期间,总会有一些猜测或信息丢失,最后,由工程师决定使用哪种方法。 与往常一样,尝试几种不同的方法,查看数据,准确性和损失,然后就最终治疗做出有根据的决定。

翻译自: https://medium.com/@sam.bbmgmt/handling-missing-incorrect-data-509cf965fca3

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值