如何使用Python处理丢失的数据

最新推荐文章于 2021-03-25 14:49:01 发布

weixin_26746401

最新推荐文章于 2021-03-25 14:49:01 发布

阅读量1.5k

点赞数 1

文章标签： python

原文链接：https://medium.com/@kvssetty/how-to-handel-missing-data-71a3eb89ef91

版权

The complete notebook and required datasets can be found in the git repo here

完整的笔记本和所需的数据集可以在git repo中找到

Real-world data often has missing values.

实际数据通常缺少值 。

Data can have missing values for a number of reasons such as observations that were not recorded/measured or may be data corrupted.

数据可能由于许多原因而缺少值，例如未记录/测量的观测值或数据可能已损坏。

Handling missing data is important as many machine learning algorithms do not support data with missing values.

处理丢失的数据非常重要，因为许多机器学习算法不支持带有缺失值的数据。

In this notebook, you will discover how to handle missing data for machine learning with Python.

在本笔记本中，您将发现如何使用Python处理丢失的数据以进行机器学习。

Specifically, after completing this tutorial you will know:

具体而言，完成本教程后，您将知道：

How to mark invalid or corrupt values as missing in your dataset.
如何在数据集中将无效或损坏的值标记为丢失 。
How to remove rows with missing data from your dataset.
如何从数据集中删除缺少数据的行。
How to impute missing values with mean values in your dataset.
如何在数据集中用均值估算缺失值 。

Lets get started.

让我们开始吧。

"How to Handle Missing Data with Python"

总览 (Overview)

This tutorial is divided into 6 parts:

本教程分为6部分：

Diabetes Dataset: where we look at a dataset that has known missing values.
糖尿病数据集：我们在其中查看具有已知缺失值的数据集。
Mark Missing Values: where we learn how to mark missing values in a dataset.
标记缺失值：我们在这里学习如何标记数据集中的缺失值。
Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
缺失值会导致问题：在这里，我们将了解机器学习算法包含缺失值时如何失败。
Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
删除具有缺失值的行：我们将在这里看到如何删除包含缺失值的行。
Impute Missing Values: where we replace missing values with sensible values.
估算缺失值：我们用合理的值替换缺失的值。
Algorithms that Support Missing Values: where we learn about algorithms that support missing values.
支持缺失值的算法：我们在此处了解支持缺失值的算法。

First, let’s take a look at our sample dataset with missing values.

首先，让我们看一下缺少值的样本数据集。

1.糖尿病数据集 (1. Diabetes Dataset)

The Diabetes Dataset involves predicting the onset of diabetes within 5 years in given medical details.

糖尿病数据集包括在给定的医疗细节中预测5年内的糖尿病发作。

Dataset File.
数据集文件。
Dataset Details both files are available in the same folder as this notebook.
数据集详细信息这两个文件都在与此笔记本相同的文件夹中。

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

这是一个二进制(2类)分类问题。每个类别的观察次数不平衡。有768个观测值，其中包含8个输入变量和1个输出变量。变量名称如下：

Number of times pregnant.
怀孕的次数。
Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
口服葡萄糖耐量试验中血浆葡萄糖浓度2小时。
Diastolic blood pressure (mm Hg).
舒张压(毫米汞柱)。
Triceps skinfold thickness (mm).
三头肌皮褶厚度(毫米)。
2-Hour serum insulin (mu U/ml).
2小时血清胰岛素(mu U / ml)。
Body mass index (weight in kg/(height in m)²).
体重指数(体重以千克/(身高以米)²)。
Diabetes pedigree function.
糖尿病谱系功能。
Age (years).
年龄(年)。
Class variable (0 or 1).
类变量(0或1)。

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

预测最流行的类别的基准性能是大约65࿰

最低0.47元/天解锁文章

weixin_26746401

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
如何使用Python处理丢失的数据

The complete notebook and required datasets can be found in the git repo here 完整的笔记本和所需的数据集可以在git repo中找到 Real-world data often has missing values. 实际数据通常缺少值。 Data can have missing values for a num...
复制链接

扫一扫