Dealing-with-Missing-Data-in-Python

最新推荐文章于 2022-08-09 17:18:36 发布

radar_sun

最新推荐文章于 2022-08-09 17:18:36 发布

阅读量624

点赞数

分类专栏： Data-Scientist-with-Python 文章标签： pandas python

本文链接：https://blog.csdn.net/agoldminer/article/details/113818231

版权

Data-Scientist-with-Python 专栏收录该内容

16 篇文章 17 订阅

订阅专栏

文章目录

1. The problem with missing data
2. Does missingness have a pattern?
3. Imputation techniques
4. Advanced imputation techniques

1. The problem with missing data

1.1 Why deal with missing data?

1.2 Steps for treating missing values

Arrange the statements in the order of steps to be taken for dealing with missing data.

A. Evaluate & compare the performance of the treated/imputed dataset.
B. Convert all missing values to null values.
C. Appropriately delete or impute missing values.
D. Analyze the amount and type of missingness in the data.

$\square$ D, A, B, C

$\blacksquare$ B, D, C, A

$\square$ B, C, A, D

1.3 Null value operations

While working with missing data, you’ll have to store these missing values as an empty type. This way, you will easily be able to identify them, replace them or play with them! This is why we have the None and numpy.nan types. You need to be able to differentiate clearly between the two types.

In this exercise, you will compare the differences between the behavior of None and numpy.nan types on application of arithmetic and logical operations. numpy has already been imported as np. The try and except blocks have been used to avoid errors.

Instruction

Sum two None values and print the output.
Sum two np.nan and print the output.
Print the output of logical or of two None.
Print the output of logical or of two np.nan.

在这里插入代码片

1.4 Finding Null values

In the previous exercise, you have observed how the two NULL data types None and the numpy not a number object np.nan behave with respect to arithmetic and logical operations. In this exercise, you’ll further understand their behavior by comparing the two types.

Instruction

Compare two None using == and print the output.
Compare two np.nan using == and print the output.
Print whether None is not a number.
Print whether np.nan is not a number.

在这里插入代码片

1.5 Handling missing values

1.6 Detecting missing values

Datasets usually come with hidden missing values filled in for missing values like 'NA', '.' or others. In this exercise, you will work with the college dataset which contains various details of college students. Your task is to identify the missing values by analyzing the dataset.

To achieve this, you can use the .info() method from pandas and the numpy function sort() along with the .unique() method to clearly distinguish the dummy value representing the missing data.

Instruction

Read the CSV version of the dataset into a pandas DataFrame.
Print the DataFrame information.
Store the unique values of the csat column to csat_unique.
Sort csat_unique and print the output.

在这里插入代码片

1.7 Replacing missing values

In the previous exercise, you analyzed the college dataset and identified that '.' represented a missing value in the data. In this exercise, you will learn the best way to handle such values using the pandas module.

You will learn how to handle such values when importing a CSV file into pandas using its read_csv() function and adjusting its na_values argument, which allows you to specify the DataFrame’s missing values.

Instruction

Load the dataset 'college.csv' to the DataFrame college while setting the appropriate missing values.
Print the DataFrame information.

在这里插入代码片

1.8 Replacing hidden missing values

In the previous two exercises, you worked on identifying and handling missing values while importing a dataset. In this exercise, you will work on identifying hidden missing values in your data and handling them. You’ll use the diabetes dataset which has already been loaded for you.

The diabetes DataFrame has 0’s in the column BMI. But BMI cannot be 0. It should instead be NaN. In this exercise, you’ll learn to identify such discrepancies. You’ll perform simple data analysis to catch missing values and replace them. Both numpy and pandas have been imported into your DataFrame as np and pd respectively.

Additionally, you can play around with the dataset like printing it’s .head(), .info() etc. to get more familiar with the dataset.

Instruction

Describe the basic statistics of diabetes.
Isolate the values of BMI which are equal to 0 and store them in zero_bmi.
Set all the values in the column BMI that are equal to 0 to np.nan.
Print the rows with NaN values in BMI.

在这里插入代码片

1.9 Analyze the amount of missingness

1.10 Analyzing missingness percentage

Before jumping into treating missing data, it is essential to analyze the various factors surrounding missing data. The elementary step in analyzing the data is to analyze the amount of missingness, that is the number of values missing for a variable. In this exercise, you’ll calculate the total number of missing values per column and also find out the percentage of missing values per column. The 'airquality' dataset which contains weather data collected from various sensors has been loaded for you.

In this exercise, you will load the dataset by parsing the Date column and then calculate the sum of missing values and the degree of missingness in percent on the nullity DataFrame

Instruction

Load 'air-quality.csv' into a pandas DataFrame while parsing the 'Date'column and setting it to the index column as well.
Find the number of missing values in airquality and store it into airquality_nullity.
Calculate the number of missing values in airquality.
Calculate the percentage of missing values in airquality.

在这里插入代码片

1.11 Visualize missingness

In the previous exercise, you calculated the number of missing values and the percentage of missingness for each column. However, this is usually not enough and preference is to visualize them graphically.

You’ll use the misssingno package which is built for visualizing missing values. The airquality DataFrame has already been imported, and the pandas library as pd.

You will visualize the missingness by plotting a bar chart and a nullity matrix of the missing values.

Instruction

Plot a bar chart of the missing values in airquality.
Plot the nullity matrix of airquality.
Plot the nullity matrix of airquality across a monthly frequency.
Splice airquality from 'May-1976' to 'Jul-1976' and plot its nullity matrix.

在这里插入代码片

2. Does missingness have a pattern?

2.1 Is the data missing at random?

2.2 Guess the missingness type

Analyzing the type of missingness helps you to deduce the best ways you can deal with missing data. The Pima Indians diabetes dataset is very popularly known for having missing data. Pima Indians is an ethnic group of people who are more prone to having diabetes. The dataset contains several lab tests conducted with members of this community.

In the video lesson, you learned the 3 types of missingness patterns. In this exercise you’ll first visualize the missingness summary and then identify the types of missingness the DataFrame contains.

Instruction
Import the missingno package as msno and plot the missingness summary of diabetes.

在这里插入代码片

2.3 Deduce MNAR

2.4 Finding patterns in missing data

2.5 Finding correlations in your data

2.6 Identify the missingness type

2.7 Visualizing missingness across a variable

2.8 Fill dummy values

2.9 Generate scatter plot with missingness

2.10 When and how to delete missing data

2.11 Delete MCAR

2.12 Will you delete?

3. Imputation techniques

3.1 Mean, median & mode imputations

3.2 Mean & median imputation

3.3 Mode and constant imputation

3.4 Visualize imputations

3.5 Imputing time-series data

3.6 Filling missing time-series data

3.7 Imputa with interpolate method

3.8 Visualizing time-series imputations

3.9 Visualize time-series imputations

3.10 Visualize forward fill imputation

3.11 Visualize backward fill imputation

3.12 Plot interpolations

4. Advanced imputation techniques

4.1 Imputing using fancyimpute

4.2 KNN imputation

4.3 MICE imputation

4.4 Imputing categorical values

4.5 Ordinal encoding of a categorical column

4.6 Ordinal encoding of a DataFrame

4.7 KNN imputation of categorical values

4.8 Evaluation of different imputation techniques

4.9 Analyze the summary of linear model

4.10 Comparing R-squared and coefficients

4.11 Comparing density plots

4.12 Conclusion

radar_sun

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Dealing-with-Missing-Data-in-Python

1. The problem with missing data2. Does missingness have a pattern?3. Imputation techniques4. Advanced imputation techniques
复制链接

扫一扫

专栏目录