文章目录
- 1. The problem with missing data
- 1.1 Why deal with missing data?
- 1.2 Steps for treating missing values
- 1.3 Null value operations
- 1.4 Finding Null values
- 1.5 Handling missing values
- 1.6 Detecting missing values
- 1.7 Replacing missing values
- 1.8 Replacing hidden missing values
- 1.9 Analyze the amount of missingness
- 1.10 Analyzing missingness percentage
- 1.11 Visualize missingness
- 2. Does missingness have a pattern?
- 2.1 Is the data missing at random?
- 2.2 Guess the missingness type
- 2.3 Deduce MNAR
- 2.4 Finding patterns in missing data
- 2.5 Finding correlations in your data
- 2.6 Identify the missingness type
- 2.7 Visualizing missingness across a variable
- 2.8 Fill dummy values
- 2.9 Generate scatter plot with missingness
- 2.10 When and how to delete missing data
- 2.11 Delete MCAR
- 2.12 Will you delete?
- 3. Imputation techniques
- 3.1 Mean, median & mode imputations
- 3.2 Mean & median imputation
- 3.3 Mode and constant imputation
- 3.4 Visualize imputations
- 3.5 Imputing time-series data
- 3.6 Filling missing time-series data
- 3.7 Imputa with interpolate method
- 3.8 Visualizing time-series imputations
- 3.9 Visualize time-series imputations
- 3.10 Visualize forward fill imputation
- 3.11 Visualize backward fill imputation
- 3.12 Plot interpolations
- 4. Advanced imputation techniques
- 4.1 Imputing using fancyimpute
- 4.2 KNN imputation
- 4.3 MICE imputation
- 4.4 Imputing categorical values
- 4.5 Ordinal encoding of a categorical column
- 4.6 Ordinal encoding of a DataFrame
- 4.7 KNN imputation of categorical values
- 4.8 Evaluation of different imputation techniques
- 4.9 Analyze the summary of linear model
- 4.10 Comparing R-squared and coefficients
- 4.11 Comparing density plots
- 4.12 Conclusion
1. The problem with missing data
1.1 Why deal with missing data?
1.2 Steps for treating missing values
Arrange the statements in the order of steps to be taken for dealing with missing data.
A. Evaluate & compare the performance of the treated/imputed dataset.
B. Convert all missing values to null values.
C. Appropriately delete or impute missing values.
D. Analyze the amount and type of missingness in the data.
□ \square □ D, A, B, C
■ \blacksquare ■ B, D, C, A
□ \square □ B, C, A, D
1.3 Null value operations
While working with missing data, you’ll have to store these missing values as an empty type. This way, you will easily be able to identify them, replace them or play with them! This is why we have the None
and numpy.nan
types. You need to be able to differentiate clearly between the two types.
In this exercise, you will compare the differences between the behavior of None
and numpy.nan
types on application of arithmetic and logical operations. numpy
has already been imported as np
. The try
and except
blocks have been used to avoid errors.
Instruction
- Sum two
None
values and print the output. - Sum two
np.nan
and print the output. - Print the output of logical
or
of twoNone
. - Print the output of logical
or
of twonp.nan
.
在这里插入代码片
1.4 Finding Null values
In the previous exercise, you have observed how the two NULL data types None
and the numpy
not a number object np.nan
behave with respect to arithmetic and logical operations. In this exercise, you’ll further understand their behavior by comparing the two types.
Instruction
- Compare two
None
using==
and print the output. - Compare two
np.nan
using==
and print the output. - Print whether
None
is not a number. - Print whether
np.nan
is not a number.
在这里插入代码片
1.5 Handling missing values
1.6 Detecting missing values
Datasets usually come with hidden missing values filled in for missing values like 'NA'
, '.'
or others. In this exercise, you will work with the college
dataset which contains various details of college students. Your task is to identify the missing values by analyzing the dataset.
To achieve this, you can use the .info()
method from pandas
and the numpy
function sort()
along with the .unique()
method to clearly distinguish the dummy value representing the missing data.
Instruction
- Read the CSV version of the dataset into a
pandas
DataFrame. - Print the DataFrame information.
- Store the unique values of the
csat
column tocsat_unique
. - Sort
csat_unique
and print the output.
在这里插入代码片
1.7 Replacing missing values
In the previous exercise, you analyzed the college dataset and identified that '.'
represented a missing value in the data. In this exercise, you will learn the best way to handle such values using the pandas
module.
You will learn how to handle such values when importing a CSV file into pandas
using its read_csv()
function and adjusting its na_values
argument, which allows you to specify the DataFrame’s missing values.
Instruction
- Load the dataset
'college.csv'
to the DataFramecollege
while setting the appropriate missing values. - Print the DataFrame information.
在这里插入代码片
1.8 Replacing hidden missing values
In the previous two exercises, you worked on identifying and handling missing values while importing a dataset. In this exercise, you will work on identifying hidden missing values in your data and handling them. You’ll use the diabetes
dataset which has already been loaded for you.
The diabetes
DataFrame has 0’s in the column BMI
. But BMI
cannot be 0. It should instead be NaN
. In this exercise, you’ll learn to identify such discrepancies. You’ll perform simple data analysis to catch missing values and replace them. Both numpy
and pandas
have been imported into your DataFrame as np
and pd
respectively.
Additionally, you can play around with the dataset like printing it’s .head()
, .info()
etc. to get more familiar with the dataset.
Instruction
- Describe the basic statistics of
diabetes
. - Isolate the values of
BMI
which are equal to 0 and store them inzero_bmi
. - Set all the values in the column
BMI
that are equal to 0 tonp.nan
. - Print the rows with
NaN
values inBMI
.
在这里插入代码片
1.9 Analyze the amount of missingness
1.10 Analyzing missingness percentage
Before jumping into treating missing data, it is essential to analyze the various factors surrounding missing data. The elementary step in analyzing the data is to analyze the amount of missingness, that is the number of values missing for a variable. In this exercise, you’ll calculate the total number of missing values per column and also find out the percentage of missing values per column. The 'airquality'
dataset which contains weather data collected from various sensors has been loaded for you.
In this exercise, you will load the dataset by parsing the Date
column and then calculate the sum of missing values and the degree of missingness in percent on the nullity DataFrame
Instruction
- Load
'air-quality.csv'
into apandas
DataFrame while parsing the'Date'
column and setting it to the index column as well. - Find the number of missing values in
airquality
and store it intoairquality_nullity
. - Calculate the number of missing values in
airquality
. - Calculate the percentage of missing values in
airquality
.
在这里插入代码片
1.11 Visualize missingness
In the previous exercise, you calculated the number of missing values and the percentage of missingness for each column. However, this is usually not enough and preference is to visualize them graphically.
You’ll use the misssingno
package which is built for visualizing missing values. The airquality
DataFrame has already been imported, and the pandas
library as pd
.
You will visualize the missingness by plotting a bar chart and a nullity matrix of the missing values.
Instruction
- Plot a bar chart of the missing values in
airquality
. - Plot the nullity matrix of
airquality
. - Plot the nullity matrix of
airquality
across a monthly frequency. - Splice
airquality
from'May-1976'
to'Jul-1976'
and plot its nullity matrix.
在这里插入代码片
2. Does missingness have a pattern?
2.1 Is the data missing at random?
2.2 Guess the missingness type
Analyzing the type of missingness helps you to deduce the best ways you can deal with missing data. The Pima Indians diabetes dataset is very popularly known for having missing data. Pima Indians is an ethnic group of people who are more prone to having diabetes. The dataset contains several lab tests conducted with members of this community.
In the video lesson, you learned the 3 types of missingness patterns. In this exercise you’ll first visualize the missingness summary and then identify the types of missingness the DataFrame contains.
Instruction
Import the missingno
package as msno
and plot the missingness summary of diabetes
.
在这里插入代码片