Section 5: Pareto Charts
Overview
Sometimes, it is impossible to fix any data quality issue in a big data set. Fortunately, we can use 20% time to fix 80% problems in most cases. In this section, we are going to talk about how to find missing data and make a Pareto chart.
Finding Missing Data
We can use is.na() to find missing data directly. It’s not only can be used in table but also can be used in column. In this demo, we use the weather.csv to be our data again.Read the CSV data:
> data=read.csv("weather.csv")Get a logical vector for missing data:
> missing = is.na(data)> head(missing)
Ozone Solar.R Wind Temp Month Day
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE FALSE
After that, we can check the number of NA in our data:
> sum(missing)[1] 44
Pareto Principle
Another name of Pareto Principle is the 80-20 principle. It means that a small number of causes lead to a huge number of effects. This principle suggests that we can focus on the major causes of problems, and fixing those few major causes will fix the majority of effects.
Pareto Chart
We can create a Pareto Chart based on our data in following steps.
Count missing data for each column:
> missum < apply(missing, 2, sum)Create the bar plot:
> percentage < sort(missum/sum(missum) * 100, decreasing=TRUE)
> barplot(percentage, ylim=c(0,100))
Plot cumulative percentage:
> cumulative < cumsum(percentage)
> lines(cumulative, type="b")
Practice Question
1) Would we say that the missing data in our demo follows the Pareto Principle?
2) How many missing data are there in Ozone?