Using R to Fix Data Quality: Section 5

Section 5: Pareto Charts


Overview

Sometimes, it is impossible to fix any data quality issue in a big data set. Fortunately, we can use 20% time to fix 80% problems in most cases. In this section, we are going to talk about how to find missing data and make a Pareto chart.


Finding Missing Data

We can use is.na() to find missing data directly. It’s not only can be used in table but also can be used in column. In this demo, we use the weather.csv to be our data again.

Read the CSV data:

> data=read.csv("weather.csv")

Get a logical vector for missing data:

> missing = is.na(data)
> head(missing)
     Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE


After that, we can check the number of NA in our data:

> sum(missing)
[1] 44


Pareto Principle

Another name of Pareto Principle is the 80-20 principle. It means that a small number of causes lead to a huge number of effects. This principle suggests that we can focus on the major causes of problems, and fixing those few major causes will fix the majority of effects.


Pareto Chart

We can create a Pareto Chart based on our data in following steps.


Count missing data for each column:

> missum <­ apply(missing, 2, sum)

Create the bar plot:

> percentage <­ sort(missum/sum(missum) * 100, decreasing=TRUE)

> barplot(percentage, ylim=c(0,100))

Plot cumulative percentage:

> cumulative <­ cumsum(percentage)

> lines(cumulative, type="b")


Congratulations! You have completed your Pareto Chart.


Practice Question

1) Would we say that the missing data in our demo follows the Pareto Principle?

2) How many missing data are there in Ozone?


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值