Using R to Fix Data Quality: Section 7

Section 7: Fix missing data


Overview

In previous sections, we have mentioned how to find missing data in our table. In this section, we are going to use linear regression to restore missing data.


Read CSV Data

In this demo, we use the hours.csv to be our data.


> data=read.csv("hours.csv")
> head(data)
  Hours Score Questions.Posted Days.Missed

1     1    NA                0           3

2     1    55                2           0

3     3    NA                0           2

4     5    60                0           1

5     5    65                0           2

6     5    70                1           3


In this table, there are four columns. The Hours, Questions.Posted and Days.Missed are complete, but there are some missing data in the Score. Thus, we need a solution to restore the missing data in Score.

Because the names of each column are too long if we always use $ operation, we can attach data:

> attach(data)


After this command, we can just type “Score” to replace “data$Score”.


Linear Regression

An easy way to restore the missing data is just using random imputation to replace NA, but it is bad for the accuracy of our data. Linear Regression can use the existent data to predict the value of each missing data. Linear Regression is a complex functionality to code if we use other programming language, such as Java, C, and Python. Fortunately, R includes the Linear Regression function, so that we can use it directly.


Use Hours, Questions.Posted and Days.Missed to make a linear Regression for Score:

> lmod=lm(Score ~ Hours + Questions.Posted + Days.Missed)
> lmod


The “lmod” is the result of our linear regression. We can use it to make a prediction of Score. For example, we want to predict the score when Hours = 3, Questions.Posted=50, and Days.Missed =2.

The code to predict:

> predict(lmod, data.frame(Hours=c(3), Questions.Posted=c(50), Days.Missed=c(2)))
       1 

31.65578


As can be seen, it predict the Score = 31.65578


Deterministic Regression Imputation

We have used linear regression to make a predicting Score based on the other three variables. The next thing we need to do is just to replace each NA to our prediction value.

In fact the function predict() can be used in data table directly:

> p=predict(lmod,data)

We should make a function impute() to replace NA:

impute <- function (a, a.impute){
ifelse (is.na(a), a.impute, a)
}

Replace each NA with our prediction value:

> data$Score=impute(data$Score,p)

Congratulations! You have fixed the missing data problem in your data table.


Practice Question

1. What are the 5 new values in the table after our regression imputation?


2. What is the benefit of this regression imputation than random imputation?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值