STAT3888 DATA CLEANING

鱼鱼冲鸭

已于 2022-08-03 23:30:07 修改

阅读量113

点赞数

文章标签：统计学

于 2022-08-03 10:01:25 首次发布

本文链接：https://blog.csdn.net/m0_55541117/article/details/126134244

版权

Raw data: The first state that the data comes in when you obtain it.

data dictionary describing what every variable means

data integrity at least one copy of the raw data should be made where the copy is maintained externally and/or regularly backed up

raw data should not be modified and (if possible) should be locked from editing.

Problems:

No variable names for some/all variables,
Consistent variable naming convention;
Empty rows and/or columns, or columns with zero variance;
Duplicate rows or columns;
Variables that represent integers, numeric values, text, and factors be assigned the correct type,

unknown or unexpected character encoding
data may not be in a rectangular form

Factors are not coded appropriately. Problems include
- factors stored as strings or numbers
- spelling of a category label is incorrect
- inconsistent labels, e.g. "m", "male", "Female"...

Missing values not encoded as NA
Dates be inconsistently coded
Strings are not normalized (removing trailing spaces for example)

到这里Technically correct，接下来deals with the internal consistency of the data

Variables are all of the same scale (e.g., heights are all in cm, or meters or feet)
Logical consistencies hold (dead people can't exercise 3 hours a day, and babies can't have 3 children).

Consistent data: The stage where data is ready for statistical inference, but could still contain statistical issues including (but not necessarily limited to):

missing values
measurement error
measurement censoring
non-normality of variables
low-variance variables
outliers/inliers and
high leverage points.

再通过EDA to identify these issues, these issues are dealt with almost always depends on context (how data is collected, meaning, questions, relevance) and purpose (EDA, prediction, inferences) of the analysis.

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

library(tidyverse)
library(ggplot2)
library(here)


The "here" package then references subdirectories relative to the ".Rproj" file adjusting for the operating system as necessary. In this case the ".Rproj" file is in the directory

here::here()
"C:/Users/jormerod/Desktop/STAT3888/Lectures/L2_DataCleaning/"

To access the file "dirtyIris.csv" in the subirectory "data":
here::here("data","dirtyIris.csv")

#Reading data
library(readr)
dirtyIris = read_csv(here::here("data","dirtyIris.csv")) #df

#cleaning variable name
library(janitor)
better = clean_names(dirtyIris, "old_janitor") 

#remove empty rows and columns
evenBetter = remove_empty(better,"rows")
evenBetter = remove_empty(better,"cols") #结果是有NA的

# add a constant var.
tib <- tibble(const=rep(1,nrow(iris)))
worseIris <- bind_cols(iris, tib) 

#remove low variance var. by janitor package
betterAgain <- worseIris %>% 
  remove_constant()  

#remove categorical var. with few instances
gender1 = c(rep(c("male","female"),99),rep("non-binary",2))
tab <- table(gender1)
percent <- 100*tab/length(gender1)
keep <- names(tab)[percent>=5]
gender2 <- gender1[gender1 %in% keep]

# removing duplicates
library(dplyr)
unique_tib <- evenBetter %>%
  distinct() 

#去掉奇怪的字符导致的data type不对
df$sevens <- as.numeric(gsub("[^0-9\\.\\-]", "", df$sevens))

#Recoding vectors
gender = c(2,1,0)
recode = c(male = 1, female = 2)
gender <- factor(gender, 
                 levels = recode, 
                 labels = names(recode))

#magic value to NA
magic_vals <- c(999,9999)
age[age%in%magic_vals] <- NA

outliers

Treat the outlier as a missing value.
Cap the outlier to a sensible value, e.g. cap the value to be within the 1.5 ×× IQR limits.
Fit a model and replace the outlier with a predicted value.

res1 <- lm(value ~ group, data=tib3)
res2 <- lm(value ~ group, data=tib3%>%filter(outlier==FALSE))
res3 <- MASS::rlm(value ~ group, data=tib3)

library(stargazer)
stargazer(res1,res2,res3, type="html", omit.stat=c("rsq","adj.rsq","f"))

Down-weight the outlier when fitting models.
Use robust methods to fit the model.

#get outlier
boxplot.stats(x)$out\

#加一列判断是否是outlier
is_outlier <- function(x) { return(x%in%boxplot(x, plot = FALSE)$out); }
out_name <- function(x) { return(paste0("out_",x)); }
tib_out  <- tib %>% mutate_all(.funs = is_outlier) %>%rename_with(.fn=out_name)
tib2 <- bind_cols(tib,tib_out)

#多列变两列
tib_long <- tib %>%
  pivot_longer(cols=1:ncol(y), 
               names_to="group", 
               values_to = "value")
    #加label
tib_out_long <- tib_out %>%
  pivot_longer(cols=1:ncol(y), 
               names_prefix="out_",
               values_to = "outlier") %>%
  select(-name)
tib3 <- bind_cols(tib_long,tib_out_long)

Missing value

remove all rows that contain missing values

a subset of variables are deleted first before removing all samples that contain missing values.

remove combinations of subsets of rows or columns that contain missing values in order to keep as much of the data as possible.

tib <- as_tibble(PimaIndiansDiabetes)
zero_to_NA <- function(x) { x[x==0] <- NA; return(x) }
impl_mis_cols <- c("glucose","pressure","triceps","insulin","mass")#这些列中的0转为NA
# Replace implicit missing values with NAs
dat <- tib %>%
  mutate_at(.vars = all_of(impl_mis_cols),
            .funs = zero_to_NA)

#missing value统计
library(naniar)
library(knitr)
kable(miss_var_summary(dat),digits=2, format = "html")