Raw data: The first state that the data comes in when you obtain it.
data dictionary describing what every variable means
data integrity at least one copy of the raw data should be made where the copy is maintained externally and/or regularly backed up
raw data should not be modified and (if possible) should be locked from editing.
Problems:
-
No variable names for some/all variables,
-
Consistent variable naming convention;
-
Empty rows and/or columns, or columns with zero variance;
-
Duplicate rows or columns;
-
Variables that represent integers, numeric values, text, and factors be assigned the correct type,
-
unknown or unexpected character encoding
-
data may not be in a rectangular form
-
Factors are not coded appropriately. Problems include
- factors stored as strings or numbers
- spelling of a category label is incorrect
- inconsistent labels, e.g. "m", "male", "Female"...
-
Missing values not encoded as
NA
-
Dates be inconsistently coded
-
Strings are not normalized (removing trailing spaces for example)
到这里Technically correct,接下来deals with the internal consistency of the data
-
Variables are all of the same scale (e.g., heights are all in cm, or meters or feet)
-
Logical consistencies hold (dead people can't exercise 3 hours a day, and babies can't have 3 children).
Consistent data: The stage where data is ready for statistical inference, but could still contain statistical issues including (but not necessarily limited to):
-
missing values
-
measurement error
-
measurement censoring
-
non-normality of variables
-
low-variance variables
-
outliers/inliers and
-
high leverage points.
再通过EDA to identify these issues, these issues are dealt with almost always depends on context (how data is collected, meaning, questions, relevance) and purpose (EDA, prediction, inferences) of the analysis.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
library(tidyverse)
library(ggplot2)
library(here)
The "here" package then references subdirectories relative to the ".Rproj" file adjusting for the operating system as necessary. In this case the ".Rproj" file is in the directory
here::here()
"C:/Users/jormerod/Desktop/STAT3888/Lectures/L2_DataCleaning/"
To access the file "dirtyIris.csv" in the subirectory "data":
here::here("data","dirtyIris.csv")
#Reading data
library(readr)
dirtyIris = read_csv(here::here("data","dirtyIris.csv")) #df
#cleaning variable name
library(janitor)
better = clean_names(dirtyIris, "old_janitor")
#remove empty rows and columns
evenBetter = remove_empty(better,"rows")
evenBetter = remove_empty(better,"cols") #结果是有NA的
# add a constant var.
tib <- tibble(const=rep(1,nrow(iris)))
worseIris <- bind_cols(iris, tib)
#remove low variance var. by janitor package
betterAgain <- worseIris %>%
remove_constant()
#remove categorical var. with few instances
gender1 = c(rep(c("male","female"),99),rep("non-binary",2))
tab <- table(gender1)
percent <- 100*tab/length(gender1)
keep <- names(tab)[percent>=5]
gender2 <- gender1[gender1 %in% keep]
# removing duplicates
library(dplyr)
unique_tib <- evenBetter %>%
distinct()
#去掉奇怪的字符导致的data type不对
df$sevens <- as.numeric(gsub("[^0-9\\.\\-]", "", df$sevens))
#Recoding vectors
gender = c(2,1,0)
recode = c(male = 1, female = 2)
gender <- factor(gender,
levels = recode,
labels = names(recode))
#magic value to NA
magic_vals <- c(999,9999)
age[age%in%magic_vals] <- NA
outliers
-
Treat the outlier as a missing value.
-
Cap the outlier to a sensible value, e.g. cap the value to be within the 1.5 ×× IQR limits.
-
Fit a model and replace the outlier with a predicted value.
res1 <- lm(value ~ group, data=tib3)
res2 <- lm(value ~ group, data=tib3%>%filter(outlier==FALSE))
res3 <- MASS::rlm(value ~ group, data=tib3)
library(stargazer)
stargazer(res1,res2,res3, type="html", omit.stat=c("rsq","adj.rsq","f"))
-
Down-weight the outlier when fitting models.
-
Use robust methods to fit the model.
#get outlier
boxplot.stats(x)$out\
#加一列判断是否是outlier
is_outlier <- function(x) { return(x%in%boxplot(x, plot = FALSE)$out); }
out_name <- function(x) { return(paste0("out_",x)); }
tib_out <- tib %>% mutate_all(.funs = is_outlier) %>%rename_with(.fn=out_name)
tib2 <- bind_cols(tib,tib_out)
#多列变两列
tib_long <- tib %>%
pivot_longer(cols=1:ncol(y),
names_to="group",
values_to = "value")
#加label
tib_out_long <- tib_out %>%
pivot_longer(cols=1:ncol(y),
names_prefix="out_",
values_to = "outlier") %>%
select(-name)
tib3 <- bind_cols(tib_long,tib_out_long)
Missing value
remove all rows that contain missing values
a subset of variables are deleted first before removing all samples that contain missing values.
remove combinations of subsets of rows or columns that contain missing values in order to keep as much of the data as possible.
tib <- as_tibble(PimaIndiansDiabetes)
zero_to_NA <- function(x) { x[x==0] <- NA; return(x) }
impl_mis_cols <- c("glucose","pressure","triceps","insulin","mass")#这些列中的0转为NA
# Replace implicit missing values with NAs
dat <- tib %>%
mutate_at(.vars = all_of(impl_mis_cols),
.funs = zero_to_NA)
#missing value统计
library(naniar)
library(knitr)
kable(miss_var_summary(dat),digits=2, format = "html")