STAT3888 DATA CLEANING

Raw data: The first state that the data comes in when you obtain it.

data dictionary describing what every variable means

data integrity at least one copy of the raw data should be made where the copy is maintained externally and/or regularly backed up

raw data should not be modified and (if possible) should be locked from editing.

Problems:

  • No variable names for some/all variables,

  • Consistent variable naming convention;

  • Empty rows and/or columns, or columns with zero variance;

  • Duplicate rows or columns;

  • Variables that represent integers, numeric values, text, and factors be assigned the correct type,

  • unknown or unexpected character encoding

  • data may not be in a rectangular form

  • Factors are not coded appropriately. Problems include

    • factors stored as strings or numbers
    • spelling of a category label is incorrect
    • inconsistent labels, e.g. "m", "male", "Female"...
  • Missing values not encoded as NA

  • Dates be inconsistently coded

  • Strings are not normalized (removing trailing spaces for example)

到这里Technically correct,接下来deals with the internal consistency of the data

  • Variables are all of the same scale (e.g., heights are all in cm, or meters or feet)

  • Logical consistencies hold (dead people can't exercise 3 hours a day, and babies can't have 3 children).

Consistent data: The stage where data is ready for statistical inference, but could still contain statistical issues including (but not necessarily limited to):

  • missing values

  • measurement error

  • measurement censoring

  • non-normality of variables

  • low-variance variables

  • outliers/inliers and

  • high leverage points.

再通过EDA to identify these issues, these issues are dealt with almost always depends on context (how data is collected, meaning, questions, relevance) and purpose (EDA, prediction, inferences) of the analysis.

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

library(tidyverse)
library(ggplot2)
library(here)


The "here" package then references subdirectories relative to the ".Rproj" file adjusting for the operating system as necessary. In this case the ".Rproj" file is in the directory

here::here()
"C:/Users/jormerod/Desktop/STAT3888/Lectures/L2_DataCleaning/"

To access the file "dirtyIris.csv" in the subirectory "data":
here::here("data","dirtyIris.csv")

#Reading data
library(readr)
dirtyIris = read_csv(here::here("data","dirtyIris.csv")) #df

#cleaning variable name
library(janitor)
better = clean_names(dirtyIris, "old_janitor") 

#remove empty rows and columns
evenBetter = remove_empty(better,"rows")
evenBetter = remove_empty(better,"cols") #结果是有NA的

# add a constant var.
tib <- tibble(const=rep(1,nrow(iris)))
worseIris <- bind_cols(iris, tib) 

#remove low variance var. by janitor package
betterAgain <- worseIris %>% 
  remove_constant()  

#remove categorical var. with few instances
gender1 = c(rep(c("male","female"),99),rep("non-binary",2))
tab <- table(gender1)
percent <- 100*tab/length(gender1)
keep <- names(tab)[percent>=5]
gender2 <- gender1[gender1 %in% keep]

# removing duplicates
library(dplyr)
unique_tib <- evenBetter %>%
  distinct() 

#去掉奇怪的字符导致的data type不对
df$sevens <- as.numeric(gsub("[^0-9\\.\\-]", "", df$sevens))

#Recoding vectors
gender = c(2,1,0)
recode = c(male = 1, female = 2)
gender <- factor(gender, 
                 levels = recode, 
                 labels = names(recode))

#magic value to NA
magic_vals <- c(999,9999)
age[age%in%magic_vals] <- NA



outliers

  • Treat the outlier as a missing value.

  • Cap the outlier to a sensible value, e.g. cap the value to be within the 1.5 ×× IQR limits.

  • Fit a model and replace the outlier with a predicted value.

res1 <- lm(value ~ group, data=tib3)
res2 <- lm(value ~ group, data=tib3%>%filter(outlier==FALSE))
res3 <- MASS::rlm(value ~ group, data=tib3)

library(stargazer)
stargazer(res1,res2,res3, type="html", omit.stat=c("rsq","adj.rsq","f"))
  • Down-weight the outlier when fitting models.

  • Use robust methods to fit the model.

#get outlier
boxplot.stats(x)$out\

#加一列判断是否是outlier
is_outlier <- function(x) { return(x%in%boxplot(x, plot = FALSE)$out); }
out_name <- function(x) { return(paste0("out_",x)); }
tib_out  <- tib %>% mutate_all(.funs = is_outlier) %>%rename_with(.fn=out_name)
tib2 <- bind_cols(tib,tib_out)

#多列变两列
tib_long <- tib %>%
  pivot_longer(cols=1:ncol(y), 
               names_to="group", 
               values_to = "value")
    #加label
tib_out_long <- tib_out %>%
  pivot_longer(cols=1:ncol(y), 
               names_prefix="out_",
               values_to = "outlier") %>%
  select(-name)
tib3 <- bind_cols(tib_long,tib_out_long)

Missing value

remove all rows that contain missing values 

a subset of variables are deleted first before removing all samples that contain missing values.

remove combinations of subsets of rows or columns that contain missing values in order to keep as much of the data as possible.

tib <- as_tibble(PimaIndiansDiabetes)
zero_to_NA <- function(x) { x[x==0] <- NA; return(x) }
impl_mis_cols <- c("glucose","pressure","triceps","insulin","mass")#这些列中的0转为NA
# Replace implicit missing values with NAs
dat <- tib %>%
  mutate_at(.vars = all_of(impl_mis_cols),
            .funs = zero_to_NA)

#missing value统计
library(naniar)
library(knitr)
kable(miss_var_summary(dat),digits=2, format = "html")

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
未来社区的建设背景和需求分析指出,随着智能经济、大数据、人工智能、物联网、区块链、云计算等技术的发展,社区服务正朝着数字化、智能化转型。社区服务渠道由分散向统一融合转变,服务内容由通用庞杂向个性化、服务导向转变。未来社区将构建数字化生态,实现数据在线、组织在线、服务在线、产品智能和决策智能,赋能企业创新,同时注重人才培养和科研平台建设。 规划设计方面,未来社区将基于居民需求,打造以服务为中心的社区管理模式。通过统一的服务平台和应用,实现服务内容的整合和优化,提供灵活多样的服务方式,如推送式、订阅式、热点式等。社区将构建数据与应用的良性循环,提高服务效率,同时注重生态优美、绿色低碳、社会和谐,以实现幸福民生和产业发展。 建设运营上,未来社区强调科学规划、以人为本,创新引领、重点突破,统筹推进、整体提升。通过实施院落+社团自治工程,转变政府职能,深化社区自治法制化、信息化,解决社区治理中的重点问题。目标是培养有活力的社会组织,提高社区居民参与度和满意度,实现社区治理服务的制度机制创新。 未来社区的数字化解决方案包括信息发布系统、服务系统和管理系统。信息发布系统涵盖公共服务类和社会化服务类信息,提供政策宣传、家政服务、健康医疗咨询等功能。服务系统功能需求包括办事指南、公共服务、社区工作参与互动等,旨在提高社区服务能力。管理系统功能需求则涉及院落管理、社团管理、社工队伍管理等,以实现社区治理的现代化。 最后,未来社区建设注重整合政府、社会组织、企业等多方资源,以提高社区服务的效率和质量。通过建立社区管理服务综合信息平台,提供社区公共服务、社区社会组织管理服务和社区便民服务,实现管理精简、高效、透明,服务快速、便捷。同时,通过培育和发展社区协会、社团等组织,激发社会化组织活力,为居民提供综合性的咨询和服务,促进社区的和谐发展。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值