【R】特征工程 - 数据探索函数汇总-CSDN博客

本文链接：https://blog.csdn.net/jianlin0402/article/details/104645791

前言

在进行数据分析前，我们需要将收集到的数据质量进行探索。质量在这里其实是质与量。数据的行列信息、缺失值、集中趋势、离散程度、分布密度、相关性、异常值等。

R有很多包可以完成这项工作，本篇我们取其精华去其糟粕，仅对我认为实用度高、性能好的函数进行介绍。

用到的包有：skimr

DataExplorer

mice

PerformanceAnalytics

psych

dlookr

Rlof

数据准备

我们以一个kaggle上的信用卡消费数据集为例

百度网盘下载：

链接：https://pan.baidu.com/s/1Qv3nAJxfo7hxjdTOGoLOXA

提取码：f4ks

pacman::p_load(tidyverse, data.table)


data <- fread("D://contest//transactions.csv") %>% tbl_df() %>% sample_frac(.1)


data_lof <- data %>% sample_frac(.1) %>% select_if(is.numeric) #异常值部分用


set.seed(1)
for (i in 1:ncol(data)) {
  data[sample(1:18513, size = 100), i] <- NA
}

#这里取该数据集的10%的记录作为轻量样本，然后分别在每个变量中插入100个NA值，以便后续展示

数据全貌

#skim可能是最给力的全貌描述函数，返回的结果包括了几乎所有探索性数据分析需要的信息

#在character栏中的n_unique信息可以用来判断哪些字符串类型的变量可以转换成factor变量

skimr::skim(data)


-- Data Summary ------------------------
                           Values
Name                       data  
Number of rows             18513 
Number of columns          14    
_______________________          
Column type frequency:           
  character                6     
  numeric                  8     
________________________         
Group variables            None  


-- Variable type: character ------------------------------------------------------------------------------
# A tibble: 6 x 8
  skim_variable   n_missing complete_rate   min   max empty n_unique whitespace
* <chr>               <int>         <dbl> <int> <int> <int>    <int>      <int>
1 authorized_flag       100         0.995     1     1     0        1          0
2 card_id               100         0.995    15    15     0    17368          0
3 category_1            100         0.995     1     1     0        1          0
4 category_3            100         0.995     0     1   525        4          0
5 merchant_id           100         0.995     0    15   253    13363          0
6 purchase_date         100         0.995    19    19     0    18274          0


-- Variable type: numeric --------------------------------------------------------------------------------
# A tibble: 8 x 11
  skim_variable        n_missing complete_rate    mean      sd     p0     p25     p50     p75  p100 hist 
* <chr>                    <int>         <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>
1 city_id                    100         0.995 141.     99.0    1      69     128     213     347   ▇▃▃▃▃
2 installments