skimr包：紧凑且灵活的数据摘要

R语言数据分析视界

已于 2024-06-25 18:07:31 修改

阅读量654

点赞数 29

文章标签：机器学习算法人工智能

于 2024-05-10 11:10:16 首次发布

本文链接：https://blog.csdn.net/a852232394/article/details/138657090

版权

数据分析服务请访问以下链接：

文章发表技术服务，数据分析服务

`skimr` 旨在提供有关数据框、tibbles、数据表和向量中变量的汇总统计信息。

在基础 R 中，最相似的函数是 summary()（用于向量和数据框）和 fivenum()（用于数值向量）。

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
summary(iris$Species)
##     setosa versicolor  virginica 
##         50         50         50

skim() 函数

skimr 的核心功能是 skim()，它被设计为与（分组的）数据框协同工作，并且会尽可能将其他对象转换为数据框。类似于 summary()，skim() 在数据框方法中为每一列呈现结果；它提供的统计信息取决于变量的类别。

浏览数据框

根据设计，skimr 的主要关注点是数据框；它旨在很好地融入数据管道，并广泛依赖于tidyverse词汇，这主要关注于数据框。

skim() 的结果以水平方式打印，每种变量类型有一个部分，每个变量一行。

library(skimr)
skim(iris)
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique
## 1 Species               0             1 FALSE          3
##   top_counts               
## 1 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

结果的格式是一个将所有结果组合在一起的单一宽数据框，附加一些属性和两个元数据列：

skim_variable：原始变量的名称
skim_type：变量的类别

与 R 中的许多其他对象不同，这些列是 skim_df 类的固有部分。删除这些变量将导致强制转换为 tibble。is_skim_df() 函数用于断言一个对象是否为 skim_df。

skim(iris) %>% is_skim_df()
## [1] TRUE
## attr(,"message")
## character(0)
skim(iris) %>%
  dplyr::select(-skim_type, -skim_variable) %>% is_skim_df()
## [1] FALSE
## attr(,"message")
## [1] "Object is not a `skim_df`: missing column `skim_type`; missing column `skim_variable`"
skim(iris) %>%
  dplyr::select(-n_missing) %>% is_skim_df()
## [1] TRUE
## attr(,"message")
## character(0)

为了避免类型强制转换，不同类型的汇总统计数据列都以相应的 skim_type 为前缀。这意味着 skim_df 的列在某种程度上是稀疏的，含有相当多的缺失值。这是因为对于某些统计数据，不同类型变量的表示方式不同。例如，日期变量和数值变量的平均值在打印时表示方式不同，但这在单一向量中是无法支持的。其中的例外是 n_missing 和 complete_rate（缺失值/观察数），这些对所有类型的变量都是相同的。

skim(iris) %>%
  tibble::as_tibble()
## # A tibble: 5 × 15
##   skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique
##   <chr>     <chr>             <int>         <dbl> <lgl>                    <int>
## 1 factor    Species               0             1 FALSE                        3
## 2 numeric   Sepal.Length          0             1 NA                          NA
## 3 numeric   Sepal.Width           0             1 NA                          NA
## 4 numeric   Petal.Length          0             1 NA                          NA
## 5 numeric   Petal.Width           0             1 NA                          NA
## # … with 9 more variables: factor.top_counts <chr>, numeric.mean <dbl>,
## #   numeric.sd <dbl>, numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
## #   numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>

这与 summary.data.frame() 形成对比，后者将统计数据存储在一个 table 中。这一区别很重要，因为 skim_df 对象可以进行管道操作并且易于进行额外的操控：例如，用户可以选择所有变量的平均值，或者选择特定变量的所有汇总统计数据。

skim(iris) %>%
  dplyr::filter(skim_variable == "Petal.Length")
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0 p25  p50 p75 p100 hist 
## 1 Petal.Length          0             1 3.76 1.77  1 1.6 4.35 5.1  6.9 ▇▁▆▇▂

支持大部分 dplyr 函数的语法

skim(iris) %>%
  dplyr::select(skim_type, skim_variable, n_missing)
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing
## 1 Species               0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing
## 1 Sepal.Length          0
## 2 Sepal.Width           0
## 3 Petal.Length          0
## 4 Petal.Width           0

在数据中，基础的“skimmers” n_missing 和 complete_rate 是为所有列计算的。但所有其他基于类型的“skimmers”都有一个命名空间。您需要使用 skim_type 前缀来正确引用列。

例如，使用 skimr 和 dplyr 库处理 iris 数据集，您可以选择特定的列以及类型为数值的列的平均值。以下是代码示例和相应的输出：

skim(iris) %>%
  dplyr::select(skim_type, skim_variable, numeric.mean)
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable mean
## 1 Sepal.Length  5.84
## 2 Sepal.Width   3.06
## 3 Petal.Length  3.76
## 4 Petal.Width   1.20

skim() 也支持通过 dplyr::group_by() 创建的分组数据。在这种情况下，skim_df 对象会为每个分组变量增加一个额外的列。这样，每个分组的统计数据都会包含在最终的 skim_df 数据框中，使得对各个分组的比较和分析变得更加直观和方便。

iris %>%
  dplyr::group_by(Species) %>%
  skim()
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  4         
## ________________________             
## Group variables            Species   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable Species    n_missing complete_rate  mean    sd  p0  p25  p50
##  1 Sepal.Length  setosa             0             1 5.01  0.352 4.3 4.8  5   
##  2 Sepal.Length  versicolor         0             1 5.94  0.516 4.9 5.6  5.9 
##  3 Sepal.Length  virginica          0             1 6.59  0.636 4.9 6.22 6.5 
##  4 Sepal.Width   setosa             0             1 3.43  0.379 2.3 3.2  3.4 
##  5 Sepal.Width   versicolor         0             1 2.77  0.314 2   2.52 2.8 
##  6 Sepal.Width   virginica          0             1 2.97  0.322 2.2 2.8  3   
##  7 Petal.Length  setosa             0             1 1.46  0.174 1   1.4  1.5 
##  8 Petal.Length  versicolor         0             1 4.26  0.470 3   4    4.35
##  9 Petal.Length  virginica          0             1 5.55  0.552 4.5 5.1  5.55
## 10 Petal.Width   setosa             0             1 0.246 0.105 0.1 0.2  0.2 
## 11 Petal.Width   versicolor         0             1 1.33  0.198 1   1.2  1.3 
## 12 Petal.Width   virginica          0             1 2.03  0.275 1.4 1.8  2   
##     p75 p100 hist 
##  1 5.2   5.8 ▃▃▇▅▁
##  2 6.3   7   ▂▇▆▃▃
##  3 6.9   7.9 ▁▃▇▃▂
##  4 3.68  4.4 ▁▃▇▅▂
##  5 3     3.4 ▁▅▆▇▂
##  6 3.18  3.8 ▂▆▇▅▁
##  7 1.58  1.9 ▁▃▇▃▁
##  8 4.6   5.1 ▂▂▇▇▆
##  9 5.88  6.9 ▃▇▇▃▂
## 10 0.3   0.6 ▇▂▂▁▁
## 11 1.5   1.8 ▅▇▃▆▁
## 12 2.3   2.5 ▂▇▆▅▇

可以使用 tidyverse 风格的选择器从数据框中选择单个列。

skim(iris, Sepal.Length, Species)
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique
## 1 Species               0             1 FALSE          3
##   top_counts               
## 1 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25 p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8 6.4  7.9 ▆▇▇▅▂

或者使用常见的 select 函数。

skim(iris, starts_with("Sepal"))
## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25 p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8 6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3   3.3  4.4 ▁▆▇▂▁

浏览向量

在 skimr v2 中，skim() 将尝试将非数据框（如向量和矩阵）强制转换为数据框。在大多数情况下，对于向量，被评估的对象应当等同于将该对象包裹在 as.data.frame() 中。

例如，lynx 数据集是 ts 类型。

skim(lynx)
## ── Data Summary ────────────────────────
##                            Values
## Name                       lynx  
## Number of rows             114   
## Number of columns          1     
## _______________________          
## Column type frequency:           
##   ts                       1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: ts ───────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate start  end frequency deltat  mean    sd
## 1 x                     0             1  1821 1934         1      1 1538. 1586.
##   min  max median line_graph
## 1  39 6991    771 ⡈⢄⡠⢁⣀⠒⣀⠔

这与强制转换为数据框是相同的。

all.equal(skim(lynx), skim(as.data.frame(lynx)))
## [1] "Attributes: < Component \"df_name\": 1 string mismatch >"

浏览矩阵

skimr 不直接支持浏览矩阵，而是将它们转换为数据框。矩阵中的列变成变量。这种行为类似于 summary.matrix()。使用 skim() 处理矩阵的三种可能方法与矩阵的均值函数的三种变体相对应。

m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

浏览矩阵产生的结果类似于使用 colMeans() 函数的结果。这意味着，当矩阵被强制转换为数据框并用 skim() 函数处理时，为每一列（即原矩阵的每一列）生成的统计数据可能包括列的平均值，类似于 colMeans() 计算每一列的平均值。

colMeans(m)
## [1]  2.5  6.5 10.5
skim(m) # Similar to summary.matrix and colMeans()
## ── Data Summary ────────────────────────
##                            Values
## Name                       m     
## Number of rows             4     
## Number of columns          3     
## _______________________          
## Column type frequency:           
##   numeric                  3     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0  p25  p50   p75 p100 hist 
## 1 V1                    0             1  2.5 1.29  1 1.75  2.5  3.25    4 ▇▇▁▇▇
## 2 V2                    0             1  6.5 1.29  5 5.75  6.5  7.25    8 ▇▇▁▇▇
## 3 V3                    0             1 10.5 1.29  9 9.75 10.5 11.2    12 ▇▇▁▇▇

Skimming the transpose of the matrix will give row-wise results.

rowMeans(m)
## [1] 5 6 7 8
skim(t(m))
## ── Data Summary ────────────────────────
##                            Values
## Name                       t(m)  
## Number of rows             3     
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist 
## 1 V1                    0             1    5  4  1   3   5   7    9 ▇▁▇▁▇
## 2 V2                    0             1    6  4  2   4   6   8   10 ▇▁▇▁▇
## 3 V3                    0             1    7  4  3   5   7   9   11 ▇▁▇▁▇
## 4 V4                    0             1    8  4  4   6   8  10   12 ▇▁▇▁▇

And call c() on the matrix to get results across all columns.

skim(c(m))
## ── Data Summary ────────────────────────
##                            Values
## Name                       c(m)  
## Number of rows             12    
## Number of columns          1     
## _______________________          
## Column type frequency:           
##   numeric                  1     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0  p25 p50  p75 p100 hist 
## 1 data                  0             1  6.5 3.61  1 3.75 6.5 9.25   12 ▇▅▅▅▇
mean(m)
## [1] 6.5

不修改数据的浏览

skim_tee() 产生与 skim() 相同的打印版本，但返回原始的、未修改的数据框。这允许继续对原始数据进行管道操作。这种功能特别有用，当你想在数据处理流程中快速检查数据的统计摘要，同时不中断数据处理的链条。

iris_setosa <- iris %>%
  skim_tee() %>%
  dplyr::filter(Species == "setosa")
## ── Data Summary ────────────────────────
##                            Values
## Name                       data  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique
## 1 Species               0             1 FALSE          3
##   top_counts               
## 1 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃
head(iris_setosa)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

请注意，skim_tee() 的定制方式与 skim 本身不同。详情请见下文。

重塑 skim() 的结果

如上所述，skim() 返回一个宽数据框。这通常是大多数操作中最合理的格式，用于调查数据，但该包还有一些其他功能来帮助处理边缘情况。

首先，partition() 返回一个命名列表，包含每种数据类型的宽数据框。与原始数据不同的是，分区后的数据只包含与该数据类型用于浏览的函数相对应的列。因此，这些数据框不是 skim_df 对象。

iris %>%
  skim() %>%
  partition()
## $factor
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts             
## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir:…
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

另外，yank() 只选择特定类型的子表。可以将其视为原始数据中对列类型使用 dplyr::select 的类似操作。同样，不适合的列将被丢弃。这使得 yank() 成为从 skim() 生成的宽数据框中快速提取特定类型数据的有效方法，便于对特定类型的数据进行更深入的分析或处理。

iris %>%
  skim() %>%
  yank("numeric")
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

to_long() 返回一个单一的长格式数据框，包含 variable、type、statistic 和 formatted 列。这与 skimr v1 中的 skim_df 对象类似，但不完全相同。这种长格式数据框便于在跨多个统计和变量类型进行比较和聚合时使用，因为它将所有统计数据标准化为单一结构，从而简化了数据操作和可视化的过程。

iris %>%
  skim() %>%
  to_long() %>% 
  head()
## # A tibble: 6 × 4
##   skim_type skim_variable stat          formatted
##   <chr>     <chr>         <chr>         <chr>    
## 1 factor    Species       n_missing     0        
## 2 numeric   Sepal.Length  n_missing     0        
## 3 numeric   Sepal.Width   n_missing     0        
## 4 numeric   Petal.Length  n_missing     0        
## 5 numeric   Petal.Width   n_missing     0        
## 6 factor    Species       complete_rate 1

由于 skim_variable 和 skim_type 列是 skim_df 类的核心组成部分，使用 dplyr::select() 时可能会产生不希望的副作用。相反，可以使用 focus() 来选择已浏览结果的列，并保持其作为 skim_df；它总是保留元数据列。这样做可以确保在选择特定统计数据时，相关的元数据仍然完整，支持后续的数据处理和分析操作。

iris %>%
  skim() %>%
  focus(n_missing, numeric.mean)
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   factor                   1         
##   numeric                  4         
## ________________________             
## Group variables            None      
## 
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing
## 1 Species               0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing mean
## 1 Sepal.Length          0 5.84
## 2 Sepal.Width           0 3.06
## 3 Petal.Length          0 3.76
## 4 Petal.Width           0 1.20

呈现 skim() 的结果

skim_df 对象是一个宽数据框。默认情况下，显示是通过 print.skim_df() 创建的；用户可以通过显式调用 print([skim_df object], ...) 来指定额外的选项。

对于由 knitr 渲染的文档，该包提供了一个自定义的 knit_print 方法。要使用它，代码块的最后一行应该有一个 skim_df 对象。这样设置后，在 knitr 渲染文档时，skim_df 的输出将自动格式化，使得其在报告或分析文档中的展示更加清晰和有吸引力。

skim(Orange)

Name	Orange
Number of rows	35
Number of columns	3
_______
Column type frequency:
factor	1
numeric	2
________
Group variables	None

factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Tree	0	1	TRUE	5	3: 7, 1: 7, 5: 7, 2: 7

numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	922.14	491.86	118	484.0	1004	1372.0	1582	▃▇▁▇▇
circumference	0	1	115.86	57.49	30	65.5	115	161.5	214	▇▃▇▇▅

来自重塑后的 skim_df 对象的同类型渲染也是可用的，特别是那些由 partition() 和 yank() 生成的对象。这意味着即使在这些函数处理和修改了原始 skim_df 数据结构之后，这些重塑的数据框仍然可以利用 skimr 提供的默认打印和渲染功能，保持一致的输出格式，便于在报告和分析中进行展示。

skim(Orange) %>%
  yank("numeric")

numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	922.14	491.86	118	484.0	1004	1372.0	1582	▃▇▁▇▇
circumference	0	1	115.86	57.49	30	65.5	115	161.5	214	▇▃▇▇▅

自定义打印选项

虽然这不是在编写关于 skimr 的小册子之外的常见用例，但你可以通过添加代码块选项 render = knitr::normal_print 来回退到默认的打印方法。

你还可以通过设置代码块选项 skimr_include_summary = FALSE 来禁用 skimr 摘要。

你可以通过更改 skimr_digits 代码块选项来改变生成统计列中显示的数字位数。

修改 skim()

skimr 在默认选择上有自己的见解，但用户可以轻松地为某个类别添加、替换或移除统计数据。对于交互式使用，你可以使用 skim_with() 工厂函数创建自己的浏览函数。skimr 还提供了一个 API 用于在其他包中进行扩展。稍后将介绍如何使用这一功能。

要为数据类型添加一个统计量，为你想要更改的每个类别创建一个 sfl()（skimr 函数列表）：

my_skim <- skim_with(numeric = sfl(new_mad = mad))
my_skim(faithful)

Name	faithful
Number of rows	272
Number of columns	2
_______
Column type frequency:
numeric	2
________
Group variables	None

numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist	new_mad
eruptions	0	1	3.49	1.14	1.6	2.16	4	4.45	5.1	▇▂▂▇▇	0.95
waiting	0	1	70.90	13.59	43.0	58.00	76	82.00	96.0	▃▃▂▇▂	11.86

正如前面的例子所示，默认情况下是将新的汇总统计数据附加到现有的集合中。这种行为并不总是理想的，特别是当你想要进行大量更改时。要停止附加，可以设置 append = FALSE。这样，你就可以完全控制哪些统计数据将被包含在最终的 skim_df 对象中，而不是简单地将新统计加到旧统计之后。这提供了更高的灵活性和精确控制，尤其是在定制数据分析和报告时。