summarise()
在dplyr中函数用于对数据进行统计描述。
maingroup subgroup value norm
1 AAA Y.1 0.152138053 10.717161
2 AAA Y.1 0.129267562 8.940263
3 BBB X.1 -0.360278960 9.723177
...
#变量类型查看
> df %>% str()
'data.frame': 3000 obs. of 4 variables:
$ maingroup: chr "AAA" "AAA" "AAA" "BBB" ...
$ subgroup: chr "Y.1" "Y.1" "X.1" "X.1" ...
$ value : num 0.152 0.129 0.892 -0.36 -0.224 ...
$ norm : num 10.72 8.94 9.67 9.72 10.59 ...
1、统计所有列的空值数
df %>% summarise_all( ~ sum(is.na(.)) )
>...
maingroup subgroup value norm
1 0 0 0 0
2、分组统计每组的记录数n()
df %>% group_by(species) %>% summarise( n = n() )
>...
# A tibble: 6 x 3
# Groups: maingroup [3]
maingroup subgroup n
<chr> <chr> <int>
1 AAA X.1 494
2 AAA Y.1 506
3 BB X.1 504
4 BB Y.1 496
5 CCC X.1 522
6 CCC Y.1 478
###3、 除开分组的列向量,对其它所有列进行分组求和
df %>% group_by(maingroup ,subgroup) %>% summarise(across(everything(), sum, na.rm = TRUE))
>...
# A tibble: 6 x 4
# Groups: rowname [3]
maingroup subgroup value norm
<chr> <chr> <dbl> <dbl>
1 AAA X.1 -4.57 4969.
2 AAA Y.1 10.3 5063.
3 BB X.1 7.11 4906.
4 BB Y.1 -6.47 5074.
5 CCC X.1 0.411 4754.
6 CCC Y.1 -7.94 5239.
###4、对所有列(不包括用于分组的列向量)进行分组求和,求均值
df %>% group_by(maingroup,subgroup ) %>% summarise(across(everything(), list(mean = mean, sum = sum),na.rm = TRUE))
>...
# A tibble: 6 x 6
# Groups: maingroup [3]
maingroup subgroup value_mean value_sum norm_mean norm_sum
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AAA X.1 -0.00708 -3.58 10.0 5059.
2 AAA Y.1 -0.0312 -15.4 10.1 4984.
3 BB X.1 0.0358 18.4 9.94 5108.
4 BB Y.1 -0.0328 -16.0 10.1 4894.
5 CCC X.1 0.0270 13.3 10.0 4945.
6 CCC Y.1 -0.00361 -1.83 9.97 5054.
5、求取列向量的分位数
df %>% summarise(across(c(value ,norm), ~quantile(.,na.rm = TRUE)) )
>...
value norm
1 -0.999550791 6.331832
2 -0.484380580 9.351605
3 -0.009638345 10.029124
4 0.487067557 10.691726
5 0.999956779 13.868918
6、 对指定类型的列进行统计
df %>% summarise( across(where(is.numeric) & !norm, ~scale(.x)) ) %>% colSums()
>...
value norm
0 0