连续变量离散化类别变量

R内置函数实现数组变量转为因子变量有:cut、split、quantile、bincode,本文主要介绍ggplot提供的几个分组函数。

*cut_interval()*按照相同范围分为n组;, cut_number() 按照相同数量(近似)观测值分为n组; cut_width() 按照参数 width指定的宽度进行分组。

语法如下:

# cut_interval(x, n = NULL, length = NULL, ...)
# 
# cut_number(x, n = NULL, ...)
# 
# cut_width(
#   x,
#   width,
#   center = NULL,
#   boundary = NULL,
#   closed = c("right", "left"),
#   ...
# )

cut_interval举例

按照相同范围分为6组,使用table进行统计分组数据进行验证:

table(cut_interval(1:10, 6))

 # [1,2.5]  (2.5,4]  (4,5.5]  (5.5,7]  (7,8.5] (8.5,10] 
 #       2        2        1        2        1        2 

table(cut_interval(1:10, 5))
  # [1,2.8] (2.8,4.6] (4.6,6.4] (6.4,8.2]  (8.2,10] 
  #       2         2         2         2         2 

cut_number举例

每组包括相同数量元素进行分组:

table(cut_number(runif(100), 10))
# [0.00693,0.17]   (0.17,0.305]   (0.305,0.38]   (0.38,0.477]   (0.477,0.58]   (0.58,0.688]  (0.688,0.771] 
#             10             10             10             10             10             10             10 
#   (0.771,0.83]   (0.83,0.922]  (0.922,0.993] 
#             10             10             10 

cut_width 举例

每组距离是0.1,对100个均匀分布数据分组:

table(cut_width(runif(100), 0.1))

# [-0.05,0.05]  (0.05,0.15]  (0.15,0.25]  (0.25,0.35]  (0.35,0.45]  (0.45,0.55]  (0.55,0.65]  (0.65,0.75] 
#            3           14           10            6           11           11           13            8 
#  (0.75,0.85]  (0.85,0.95]  (0.95,1.05] 
#            9            7            8 
table(cut_width(runif(100), 0.1, boundary = 0))
  # [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
  #      10         6        13        11         8        11         9        11        11        10 

table(cut_width(runif(100), 0.1, center = 0))
# [-0.05,0.05]  (0.05,0.15]  (0.15,0.25]  (0.25,0.35]  (0.35,0.45]  (0.45,0.55]  (0.55,0.65]  (0.65,0.75] 
#            5           16           12           11            8           11            7            8 
#  (0.75,0.85]  (0.85,0.95]  (0.95,1.05] 
#           13            5            4 

table(cut_width(runif(100), 0.1, labels = FALSE))
# 1  2  3  4  5  6  7  8  9 10 11 
# 9  8 13 12  7 10  9 11  8  9  4 

boundary 设置分组初始边界,如果不指定则为width的一半。center 指定分组中心,center=0让中心为整数。

labels
labels 指定分组结果的级别. 默认使用 “(a,b]” 作为分组标识. 如果设置 labels = FALSE, 简单使用整数代码代替因子变量.

应用举例

统计diamonds数据中钻石重量的分布情况:

library("dplyr")
diamonds %>% count(cut_width(carat, 0.5)) 

# A tibble: 11 x 2
#    `cut_width(carat, 0.5)`     n
#    <fct>                   <int>
#  1 [-0.25,0.25]              785
#  2 (0.25,0.75]             29498
#  3 (0.75,1.25]             15977
#  4 (1.25,1.75]              5313
#  5 (1.75,2.25]              2002
#  6 (2.25,2.75]               322
#  7 (2.75,3.25]                32
#  8 (3.25,3.75]                 5
#  9 (3.75,4.25]                 4
# 10 (4.25,4.75]                 1
# 11 (4.75,5.25]                 1

ggplot(data = diamonds, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.5)

在这里插入图片描述
geom_histogram也可以指定分组宽度。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值