1. question
给定一个hive表tmp,表中有一个字段result(double型);给定一个区间分类个数m,统计每个区间内的count(*)
的数量。
2. answer
在linux终端下,
# step1 设定区间分类个数,此处设置为10
m=10
# step2 求解字段 result 的最大值和最小值
section=`hive -e "
select max(result) as max_num,
min(result) as min_num
from tmp
"`
max_num=`echo -e "${section}" | cut -f1`
min_num=`echo -e "${section}" | cut -f2`
# step3 求解区间的长度
len_section=`hive -e "
select (${max_num} - ${min_num}) / ${m}
from dual
"`
# step4 统计每个区间的个数
hive -e "
select ${min_num} + floor((result - ${min_num}) / ${len_section}) * ${len_section} as section_flag,
count(*) as num
from tmp
"
最后的运行结果
0.0 1197
0.1 1587
0.2 1495
0.3 981
0.4 1074
0.5 947
0.6 424
0.7 173
0.8 114
0.9 59
1.0 2