大数据下的Distinct Count（一）：序

最新推荐文章于 2025-05-07 14:30:23 发布

小狼_百度

最新推荐文章于 2025-05-07 14:30:23 发布

阅读量2.5k

点赞数

分类专栏：过滤去重 spark hive

spark 同时被 3 个专栏收录

48 篇文章

订阅专栏

hive

33 篇文章

订阅专栏

过滤去重

6 篇文章

订阅专栏

在数据库中，常常会有Distinct Count的操作，比如，查看每一选修课程的人数：

select course, count(distinct sid)
from stu_table
group by course;

Hive

在大数据场景下，报表很重要一项是UV（Unique Visitor）统计，即某时间段内用户人数。例如，查看一周内app的用户分布情况，Hive中写HiveQL实现：

select app, count(distinct uid) as uv
from log_table
where week_cal = '2016-03-27'
order by uv desc
limit 20

Pig

大部分情况下，Hive的执行效率偏低，我更为偏爱Pig：

-- all users
define DISTINCT_COUNT(A, a) returns dist {
    B = foreach $A generate $a;
    unique_B = distinct B;
    C = group unique_B all;
    $dist = foreach C generate SIZE(unique_B);
}
A = load '/path/to/data' using PigStorage() as (app, uid);
B = DISTINCT_COUNT(A, uid);

-- <app, users>
A = load '/path/to/data' using PigStorage() as (app, uid);
B = distinct A;
C = group B by app;
D = foreach C generate group as app, COUNT($1) as uv;
-- or
D = foreach C generate group as app, SIZE($1) as uv;

DataFu 为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus，其采用HyperLogLog++算法，更为快速地Distinct Count：

define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
A = load '/path/to/data' using PigStorage() as (app, uid);
B = group A by app;
C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;

Spark

在Spark中，Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count：

rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map { line => (line._1, 1) }
  .reduceByKey(_ + _)

// or
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .mapValues{ _ => 1 }
  .reduceByKey(_ + _)

// or 
rdd.map { row => (row.app, row.uid) }
  .distinct()
  .map(_._1)
  .countByValue()

同时，Spark提供近似Distinct Count的API：

rdd.map { row => (row.app, row.uid) }
    .countApproxDistinctByKey(0.001)

实现是基于HyperLogLog算法：

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

或者，将Schema化的RDD转成DataFrame后，registerTempTable然后执行sql命令亦可：

val sqlContext = new SQLContext(sc)
val df = rdd.toDF()
df.registerTempTable("app_table")

val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")