关于大数据去重的一些总结
前言
之前在实习的公司里, 涉及到从 Hbase 中去数据转储到 MongoDB 这一过程, 其中取数据的依据是 redis 中存放的 id 队列, 因为往这个队列中生产ID完全是另一个项目组的行为, 在这一过程中当然会涉及到数据重复的问题, 针对这一问题, 当时项目中的做法是: 利用 redis 中 HyperLoglog 这一数据结构来去重.
HyperLoglog
Redis HyperLogLog是一种使用随机化的算法,以少量内存提供集合中唯一元素数量的近似值。
定义
A HyperLogLog is a probabilistic data structure used in order to count unique things (technically this is referred to estimating the cardinality of a set). Usually counting unique items requires using an amount of memory proportional to the number of items you want to count, because you need to remember the elements you have already seen in the past in order to avoid counting them multiple times. However there is a set of algorithms that trade memory for precision: you end with an estimated measure with a standard error, in the case of the Redis implementation, which is less than 1%. The magic of this algo