关于大数据去重的一些总结

最新推荐文章于 2024-07-22 21:51:13 发布

branSummer

最新推荐文章于 2024-07-22 21:51:13 发布

阅读量3.3k

点赞数

文章标签： redis 后端 Java

本文链接：https://blog.csdn.net/branSummer/article/details/82718748

版权

本文总结了大数据去重的方法，重点介绍了Redis中的HyperLoglog数据结构，包括其定义、工作原理和在实际项目中的使用，以及与其他基数统计方法如bitmap和布隆过滤器的对比。

摘要由CSDN通过智能技术生成

关于大数据去重的一些总结

前言

之前在实习的公司里, 涉及到从 Hbase 中去数据转储到 MongoDB 这一过程, 其中取数据的依据是 redis 中存放的 id 队列, 因为往这个队列中生产ID完全是另一个项目组的行为, 在这一过程中当然会涉及到数据重复的问题, 针对这一问题, 当时项目中的做法是: 利用 redis 中 HyperLoglog 这一数据结构来去重.

HyperLoglog

Redis HyperLogLog是一种使用随机化的算法，以少量内存提供集合中唯一元素数量的近似值。

定义

A HyperLogLog is a probabilistic data structure used in order to count unique things (technically this is referred to estimating the cardinality of a set). Usually counting unique items requires using an amount of memory proportional to the number of items you want to count, because you need to remember the elements you have already seen in the past in order to avoid counting them multiple times. However there is a set of algorithms that trade memory for precision: you end with an estimated measure with a standard error, in the case of the Redis implementation, which is less than 1%. The magic of this algo

最低0.47元/天解锁文章

branSummer

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
关于大数据去重的一些总结

关于大数据去重的一些总结前言之前在实习的公司里, 涉及到从 Hbase 中去数据转储到 MongoDB 这一过程, 其中取数据的依据是 redis 中存放的 id 队列, 因为往这个队列中生产ID完全是另一个项目组的行为, 在这一过程中当然会涉及到数据重复的问题, 针对这一问题, 当时项目中的做法是: 利用 redis 中 HyperLoglog 这一数据结构来去重. HyperL...
复制链接

扫一扫