hyperloglog算法_这就是为什么HyperLogLog算法是我的新宠

HyperLogLog是一种简单而强大的基数估算算法,被广泛应用于互联网巨头如Google和Reddit。它通过记录数据集中最长的零序列来估算唯一元素的数量,从而提供了一种在大量数据中高效计数唯一值的方法,同时保持相对高的准确性。文章通过收集电话号码的场景解释了HLL的基本原理,并提及在BigQuery和Reddit等实际应用中的案例。
摘要由CSDN通过智能技术生成

hyperloglog算法

by Alex Nadalin

通过亚历克斯·纳达林

这就是为什么HyperLogLog算法是我的新宠 (This is why the HyperLogLog algorithm is my new favorite)

Every now and then I bump into a concept that’s so simple and powerful that I’m wish I’d discovered such an incredible and beautiful idea.

我时不时碰到一个如此简单而强大的概念,希望我发现了这样一个令人难以置信且美丽的想法。

I discovered HyperLogLog (HLL) a couple of years ago, and fell in love with it right after reading how redis decided to add a HLL data structure.

几年前,我发现了HyperLogLog (HLL),并在阅读Redis如何决定添加HLL数据结构后立即爱上了它。

The idea behind HLL is devastatingly simple but extremely powerful. This is what makes it such a widespread algorithm, used by giants of the internet such as Google and Reddit.

HLL背后的想法非常简单,但却非常强大。 这就是使其成为如此广泛的算法的原因,该算法已被互联网巨头(例如Google和Reddit)使用。

收集电话号码 (Collecting phone numbers)

My friend Tommy and I planned to go to a conference. While heading to its location, we decided to wager on who would meet the most new people. Once we reached the place, we’d start conversing around and keep a counter of how many people we talked to.

我的朋友汤米和我打算去参加一个会议。 前往其所在地时,我们决定押注谁会遇到最新的人。 到达该地点后,我们将开始交谈,并与我们交谈的人数保持一致。

At the end of the event, Tommy comes to me with his figure — say, 17 — and I tell him that I had a word with 46 people.

在活动结束时,汤米(Tommy)带着他的身影来到我身边,例如17岁。我告诉他,我与46个人有过一段话。

Clearly, I am the winner, but Tommy’s frustrated as he thinks I’ve counted the same people multiple times. He believes he only saw me talking to maybe 15–20 people in total.

显然,我是赢家,但汤米(Tommy)沮丧,因为他认为我已经多次计算过同一个人。 他相信他只看到我与大约15–20个人进行了交谈。

So, the wager’s off. We decide that for our next event, we’ll be taking down names instead, to be sure we’re counting unique people, and not just the total number of conversations.

因此,下注了。 我们决定在下一次活动中,取而代之的是记下姓名,以确保我们在计算的是唯一身份的人,而不只是对话的总数。

At the end of the following conference, we meet each other with a very long list of names and — guess what? Tommy had a couple more encounters than I did! We laugh it off, and while discussing our approach to counting uniques, Tommy comes up with a great idea:

在下一次会议结束时,我们会面很长的名字,彼此会面-猜猜是什么? 汤米比我多遇到了两次! 我们笑了起来,在讨论计算唯一性的方法时,Tommy提出了一个好主意:

“Alex, you know what? We can’t go around with pen and paper and track down a list of names, it’s really impractical! Today I spoke to 65 different people and counting their names on this paper was a real pain. I lost count 3 times and had to start from scratch!”

“亚历克斯,你知道吗? 我们不能随便用笔和纸来寻找名字的列表,这是不切实际的! 今天,我与65位不同的人进行了交谈,而在这篇论文上计算他们的名字实在是很痛苦。 我输了3次,不得不从头开始!”

“Yeah, I know, but do we even have an alternative?”

“是的,我知道,但是我们还有其他选择吗?”

“What if, for our next conference, instead of asking for names, we ask people the last 5 digits of their phone number? Instead of winning by counting their names, the winner will be the one who spoke to someone with the longest sequence of leading zeroes in those digits.”

“如果在下一次会议上,我们不问姓名,而是问人们电话号码的后5位怎么办? 获胜者将是与那些与数字中前导零序列最长的人交谈的人,而不是通过计数他们的名字来获胜。”

“Wait Tommy, you’re going too fast! Slow down a second and give me an example…”

“等等汤米,你太快了! 慢一点,给我一个例子……”

“Sure, just ask each person for tho

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值