- HyperLogLog
- 是Redis的高级数据结构,可以实现统计用户访问情况中的UV情况,也就是独立访问次数,重复用户在每天只算一次。
- HyperLogLog提供了一种不精确的去重计数方案,标准误差为0.81%。
- 实现原理:
- 通过记录低位最长的连续零位数据,估计出总的数量。经过试验发现随机数总数的对数和最长零位长度存在显著线性关系。
- 通过设置多个桶,计算平均估计,达到较准确的结果
- 使用方法的代码:以下代码模拟了多线程时10000个用户访问时的统计情况,最终结果为10064。
public static void testHyperLogLog() throws InterruptedException {
Jedis jedis1 = new Jedis("192.168.198.128");
Jedis jedis2 = new Jedis("192.168.198.128");
Thread thread1 = new Thread(){
@Override
public void run() {
for(int i=0;i<10000;i++){
jedis1.pfadd("codehole" ,"user"+ i);
}
}
};
Thread thread2 = new Thread(){
@Override
public void run() {
for(int i=0;i<10000;i++){
jedis2.pfadd("codehole" , "user"+i);
}
}
};
thread1.start();
thread2.start();
thread1.join();
thread2.join();
System.out.println(jedis1.pfcount("codehole"));
}
- 原理分析的代码模拟
- BitKeeper类:每个BitKeeper相当于是一个桶。每次出现一个新数据时,随机选择一个桶进行统计,最后将所有桶的结果取调和平均。然后计算平均每个桶中的数量*桶的总个数,就得到了最后的结果。
package com.xliu.chapter1;
import org.junit.Test;
import redis.clients.jedis.Jedis;
import java.util.concurrent.ThreadLocalRandom;
public class HyperLogLogTest {
private int n;
private BitKeeper[] keepers;
public HyperLogLogTest(int n) {
this.n = n;
this.keepers = new BitKeeper[1024];
for(int i=0;i<1024;i++){
keepers[i] = new BitKeeper();
}
}
public static void testPF(){
for(int i=100000;i<1000000;i+=10000){
HyperLogLogTest hyperLogLogTest = new HyperLogLogTest(i);
hyperLogLogTest.work();
double estimate = hyperLogLogTest.estimate();
System.out.printf("%d %.2f %.2f\n",i,estimate,Math.abs(estimate - i)/i);
}
}
private double estimate() {
double subitsInverse = 0.0;
for(BitKeeper keeper:keepers){
subitsInverse += 1/(float)keeper.maxbits;
}
double avgBits = (float)keepers.length / subitsInverse;
return Math.pow(2,avgBits) * 1024;
}
private void work() {
for(int i=0;i<n;i++){
long m = ThreadLocalRandom.current().nextLong(1L<<32);
BitKeeper keeper = keepers[(int) (((m & 0xfff0000) >> 16) % 1024)];
keeper.random();
}
}
public static void testHyperLogLog() throws InterruptedException {
Jedis jedis1 = new Jedis("192.168.198.128");
Jedis jedis2 = new Jedis("192.168.198.128");
Thread thread1 = new Thread(){
@Override
public void run() {
for(int i=0;i<10000;i++){
jedis1.pfadd("codehole" ,"user"+ i);
}
}
};
Thread thread2 = new Thread(){
@Override
public void run() {
for(int i=0;i<10000;i++){
jedis2.pfadd("codehole" , "user"+i);
}
}
};
thread1.start();
thread2.start();
thread1.join();
thread2.join();
System.out.println(jedis1.pfcount("codehole"));
}
public static void main(String[] args) throws InterruptedException {
testPF();
}
private class BitKeeper {
private int maxbits;
public void random(){
long value = ThreadLocalRandom.current().nextLong(2L << 32);
int bits = lowZeros(value);
maxbits = Math.max(bits,maxbits);
}
private int lowZeros(long value) {
int i = 1;
for(;i<32;i++){
if(value>>i<<i != value){
break;
}
}
return i-1;
}
}
}
- 试验结果:可以看到估算结果还是比较准确的,百分比误差率在个位数,当然实际的实现代码会更复杂更精确。
100000 93113.60 0.07
200000 192753.58 0.04
300000 309455.21 0.03
400000 399014.89 0.00
500000 488323.85 0.02
600000 614030.15 0.02
700000 726671.02 0.04
800000 819612.06 0.02
900000 893945.77 0.01