布隆过滤器

ADCQprogrammer

已于 2023-04-24 16:35:56 修改

阅读量164

点赞数

分类专栏： distribute 文章标签： java

于 2023-04-24 16:34:03 首次发布

本文链接：https://blog.csdn.net/ADCQprogrammer/article/details/130346250

版权

distribute 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

布隆过滤器

BLOOM FILTER

参考资料：
https://doc.bulkall.top/spring/bloom-filter/

1、概述：

1.1、背景与定义：

1970 年由布隆提出
本质：二进制向量(0/1数组集合)以及一系列随机映射函数(hash)
作用：检索一个元素是否在一个集合中
特点：不在一定不在，在有也可能不在
优点：
1. 空间效率和查询时间相对比一般算法要好，存储空间和查询时间为常数
2. hash函数之间没有关系，方便由硬件并行实现
3. 不存储元素本身，可适用于保密性需求
4. 可以表示全集
缺点：
1. 存在误识别率(随着存入的元素数量增加，误算率随之增加)，可以添加白名单，存储可能误判的元素(元素少的话直接用散列表)，出现变种
2. 难删除，一般情况不能从其中删除元素，
  1. 可以将位图的位变成整型数组（计时器），每插入一个元素相应的计数器加1, 这样删除元素时将对应计数器减掉1就可以了。
  2. 然而要保证安全的删除元素并非如此简单。首先我们必须保证删除的元素的确在布隆过滤器里面. 这一点单凭这个过滤器是无法保证的。
  3. 另外计数器回绕也会造成问题。
  4. 布谷鸟过滤器（Cuckoo filter）

1.2、应用：

网页URL的去重
垃圾邮件的判别
集合重复元素的判别
查询加速（比如基于key-value的[存储系统)
数据库防止查询击穿，使用BloomFilter来减少不存在的行或列的磁盘查找。(缓存穿透)

缓存目标：

降低数据库的访问压力
提高响应效率和并发量

缓存设置有效期的原因：

保证数据库与缓存的数据一致性
降低冷缓存数据占用过多的内存空间

当缓存失效或没有抵挡住流量，流量直接涌入到数据库，在高并发的情况下，可能直接击垮数据库，导致整个系统崩溃。

正常流程：

接收请求，请求缓存
缓存中有值，直接返回，否则走3
查库。有值，刷缓存，返回，否则走4
直接报错

缓存穿透：缓存和数据库均没有用户所需值，导致每次需要访问数据库，高并发或有人利用不存在的Key频繁攻击时，DB直接宕机(数据误删、知道不存在的key恶意攻击)

缓存默认值或者空(null)

分析业务请求，如果是正常业务请求时发生缓存穿透现象，可针对相应的业务数据，在数据库查询不存在时，将其缓存为空值（null）或默认值。
需要注意的是，针对空值的缓存失效时间不宜过长，一般设置为5分钟之内。当数据库被写入或更新该key的新数据时，缓存必须同时被刷新，避免数据不一致。

业务逻辑前置校验

业务入口合法性校验，检查请求参数是否合理、是否包含非法值、是否恶意请求等，提前有效阻断非法请求

布隆过滤器请求白名单

在写入数据时，使用布隆过滤器进行标记（相当于设置白名单），业务请求发现缓存中无对应数据时，可先通过查询布隆过滤器判断数据是否在白名单内，如果不在白名单内，则直接返回空或失败。

用户黑名单限制

当发生异常情况时，实时监控访问的对象和数据，分析用户行为，针对故意请求、爬虫或攻击者，进行特定用户的限制；

随机而动：针对具体情况，采用对应的措施

缓存雪崩：大量热点key同一时间过期(还有一种可能：缓存服务器挂了)，当缓存中大量热点缓存采用了相同的实效时间，就会导致缓存在某一个时刻同时实效，请求全部转发到数据库，从而导致数据库压力骤增，甚至宕机

控制失效时间，尽量不让同一时间同时失效，每个时间后边加随机数，让其均匀失效
缓存单线程写(队列或者锁)，并发量低
异步更新缓存，适用不严格要求缓存一致性的场景
双key策略，主key设置过期时间，备key不设置过期时间，当主key失效时，直接返回备key值
构建缓存高可用集群（针对缓存服务故障情况）。
当缓存雪崩发生时，服务熔断、限流、降级等措施保障。

缓存击穿：某一个热点key同一时间过期(缓存雪崩的子集)

互斥锁(muter key),只让一个线程构建缓存，其他线程等待构建缓存执行完毕，重新从缓存中获取数据。单机通过synchronized或lock来处理，分布式环境采用分布式锁。
异步更新缓存，适用不严格要求缓存一致性的场景
”提前“使用互斥锁（Mutex Key）：在value内部设置一个比缓存（Redis）过期时间短的过期时间标识，当异步线程发现该值快过期时，马上延长内置的这个时间，并重新从数据库加载数据，设置到缓存中去。

数据库层只要可以横向扩展,解决一切问题

1.3、api集成

redission

@SpringBootTest
public class RedissonDemoTest {
    @Resource
    RedissonClient redissonClient;
    @Test
    void contextLoads() {
        RBloomFilter<String> bloomFilter = redissonClient.getBloomFilter("phoneList");
        //初始化布隆过滤器：预计元素为1000000L,误差率为3%
        bloomFilter.tryInit(1000000L,0.03);
        //将号码10086插入到布隆过滤器中
        bloomFilter.add("10086");
        //判断下面号码是否在布隆过滤器中
        System.out.println(bloomFilter.contains("123456"));//false
        System.out.println(bloomFilter.contains("10086"));//true
    }
}

guava

public class GuavaBloomFilterTest {
    @Test
    public void test() {
        //插入多少数据
        int insertions = 1000000;
        //期望的误判率
        double fpp = 0.02;
        //初始化一个存储string数据的布隆过滤器,默认误判率是0.03
        BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions, fpp);
        //用于存放所有实际存在的key，用于是否存在
        Set<String> sets = new HashSet<>(insertions);
        //用于存放所有实际存在的key，用于取出
        List<String> lists = new ArrayList<>(insertions);
        //插入随机字符串
        for (int i = 0; i < insertions; i++) {
            String uuid = UUID.randomUUID().toString();
            bf.put(uuid);
            sets.add(uuid);
            lists.add(uuid);
        }
        int rightNum = 0;
        int wrongNum = 0;
        for (int i = 0; i < 10000; i++) {
            // 0-10000之间，可以被100整除的数有100个（100的倍数）
            String data = i % 100 == 0 ? lists.get(i / 100) : UUID.randomUUID().toString();
            //这里用了 might ,看上去不是很自信，所以如果布隆过滤器判断存在了,我们还要去 sets 中实锤
            if (bf.mightContain(data)) {
                if (sets.contains(data)) {
                    rightNum++;
                    continue;
                }
                wrongNum++;
            }
        }
        BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        System.out.println("在100W个元素中，判断100个实际存在的元素，布隆过滤器认为存在的：" + rightNum);
        System.out.println("在100W个元素中，判断9900个实际不存在的元素，误认为存在的：" + wrongNum + "，命中率：" + bingo + "，误判率：" + percent);
    }
}

hutool

自己实现

/**
 * 自定义布隆过滤器
 * 哈希函数(n个)
 * 二进制向量
 */
public class CustomBloomFilter {
    /**
     * 长度10亿的比特位
     */
//    private static final int DEFAULT_SIZE = 256 << 22;
    private static final int DEFAULT_SIZE = 1000000;
    /**
     * 不同哈希函数的种子，一般应取质数
     * 为了降低错误率，使用加法 hash 算法，所以定义一个8个元素的质数数组
     */
    private static final int[] SEEDS = {3,5,7,11,13,31,37,61};
    /**
     * 相当于构建 8 个不同的 hash 算法 HashFunction 越多，误判率越低，也越慢
     */
    public static final HashFunction[] FUNCTIONS = new HashFunction[SEEDS.length];
    /**
     * 初始化布隆过滤器的 BitSet
     * BitSet 即“位图”，是一个很长的 “0/1”序列，他的功能就是存储0或者1
     */
    public static final BitSet BIT_SET = new BitSet(DEFAULT_SIZE);

    public static void add(String value){
        if(StringUtils.isNotBlank(value)){
            for (HashFunction f : FUNCTIONS) {
                //计算value的hash值并修改bitmap中相应的位置为true
                BIT_SET.set(f.hash(value),true);
            }
        }
    }
    public static boolean contains(String value){
        if(StringUtils.isBlank(value)){
            return false;
        }
        boolean ret =false;
        for (HashFunction f : FUNCTIONS) {
            ret = BIT_SET.get(f.hash(value));
            if(!ret){
                break;
            }
        }
        return ret;
    }
    public static void main(String[] args) {
        //初始化functions
        for (int i = 0; i < SEEDS.length; i++) {
            FUNCTIONS[i] = new HashFunction(DEFAULT_SIZE,SEEDS[i]);
        }
        long startTime = System.currentTimeMillis();
        //添加数据
        for (int i = 0; i < DEFAULT_SIZE/100; i++) {
            add(String.valueOf(i));
        }
        System.err.println(System.currentTimeMillis()-startTime);

        String id = "123456789";
        add(id);
        System.err.println(contains(id));
        System.err.println(contains("100000000"));

    }
    static class HashFunction {
        private final int size;//数组长度，hash生成值的最大值

        private final int seed;//不同哈希函数的种子，一般应取质数

        public HashFunction(int size, int seed) {
            this.seed = seed;
            this.size = size;
        }
        /**
         * 计算hash
         * @param value
         * @return
         */
        public int hash(String value) {
            int result = 0;
            int len = value.length();
            for (int i = 0; i < len; i++) {
                result = seed * result + value.charAt(i);
            }
            return (size - 1) & result;
        }
    }
}

2、业务使用

2.1、使用流程：

2.2、demo演示：

缓存使用redis

工具使用redission

pom依赖：

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-redis</artifactId>
    </dependency>

    <!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>31.1-jre</version>
    </dependency>

    <dependency>
        <groupId>org.redisson</groupId>
        <artifactId>redisson</artifactId>
        <version>3.16.1</version>
    </dependency>
    <dependency>
        <groupId>org.redisson</groupId>
        <!-- for Spring Data Redis v.2.2.x -->
        <artifactId>redisson-spring-data-22</artifactId>
        <version>3.16.1</version>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
    </dependency>
     <!--lombok 依赖-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
</dependencies>

配置文件：

server:
  port: 8045
spring:
  redis:
    #超时时间
    timeout: 1000000ms
    #服务器地址
    host: 127.0.0.1
    #端口
    port: 6379
    #数据库
    database: 0
    #密码
    password: ljc123456
    lettuce:
      pool:
        #最大连接数 默认8
        max-active: 1024
        #最大连接阻塞等待时间，默认-1ms
        max-wait: 1000000ms
        #最大空闲连接
        max-idle: 200
        #最小空闲连接
        min-idle: 20
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/auth?allowPublicKeyRetrieval=true&useSSL=false&serverTimezone=Asia/Shanghai&characterEncoding=utf-8
    username: root
    password: 950824
    hikari:
      # 连接池名 springboot 自带
      pool-name: DateHikariCP
      # 最小空闲连接数
      minimum-idle: 5
      # 空闲连接存活最大时间，默认600000（10分钟）
      idle-timeout: 180000
      # 最大连接数，默认10
      maximum-pool-size: 10
      # 从连接池返回的连接的自动提交
      auto-commit: true
      # 连接最大存活时间，0表示永久存活，默认1800000（30分钟）
      max-lifetime: 1800000
      # 连接超时时间，默认30000（30秒）
      connection-timeout: 30000
      # 测试连接是否可用的查询语句
      connection-test-query: SELECT 1

主要业务类：

启动类：

/**
 * Spring Boot 整合布隆过滤器 demo
 */
@SpringBootApplication
@EnableScheduling//用来定时重刷布隆过滤器数据
@MapperScan("top.jkxljc.bloom.mapper")
public class SpringBootBloomFilterApplication {

    public static void main(String[] args) {
        SpringApplication.run(SpringBootBloomFilterApplication.class, args);
    }

}

定时任务重刷：

/**
 * 定时任务 刷 布隆过滤器中的数据
 */
@Component
public class ProductSchedule {
    @Resource
    ProductService productService;
    /**
     * 每天一点执行
     * 目的是为了刷新那些被删除掉的产品
     */
    @Scheduled(cron = "0 0 1 * * ?")
    public void refreshBloom() {
        productService.refreshBloom();
    }
}

mapper+实体

/**
 * 产品demo(Product)表实体类
 */
@Data
public class Product implements Serializable {
    /**
     * 主键
     **/
    private Integer id;
    /**
     * 名称
     **/
    private String productName;
    /**
     * 价钱
     **/
    private Double productPrice;
    /**
     * 数量
     **/
    private Integer productNum;
    /**
     * 添加时间
     **/
    private LocalDateTime addTime;
    /**
     * 创建人
     **/
    private String addBy;
    /**
     * 更新时间
     **/
    private LocalDateTime updateTime;
    /**
     * 更新人
     **/
    private String updateBy;
}
/**
 * 产品demo(Product)表数据库访问层
 */
public interface ProductMapper extends BaseMapper<Product> {
}

controller

@RestController
@RequestMapping("/product")
public class ProductController {
    @Resource
    ProductService productService;
    @GetMapping("/{id}")
    public String getProduct(@PathVariable Integer id) {
        Product product = productService.getProductById(id);
        if (BeanUtil.isEmpty(product)) {
            return "暂无该产品！";
        }
        return JSONUtil.toJsonStr(product);
    }

    @PostMapping("/add")
    public Boolean addProduct(@RequestBody Product product) {
        return productService.addProduct(product);
    }

    @DeleteMapping("/{id}")
    public Boolean delProduct(@PathVariable Integer id) {
        return productService.removeById(id);
    }
}

业务类：

/**
 * 产品demo(Product)表服务接口
 */
public interface ProductService extends IService<Product> {
    Product getProductById(Integer id);

    Boolean addProduct(Product product);
    void refreshBloom();
}
@Service
@Slf4j
public class ProductServiceImpl extends ServiceImpl<ProductMapper, Product> implements ProductService {
    private static final String BLOOM_STR = "product_list_bloom";

    private static final String REDIS_CACHE = "product_list";
    @Resource
    RedissonClient redissonClient;

    RBloomFilter<Integer> bloomFilter;
    /**
     * 启动时候将产品加入到 布隆过滤器中
     */
    @PostConstruct
    public void init() {
        bloomFilter = redissonClient.getBloomFilter(BLOOM_STR, new JsonJacksonCodec());
        this.refreshBloom();
    }

    @Override
    public void refreshBloom() {
        bloomFilter.delete();
        //初始化布隆过滤器：预计元素为 1000000L (这个值根据实际的数量进行调整),误差率为3%
        bloomFilter.tryInit(1000000L, 0.03);
        List<Integer> productIdList = this.list(new LambdaQueryWrapper<Product>().select(Product::getId))
                .stream().map(Product::getId).collect(Collectors.toList());
        productIdList.forEach(bloomFilter::add);
    }

    @Override
    public Product getProductById(Integer id) {
        // 走布隆过滤器筛选一下，防止被缓存穿透
        boolean contains = bloomFilter.contains(id);
        // 如果布隆过滤器判断当前产品id 存在，则去查询数据库
        if (contains) {
            // 先去缓存中查
            RMap<Integer, String> productCache = redissonClient.getMap(REDIS_CACHE);
            String cacheProduct = productCache.get(id);
            if (StrUtil.isNotEmpty(cacheProduct)) {
                // 如果缓存中不是空 则返回
                return JSONUtil.toBean(cacheProduct, Product.class);
            }
            Product product = this.getById(id);
            // 如果查到了数据，那么存一份到 redis 中去
            if (BeanUtil.isNotEmpty(product)) {
                productCache.put(id, JSONUtil.toJsonStr(product));
                return product;
            }
        } else {
            log.info("布隆过滤器中不存在产品id：{}的数据", id);
        }
        return null;
    }

    @Override
    public Boolean addProduct(Product product) {
        boolean success = this.save(product);
        // 数据添加成功后，往 redis 缓存中 和 布隆过滤器中添加数据
        if (success) {
            final Integer id = product.getId();
            RMap<Integer, String> redisCache = redissonClient.getMap(REDIS_CACHE);
            redisCache.put(id, JSONUtil.toJsonStr(product));
            bloomFilter.add(id);
        }
        return success;
    }
}
  } else {
            log.info("布隆过滤器中不存在产品id：{}的数据", id);
        }
        return null;
    }

    @Override
    public Boolean addProduct(Product product) {
        boolean success = this.save(product);
        // 数据添加成功后，往 redis 缓存中 和 布隆过滤器中添加数据
        if (success) {
            final Integer id = product.getId();
            RMap<Integer, String> redisCache = redissonClient.getMap(REDIS_CACHE);
            redisCache.put(id, JSONUtil.toJsonStr(product));
            bloomFilter.add(id);
        }
        return success;
    }
}