【rocksdb源码解析】1.histogram

本文解析了RocksDB中的Histogram数据结构,它用于实时监控大量数据的分位数,虽看似复杂,实则基于估算和动态调整。核心在于HistogramStat的实现,通过估算确保在大数据场景下提供有效统计信息。
摘要由CSDN通过智能技术生成

2021-10-26挖坑待施工
基于版本6.25.0 (2021-09-20)

看了histogram最基本的实现。实际上在monitor目录下有相当多的文件,都与监控有关,即使是histogram开头的也有好几个。想必以rocksdb面面俱到的风格,即使是histogram也实现了多种子类。

这个结构用来实时保存巨大数据流当中的百分比分位数,如90分位数,99分位数
最初接触到这个概念我毫无头绪,完全想不到是怎么样的鬼斧神工能在巨大的qps下保存90分位数。如果说中位数可以用两个一样大小的堆来保持的话,百分比分位数是完全做不到的。
在看了实现后,我完全没有想到,实现出来的值仅仅是一个估算值。也就是说该结构必须保证数据量足够多,才能发挥出应有的作用。这完全出乎我的认知。
不过在看到skiplist对于建立索引的方法(随机选取层数),也意识到,诸多结构都是为了应对大数据的情况,如果数据不多,设计新结构也就没有意义,简单的计算完全可以满足。所以才产生出了诸如估算、随机等处理办法。

monitoring/histogram.h

//  Copyright (c) 2011-present, Facebook, Inc.  All rights reserved.
//  This source code is licensed under both the GPLv2 (found in the
//  COPYING file in the root directory) and Apache 2.0 License
//  (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.

#pragma once
#include "rocksdb/statistics.h"

#include <cassert>
#include <string>
#include <vector>
#include <map>
#include <mutex>

namespace ROCKSDB_NAMESPACE {

class HistogramBucketMapper {
 public:

  HistogramBucketMapper();

  // converts a value to the bucket index.
  size_t IndexForValue(uint64_t value) const;
  // number of buckets required.

  size_t BucketCount() const {
    return bucketValues_.size();
  }

  uint64_t LastValue() const {
    return maxBucketValue_;
  }

  uint64_t FirstValue() const {
    return minBucketValue_;
  }

  uint64_t BucketLimit(const size_t bucketNumber) const {
    assert(bucketNumber < BucketCount());
    return bucketValues_[bucketNumber];
  }

 private:
  std::vector<uint64_t> bucketValues_;
  uint64_t maxBucketValue_;
  uint64_t minBucketValue_;
};

struct HistogramStat {
  HistogramStat();
  ~HistogramStat() {}

  HistogramStat(const HistogramStat&) = delete;
  HistogramStat& operator=(const HistogramStat&) = delete;

  void Clear();
  bool Empty() const;
  void Add(uint64_t value);
  void Merge(const HistogramStat& other);

  inline uint64_t min() const { return min_.load(std::memory_order_relaxed); }
  inline uint64_t max() const { return max_.load(std::memory_order_relaxed); }
  inline uint64_t num() const { return num_.load(std::memory_order_relaxed); }
  inline uint64_t sum() const { return sum_.load(std::memory_order_relaxed); }
  inline uint64_t sum_squares() const {
    return sum_squares_.load(std::memory_order_relaxed);
  }
  inline uint64_t bucket_at(size_t b) const {
    return buckets_[b].load(std::memory_order_relaxed);
  }

  double Median() const;
  double Percentile(double p) const;
  double Average() const;
  double StandardDeviation() const;
  void Data(HistogramData* const data) const;
  std::string ToString() const;

  // To be able to use HistogramStat as thread local variable, it
  // cannot have dynamic allocated member. That's why we're
  // using manually values from BucketMapper
  std::atomic_uint_fast64_t min_;
  std::atomic_uint_fast64_t max_;
  std::atomic_uint_fast64_t num_;
  std::atomic_uint_fast64_t sum_;
  std::atomic_uint_fast64_t sum_squares_;
  std::atomic_uint_fast64_t buckets_[109]; // 109==BucketMapper::BucketCount()
  const uint64_t num_buckets_;
};

class Histogram {
public:
  Histogram() {}
  virtual ~Histogram() {};

  virtual void Clear() = 0;
  virtual bool Empty() const = 0;
  virtual void Add(uint64_t value) = 0;
  virtual void Merge(const Histogram&) = 0;

  virtual std::string ToString() const = 0;
  virtual const char* Name() const = 0;
  virtual uint64_t min() const = 0;
  virtual uint64_t max() const = 0;
  virtual uint64_t num() const = 0;
  virtual double Median() const = 0;
  virtual double Percentile(double p) const = 0;
  virtual double Average() const = 0;
  virtual double StandardDeviation() const = 0;
  virtual void Data(HistogramData* const data) const = 0;
};

class HistogramImpl : public Histogram {
 public:
  HistogramImpl() { Clear(); }

  HistogramImpl(const HistogramImpl&) = delete;
  HistogramImpl& operator=(const HistogramImpl&) = delete;

  virtual void Clear() override;
  virtual bool Empty() const override;
  virtual void Add(uint64_t value) override;
  virtual void Merge(const Histogram& other) override;
  void Merge(const HistogramImpl& other);

  virtual std::string ToString() const override;
  virtual const char* Name() const override { return "HistogramImpl"; }
  virtual uint64_t min() const override { return stats_.min(); }
  virtual uint64_t max() const override { return stats_.max(); }
  virtual uint64_t num() const override { return stats_.num(); }
  virtual double Median() const override;
  virtual double Percentile(double p) const override;
  virtual double Average() const override;
  virtual double StandardDeviation() const override;
  virtual void Data(HistogramData* const data) const override;

  virtual ~HistogramImpl() {}

 private:
  HistogramStat stats_;
  std::mutex mutex_;
};

}  // namespace ROCKSDB_NAMESPACE

其中Histogram 没什么用。HistogramImpl 包裹着最为重要的结构HistogramStat 。而HistogramStat 又依赖于静态结构HistogramBucketMapper实现。

// include/rocksdb/statistics.h
// 作为参数使用
struct HistogramData {
  double median;
  double percentile95;
  double percentile99;
  double average;
  double standard_deviation;
  // zero-initialize new members since old Statistics::histogramData()
  // implementations won't write them.
  double max = 0.0;
  uint64_t count = 0;
  uint64_t sum = 0;
  double min = 0.0;
};
// monitoring/histogram.cc 重要实现
HistogramBucketMapper::HistogramBucketMapper() {
  // If you change this, you also need to change
  // size of array buckets_ in HistogramImpl
  bucketValues_ = {1, 2};
  double bucket_val = static_cast<double>(bucketValues_.back());
  // std::atomic_uint_fast64_t rocksdb::HistogramStat::buckets_[109]
  // 长度定义为109的原因
  while ((bucket_val = 1.5 * bucket_val) <= static_cast<double>(port::kMaxUint64)) {
    bucketValues_.push_back(static_cast<uint64_t>(bucket_val));
    // Extracts two most significant digits to make histogram buckets more
    // human-readable. E.g., 172 becomes 170.
    uint64_t pow_of_ten = 1;
    while (bucketValues_.back() / 10 > 10) {
      bucketValues_.back() /= 10;
      pow_of_ten *= 10;
    }
    bucketValues_.back() *= pow_of_ten;
  }
  maxBucketValue_ = bucketValues_.back();
  minBucketValue_ = bucketValues_.front();
}

// 实际并没有保存每个数字
void HistogramStat::Add(uint64_t value) {
  // This function is designed to be lock free, as it's in the critical path
  // of any operation. Each individual value is atomic and the order of updates
  // by concurrent threads is tolerable.
  const size_t index = bucketMapper.IndexForValue(value);
  assert(index < num_buckets_);
  buckets_[index].store(buckets_[index].load(std::memory_order_relaxed) + 1,
                        std::memory_order_relaxed);

  uint64_t old_min = min();
  if (value < old_min) {
    min_.store(value, std::memory_order_relaxed);
  }

  uint64_t old_max = max();
  if (value > old_max) {
    max_.store(value, std::memory_order_relaxed);
  }

  num_.store(num_.load(std::memory_order_relaxed) + 1,
             std::memory_order_relaxed);
  sum_.store(sum_.load(std::memory_order_relaxed) + value,
             std::memory_order_relaxed);
  sum_squares_.store(
      sum_squares_.load(std::memory_order_relaxed) + value * value,
      std::memory_order_relaxed);
}

// 核中核 如何利用桶计算百分数
double HistogramStat::Percentile(double p) const {
  double threshold = num() * (p / 100.0);
  uint64_t cumulative_sum = 0;
  for (unsigned int b = 0; b < num_buckets_; b++) {
    uint64_t bucket_value = bucket_at(b);
    cumulative_sum += bucket_value;
    if (cumulative_sum >= threshold) {
      // Scale linearly within this bucket
      uint64_t left_point = (b == 0) ? 0 : bucketMapper.BucketLimit(b-1);
      uint64_t right_point = bucketMapper.BucketLimit(b);
      uint64_t left_sum = cumulative_sum - bucket_value;
      uint64_t right_sum = cumulative_sum;
      double pos = 0;
      uint64_t right_left_diff = right_sum - left_sum;
      if (right_left_diff != 0) {
       pos = (threshold - left_sum) / right_left_diff;
      }
      double r = left_point + (right_point - left_point) * pos;
      uint64_t cur_min = min();
      uint64_t cur_max = max();
      if (r < cur_min) r = static_cast<double>(cur_min);
      if (r > cur_max) r = static_cast<double>(cur_max);
      return r;
    }
  }
  return static_cast<double>(max());
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
如果您想将 `descriptor1.histogram[k]` 和 `descriptor2.histogram[k]` 改为布尔类型,可以按照以下步骤进行修改: 1. 首先,您需要将 `descriptor1` 和 `descriptor2` 中的 `histogram` 数组的类型修改为 `bool` 类型,以便将直方图的每个分量存储为布尔值。您可以使用以下代码定义 `SHOT352` 类型: ``` struct SHOT352 { bool histogram[352]; }; ``` 这个代码定义了一个名为 `SHOT352` 的结构体类型,其中包含一个名为 `histogram` 的布尔类型数组,用于存储 352 维的直方图。 2. 接下来,您需要将 `descriptor1.histogram[k]` 和 `descriptor2.histogram[k]` 的类型都修改为 `bool` 类型,以便进行异或操作并计算 Hamming 距离。您可以使用以下代码计算两个布尔类型的值之间的异或操作: ``` bool diff = descriptor1.histogram[k] ^ descriptor2.histogram[k]; ``` 这个代码将 `descriptor1.histogram[k]` 和 `descriptor2.histogram[k]` 两个布尔值进行异或操作,并将结果存储在 `diff` 变量中。 3. 最后,您需要相应地修改计算 Hamming 距离的代码,以便使用布尔类型的值进行计算。您可以使用以下代码计算两个布尔类型的值之间的 Hamming 距离: ``` uint64_t distance = 0; if (diff) { ++distance; } ``` 这个代码将 `diff` 变量的值与 0 进行比较,如果不为 0,则说明 `descriptor1.histogram[k]` 和 `descriptor2.histogram[k]` 两个布尔值不同,因此将 `distance` 变量加 1。这个代码将 Hamming 距离计算为两个布尔类型的值不同的数量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值