RocksDB问题点解决及相关学习记录

最新推荐文章于 2024-05-09 09:49:20 发布

心中的亚雷泽

最新推荐文章于 2024-05-09 09:49:20 发布

阅读量2.1k

点赞数

分类专栏： RocksBD 文章标签： nosql

本文链接：https://blog.csdn.net/gunri_tianjin/article/details/106192139

版权

RocksBD 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

RocksDB实际使用过程中，面对高并发环境，出现了数据存储阻塞的问题。

[场景介绍]

数据量/次：300多张表，每张表数据量为4K个Key-Value左右。

下盘周期：60s一次。

[分析日志]

RocksDB的日志，先后出现“Stall write”与“Stop write”相关错误日志；操作RocksDB的相关Java线程出现阻塞的情况。

查找资料，网上的解释为：

我们知道，当flush/compaction赶不上write rate的速度时，rockdb会降低write rate，甚至直接停写，如果没有这个策略，会有什么问题？

其实主要是两个：

增加空间放大，耗尽磁盘空间
增加读放大，极大的降低读性能

[解决过程]

为了弄清楚这个问题发生的根本原因，下载了一份RocksDB源码(https://github.com/facebook/rocksdb/)。

阅读源码，可以看到RocksDB源码中存在一个配置文件(rocksdb-master\examples\rocksdb_option_file_example.ini)：

# This is a RocksDB option file.
#
# A typical RocksDB options file has four sections, which are
# Version section, DBOptions section, at least one CFOptions
# section, and one TableOptions section for each column family.
# The RocksDB options file in general follows the basic INI
# file format with the following extensions / modifications:
#
#  * Escaped characters
#    We escaped the following characters:
#     - \n -- line feed - new line
#     - \r -- carriage return
#     - \\ -- backslash \
#     - \: -- colon symbol :
#     - \# -- hash tag #
#  * Comments
#    We support # style comments.  Comments can appear at the ending
#    part of a line.
#  * Statements
#    A statement is of the form option_name = value.
#    Each statement contains a '=', where extra white-spaces
#    are supported. However, we don't support multi-lined statement.
#    Furthermore, each line can only contain at most one statement.
#  * Sections
#    Sections are of the form [SecitonTitle "SectionArgument"],
#    where section argument is optional.
#  * List
#    We use colon-separated string to represent a list.
#    For instance, n1:n2:n3:n4 is a list containing four values.
#
# Below is an example of a RocksDB options file:
[Version]
  rocksdb_version=4.3.0
  options_file_version=1.1

[DBOptions] - RocksDB的数据库全局配置参数
  stats_dump_period_sec=600
  max_manifest_file_size=18446744073709551615
  bytes_per_sync=8388608
  delayed_write_rate=2097152
  WAL_ttl_seconds=0
  WAL_size_limit_MB=0
  max_subcompactions=1
  wal_dir=
  wal_bytes_per_sync=0
  db_write_buffer_size=0
  keep_log_file_num=1000
  table_cache_numshardbits=4
  max_file_opening_threads=1
  writable_file_max_buffer_size=1048576
  random_access_max_buffer_size=1048576
  use_fsync=false
  max_total_wal_size=0
  max_open_files=-1
  skip_stats_update_on_db_open=false
  max_background_compactions=16 （进行Compaction操作的线程数）
  manifest_preallocation_size=4194304
  max_background_flushes=7（进行Flush/下盘操作的线程数）
  is_fd_close_on_exec=true
  max_log_file_size=0
  advise_random_on_open=true
  create_missing_column_families=false
  paranoid_checks=true
  delete_obsolete_files_period_micros=21600000000
  log_file_time_to_roll=0
  compaction_readahead_size=0
  create_if_missing=false
  use_adaptive_mutex=false
  enable_thread_tracking=false
  allow_fallocate=true
  error_if_exists=false
  recycle_log_file_num=0
  skip_log_error_on_recovery=false
  db_log_dir=
  new_table_reader_for_compaction_inputs=true
  allow_mmap_reads=false
  allow_mmap_writes=false
  use_direct_reads=false
  use_direct_writes=false


[CFOptions "default"] - Colume Family的配置参数
  compaction_style=kCompactionStyleLevel
  compaction_filter=nullptr
  num_levels=6
  table_factory=BlockBasedTable
  comparator=leveldb.BytewiseComparator
  max_sequential_skip_in_iterations=8
  soft_rate_limit=0.000000
  max_bytes_for_level_base=1073741824
  memtable_prefix_bloom_probes=6
  memtable_prefix_bloom_bits=0
  memtable_prefix_bloom_huge_page_tlb_size=0
  max_successive_merges=0
  arena_block_size=16777216
  min_write_buffer_number_to_merge=1
  target_file_size_multiplier=1
  source_compaction_factor=1
  max_bytes_for_level_multiplier=8
  max_bytes_for_level_multiplier_additional=2:3:5
  compaction_filter_factory=nullptr
  max_write_buffer_number=8
  level0_stop_writes_trigger=20（当Level-0文件积压到一定数量时，触发“停止写”）
  compression=kSnappyCompression
  level0_file_num_compaction_trigger=4
  purge_redundant_kvs_while_flush=true
  max_write_buffer_size_to_maintain=0
  memtable_factory=SkipListFactory
  max_grandparent_overlap_factor=8
  expanded_compaction_factor=25
  hard_pending_compaction_bytes_limit=137438953472
  inplace_update_num_locks=10000
  level_compaction_dynamic_level_bytes=true
  level0_slowdown_writes_trigger=12（当Level-0文件积压到一定数量时，触发“缓慢写”）
  filter_deletes=false
  verify_checksums_in_compaction=true
  min_partial_merge_operands=2
  paranoid_file_checks=false
  target_file_size_base=134217728
  optimize_filters_for_hits=false
  merge_operator=PutOperator
 compression_per_level=kNoCompression:kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression
  compaction_measure_io_stats=false
  prefix_extractor=nullptr
  bloom_locality=0
  write_buffer_size=134217728（每个Memtable的大小）
  disable_auto_compactions=false（是否关闭自动Comaction）
  inplace_update_support=false



[TableOptions/BlockBasedTable "default"]
  format_version=2
  whole_key_filtering=true
  no_block_cache=false
  checksum=kCRC32c
  filter_policy=rocksdb.BuiltinBloomFilter
  block_size_deviation=10
  block_size=8192
  block_restart_interval=16
  cache_index_and_filter_blocks=false
  pin_l0_filter_and_index_blocks_in_cache=false
  pin_top_level_index_and_filter=false
  index_type=kBinarySearch
  hash_index_allow_collision=true
  flush_block_policy_factory=FlushBlockBySizePolicyFactory

*上述参数，在Java层嵌入式调用RocksDB的过程中，均可以设置。

RocksDB的缓慢写“”/“停止写”的触发条件以及解决方案：

通常 write stall 会在几个地方出现

Too many memtables

当需要等待被 flush 到 level 0 的 memtable 到了或者超过了 max_write_buffer_number，RocksDB 就会完全 stop 写入，直到 flush 结束。同时，当 max_write_buffer_number 大于等于 3，需要 flush 的 memtable 数量已经大于等于 max_writer_buffer_number - 1 的时候，RocksDB 就会 stall 写入。leveldb因为只会有一个memtable和immemtable，所以没有这个。

Too many level-0 SST files

当 level 0 的 SST file 的数量达到 level0_slowdown_writes_tigger 的时候，RocksDB 就会 stall 写入。当 level 0 的 SST file 的数量达到 level0_stop_writes_trigger 的时候，RocksDB 就会 stop 写入，直到 level 0 到 level 1 之间的 compaction 完成，level 0 SST file 的数量减少之后。

Too many pending compaction bytes

当预计的 compaction 数据的大小达到了 sofe_pending_compaction_bytes 之后，RocksDB 会 stall 写入。当达到了 hard_pending_compaction_bytes 之后，则会 stop 写入。这个机制是leveldb所没有的。

Mitigate Stall

我们并不能杜绝 stall，只能通过配置尽量的改善。
当发生 stall 的时候，RocksDB 会降低写入的速度到 delayed_write_rate，甚至有可能比这个更低。另外需要注意的是 slowdown/stop trigger 或者 pending compaction limit 都是针对不同的 CF 的，但 stall 是针对整个 DB 的，如果程序里面有多个 CF，一个 CF 出现了 stall 的情况，整个 DB 都会 stall。

如果 stall 是因为 pending flush memtable 不及时导致的，我们可以尝试:

增大 max_background_flushes ，这样就能有更多的线程同时 flush memtable。
增大 max_write_buffer_number ，用更小的 memtable 来提升 flush 的速度。

如果 stall 是因为 level 0 或者 pending compaction 太多导致，我们就需要考虑提升 compaction 的速度。另外，也可以减小写放大，因为写放大越小，需要 compaction 的数据量就越小。所以我们可以尝试：
增大 max_background_compactions，用更多的线程来进行 compaction。
增大 write_buffer_size，这样就能有更大的 memtable，用来减少写放大。
增加 min_write_buffer_number_to_merge，在 flush 之前先将 memtable merge，减少写入 key 的数量，但这样会影响从 memtable read 的性能。

心中的亚雷泽

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RocksDB问题点解决及相关学习记录

RocksDB实际使用过程中，面对高并发环境，出现了数据存储阻塞的问题。[场景介绍]数据量/次：300多张表，每张表数据量为4K个Key-Value左右。下盘周期：60s一次。[分析日志]RocksDB的日志，先后出现“Stall write”与“Stop write”相关错误日志；操作RocksDB的相关Java线程出现阻塞的情况。查找资料，网上的解释为：我们知道，当flush/compaction赶不上write rate的速度时，rockdb会降低write ra..
复制链接

扫一扫