etcd Backend存储引擎实现原理

最新推荐文章于 2023-10-09 11:28:45 发布

惜暮

最新推荐文章于 2023-10-09 11:28:45 发布

阅读量1.9k

点赞数 1

分类专栏： etcd 文章标签： etcd

本文链接：https://blog.csdn.net/u010853261/article/details/109630223

版权

本文深入探讨了etcd的KV存储层原理，包括基于BoltDB的Backend设计，读写事务（ReadTX与batchTx）的实现，特别是并发读事务的优化，以及MVCC索引与存储的细节，如Revision概念、keyIndex和treeIndex。etcd通过Backend接口和MVCC层提供高性能的并发读写，并实现了高效的数据恢复和索引管理。

摘要由CSDN通过智能技术生成

文章目录

etcd KV存储层原理

etcd KV存储层原理

要理解etcd存储层的原理，需要有一些背景：

对于读操作，不需要走Leader的共识算法，直接通过读事务读取存储层的KV；
对于写操作，首先需要通过raft集群的共识算法，共识算法达成一致之后，会apply写操作日志，也就是进入etcd的KV存储层
etcd的KV存储层是基于BoltDB这个KV存储的嵌入式数据库，BoltDB本身已经实现了事务的隔离性、原子性、持久化、一致性，并提供了并发的单写+多读。但是BoltDB的写是基于COW(copy on write)技术的，这会影响写性能的吞吐量，此外BoltDB也提供了batch写事务以提升写事务吞吐。etcd的 mvcc/backend 层是基于 BoltDB 提供的并发1写+N读能力做了优化以提升写吞吐。

etcd KV存储层分为两部分，一部分是etcd实现的基于revision的MVCC层(https://github.com/etcd-io/etcd/tree/master/server/mvcc)；另一部分是通过 backend 这一后端设计(https://github.com/etcd-io/etcd/tree/master/server/mvcc/backend)，封装了KV存储引擎的细节，为上层提供了一个统一的接口。

Backend 存储引擎

Backend封装了一层通用的存储接口：

type Backend interface {
   
	// ReadTx returns a read transaction. It is replaced by ConcurrentReadTx in the main data path, see #10523.
	ReadTx() ReadTx
	BatchTx() BatchTx
	// ConcurrentReadTx returns a non-blocking read transaction.
	ConcurrentReadTx() ReadTx

	Snapshot() Snapshot
	Hash(ignores map[IgnoreKey]struct{
   }) (uint32, error)
	// Size returns the current size of the backend physically allocated.
	// The backend can hold DB space that is not utilized at the moment,
	// since it can conduct pre-allocation or spare unused space for recycling.
	// Use SizeInUse() instead for the actual DB size.
	Size() int64
	// SizeInUse returns the current size of the backend logically in use.
	// Since the backend can manage free space in a non-byte unit such as
	// number of pages, the returned value can be not exactly accurate in bytes.
	SizeInUse() int64
	// OpenReadTxN returns the number of currently open read transactions in the backend.
	OpenReadTxN() int64
	Defrag() error
	ForceCommit()
	Close() error
}

这里只解释最核心主流程，重点关注ReadTx() ReadTx 和 BatchTx() BatchTx接口：

ReadTx() ReadTx 创建了一个读事务；
BatchTx() BatchTx 创建了一个读写事务，这里的batch的含义也就是写事务的批量提交意思。

etcd默认是基于 BoltDB存储实现了Backend接口。backend结构体定义如下：

// backend是etcd基于BoltDB实现的KV存储
// BoltDB 天然支持一写多读并发事务；
// backend 基于BoltDB封装了事务，提升了并发读的性能
type backend struct {
   
	// size and commits are used with atomic operations so they must be
	// 64-bit aligned, otherwise 32-bit tests will crash

	// size is the number of bytes allocated in the backend
	size int64
	// sizeInUse is the number of bytes actually used in the backend
	sizeInUse int64
	// commits counts number of commits since start
	commits int64
	// openReadTxN is the number of currently open read transactions in the backend
	openReadTxN int64

	// 这里的锁也是隔离下面的db对象；
	// 正常的创建bolt.DB事务只需要读锁；
	// 但是做 defrag 时候需要写锁隔离
	mu sync.RWMutex
	db *bolt.DB

	// 默认100ms
	batchInterval time.Duration
	// 默认defaultBatchLimit    = 10000
	batchLimit int
	// backend 执行写事务的对象
	batchTx *batchTxBuffered

	// backend 执行只读事务的对象
	readTx *readTx

	stopc chan struct{
   }
	donec chan struct{
   }

	lg *zap.Logger
}

从封装的对象就可以看出，backend是使用 *bolt.DB 作为存储引擎。

读事务ReadTX

Backend接口返回读事务的有两个接口：

ReadTx() ReadTx:
ConcurrentReadTx() ReadTx: 在主流程中都是使用的这个并发读事务。

ReadTx接口定义：

type ReadTx interface {
   
	Lock()
	Unlock()
	RLock()
	RUnlock()

	UnsafeRange(bucketName []byte, key, endKey []byte, limit int64) (keys [][]byte, vals [][]byte)
	UnsafeForEach(bucketName []byte, visitor func(k, v []byte) error) error
}

不难看出外部对读事务的使用流程:

ReadTx.RLock()
ReadTx.UnsafeRange()
ReadTx.RUnlock()

并发读事务ConcurrentReadTx

// ConcurrentReadTx creates and returns a new ReadTx, which:
// A) creates and keeps a copy of backend.readTx.txReadBuffer,
// B) references the boltdb read Tx (and its bucket cache) of current batch interval.
func (b *backend) ConcurrentReadTx() ReadTx {
   
	// 这里需要读 readTx 的buffer， 所以需要读锁
	// 这里的锁占用时间是很低的
	b.readTx.RLock()
	defer b.readTx.RUnlock()
	// prevent boltdb read Tx from been rolled back until store read Tx is done. Needs to be called when holding readTx.RLock().
	// 用于 ConcurrentReadTx 的，增加一个并发的ReadTx
	// 在concurrentReadTx.RUnlock()时候会释放这个waiter信号
	b.readTx.txWg.Add(1)
	// TODO: might want to copy the read buffer lazily - create copy when A) end of a write transaction B) end of a batch interval.
	return &concurrentReadTx{
   
		baseReadTx: baseReadTx{
   
			// copy一份backend的readTx.buf, 这样就可以不用持有readTx.mu对buffer的保护，从而提升读的性能
			// 这里就是空间换时间(锁的竞争)
			buf:     b.readTx.buf.unsafeCopy(),
			txMu:    b.readTx.txMu,
			tx:      b.readTx.tx,
			buckets: b.readTx.buckets,
			txWg:    b.readTx.txWg,
		},
	}
}

backend通过ConcurrentReadTx()接口创建了一个并发读事务，创建流程表明了几个很重要的点：

这里需要持有backend.readTx的读锁，创建完concurrentReadTx对象之后读锁会很快释放，整体对读锁的占用非常轻
新创建的concurrentReadTx对象会对backend.readTx里面的缓存做一次全拷贝，(这里的缓存与batch批量提交有关，后面将batchTx会讲），这样读事务对buf的访问就是独立的，在一个concurrentReadTx内不需要任何加锁。

看一下concurrentReadTx的一些实现：

type concurrentReadTx struct {
   
	baseReadTx
}

func (rt *concurrentReadTx) Lock()   {
   }
func (rt *concurrentReadTx) Unlock() {
   }

// RLock is no-op. concurrentReadTx does not need to be locked after it is created.
func (rt *concurrentReadTx) RLock() {
   }

// RUnlock signals the end of concurrentReadTx.
func (rt *concurrentReadTx) RUnlock() {
    rt.txWg.Done() }

可以看到，就是因为concurrentReadTx前面的一次buf的拷贝操作，所以每个concurrentReadTx的加锁全部是空操作。

这里我们对比非并发的readTx的实现：

// Base type for readTx and concurrentReadTx to eliminate duplicate functions between these
// baseReadTx的访问是并发的，所以需要读写锁来保护。
type baseReadTx struct {
   
	// mu protects accesses to the txReadBuffer
	// 写事务执行End时候，需要获取这个写锁，然后把写事务的更新写到 baseReadTx 的buffer里面；
	// 创建 concurrentReadTx 时候，需要获取读锁，因为需要拷贝buffer
	mu  sync.RWMutex
	buf txReadBuffer

	// TODO: group and encapsulate {txMu, tx, buckets, txWg}, as they share the same lifecycle.
	// txMu protects accesses to buckets and tx on Range requests.
	// 这个读写锁是保护下面的tx和buckets
	txMu    *sync.RWMutex
	tx      *bolt.Tx
	buckets map[string]*bolt.Bucket
	// txWg protects tx from being rolled back at the end of a batch interval until all reads using this tx are done.
	txWg *sync.WaitGroup
}

type readTx struct {
   
	baseReadTx
}

func (rt *readTx) Lock()    {
    rt.mu.Lock() }
func (rt *readTx) Unlock()  {
    rt.mu.Unlock() }
func (rt *readTx) RLock(