HBase source code. MultiVersionConsistencyControl

MultiVersionConsistencyControl 简称 mvcc, 是HBase提供给transaction的一致性的一个方法

具体原理可以参见: mvcc wiki

简单来说, 每笔操作都被赋予一个number, 写操作被赋予writeNumber, 读操作被赋予readNumber. (这一版的mvcc是拿sequenceId当做writeNumber进行赋予, 跟我之前看0.98.7确实有很大改变, 难怪cell中的getMvccVersion()被标记成deprecated, 因为直接getSequenceId()的效果是一样的.)

特别地, 读写操作都在memStore完成, 更准确地说, mvcc就是针对transacations对memStore操作而设计的实现一致性的方法 

图文并茂的HBase中的实现机制, 请参考: Hbase MVCC (这个一定要看, 看了会加深理解, 而且我肯定表述的没人家committers好)

反正通过mvcc和锁机制相互配合, 可以实现transaction的一致性.


MVCC中有3个私有成员, 一个是默认的用来初始化的writeNumber, 一个是volatile的memstoreRead, 相当于readNumber, 还有一个readWaiters, 用于唤醒线程池的一个Object, 还有一个writeQueue, 就是存放一堆写操作先后顺序的这么一个链表.

还有一个公有的内部类, WriteEntry根据名字可以直接理解为, 是memStore的写入口, 要写memStore必须要生成这个类的实例才可以操作, 初始化赋予writeNumber. 所以说WriteEntry是整个写操作的基本单位, 它可以告诉users这个写操作完成与否, 这个写操作的writeNumber是多少.


一般为了确保一致性, 在对数据进行写操作的时候, 需要等待前面的操作完成, 不然可能回造成数据更新遗失.

所以当一个写操作进来之后, 会被存放在上述写操作链表的末尾. 直到前面的写操作完成. 

完成的操作被标记为完成, 而且会逐个从链表头被移除, 表示已完成. 

而最大的writeNumber会记录在volatile的memstoreRead, 从而告知所有waiter, 此时此刻操作到最新的sequeceId是多少, 保证waiters能看到的数据都是最新的.


所以根据上述描述的大概步骤, 应该有如下对外提供的methods:

writeEntry的初始化, 和相对应的writeNumber的设定与提取, 还有标记完成, 和判断是否完成. 这几个操作.

对于MVCC而言, 除了初始化之外, 它还提供:

等待前面写操作完成的方法, 

插入一个新的写操作的方法 

还有推进写操作(写链表表头的逐个移除)

提交写操作(即表示完成可以提交, transaction commit)方法, 

更新或读取memstoreRead的方法.


更为具体的使用方法, 会在以后分享HBase的write path时候再作介绍


/**
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hbase.regionserver;

import java.io.IOException;
import java.util.LinkedList;
import java.util.concurrent.atomic.AtomicLong;

import org.apache.hadoop.hbase.classification.InterfaceAudience;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.util.ClassSize;

/**
 * Manages the read/write consistency within memstore. This provides
 * an interface for readers to determine what entries to ignore, and
 * a mechanism for writers to obtain new write numbers, then "commit"
 * the new writes for readers to read (thus forming atomic transactions).
 */
@InterfaceAudience.Private
public class MultiVersionConsistencyControl {
  private static final long NO_WRITE_NUMBER = 0;
  private volatile long memstoreRead = 0;
  private final Object readWaiters = new Object();

  // This is the pending queue of writes.
  private final LinkedList<WriteEntry> writeQueue =
      new LinkedList<WriteEntry>();

  /**
   * Default constructor. Initializes the memstoreRead/Write points to 0.
   */
  public MultiVersionConsistencyControl() {
  }

  /**
   * Initializes the memstoreRead/Write points appropriately.
   * @param startPoint
   */
  public void initialize(long startPoint) {
    synchronized (writeQueue) {
      writeQueue.clear();
      memstoreRead = startPoint;
    }
  }

  /**
   *
   * @param initVal The value we used initially and expected it'll be reset later
   * @return WriteEntry instance.
   */
  WriteEntry beginMemstoreInsert() {
    return beginMemstoreInsertWithSeqNum(NO_WRITE_NUMBER);
  }

  /**
   * Get a mvcc write number before an actual one(its log sequence Id) being assigned
   * @param sequenceId
   * @return long a faked write number which is bigger enough not to be seen by others before a real
   *         one is assigned
   */
  public static long getPreAssignedWriteNumber(AtomicLong sequenceId) {
    // the 1 billion is just an arbitrary big number to guard no scanner will reach it before
    // current MVCC completes. Theoretically the bump only needs to be 2 * the number of handlers
    // because each handler could increment sequence num twice and max concurrent in-flight
    // transactions is the number of RPC handlers.
    // we can't use Long.MAX_VALUE because we still want to maintain the ordering when multiple
    // changes touch same row key
    // If for any reason, the bumped value isn't reset due to failure situations, we'll reset
    // curSeqNum to NO_WRITE_NUMBER in order NOT to advance memstore read point at all
    return sequenceId.incrementAndGet() + 1000000000;
  }

  /**
   * This function starts a MVCC transaction with current region's log change sequence number. Since
   * we set change sequence number when flushing current change to WAL(late binding), the flush
   * order may differ from the order to start a MVCC transaction. For example, a change begins a
   * MVCC firstly may complete later than a change which starts MVCC at a later time. Therefore, we
   * add a safe bumper to the passed in sequence number to start a MVCC so that no other concurrent
   * transactions will reuse the number till current MVCC completes(success or fail). The "faked"
   * big number is safe because we only need it to prevent current change being seen and the number
   * will be reset to real sequence number(set in log sync) right before we complete a MVCC in order
   * for MVCC to align with flush sequence.
   * @param curSeqNum
   * @return WriteEntry a WriteEntry instance with the passed in curSeqNum
   */
  public WriteEntry beginMemstoreInsertWithSeqNum(long curSeqNum) {
    WriteEntry e = new WriteEntry(curSeqNum);
    synchronized (writeQueue) {
      writeQueue.add(e);
      return e;
    }
  }

  /**
   * Complete a {@link WriteEntry} that was created by
   * {@link #beginMemstoreInsertWithSeqNum(long)}. At the end of this call, the global read
   * point is at least as large as the write point of the passed in WriteEntry. Thus, the write is
   * visible to MVCC readers.
   * @throws IOException
   */
  public void completeMemstoreInsertWithSeqNum(WriteEntry e, SequenceId seqId)
      throws IOException {
    if(e == null) return;
    if (seqId != null) {
      e.setWriteNumber(seqId.getSequenceId());
    } else {
      // set the value to NO_WRITE_NUMBER in order NOT to advance memstore readpoint inside
      // function beginMemstoreInsertWithSeqNum in case of failures
      e.setWriteNumber(NO_WRITE_NUMBER);
    }
    waitForPreviousTransactionsComplete(e);
  }

  /**
   * Complete a {@link WriteEntry} that was created by {@link #beginMemstoreInsert()}. At the
   * end of this call, the global read point is at least as large as the write point of the passed
   * in WriteEntry. Thus, the write is visible to MVCC readers.
   */
  public void completeMemstoreInsert(WriteEntry e) {
    waitForPreviousTransactionsComplete(e);
  }

  /**
   * Mark the {@link WriteEntry} as complete and advance the read point as
   * much as possible.
   *
   * How much is the read point advanced?
   * Let S be the set of all write numbers that are completed and where all previous write numbers
   * are also completed.  Then, the read point is advanced to the supremum of S.
   *
   * @param e
   * @return true if e is visible to MVCC readers (that is, readpoint >= e.writeNumber)
   */
  boolean advanceMemstore(WriteEntry e) {
    long nextReadValue = -1;
    synchronized (writeQueue) {
      e.markCompleted();

      while (!writeQueue.isEmpty()) {
        WriteEntry queueFirst = writeQueue.getFirst();
        if (queueFirst.isCompleted()) {
          // Using Max because Edit complete in WAL sync order not arriving order
          nextReadValue = Math.max(nextReadValue, queueFirst.getWriteNumber());
          writeQueue.removeFirst();
        } else {
          break;
        }
      }

      if (nextReadValue > memstoreRead) {
        memstoreRead = nextReadValue;
      }

      // notify waiters on writeQueue before return
      writeQueue.notifyAll();
    }

    if (nextReadValue > 0) {
      synchronized (readWaiters) {
        readWaiters.notifyAll();
      }
    }

    if (memstoreRead >= e.getWriteNumber()) {
      return true;
    }
    return false;
  }

  /**
   * Advances the current read point to be given seqNum if it is smaller than
   * that.
   */
  void advanceMemstoreReadPointIfNeeded(long seqNum) {
    synchronized (writeQueue) {
      if (this.memstoreRead < seqNum) {
        memstoreRead = seqNum;
      }
    }
  }

  /**
   * Wait for all previous MVCC transactions complete
   */
  public void waitForPreviousTransactionsComplete() {
    WriteEntry w = beginMemstoreInsert();
    waitForPreviousTransactionsComplete(w);
  }

  public void waitForPreviousTransactionsComplete(WriteEntry waitedEntry) {
    boolean interrupted = false;
    WriteEntry w = waitedEntry;

    try {
      WriteEntry firstEntry = null;
      do {
        synchronized (writeQueue) {
          // writeQueue won't be empty at this point, the following is just a safety check
          if (writeQueue.isEmpty()) {
            break;
          }
          firstEntry = writeQueue.getFirst();
          if (firstEntry == w) {
            // all previous in-flight transactions are done
            break;
          }
          try {
            writeQueue.wait(0);
          } catch (InterruptedException ie) {
            // We were interrupted... finish the loop -- i.e. cleanup --and then
            // on our way out, reset the interrupt flag.
            interrupted = true;
            break;
          }
        }
      } while (firstEntry != null);
    } finally {
      if (w != null) {
        advanceMemstore(w);
      }
    }
    if (interrupted) {
      Thread.currentThread().interrupt();
    }
  }

  public long memstoreReadPoint() {
    return memstoreRead;
  }

  public static class WriteEntry {
    private long writeNumber;
    private volatile boolean completed = false;

    WriteEntry(long writeNumber) {
      this.writeNumber = writeNumber;
    }
    void markCompleted() {
      this.completed = true;
    }
    boolean isCompleted() {
      return this.completed;
    }
    long getWriteNumber() {
      return this.writeNumber;
    }
    void setWriteNumber(long val){
      this.writeNumber = val;
    }
  }

  public static final long FIXED_SIZE = ClassSize.align(
      ClassSize.OBJECT +
      2 * Bytes.SIZEOF_LONG +
      2 * ClassSize.REFERENCE);

}




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值