我们先来看一下DFSInputStream中的read的流程,虽然前面有讲过,不过我们这里再复习一下,整个流程图如下(图1):
![客户端读数据流程图](https://i-blog.csdnimg.cn/blog_migrate/75bd383adb6ae6fbd747658700b59957.png)
只有新版的短路读才实现了短路读共享内存,也就是创建BlockReaderLocal类对象这种情况,该类对象通过在BlockReaderFactory类build函数中调用getBlockReaderLocal()获得。getBlockReaderLocal函数中整个流程我们后面会进行讲解,在这之前我们先来梳理一下相应的类结构,如图2:
![短路读类结构图](https://i-blog.csdnimg.cn/blog_migrate/dd89884b8cd3e883fe83301a8f9602b4.png)
ShortCircuitReplica类对象中包含了短路读取副本的数据块文件输入流、校验文件输入流、短路读取副本在共享内存中的槽位(slot)以及副本的引用次数等信息。
getBlockReaderLocal函数的流程图见图3
![短路读操作流程图](https://i-blog.csdnimg.cn/blog_migrate/56d60ca60dc2d5aef269226b4b2d88e4.png)
每次新创建的ShortCircuitReplicaInfo类对象都会放到ShortCircuitCache类中的replicaInfoMap(Map类型)变量中,其中key为ExtendedBlockId类对象(数据块id变量blockId以及数据块池id变量bpId),value为Waitable<ShortCircuitReplicaInfo>类型值。每次获取ShortCircuitReplicaInfo类对象的时候都会先去replicaInfoMap中去找,找到了就直接使用找到的ShortCircuitReplicaInfo类对象,否则重新创建一个,然后将创建的对象放入到replicaInfoMap变量中。
接下来我们来分析内存槽Slot的作用。什么是Slot?它是怎么获取到的?Slot是内存槽,用来保存短路读副本的信息,包括当前槽位是否有效,以及当前副本锚的次数。这个Slot对象是在ShortCircuitReplica类对象中,用来标志ShortCircuitReplica类对象的相关状态,当Datanode在内存中缓存了一个数据块副本时(通过mlock()系统调用),该副本对应的槽位会被设置为可锚(Anchorable)状态,可锚状态的数据块在读取时可以不用进行校验,同时可锚状态的数据块可以进行真正的零拷贝读取,可锚状态是在Datanode端设置的,它会进行文件校验,一旦校验成功,那么在客户端就不需要进行校验了。Slot用来操作一块大小为64位的内存,进行状态标记,这个内存是一个映射内存,通过客户端从Datanode获取文件描述符进行文件内存映射,这块映射内存的大小为64的整数倍,然后创建Slot数组,以及一个BitSet类对象,两者的大小都为映射内存/64,BitSet类对象用来存储一个没有被使用的索引值,然后根据索引值去获取映射内存的偏移地址,通过偏移地址和数据块id来创建Slot类对象,然后将创建后的Slot类对象存储到Slot数组中,索引下标为从BitSet类对象中获取到的索引值。为什么要用BitSet类对象,主要就是要保证每一小块映射内存只能被一个Slot类对象占用,不然就会出现问题。通过Slot类对象来操作与之对应的小块映射内存,实现ShortCircuitReplica类对像状态的标记。ShortCircuitReplica类在ShortCircuitCache类中有以下几种状态:
normal:ShortCircuitReplica的正常状态
evictable:可移除状态,当DFSClient中只有ShortCircuitCache类引用了该副本,没有其他类引用时,副本处于可移除状态。 ShortCircuitCache会定期清理缓存中的可移除副本。
purged:删除状态,副本已经从缓存中删除了,DFSClient不能通过缓存获取该副本。处于删除状态的副本不一定被关闭了,有可能 还有引用这个副本的操作,当引用数为0时,就可以将副本关闭了。
closed:关闭状态,副本对应的Slot已经从共享内存中释放了。
stale:过期状态,有可能是副本对应的Slot无效,也有可能是副本对应的输入流异常,还有可能是副本在缓存中存在的时间超过了有 效时间。
接下来我们使用代码来分析这几种状态。
normal:用户调用fetchOrCreate()方法获得一个副本(就是ShortCircuitReplica类对象),会先到缓存中去获取,如果缓存中没有,那么就创建,然后将创建好的副本存储到ShortCircuitCache类对象的replicaInfoMap对象中,不管采用哪种方式,获取到的副本所处的状态都为正常状态。
stale:在ShortCircuitReplica类中有一个函数isStale()来判断当前ShortCircuitReplica类对象是否处于过期状态,函数代码如下:
/**
* Check if the replica is stale.
*
* Must be called with the cache lock held.
*/
boolean isStale() {
if (slot != null) {
// Check staleness by looking at the shared memory area we use to
// communicate with the DataNode.
//判断该类对象中slot对应的值是否有效,如果无效那么就认为过期
boolean stale = !slot.isValid();
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": checked shared memory segment. isStale=" + stale);
}
return stale;
} else {
// Fall back to old, time-based staleness method.
//获取当前时间和当前对象创建时间的差值
long deltaMs = Time.monotonicNow() - creationTimeMs;
long staleThresholdMs = cache.getStaleThresholdMs();
//如果差值大于指定的最大过期时间,那么就认为过期
if (deltaMs > staleThresholdMs) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + " is stale because it's " + deltaMs +
" ms old, and staleThresholdMs = " + staleThresholdMs);
}
return true;
} else {
if (LOG.isTraceEnabled()) {
LOG.trace(this + " is not stale because it's only " + deltaMs +
" ms old, and staleThresholdMs = " + staleThresholdMs);
}
return false;
}
}
}
从上面的代码可以看出,当ShortCircuitReplica类对象中slot无效或者ShortCircuitReplica类对象创建的时间长度大于过期时间,那么该ShortCircuitReplica类对象就属于过期状态。
stale-->purged:我们进入到ShortCircuitCache类对象的函数fetch()中去分析,代码如下:
/**
* Fetch an existing ReplicaInfo object.
*
* @param key The key that we're using.
* @param waitable The waitable object to wait on.
* @return The existing ReplicaInfo object, or null if there is
* none.
*
* @throws RetriableException If the caller needs to retry.
*/
private ShortCircuitReplicaInfo fetch(ExtendedBlockId key,
Waitable<ShortCircuitReplicaInfo> waitable) throws RetriableException {
// Another thread is already in the process of loading this
// ShortCircuitReplica. So we simply wait for it to complete.
ShortCircuitReplicaInfo info;
try {
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": found waitable for " + key);
}
//开始等待Waitable类对象中的ShortCircuitReplicaInfo类对象
info = waitable.await();
} catch (InterruptedException e) {
LOG.info(this + ": interrupted while waiting for " + key);
Thread.currentThread().interrupt();
throw new RetriableException("interrupted");
}
if (info.getInvalidTokenException() != null) {
LOG.warn(this + ": could not get " + key + " due to InvalidToken " +
"exception.", info.getInvalidTokenException());
return info;
}
ShortCircuitReplica replica = info.getReplica();
if (replica == null) {
LOG.warn(this + ": failed to get " + key);
return info;
}
//如果副本已经从缓存中删除了,抛出异常
if (replica.purged) {
// Ignore replicas that have already been purged from the cache.
throw new RetriableException("Ignoring purged replica " +
replica + ". Retrying.");
}
// Check if the replica is stale before using it.
// If it is, purge it and retry.
//如果replica对象已经过期了,那么就调用purge()函数将replica从缓存中清除
if (replica.isStale()) {
LOG.info(this + ": got stale replica " + replica + ". Removing " +
"this replica from the replicaInfoMap and retrying.");
// Remove the cache's reference to the replica. This may or may not
// trigger a close.
purge(replica);
throw new RetriableException("ignoring stale replica " + replica);
}
//增加replica的引用计数,同时将该replica的对象从相应的清空队列中删除,表示该对象已经被引用了不能被删除
ref(replica);
return info;
}
这个函数用来获取缓存中的或者是新创建的ShortCircuitReplica类对象,我们这里分析一下如下代码:
// Check if the replica is stale before using it.
// If it is, purge it and retry.
//如果replica对象已经过期了,那么就调用purge()函数将replica从缓存中清除
if (replica.isStale()) {
LOG.info(this + ": got stale replica " + replica + ". Removing " +
"this replica from the replicaInfoMap and retrying.");
// Remove the cache's reference to the replica. This may or may not
// trigger a close.
purge(replica);
throw new RetriableException("ignoring stale replica " + replica);
}
如果replica代表的ShortCircuitReplica类对象过期了,那么就会执行purge函数,我们进入到这个函数中,代码如下:
/**
* Purge a replica from the cache.
*
* This doesn't necessarily close the replica, since there may be
* outstanding references to it. However, it does mean the cache won't
* hand it out to anyone after this.
*
* You must hold the cache lock while calling this function.
*
* @param replica The replica being removed.
*/
private void purge(ShortCircuitReplica replica) {
boolean removedFromInfoMap = false;
String evictionMapName = null;
Preconditions.checkArgument(!replica.purged);
replica.purged = true;
//从缓存中获取对应的Waitable<ShortCircuitReplicaInfo>类对象
Waitable<ShortCircuitReplicaInfo> val = replicaInfoMap.get(replica.key);
if (val != null) {
ShortCircuitReplicaInfo info = val.getVal();
//确保得到的ShortCircuitReplica类对象是同一个
if ((info != null) && (info.getReplica() == replica)) {
//从缓存中删除相应的ShortCircuitReplica类对象
replicaInfoMap.remove(replica.key);
removedFromInfoMap = true;
}
}
//evictableTimeNs如果不为null,表示replica可移除
Long evictableTimeNs = replica.getEvictableTimeNs();
if (evictableTimeNs != null) {
//将replica变量移除:
//1、如果是零拷贝那么将replica从ShortCircuitCache类对象的map类型变量evictableMmapped中移除
//2、如果不是零拷贝那么将replica从ShortCircuitCache类对象的map类型变量evictable中移除
evictionMapName = removeEvictable(replica);
}
if (LOG.isTraceEnabled()) {
StringBuilder builder = new StringBuilder();
builder.append(this).append(": ").append(": purged ").
append(replica).append(" from the cache.");
if (removedFromInfoMap) {
builder.append(" Removed from the replicaInfoMap.");
}
if (evictionMapName != null) {
builder.append(" Removed from ").append(evictionMapName);
}
LOG.trace(builder.toString());
}
//由于在replica构造时就考虑了缓存的引用,所以从缓存中删除时,要unref()这个replica
unref(replica);
}
可以看出purge函数首先将副本从缓存中删除,然后将副本添加到evictableMmapped或者evictable中(这两个都是map类型),然后执行unref()函数,将replica的引用计数减一,代码如下:
/**
* Unreference a replica.
*
* You must hold the cache lock while calling this function.
*
* @param replica The replica being unreferenced.
*/
void unref(ShortCircuitReplica replica) {
lock.lock();
try {
// If the replica is stale or unusable, but we haven't purged it yet,
// let's do that. It would be a shame to evict a non-stale replica so
// that we could put a stale or unusable one into the cache.
//如果当前副本不处于删除状态
if (!replica.purged) {
String purgeReason = null;
//如果数据块文件通道关闭了,那么说明需要清除
if (!replica.getDataStream().getChannel().isOpen()) {
purgeReason = "purging replica because its data channel is closed.";
} else if (!replica.getMetaStream().getChannel().isOpen()) {//如果校验文件通道关闭了,那么说明需要清除
purgeReason = "purging replica because its meta channel is closed.";
} else if (replica.isStale()) {//如果过期了
purgeReason = "purging replica because it is stale.";
}
if (purgeReason != null) {
LOG.debug(this + ": " + purgeReason);
//开始清除该对象
purge(replica);
}
}
String addedString = "";
boolean shouldTrimEvictionMaps = false;
int newRefCount = --replica.refCount;
//副本没有被引用,也就是没有类使用这个副本进行读取操作,同时副本也从缓存中删除了,这种情况就将副本关闭,释放slot,释放副本数据块文件的内存映射
if (newRefCount == 0) {
// Close replica, since there are no remaining references to it.
Preconditions.checkArgument(replica.purged,
"Replica " + replica + " reached a refCount of 0 without " +
"being purged");
replica.close();
} else if (newRefCount == 1) {
Preconditions.checkState(null == replica.getEvictableTimeNs(),
"Replica " + replica + " had a refCount higher than 1, " +
"but was still evictable (evictableTimeNs = " +
replica.getEvictableTimeNs() + ")");
//还有一个地方引用副本,同时该副本不处于删除状态
if (!replica.purged) {
// Add the replica to the end of an eviction list.
// Eviction lists are sorted by time.
//如果副本的数据块文件被映射到了内存
if (replica.hasMmap()) {
//将replica副本保存到evictableMmapped中
insertEvictable(System.nanoTime(), replica, evictableMmapped);
addedString = "added to evictableMmapped, ";
} else {
//将replica副本保存到evictable中
insertEvictable(System.nanoTime(), replica, evictable);
addedString = "added to evictable, ";
}
shouldTrimEvictionMaps = true;
}
} else {
Preconditions.checkArgument(replica.refCount >= 0,
"replica's refCount went negative (refCount = " +
replica.refCount + " for " + replica + ")");
}
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": unref replica " + replica +
": " + addedString + " refCount " +
(newRefCount + 1) + " -> " + newRefCount +
StringUtils.getStackTrace(Thread.currentThread()));
}
if (shouldTrimEvictionMaps) {
//调整evictable队列的大小,大于缓存队列大小的副本直接删除
trimEvictionMaps();
}
} finally {
lock.unlock();
}
}
trimEvictionMaps函数代码如下:
/**
* Trim the eviction lists.
*/
private void trimEvictionMaps() {
long now = Time.monotonicNow();
//空间不够且副本到期了那么就会从evictableMmapped移除并将移除的副本插入到evictable中
demoteOldEvictableMmaped(now);
while (true) {
long evictableSize = evictable.size();
long evictableMmappedSize = evictableMmapped.size();
if (evictableSize + evictableMmappedSize <= maxTotalSize) {
return;
}
ShortCircuitReplica replica;
if (evictableSize == 0) {
replica = evictableMmapped.firstEntry().getValue();
} else {
replica = evictable.firstEntry().getValue();
}
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": trimEvictionMaps is purging " + replica +
StringUtils.getStackTrace(Thread.currentThread()));
}
purge(replica);
}
}
demoteOldEvictableMmaped函数代码如下:
/**
* Demote old evictable mmaps into the regular eviction map.
*
* You must hold the cache lock while calling this function.
*
* @param now Current time in monotonic milliseconds.
* @return Number of replicas demoted.
*/
private int demoteOldEvictableMmaped(long now) {
int numDemoted = 0;
boolean needMoreSpace = false;
Long evictionTimeNs = Long.valueOf(0);
while (true) {
Entry<Long, ShortCircuitReplica> entry =
evictableMmapped.ceilingEntry(evictionTimeNs);
if (entry == null) break;
evictionTimeNs = entry.getKey();
//将纳秒单位的evictionTimeNs转换成毫秒单位
long evictionTimeMs =
TimeUnit.MILLISECONDS.convert(evictionTimeNs, TimeUnit.NANOSECONDS);
//如果过了有效期
if (evictionTimeMs + maxEvictableMmapedLifespanMs >= now) {
//如果还有空间,那么久break,不进行任何删除操作
if (evictableMmapped.size() < maxEvictableMmapedSize) {
break;
}
//如果空间不够且到期了此时需要更多的空间
needMoreSpace = true;
}
ShortCircuitReplica replica = entry.getValue();
if (LOG.isTraceEnabled()) {
String rationale = needMoreSpace ? "because we need more space" :
"because it's too old";
LOG.trace("demoteOldEvictable: demoting " + replica + ": " +
rationale + ": " +
StringUtils.getStackTrace(Thread.currentThread()));
}
//将该replica从evictableMmapped中移除
removeEvictable(replica, evictableMmapped);
//释放映射内存
munmap(replica);
//将replica放入到evictable中
insertEvictable(evictionTimeNs, replica, evictable);
numDemoted++;
}
return numDemoted;
}
从上可以看出unref函数中会进行purged操作,如果引用次数为0那么直接关闭副本(调用close函数),如果为1那么就将副本移入到evictableMmapped或者evictable中,同时会调整这两个map类对象的大小。
与unref函数相对的是ref函数,代码如下:
/**
* Increment the reference count of a replica, and remove it from any free
* list it may be in.
*
* You must hold the cache lock while calling this function.
*
* @param replica The replica we're removing.
*/
private void ref(ShortCircuitReplica replica) {
lock.lock();
try {
Preconditions.checkArgument(replica.refCount > 0,
"can't ref " + replica + " because its refCount reached " +
replica.refCount);
//判断该replica变量是否加入到了待释放的队列中
Long evictableTimeNs = replica.getEvictableTimeNs();
replica.refCount++;
if (evictableTimeNs != null) {
//如果加入到了,那么要从该队列中删除,防止被空间被释放掉了
String removedFrom = removeEvictable(replica);
if (LOG.isTraceEnabled()) {
LOG.trace(this + ": " + removedFrom +
" no longer contains " + replica + ". refCount " +
(replica.refCount - 1) + " -> " + replica.refCount +
StringUtils.getStackTrace(Thread.currentThread()));
}
} else if (LOG.isTraceEnabled()) {
LOG.trace(this + ": replica refCount " +
(replica.refCount - 1) + " -> " + replica.refCount +
StringUtils.getStackTrace(Thread.currentThread()));
}
} finally {
lock.unlock();
}
}
这个函数会将副本从evictableMmapped或者evictable中删除。
在ShortCircuitCache类中的CacheCleaner线程,线程的run函数如下:
/**
* Run the CacheCleaner thread.
*
* Whenever a thread requests a ShortCircuitReplica object, we will make
* sure it gets one. That ShortCircuitReplica object can then be re-used
* when another thread requests a ShortCircuitReplica object for the same
* block. So in that sense, there is no maximum size to the cache.
*
* However, when a ShortCircuitReplica object is unreferenced by the
* thread(s) that are using it, it becomes evictable. There are two
* separate eviction lists-- one for mmaped objects, and another for
* non-mmaped objects. We do this in order to avoid having the regular
* files kick the mmaped files out of the cache too quickly. Reusing
* an already-existing mmap gives a huge performance boost, since the
* page table entries don't have to be re-populated. Both the mmap
* and non-mmap evictable lists have maximum sizes and maximum lifespans.
*/
@Override
public void run() {
ShortCircuitCache.this.lock.lock();
try {
if (ShortCircuitCache.this.closed) return;
long curMs = Time.monotonicNow();
if (LOG.isDebugEnabled()) {
LOG.debug(this + ": cache cleaner running at " + curMs);
}
//将evictableMmapped队列中的元素放入evictable队列中
int numDemoted = demoteOldEvictableMmaped(curMs);
int numPurged = 0;
Long evictionTimeNs = Long.valueOf(0);
while (true) {
//获取treemap中大于等于evictionTimeNs的Entry,按序拿出第一个副本
Entry<Long, ShortCircuitReplica> entry =
evictable.ceilingEntry(evictionTimeNs);
if (entry == null) break;
evictionTimeNs = entry.getKey();
//将单位为微秒的evictionTimeNs值转换成纳秒
long evictionTimeMs =
TimeUnit.MILLISECONDS.convert(evictionTimeNs, TimeUnit.NANOSECONDS);
//大于maxNonMmappedEvictableLifespanMs时间的,则直接删除
if (evictionTimeMs + maxNonMmappedEvictableLifespanMs >= curMs) break;
ShortCircuitReplica replica = entry.getValue();
if (LOG.isTraceEnabled()) {
LOG.trace("CacheCleaner: purging " + replica + ": " +
StringUtils.getStackTrace(Thread.currentThread()));
}
//删除副本
purge(replica);
numPurged++;
}
if (LOG.isDebugEnabled()) {
LOG.debug(this + ": finishing cache cleaner run started at " +
curMs + ". Demoted " + numDemoted + " mmapped replicas; " +
"purged " + numPurged + " replicas.");
}
} finally {
ShortCircuitCache.this.lock.unlock();
}
}
该线程会调用demoteOldEvictableMmaped函数将evictableMmapped中超时的副本放入到evictable队列中,然后再进行分析evictable队列,将该队列中超时的副本从缓存中删除(通过调用purge函数)。
当副本的refCount为0时,那么将调用副本的close函数关闭该副本,该close函数代码如下:
/**
* Close the replica.
*
* Must be called after there are no more references to the replica in the
* cache or elsewhere.
*/
void close() {
String suffix = "";
Preconditions.checkState(refCount == 0,
"tried to close replica with refCount " + refCount + ": " + this);
refCount = -1;
Preconditions.checkState(purged,
"tried to close unpurged replica " + this);
if (hasMmap()) {
//释放映射内存
munmap();
suffix += " munmapped.";
}
IOUtils.cleanup(LOG, dataStream, metaStream);
if (slot != null) {
//释放共享槽位
cache.scheduleSlotReleaser(slot);
suffix += " scheduling " + slot + " for later release.";
}
if (LOG.isTraceEnabled()) {
LOG.trace("closed " + this + suffix);
}
}
接下来我们看看共享槽位的释放函数scheduleSlotReleaser,代码如下:
/**
* Schedule a shared memory slot to be released.
*
* @param slot The slot to release.
*/
public void scheduleSlotReleaser(Slot slot) {
Preconditions.checkState(shmManager != null);
//在线程池中启动释放任务
releaserExecutor.execute(new SlotReleaser(slot));
}
我们进入到线程类SlotReleaser的run函数中,代码如下:
@Override
public void run() {
if (LOG.isTraceEnabled()) {
LOG.trace(ShortCircuitCache.this + ": about to release " + slot);
}
final DfsClientShm shm = (DfsClientShm)slot.getShm();
final DomainSocket shmSock = shm.getPeer().getDomainSocket();
DomainSocket sock = null;
DataOutputStream out = null;
final String path = shmSock.getPath();
boolean success = false;
try {
//连接Datanode
sock = DomainSocket.connect(path);
out = new DataOutputStream(
new BufferedOutputStream(sock.getOutputStream()));
//发送释放共享内存中的槽位信息
new Sender(out).releaseShortCircuitFds(slot.getSlotId());
//构造输入流
DataInputStream in = new DataInputStream(sock.getInputStream());
//获取从Datanode发送过来的响应信息
ReleaseShortCircuitAccessResponseProto resp =
ReleaseShortCircuitAccessResponseProto.parseFrom(
PBHelper.vintPrefixed(in));
//如果Datanode操作失败了,也就是说Datanode释放共享内存中的槽位失败,那么就抛出异常
if (resp.getStatus() != Status.SUCCESS) {
String error = resp.hasError() ? resp.getError() : "(unknown)";
throw new IOException(resp.getStatus().toString() + ": " + error);
}
if (LOG.isTraceEnabled()) {
LOG.trace(ShortCircuitCache.this + ": released " + slot);
}
success = true;
} catch (IOException e) {
LOG.error(ShortCircuitCache.this + ": failed to release " +
"short-circuit shared memory slot " + slot + " by sending " +
"ReleaseShortCircuitAccessRequestProto to " + path +
". Closing shared memory segment.", e);
} finally {
if (success) {
//如果Datanode释放槽成功了,那么释放客户端的共享内存槽
shmManager.freeSlot(slot);
} else {
//如果没有成功,那么就关闭这个共享内存段
shm.getEndpointShmManager().shutdown(shm);
}
IOUtils.cleanup(LOG, sock, out);
}
}
}
该函数先调用releaseShortCircuitFds函数通知Datanode释放Datanode侧共享内存中的槽位,然后调用shmManager.freeSlot(slot)释放客户端侧共享内存中的槽位。如果RPC过程失败,则直接关闭客户端侧的共享内存中的槽位。