理解
- 租约时间的权衡:短租约服务器维护的客户端信息少,但续约频繁开销大。
- 本质:租约就是在一定期限内给予持有者特定权力的协议。特性是期限。
- 如果协议内容是服务器确认客户端还存活,那么这个租约的功能就相当于心跳;
- 如果协议内容是服务器保证内容不会被修改,那么这个租约就相当于读锁;
- 如果协议内容是服务器保证内容只能被这个客户端修改,那么这个租约就相当于写锁。
- Lease说白了就是一个有时间约束的锁。客户端写文件时需要先申请一个Lease,持有该租约的客户端才可以对相应的文件进行块的添加
与租约相关的类
Server端
- LeaseManager – 管理写文件相关的租约
- LeaseManager.Monitor – 监控租约是否过期(主要检查hardLimit)
- LeaseManager.Lease – 租约实体类,管理某个客户端持有的所以写锁
Client端
- LeaseRenewer – 客户端续约更新类
下面先简单介绍下各类的内部结构,基于hadoop3.2版本。
Lease
可以看到只有3个字段。持有者即客户端
、更新时间
、文件列表
。
一个客户端对应一个租约,一个客户端可以同时写很多个文件,这些文件放在files
中,租约维护着这些文件的写权限,并对这些文件统一续约,并不是对某个文件单独续约,不需要对某个文件进行操作之后直接从files
中移除,如果files
为null,则回收此租约。
class Lease {
// 租约持有者(持有租约的客户端名字)
private final String holder;
// 租约更新的时间
private long lastUpdate;
// 该租约中包含的文件(包含持有该租约的客户端所打开的所有文件)
private final HashSet<Long> files = new HashSet<>();
/** Only LeaseManager object can create a lease */
private Lease(String holder) {
this.holder = holder;
renew();
}
/** Only LeaseManager object can renew a lease */
private void renew() {
this.lastUpdate = monotonicNow();
}
/** @return true if the Hard Limit Timer has expired */
public boolean expiredHardLimit() {
return monotonicNow() - lastUpdate > hardLimit;
}
/** @return true if the Soft Limit Timer has expired */
public boolean expiredSoftLimit() {
return monotonicNow() - lastUpdate > softLimit;
}
/** Does this lease contain any path? */
boolean hasFiles() {return !files.isEmpty();}
boolean removeFile(long inodeId) {
return files.remove(inodeId);
}
@Override
public String toString() {
return "[Lease. Holder: " + holder
+ ", pending creates: " + files.size() + "]";
}
@Override
public int hashCode() {
return holder.hashCode();
}
private Collection<Long> getFiles() {
return Collections.unmodifiableCollection(files);
}
String getHolder() {
return holder;
}
@VisibleForTesting
long getLastUpdate() {
return lastUpdate;
}
}
LeaseManager
LeaseManager是租约管理类,其内部主要维护了3个集合列表(leases
、sortedLeases
和leasesById
)和两个变量(softLimit和hardLimit)。
在softLimit期限内,该客户端拥有对这个文件的独立访问权,其他客户端不能剥夺该客户端独占写这个文件的权利。
softLimit过期后,任何一个客户端都可以回收lease,继而得到这个文件的lease,获得对这个文件的独占访问权。
hardLimit过期后,namenode强制关闭文件,撤销lease。
sortedLeases中存放这从nn发出的所有租约,其中Lease按照时间顺序排序,Monitor检查hardLimit时,从sortedLeases中按照顺序拿出Lease检查就可以了。
// 软限制就是写文件时规定的租约超时时间,默认是60s
private long softLimit = HdfsConstants.LEASE_SOFTLIMIT_PERIOD;
// 硬限制则是考虑到文件close时未来得及释放lease的情况强制回收租约,默认是1h
private long hardLimit = HdfsConstants.LEASE_HARDLIMIT_PERIOD;
// 租约持有者和租约的映射
// Mapping: leaseHolder -> Lease
private final SortedMap<String, Lease> leases;
// Set of: Lease
// 存储nn所发放的所有租约
private final NavigableSet<Lease> sortedLeases;
// INodeID -> Lease
// INode和租约的映射
private final TreeMap<Long, Lease> leasesById;
Monitor
Monitor是一个Runnable类,主要用来检测Lease是否超过了hardLimit期限。在run中调用LeaseManager.checkLeases
方法进行检测。其周期性是(2s
)
class Monitor implements Runnable {
final String name = getClass().getSimpleName();
/** Check leases periodically. */
@Override
public void run() {
for(; shouldRunMonitor && fsnamesystem.isRunning(); ) {
boolean needSync = false;
try {
fsnamesystem.writeLockInterruptibly();
try {
if (!fsnamesystem.isInSafeMode()) {
needSync = checkLeases();
}
} finally {
fsnamesystem.writeUnlock("leaseManager");
// lease reassignments should to be sync'ed.
if (needSync) {
fsnamesystem.getEditLog().logSync();
}
}
// 2s
Thread.sleep(fsnamesystem.getLeaseRecheckIntervalMs());
} catch(InterruptedException ie) {
LOG.debug("{} is interrupted", name, ie);
} catch(Throwable e) {
LOG.warn("Unexpected throwable: ", e);
}
}
}
}
LeaseRenewer
见下租约更新部分分析,暂略。
LeaseRenewer是client端更新自己租约。其中有个线程检测租约的softLimit
期限,其周期性(1s
)的调用LeaseRenewer.run()
方法对租约过半的lease进行续约。
- 服务端:Monitor硬检查
- 客户端:LeaseRenewer软检查
写锁流程
HDFS租约解析.html
FSNamesystem.startFileInternal()
和FSNamesystem.appendFileInternal()
都会调用LeaseManager.addLease()
为客户端添加租约。
/**
* Adds (or re-adds) the lease for the specified file.
*/
synchronized Lease addLease(String holder, long inodeId) {
Lease lease = getLease(holder);
if (lease == null) {
// 构造lease对象
lease = new Lease(holder);
// 在LeaseManager.leases字段中添加lease对象
leases.put(holder, lease);
// 在LeaseManager.sortedLeases字段中添加lease对象
sortedLeases.add(lease);
} else {
renewLease(lease);
}
// 在LeaseManager.leasesById字段中添加lease对象
leasesById.put(inodeId, lease);
lease.files.add(inodeId);
return lease;
}
在nn端一个Lease对应一个DFSClient,Lease是由holder标识的,holder的值就是DFSClient.clientName,clientName在DFSClient的构造函数中初始化,代码如下:
taskId = conf.get("mapreduce.task.attempt.id", "NONMAPREDUCE");
this.clientName = "DFSClient_" + dfsClientConf.taskId + "_" +
DFSUtil.getRandom().nextInt() + "_" + Thread.currentThread().getId();
clientName
是由taskId
、随机数
和currentThread.Id
拼起来的,所以每次写请求的clientName是不一样的,则Lease也是不一样的。
addLease
的逻辑是先从LeaseManager.leases(holder和lease映射)
中查找是否存在holder
对应的lease
,不存在则由LeaseManager
创建一个lease
,存在则更新lease
。
new出lease后,将其放入LeaseManager中的三个集合中,并把此租约对应的path放入lease的files中。 添加完成。
租约更新
当客户端打开一个文件用于写或者追加写操作时,LeaseManager会保护这个客户端在该文件上的租约。客户端会启动一个LeaseRenewer定期更新租约,以防租约过期。
注意:租约续约是由客户端发起的。
客户端在dfs.create()
中调用beginFileLease()
对租约进行续约。
/** Get a lease and start automatic renewal */
private void beginFileLease(final long inodeId, final DFSOutputStream out)
throws IOException {
getLeaseRenewer().put(inodeId, out, this);
}
客户端续约是通过LeaseRenewer来实现的,LeaseRenewer是由存放namenode信息的authority和user信息的ugi来实例化的。
// DFSClient.class
public LeaseRenewer getLeaseRenewer() throws IOException {
return LeaseRenewer.getInstance(authority, ugi, this);
}
// LeaseRenewer.class
static LeaseRenewer getInstance(final String authority,
final UserGroupInformation ugi, final DFSClient dfsc) throws IOException {
final LeaseRenewer r = Factory.INSTANCE.get(authority, ugi);
r.addClient(dfsc);
return r;
}
// LeassRenewer.Factory.class
private synchronized LeaseRenewer get(final String authority,
final UserGroupInformation ugi) {
final Key k = new Key(authority, ugi);
LeaseRenewer r = renewers.get(k);
if (r == null) {
r = new LeaseRenewer(k);
renewers.put(k, r);
}
return r;
}
LeaseRenewer的实例化是通过Factory实例化的,Factory先去renewers中查找是否有当前user的LeaseRenewer,没有则new一个,有则直接返回已有的LeaseRenewer,然后在getInstance中,将DFSClient的实例dfsc放入LeaseRenewer的dfsclients的list中。user对应的LeaseRenewer对象初始化完毕。
然后调用put方法将文件标识Id、对应的文件流和DFSClient实例传入LeaseRenewer中:
synchronized void put(final long inodeId, final DFSOutputStream out,
final DFSClient dfsc) {
if (dfsc.isClientRunning()) {
// 判断daemon是否在运行,
// 或者检查dfsclients为空之后的时间是否超过了gracePeriod
// 如果daemon没有运行或者为空的时间超过了gracePeriod则新new一个守护线程
if (!isRunning() || isRenewerExpired()) {
//start a new deamon with a new id.
final int id = ++currentId;
daemon = new Daemon(new Runnable() {
@Override
public void run() {
try {
if (LOG.isDebugEnabled()) {
LOG.debug("Lease renewer daemon for " + clientsString()
+ " with renew id " + id + " started");
}
LeaseRenewer.this.run(id);
} catch(InterruptedException e) {
if (LOG.isDebugEnabled()) {
LOG.debug(LeaseRenewer.this.getClass().getSimpleName()
+ " is interrupted.", e);
}
} finally {
synchronized(LeaseRenewer.this) {
Factory.INSTANCE.remove(LeaseRenewer.this);
}
if (LOG.isDebugEnabled()) {
LOG.debug("Lease renewer daemon for " + clientsString()
+ " with renew id " + id + " exited");
}
}
}
@Override
public String toString() {
return String.valueOf(LeaseRenewer.this);
}
});
daemon.start();
}
dfsc.putFileBeingWritten(inodeId, out);
emptyTime = Long.MAX_VALUE;
}
}
在put中有个守护线程,在守护线程中调用LeaseRenewer.run方法对租约进行check然后renew,这里check的是softLimit。守护线程只有在daemon为null或者dfsclients为空的时间超过了gracePeriod时才需要重新new一个daemon线程。
LeaseRenewer.this.run(id);
调用外层的run。
private void run(final int id) throws InterruptedException {
for(long lastRenewed = Time.now(); !Thread.interrupted();
Thread.sleep(getSleepPeriod())) {
final long elapsed = Time.now() - lastRenewed;
// 判断是否超过了softLimit的一半
if (elapsed >= getRenewalTime()) {
try {
// 续约
renew();
...
// 更新续约时间
lastRenewed = Time.now();
} catch (SocketTimeoutException ie) {
...
break;
} catch (IOException ie) {
...
}
}
...
}
}
run中调用renew()进行续约,这里续约是对当前user的所有DFSClient(也就是当前user的所有Lease)进行续约。
private void renew() throws IOException {
final List<DFSClient> copies;
synchronized(this) {
copies = new ArrayList<DFSClient>(dfsclients);
}
//sort the client names for finding out repeated names.
Collections.sort(copies, new Comparator<DFSClient>() {
@Override
public int compare(final DFSClient left, final DFSClient right) {
return left.getClientName().compareTo(right.getClientName());
}
});
String previousName = "";
for(int i = 0; i < copies.size(); i++) {
final DFSClient c = copies.get(i);
//skip if current client name is the same as the previous name.
if (!c.getClientName().equals(previousName)) {
// 续约
if (!c.renewLease()) {
if (LOG.isDebugEnabled()) {
LOG.debug("Did not renew lease for client " +
c);
}
continue;
}
previousName = c.getClientName();
...
}
}
}
在renew中,先对dfsclients中的DFSClient进行排序,主要是为了将重复发clientName放在一起,renew时只对其中一个clientName进行更新,调用c.renewLease进行续约
boolean renewLease() throws IOException {
if (clientRunning && !isFilesBeingWrittenEmpty()) {
try {
// rpc调用LeaseManager.renewLease
namenode.renewLease(clientName);
updateLastLeaseRenewal();
return true;
} catch (IOException e) {
// Abort if the lease has already expired.
final long elapsed = Time.now() - getLastLeaseRenewal();
if (elapsed > HdfsConstants.LEASE_HARDLIMIT_PERIOD) {
LOG.warn("Failed to renew lease for " + clientName + " for "
+ (elapsed/1000) + " seconds (>= hard-limit ="
+ (HdfsConstants.LEASE_HARDLIMIT_PERIOD/1000) + " seconds.) "
+ "Closing all files being written ...", e);
closeAllFilesBeingWritten(true);
} else {
// Let the lease renewer handle it and retry.
throw e;
}
}
}
return false;
}
在renewLease中远程调用LeaseManager.renewLease,其调用流程为NameNodeRpcServer.renewLease --> FSNamesystem.renewLease --> LeaseManager.renewLease(holder),
// LeaseManager.class
synchronized void renewLease(String holder) {
renewLease(getLease(holder));
}
synchronized void renewLease(Lease lease) {
if (lease != null) {
sortedLeases.remove(lease);
lease.renew();
sortedLeases.add(lease);
}
}
// LeaseManager.Lease.class
private void renew() {
this.lastUpdate = now();
}
客户端通过LeaseRenewer调用LeaseManager.renewLease进行续约,续约逻辑是先从leases中get到clientName对应的lease,然后从sortedLeases中移除该lease,调用lease.renew对lease的lastUpdate进行更新,最后将lease再放入sortedLeases中。sortedLeases中的lease是按照lease的lastUpdate进行排序的,到此客户端续约的流程结束。
租约恢复
客户端发生故障,不能完成租约更新,则进行租约恢复。
- 写文件期间,租约限制60s(不可配置)
- 故障时,租约限制60min(不可配置),进行删除过期租约。
/**
* States, which a block can go through while it is under construction.
*/
static public enum BlockUCState {
/**
* Block construction completed.<br>
* The block has at least the configured minimal replication number
* of {@link ReplicaState#FINALIZED} replica(s), and is not going to be
* modified.
* NOTE, in some special cases, a block may be forced to COMPLETE state,
* even if it doesn't have required minimal replications.
* 在某些特殊情况下,一个块可能被强制为COMPLETE状态,
*/
COMPLETE,
/**
* The block is under construction.<br>
* It has been recently allocated for write or append.
*/
UNDER_CONSTRUCTION,
/**
* The block is under recovery.<br>
* When a file lease expires its last block may not be {@link #COMPLETE}
* and needs to go through a recovery procedure,
* which synchronizes the existing replicas contents.
* 当文件租约到期时,其最后一块可能没有完成
* 并且需要执行恢复程序,
*同步现有副本内容。
*/
UNDER_RECOVERY,
/**
* The block is committed.<br>
* The client reported that all bytes are written to data-nodes
* with the given generation stamp and block length, but no
* {@link ReplicaState#FINALIZED}
* replicas has yet been reported by data-nodes themselves.
*/
COMMITTED;
}