Cannot obtain block length for LocatedBlock故障分析和解决

最新推荐文章于 2023-08-26 07:45:00 发布

哈哈-bazinga

最新推荐文章于 2023-08-26 07:45:00 发布

阅读量3.5k

点赞数

分类专栏： hdfs 文章标签： hdfs租约

hdfs 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

转载1https://www.cnblogs.com/cssdongl/p/6700512.html
这几天想cat一下某天的HDFS文件内容的时候突然报Cannot obtain block length for LocatedBlock异常，get也一样，这样无法访问hdfs文件的问题必须解决，Mark一下问题背景和解决过程

一.问题背景

问题产生的原因可能是由于前几日Hadoop集群维护的时候，基础运维组操作不当，先关闭的Hadoop集群，然后才关闭的Flume agent导致的hdfs文件写入后状态不一致。排查和解决过程如下.

二.解决过程

1.既然是hdfs文件出问题,用fsck检查一下吧

hdfs fsck /

当然你可以具体到指定的hdfs路径,检查完打印结果没有发现任何异常，没有发现损坏或者Corrupt的block，继续排查

2.那么加上其他参数细查

hdfs fsck –openforwrite /user/flume/data/kcxp/20180529

ok,这次检查出来不少文件打印显示都是 openforwrite状态,而且我测试相应文件确实不能读取,这很不正常不是吗？Flume已经写过的hdfs文件居然还处于openforwrite状态，而且无法cat和get

所以这里的”Cannot obtain block length for LocatedBlock”结合字面意思讲应该是当前有文件处于写入状态尚未关闭，无法与对应的datanode通信来成功标识其block长度.

那么分析其产生的可能性，举栗子如下

1>Flume客户端写入hdfs文件时的网络连接被不正常的关闭了

或者

2>Flume客户端写入hdfs失败了，而且其replication副本也丢失了

我这里应该属于第一种，总结一下就是Flume写入的hdfs文件由于什么原因没有被正常close，状态不一致随后无法正常访问.继续排查

3.推断:HDFS文件租约未释放

可以参考这篇文章来了解HDFS租约机制 http://www.cnblogs.com/cssdongl/p/6699919.html

了解过HDFS租约后我们知道,客户端在每次读写HDFS文件的时候获取租约对文件进行读写，文件读取完毕了，然后再释放此租约.文件状态就是关闭的了。

但是结合当前场景由于先关闭的hadoop集群，后关闭的Flume sink hdfs,那么hadoop集群都关了，Flume还在对hdfs文件写入，那么租约最后释放了吗？答案是肯定没释放.

4.恢复租约
转载2：
之前有文章介绍过HDFS租约带来的问题，导致spark应用无法正常读取文件，只能将异常文件找出并且删除后，任务才能继续执行。
但是删除文件实在是下下策，而且文件本身其实并未损坏，只是因为已经close的客户端没有及时的释放租约导致。
按照Hadoop官网的说法，HDFS会启动一个单独的线程，专门处理未及时释放的租约，自动释放超过“硬超时”（默认1小时）仍未释放的租约，但是从问题的现象上来看，这个线程并没有正常的工作，甚至怀疑这个线程是否没有启动，我使用的是CDH集群，可能与相关的设置有关，这一点需要确认。

如果Hadoop没有自动清理租约，我们有办法手动的刷新租约吗？答案是肯定的。
在网上查看资料时，发现HDFS源码中的DistributedFileSystem类提供了一个叫做recoverLease的方法，可以主动的刷新租约。但是非常奇怪，既然已经为外界提供了这个接口，为什么不提供shell指令给用户使用呢？为什么只能通过代码的方式调用呢？我使用的是hadoop-2.6.0，也许后期的版本有所更新，这一点也需要求证。

下面看一下这个方法的源码：

/**
* Start the lease recovery of a file
*
* @param f a file
* @return true if the file is already closed
* @throws IOException if an error occurs
*/
public boolean recoverLease(final Path f) throws IOException {
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<Boolean>() {
@Override
public Boolean doCall(final Path p)
throws IOException, UnresolvedLinkException {
return dfs.recoverLease(getPathName(p));
}
@Override
public Boolean next(final FileSystem fs, final Path p)
throws IOException {
if (fs instanceof DistributedFileSystem) {
DistributedFileSystem myDfs = (DistributedFileSystem)fs;
return myDfs.recoverLease(p);
}
throw new UnsupportedOperationException("Cannot recoverLease through" +
" a symlink to a non-DistributedFileSystem: " + f + " -> " + p);
}
}.resolve(this, absF);
}

有兴趣的朋友可以下载hadoop源码来仔细推敲一下内部的实现原理，这里我们只说如何调用，解决我们的问题：

public static void recoverLease(String path) throws IOException {
DistributedFileSystem fs = new DistributedFileSystem();
Configuration conf = new Configuration();
fs.initialize(URI.create(path), conf);
fs.recoverLease(new Path(path));
fs.close();
}
这是我编写的一个调用改接口的简单的封装方法，需要注意的是，此处传入的path，必须是包含文件系统以及namenode和端口号的全路径，比如：hdfs://namenode1:9000/xxx/xxx.log

如果只需要恢复单个文件，调用上述方法即可，但是通常情况下，我们需要对一个目录进行递归的处理，即恢复指定目录下所有租约异常的文件。

这个时候，我们需要先找出指定目录下所有租约异常的文件，形成一个Set或者List，然后再遍历这个容器，对每个文件进行恢复。

寻找文件列表的方法如下：

public static Set<String> getOpenforwriteFileList(String dir) throws IOException {
/*拼接URL地址，发送给namenode监听的dfs.namenode.http-address端口，获取所需数据*/
StringBuilder url = new StringBuilder();
url.append("/fsck?ugi=").append("dev");
url.append("&openforwrite=1");

/*获得namenode的主机名以及dfs.namenode.http-address监听端口，例如：http://hadoopnode1:50070*/
Path dirpath;
URI namenodeAddress;
dirpath = HDFSUtil.getResolvedPath(dir);
namenodeAddress = HDFSUtil.getDFSHttpAddress(dirpath);

url.insert(0, namenodeAddress);
try {
url.append("&path=").append(URLEncoder.encode(
Path.getPathWithoutSchemeAndAuthority(new Path(dir)).toString(), "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

Configuration conf = new Configuration();
URLConnectionFactory connectionFactory = URLConnectionFactory.newDefaultURLConnectionFactory(conf);
URL path = null;
try {
path = new URL(url.toString());
} catch (MalformedURLException e) {
e.printStackTrace();
}

URLConnection connection;
BufferedReader input = null;
try {
connection = connectionFactory.openConnection(path, UserGroupInformation.isSecurityEnabled());
InputStream stream = connection.getInputStream();
input = new BufferedReader(new InputStreamReader(stream, "UTF-8"));
} catch (IOException | AuthenticationException e) {
e.printStackTrace();
}

if (input == null) {
System.err.println("Cannot get response from namenode, url = " + url);
return null;
}

String line;
Set<String> resultSet = new HashSet<>();
try {
while ((line = input.readLine()) != null) {
if (line.contains("MISSING") || line.contains("OPENFORWRITE")) {
String regEx = "/[^ ]*";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
resultSet.add(matcher.group().replaceAll(":", ""));
}
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
input.close();
}

return resultSet;
}

其实获取租约异常列表的方法是我从HDFS源码的org.apache.hadoop.hdfs.tools.DFSck中仿照而来的，通过向NameNode的dfs.namenode.http-address端口通信，获取openforwrite状态的文件列表，然后通过正则匹配以及字符串切割，获取所需的内容。

顺便提一句，由于此代码是Java代码，并且返回的Set类型为java.util.Set，如果在Scala代码中调用，则需要将Set类型转化为scala.collection.immutable.Set，具体方法如下：

/*获取需要被恢复租约的文件列表，返回类型为java.util.Set*/
val javaFilesSet = HDFSUtil.getOpenforwriteFileList(hdfsPrefix + recoverDirPath)
if (null == javaFilesSet || javaFilesSet.isEmpty) {
println("No files need to recover lease : " + hdfsPrefix + recoverDirPath)
return
}

/*将java.util.Set转换成scala.collection.immutable.Set*/
import scala.collection.JavaConverters._
val filesSet = javaFilesSet.asScala.toSet

至此，利用以上两个方法，即可获取指定目录下的所有租约异常的文件列表，然后遍历调用租约恢复接口，即可实现批量恢复。
如何恢复未释放租约的HDFS文件