解决File has reached the limit on maximum number of blocks的问题

晓之木初

已于 2023-01-29 10:13:25 修改

阅读量1.5k

点赞数

分类专栏： linux 大数据文章标签：大数据

于 2022-06-04 20:45:00 首次发布

本文链接：https://blog.csdn.net/u014454538/article/details/125124280

版权

linux 同时被 2 个专栏收录

9 篇文章 3 订阅

订阅专栏

大数据

3 篇文章 0 订阅

订阅专栏

1. 絮絮叨叨

最近在导入3000x的TPC-H的lineitem数据时，发现直接通过Hive的LOAD DATA LOCAL INPATH莫名其妙地失败

hive> LOAD DATA LOCAL INPATH '/data1/tpch/tpch_tools/dbgen/lineitem.tbl' INTO TABLE tpch_3000x_orc.lineitem_text;
Loading data to table tpch_3000x_orc.lineitem_text
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. org.apache.hadoop.hive.ql.metadata.HiveException: 
	Unable to move source file:/data1/tpch/tpc-h_tools_v3.0.0/dbgen/lineitem.tbl to destination hdfs://.../da_lineitem_text

lineitem.tbl是一个大概2.3T的大文件，因为怀疑是文件太大无法上传
通过查看其它数据的HDFS目录发现，通过LOAD DATA LOCAL INPATH向Hive表导入数据时，其实是直接将文件copy到对应的HDFS目录
因此，想直接通过hdfs dfs -put直接将lineitem.tbl上传到对应的HDFS目录

上传过程中，直接提示文件block数超过最大的block数，对应的配置项为dfs.namenode.fs-limits.max-blocks-per-file

hive> dfs -put /data1/tpch/tpc-h_tools_v3.0.0/dbgen/lineitem.tbl hdfs:///.../da_lineitem_text/;
put: File has reached the limit on maximum number of blocks (dfs.namenode.fs-limits.max-blocks-per-file): 10000 >= 10000
Command -put /data1/tpch/tpc-h_tools_v3.0.0/dbgen/lineitem.tbl hdfs://.../da_lineitem_text/ failed with exit code = 1
Query returned non-zero code: 1, cause: null

2. 使用split进行文件分割

2.1 确认HDFS配置

看到报错信息后，首先想到的就是确认dfs.namenode.fs-limits.max-blocks-per-file的值

命名如下：

 hdfs getconf -confKey dfs.namenode.fs-limits.max-blocks-per-file

执行结果为1048576，根本不是报错信息中的说的10000

在Hive CLI中，执行set命令也显示1048576

set dfs.namenode.fs-limits.max-blocks-per-file;

2.2 运维建议使用小一点的数据集

能力有限，只能求助HDFS的运维同事
同事建议使用小一点的数据集，但自己又需要那么大的数据集
自己的猜测：dfs.namenode.fs-limits.max-blocks-per-file在NameNode中的设置为10000，当前服务器的HDFS配置，可能并未与NameNode同步（后续我会找运维同事进行确认的）

2.3 split按行拆分文件

通过上网查阅资料，发现split命令可以实现文件的拆分
而考虑到lineitem.tbl文件，每行代表一条记录，所以按照行数（每个文件20亿条数据）对其进行拆分

同样地，因为拆分时间比较长，也是通过nohup提交后台作业

nohup bash -c 'split -l 2000000000 lineitem.tbl /data7/lineitem' > split.log 2>&1 &

拆分后的文件，使用/data7/lineitem作为前缀，从aa、ab、ac依次编号
split命令的使用，可以参考博客：Split Command in Linux with Examples
当然，具体的使用，还是查看帮助文档来的更全面
```
split --help
```

晓之木初

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录