hdfs小文件合并方法(hive/spark/历史文件)

最新推荐文章于 2024-08-21 11:56:03 发布

孙小思思

最新推荐文章于 2024-08-21 11:56:03 发布

阅读量851

点赞数 2

分类专栏： hdfs 文章标签： hive spark hdfs

本文链接：https://blog.csdn.net/qq_42616974/article/details/108579809

版权

hdfs 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

hive新增数据合并方法
1、入数据之前设置session级别的参数set mapred.reduce.tasks=10根据表的大小估算出参数，保证满足以下公式128M* mapred.reduce.tasks=表的大小（M）

2、在sql语句中增加均衡分布insert overwrite table bhy.cp_test partition(date_no=‘20170503’,hour_no=‘03’) select t.source_type,t.starttime,t.endtime,t.acc_nbr,t.classid,t.appid,t.ruleid,t.duration,t.send_traffic,t.recv_traffic from dic_khxw.tb_dpi_cw_class_app_info_h t where date_no=‘20170503’ and hour_no=‘03’ DISTRIBUTE BY rand();

spark新增数据合并方法
1、入数据之前设置session级别的参数
set spark.sql.shuffle.partitions=10; //设置并行度为10
set spark.sql.adaptive.enabled=true; //是否开启调整partition功能，如果开启，spark.sql.shuffle.partitions设置的partition可能会被合并到一个reducer里运行
set spark.sql.adaptive.shuffle.targetPostShuffleInputSize=128000000; //设置每个 Reducer 读取的目标数据量，其单位是字节。默认64M，一般改成集群块大小
set spark.sql.adaptive.shuffle.targetPostShuffleRowCount=10000000; //设置每个 Reducer 读取的目标记录数，其单位是条数。
set spark.sql.adaptive.minNumPostShufflePartitions=1; // 开启spark.sql.adaptive.enabled后，最小的分区数
set spark.sql.adaptive.maxNumPostShufflePartitions=100; // 开启spark.sql.adaptive.enabled后，最大的分区数

2、在sql语句中增加均衡分布
如原SQL比较特殊，不存在shuffle，改写SQL，在后面加上distribute by rand()以强制shuffle。

历史数据合并方法

1、设置hive参数，支持合并
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = true;

2、新建备份表，表结构与原表保持一致
create table bhy.test like bhy.tb_day;

3、从原表查询数据入备份表
非分区表：insert overwrite table bhy.test select * from bhy.tb_day;
分区表：insert overwrite table bhy.test PARTITION(day,latn_id) select * from bhy.tb_day; 注意，对于分区表需加上PARTITION(day,latn_id)，括号里为PARTITIONED BY里的字段。

4、检查原表与备份表数据一致性
select count from bhy.test;
select count from bhy.tb_day;

5、检查备份表文件数
hdfs dfs -count /user/hive/warehouse/bhy/test

6、删除原表，将备用表表名修改为原表名
drop table bhy.tb_day;
alter table bhy.test rename to bhy.tb_day;