数据迁移时,发现以前的Hive表都是没有压缩的,为了提高效率和节约资源,将Hive表添加压缩。
背景:原集群CDH5.16、Hive1.1,新集群CDH6.0.1、Hive2.1。
基本思路
1. 从其他集群把Hive数据discp到本地集群
2. 先将数据导入Hive, 只改动表名当做临时表
3. 建压缩表
4. 从临时表将数据导入压缩表
目录
原表建表语句与表结构
CREATE TABLE `virtual_payment_cp` (
...
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
CREATE TABLE `virtual_payment_cp`(
...
PARTITIONED BY (
`date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hadoop-master:8020/user/hive/warehouse/cp_data.db/virtual_payment_cp'
TBLPROPERTIES (
'transient_lastDdlTime'='1565767429')
最终表建表语句与表结构
CREATE TABLE `virtual_payment_cp` (
...
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE `tb_sdk_user_payment_cp`(
...
PARTITIONED BY (
`date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://nameservice2/user/hive/warehouse/cp_data.db/tb_sdk_user_payment_cp'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'transient_lastDdlTime'='1565864810')
1. 备份建表结构
hive-create-schema.sh,参数为Hive库名
#!/bin/bash
## 获取hive建表HQL脚本
#输入数据库
DB=$1
#获取hive建表语句
tables=$(hive -e "use $DB; show tables;")
for table in $tables;
do
echo "============== db: $DB , table: $table ============\n"
echo "$(hive -e "use $DB;show create table $table;");" >> /user/bak/hive_create_schema.hql
done
2. 导出数据
单表导出
hive -e "use cp_data; export table virtual_payment_cp to '/tmp/hive-export/virtual_payment_cp';"
批量导出
export-hive.sh,参数为Hive库名。
#!/bin/bash
## 导出hive表到HDFS
# 输入数据库
DB=$1
# 获取hive建表语句
tables=$(hive -e "use $DB; show tables;")
# echo "============== 开始生成hive-export.hql =============="
hive -e "use $DB;show tables" | awk '{printf "export table %s to |/tmp/hive-export/%s|;\n",$1,$1}' \
| sed "s/|/'/g" > /user/bak/hive/hive-export.hql
# echo "============== hive-export.hql生成成功! =============="
# echo "================== 开始导出hive数据 =================="
hive -database $DB -f "/user/bak/hive/hive-export.hql"
export-hive.hql
export table sdk_behavior_daily to '/tmp/hive-export/sdk_behavior_daily';
export table tb_ad_account_conf to '/tmp/hive-export/tb_ad_account_conf';
export table tb_ad_channel_conf to '/tmp/hive-export/tb_ad_channel_conf';
...
3. 添加Hosts
添加新集群hosts到原集群
vim /etc/hosts
# 添加
192.168.100.45 cdh-master
192.168.100.46 cdh-slave01
192.168.100.47 cdh-slave02
4. 拷贝数据
将导出的Hive数据拷贝到新集群。原集群为CDH5.16,新集群为CDH6.0.1,HDFS端口不同。数据传到HDFS的 /tmp/hive-export 目录。
hadoop distcp webhdfs://hadoop-master:50070/tmp/hive-export/ webhdfs://cdh-master:9870/tmp/hive-export/
5. 导入数据
因为以前的Hive数据没有进行压缩,现在重建表,采用orc+Snappy压缩。
1. 修改导入HQL
拷贝导出HQL hive-export.hql 到新集群服务器,修改为导入HQL,修改以下中间表名称。
cp hive-export.hql hive-import-middle.hql
sed -i 's/export table/import table/g' hive-import-middle.hql
sed -i 's/ to /2 from /g' hive-import-middle.hql
hive-import-middle.hql
import table sdk_behavior_daily2 from '/tmp/hive-export/sdk_behavior_daily';
import table tb_ad_account_conf2 from '/tmp/hive-export/tb_ad_account_conf';
import table tb_ad_channel_conf2 from '/tmp/hive-export/tb_ad_channel_conf';
...
2. 创建临时表
根据建表语句或表结构修改表名,创建表,临时表不添加压缩。
CREATE TABLE `virtual_payment_cp2` (
...
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
3. 数据导入临时表
先将数据导入临时表,然后从临时表导入压缩表中。
单表导入
use cp_data;
import table virtual_payment_cp2 from '/tmp/hive-export/virtual_payment_cp';
批量导入
time hive –database cp_data –f hive-import-middle.hql
3. 创建压缩表
改回表名,并添加压缩。
CREATE TABLE `virtual_payment_cp` (
...
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
4. 临时表导入压缩表
启用动态分区
# 设置压缩
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.Hadoop.io.compress.SnappyCodec;
# 设置动态分区
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
# 总允许动态分区数
set hive.exec.max.dynamic.partitions=10000;
# 节点上动态分区数
set hive.exec.max.dynamic.partitions.pernode=2000;
单表导入
insert into table virtual_payment_cp
partition (`DATE`)
SELECT * FROM virtual_payment_cp2;
批量导入
time hive -database cp_data -f hive-import-final.hql
HQL语句可以导出表名再构建,因为有的表含有分区,所以注意下。
hive-import-final.hql
insert into table sdk_behavior_daily partition (`DATE`) SELECT * FROM sdk_behavior_daily2;
insert into table tb_ad_account_conf SELECT * FROM tb_ad_account_conf2;
insert into table tb_ad_channel_conf SELECT * FROM tb_ad_channel_conf2;
...