Sqoop把mysql数据导入hive，hdfs采用parquet+snappy存储

最新推荐文章于 2023-08-30 19:48:29 发布

Rudolf__

最新推荐文章于 2023-08-30 19:48:29 发布

阅读量1.9k

点赞数 1

分类专栏： hive cdh 文章标签： hive mysql 大数据

本文链接：https://blog.csdn.net/Rudolf__/article/details/110533579

版权

hive 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

cdh

2 篇文章 0 订阅

订阅专栏

背景：
1.业务和业务数据量不断增加，需要用到hadoop进行数据分析和存储，现在将mysql数据到到hive，每个mysql表结构都差不多，现在以order_node为例，它的字段类型有bigint,varchar,int,tinyint(1),tinyint(4),datetime,decimal(18,6);
2.mysql表有300多张，30多张表数据量超过1000万，有5个表数据量已经过亿;
3.hdfs副本数为3，为了节省存储空间，需要用parquet存储，snappy压缩,parquet+snappy具有最好的查询+写入性能;
4.数据导入没有太多逻辑转换，全量表导入或者根据创建时间，所有选择sqoop;
5.适合一次性导入，增量导入这里不做介绍
6.总资源情况，hadoop集群ha，3个节点，机器资源，cpu=8+8+4=20，mem=16+16+8=40g；
7.yarn资源分配情况
am.reduce=1g,am.resource.cpu-vcores=1,nodemanager.memory=12+12+6=30g,
nodemanager.resource.cpu-vcores=8+8+4=20(全部了)，scheduler.minimum=1g,scheduler.maximum=2g
scheduler.minimum-allocation-vcores=1,scheduler.maximum-allocation-vcores=8,
采用dfr调度，因为是大数据平台是cdh

最终导入语句如下，为什么是下面的样子？后面再讲解

 sqoop import \
-D sqoop.parquet.logical_types.decimal.enable=true \
-D parquetjob.configurator.implementation=hadoop \
-D sqoop.avro.decimal_padding.enable=true \
-D sqoop.avro.logical_types.decimal.default.precision=18 \
-D sqoop.avro.logical_types.decimal.default.scale=6 \
--connect "jdbc:mysql://106.13.128.83:3306/oms?useSSL=false&useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&dontTrackOpenResources=true&defaultFetchSize=50000&useCursorFetch=true" \
--username root \
--password 123456 \
--table order_node \
--delete-target-dir \
--target-dir /user/hive/warehouse/testdb.db/order_node \
--hive-drop-import-delims \
--hive-import \
--hive-overwrite \
--fetch-size 50000 \
--null-non-string '\\N' \
--null-string '\\N' \
--fields-terminated-by "\001" \
--table order_node \
--hive-database testdb \
--as-parquetfile \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--split-by order_node_id \
-m 20

参数说明；
1.数据库连接中，useSSL=false数据库连接不用ssl加密，useUnicode=true&characterEncoding=utf-8编码规范，zeroDateTimeBehavior=convertToNull时间类型为空值需要，不然可能有问题;不指定tinyInt1isBit=false,mysql varchar(1)类型在hive自动建表就会变为tinyint
2.defaultFetchSize=50000&useCursorFetch=true是每次拉取mysql数据量大小，这个设置对应下面的fetch-size,设置这个可能会出现 GC overhead limit exceeded或者 Java heap space，yarn的资源配置不够，或者导入的数量太大(字段内容太大)，都有可能产生问题，所以要根据以下四个方面调整fetch-size
01.yarn总资源、yarn的mapreduce资源和yarn的container资源(参考上面资源，灵活调整自己的资源配置)
02.数据量大小(字段多少个，每一个字段内容大小)
03.数据库压力大小
04.mapreduce数量，对应设置参数–split-by和–num-mappers(-m)
3.–delete-target-dir删掉目标目录，若是增量抽取，就不能有这个参数
4.–hive-overwrite覆盖数据，一般一次性导入需要这个参数，提升速度，若是增量抽取，就不能有这个参数
5.–hive-database这是指定hive数据库，以前经常使用–hive-table=db.table来指定数据库和表，但是在导入文件为parquet格式时，不能这样用，因为我们的表一般有’’,使用db.table导入parquet格式并不支持，会报错：Namespace test.db is not alphanumeric (plus '’)；这里有一点说明，对应参数–hive-table,若有表，则导入表schema和该表schema必须相同，不然会报错，尤其注意字段类型datatime，decimal，若没有表，它会自己建表
6.–compress一般都需要压缩，parquet常用snappy压缩格式

注意：1.不指定–map-column-java来映射mysql<==>hdfs时，datetime(msyql)=>bigint(hive),存储为txt文件时为string(hive)
如果想要datetime为字符串格式(行业通用)，那么就需要先用sqoop建表(datetime默认会表为string)，再导入数据，并指定datetim->string，具体步骤
01.sqoop建表
sqoop create-hive-table --connect $mysqlconn --username root --password 123 --table order_node --hive-table oms.order_node
执行多次，建多个表,注意，并不能建表的时候指定为parquet格式
02.运行脚本，抽取多个表到txt

#! /bin/bash
 database=$1
 hive -e "use ${database};show tables;">>tables.txt
     cat tables.txt | while read eachline
	 do
	    hive -e "show create table ${database}.${eachline};">>${database}-tables.txt
	 done
 rm tables.txt

03.修改表结构，增加如下语句，其他多余的属性去掉
stored as parquet tblproperties(‘parquet.compress’=‘SNAPPY’)
04.根据03做成的表文件，存储为.sql，执行这个文件实现批量建parquet表
hive -f create_tables.sql
05.sqoop导数据过来,指定表名，增加–map-column-java date_field=String,datatime_field2=String

2.sqoop没有 -D 那些参数时，hive自动建表decimal(mysql)=>string(hive),所以mysql中有decimal类型时需要指定这些参数

参考：https://blog.csdn.net/weixin_35852328/article/details/86509506
https://www.jianshu.com/p/25e424bc4150
https://blog.csdn.net/weixin_42496757/article/details/88096925
https://www.cnblogs.com/piperck/p/10234102.html

Rudolf__

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
Sqoop把mysql数据导入hive，hdfs采用parquet+snappy存储

背景：1.业务和业务数据量不断增加，需要用到hadoop进行数据分析和存储，现在将mysql数据到到hive，每个mysql表结构都差不多，现在以order_node为例，它的字段类型有bigint,varchar,int,tinyint(1),tinyint(4),datetime,decimal(18,6);2.mysql表有300多张，30多张表数据量超过1000万，有5个表数据量已经过亿;3.hdfs副本数为3，为了节省存储空间，需要用parquet存储，snappy压缩,parquet+sn
复制链接

扫一扫