Sqoop 学习总结

最新推荐文章于 2023-05-24 20:14:23 发布

date-date

最新推荐文章于 2023-05-24 20:14:23 发布

阅读量752

点赞数

分类专栏： sqoop

本文链接：https://blog.csdn.net/learner_up/article/details/85089090

版权

sqoop 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

最近在工作中一直使用到了sqoop，理一下最近学习的东西，备忘。内容包括:hive与mysql通过sqoop的导入导出的命令使用及注意点。

把hive表的数据导入mysql:

sqoop export --connect "${jdbcUrl}" --username ${username} --password ${password} --table tb_adslot_hour_report_tmp --columns date,hour,user_id,app_id,adslot_id,os_type,is_asdk,adslot_type,req_dau,dwnl_cnt,actv_cnt --fields-terminated-by \001 --update-key date,hour,user_id,app_id,adslot_id,os_type,is_asdk,adslot_type --update-mode allowinsert --export-dir /user/hive/warehouse/yin.db/adslot_report_tmp --input-null-string '\\N'

--table 指定要导入到mysql中的表

--colums 指定hive中要导入到mysql中的列

--fields-terminated-by 指定hive表的字段分隔符

--update-key 更新依赖的主列

--export-dir 指定导出目录，会导出目录下的所有文件

--null-string '\\N' 类型为String 类型的字符串，当value 为null时，替换成指定的字符串

--null-non-string 含义是非string类型的字段，当Value是NULL，替换成指定字符

--delete-target-dir 删除表，如果表已经存在

-m 1 指定开启map任务个数

--update-mode 更新模式共有两种updateonly（默认）和allowinsert

updateonly: 更新mysql与hive表中对于相同主键时有差异的字段，对于hive表中存在而mysql中不存在的记录，不导入。

allowinsert 插入时的主列如果在表中已经存在，那么就是更新，不然就是插入（包含updateonly）

注意：

1.如果在hive表中存在某个字段为空，在mysql中该字段类型为int，报错。

2.如果hive中某个字段长度超出mysql中定义的字段长度，报错。

3.由于sqoop导入时不具有事物的原子性，同时执行多个sqoop语句时，如果在导入时出现错误，那么就会出现部分数据到导入成功，部分数据导入失败。因此最好先建立一个临时表，通过sqoop先导入临时表中，最后把临时表中的数据同步到正式表中。

把mysql 导入hive表（全量导入）

sqoop import --connect "jdbc:mysql://10.10.15.4:3306/cpd_server_test?useUnicode=true&characterEncoding=utf-8" --username root --password *** --table tmp_vivo_idea_minutely_report_20170331  --fields-terminated-by '\001' -m 1  --columns plat_idea_id,utime,download_cnt,price,cost,imp_cnt,ctime,mtime,hour,minute --target-dir /jutou/original/vivo_idea_minutely_report/day=20170331  --where "date=20170331"  --delete-target-dir  --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'

--fields-terminated-by 指定hive表的分隔符，该分隔符应该与hive 建表时的分隔符一致，当不一致时会出现把整行当做起始列的数据插入,并且在文件中写sqoop中时，--fields-terminated-by \t 这样写代表分隔符会是\t，--fields-terminated-by '\t' 这样写代表分隔符为‘\t'

--target-dir 指定要导入的hive表的目录所在

--delete-target-dir 删除/r 等特殊字符

注意:

1.导入hive表时，Hive 表中的数据类型最好与mysql一致，如果不一致，hive表中该字段结果可能为null，例:mysql 中为bigint hive中为int 会导致改行结果为null

2.当你创建一个分区表时，在建表时指定路径，那么在mysql的数据导入到hive的某个分区下，要修改hive表把hive分区路径指定到target-dir下，不然你查询出来的hive表始终为空。

ALTER TABLE vivo_idea_minutely_report ADD IF NOT EXISTS PARTITION(day='${day}') LOCATION '/jutou/original/vivo_idea_minutely_report/day=${day}';

mysql 导入hive（增量导入）

1.Append 方式

sqoop import --connect "jdbc:mysql://10.10.15.4:1063/server_test?useUnicode=true&characterEncoding=utf-8" --username root --password test*** --table tb_vivo_idea_daily_report --hive-import --fields-terminated-by '\001' --target-dir  /jutou/original/vivo_idea_daily_report/day=20190111  --incremental append --check-column date --last-value '20190110'

实质：

SELECT MIN(`id`), MAX(`id`) FROM `tb_vivo_idea_daily_report` WHERE ( `date` > 20190110 AND `date` <= 20190111 )

append 会导入大于--last-value值的所有记录进入hive。

2.lastModify方式

此方式要求原有表中有time字段。当指定 -check-column 时，必须指定为 timestamp or date。

参数   说明
–incremental lastmodified   基于时间列的增量导入（将时间列大于等于阈值的所有数据增量导入Hadoop）
–check-column   时间列（int）
–last-value   阈值（int）
–merge-key   合并列（主键，合并键值相同的记录）

sqoop 命令全解

hive导出到mysql中的null值，增量导入，parquet日期格式类型处理

date-date

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Sqoop 学习总结

最近在工作中一直使用到了sqoop，理一下最近学习的东西，备忘。内容包括:hive与mysql通过sqoop的导入导出的命令使用及注意点。把hive表的数据导入mysql:sqoop export --connect "${jdbcUrl}" --username ${username} --password ${password} --table tb_adslot_hour_repo...
复制链接

扫一扫

专栏目录