sqoop命令大全和踩坑大全

qzWsong

已于 2022-06-09 17:26:22 修改

阅读量1.1k

点赞数 2

分类专栏： sqoop 文章标签： hive mysql sqoop

于 2021-02-28 16:10:25 首次发布

本文链接：https://blog.csdn.net/java_creatMylief/article/details/114223268

版权

sqoop 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

Import(Mysql到Hive/hdfs)

参数解释

------------dbms相关------------

--connect✳✳

--table --where--columns

--query✳

--m

--split-by id✳✳

------------hdfs相关------------

--as-textfile

--compress

--compression-codec gzip

--null-non-string✳✳

--null-string✳✳

--fields-terminated-by✳

--lines-terminated-by

--target-dir✳

--delete-target-dir✳

------------hive相关------------

--hive-import✳✳

--hive-database

--hive-table✳✳

--hive-partition-key✳

--hive-partition-value✳

--hive-overwrite✳

--hcatalog

------------增量导入相关------------

--incremental lastmodified/append

--check-column

--last-value

Export(Hive/hdfs到mysql)

参数解释

------------dbms相关------------

--connect✳✳

--columns

--table✳✳

------------hdfs相关------------

--export-dir✳✳

--input-fields-terminated-by✳

--input-null-string ✳✳

--input-null-non-string✳✳

------------重复数据处理相关------------

--update-mode✳

--update-key✳

Import(Mysql到Hive/hdfs)

bin/sqoop import \

--connect jdbc:mysql://dream3:3306/test \

--username root \

--password root \

--hive-import \

--hive-table test.src_test_ws \

--as-textfile \

--split-by id \

--null-string '\\N' \

--null-non-string '\\N' \

--query "select * from demo where $CONDITIONS" \

--hive-partition-key part_id \

--hive-partition-value 7 \

--target-dir 'hdfs://dream1:9000/sqoop/test.db/src_test_ws/part_id=7/' \

--delete-target-dir \

-m 2

参数解释

------------dbms相关------------

--connect✳✳

Dbms数据库连接地址

--connect jdbc:mysql://dream3:3306/test \

--username root \

--password root \

--table --where--columns

指定mysql的表,指定where条件,指定迁移字段，【不要用，用下边的—query】

--query✳

#查询导入： --query

#有了--query，就不要有--table了，也不要有--where了，也不要有--columns了 ,会有冲突而且不要用别的就用query就够了

# $CONDITIONS条件：

where $CONDITIONS 这个在语句中必须有，要么 where id>20 and $CONDITIONS

#为什么呢？

因为sqoop要将你的sql语句交给多个不同的maptask执行，每个maptask执行sql时肯定要按任务规划加范围条件，

所以就提供了这个$CONDITIONS作为将来拼接条件的占位符

#禁用聚合函数

这个sql不要过于复杂而且不允许使用聚合函数，因为会分区处理（除非m=1）

#引号问题，

要么外层使用单引号，内层使用双引号，$CONDITIONS的$符号不用转义，

要么外层使用双引号，那么内层使用单引号，然后$CONDITIONS的$符号需要转义\$

#而且必须指定目标路径 --target-dir

#Must specify destination with --target-dir

#--query 可以指定额外的列

--query 'select *,"xxx" as xx,"ttt" as tt from demo where $CONDITIONS'

--m

分为几个task去运行

--split-by id✳✳

--m 等于1 的话，没有意义。

--m 大于1 如果没有配置，默认找MySQL的表的主键作为split-by对象，如果此时mysql没有主键就会报错

#2022-06-08 02:05:33,676 ERROR tool.ImportTool: Import failed: No primary key could be found for table demo.

--split-by 最好是数字类型（日期类型也可以，max min有意义的就行），如果是string类型运行命令时需要指定 -Dorg.apache.sqoop.splitter.allow_text_splitter=true 否则报错如下

#ERROR tool.ImportTool: Import failed: java.io.IOException:

#Generating splits for a textual index column allowed only in case of

#"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter

------------hdfs相关------------

--as-textfile

#指定要生成的文件的类型默认是--as-textfile

--as-avrodatafile

--as-parquetfile

--as-sequencefile

--as-textfile

--compress

--compression-codec gzip

## 如果需要压缩指定压缩格式

--compress

--compression-codec gzip

--null-non-string✳✳

--null-string✳✳

数据库中的NULL应该写成什么样的东西

--null-non-string ‘\\N’

--null-string <null-str> \\N’

hive默认将hdfs文件中的\N识别为NULL

--fields-terminated-by✳

字段分割符默认’\001’

--lines-terminated-by

换行符，当前默认是\n (\012),暂时没有意义

--target-dir✳

user/【用户名】 /【表名】

数据目标路径，不指定target-dir的时候，默认放到/user/用户名下

#Output directory hdfs://dream1:9000/user/root/demo already exist

--delete-target-dir✳

# hdfs目标路径已存在会报错，所以可以加上这个自动删除目标文件夹，

------------hive相关------------

--hive-import✳✳

指定本次sqoop迁移的数据要直接导入到hive中

如果同时指定了，hive-table,那么sqoop会自动给创建hive的表。但是不会自动创建不存在的库

CREATE TABLE IF NOT EXISTS `test.demo`(

`id` INT, `name` STRING, `value` STRING,

`create_date` STRING, `create_by` BIGINT,

`update_date` STRING, `update_by` BIGINT,

`delete_flag` STRING, `versions` INT

)

COMMENT 'Imported by sqoop on 2022/06/08 03:59:10'

PARTITIONED BY (part_id STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'

LINES TERMINATED BY '\012'

STORED AS TEXTFILE

--hive-database

--hive-table✳✳

hive库名、表名，hive-database一般不用，--hive-table test.demo 一下指定两个，但是在多分区（hcatalog）的时候必须用

--hive-partition-key✳

如果hive目标表是分区表，需要指定分区key，只能指定一个，否则只能用hcatalog

--hive-partition-value✳

如果hive目标表是分区表，需要指定分区value

--hive-overwrite✳

先truncate 然后再导入，如果是分区表，那么就只会清空那个分区的数据，其余分区不受影响,不用这个参数默认就是就是追加

--hcatalog

#处理多分区导入,不支持overwrite,不支持hive-import 而且必须指定hcatalog-database 否则默认是default库

bin/sqoop import \

--connect jdbc:mysql://dream3:3306/test \

--username root \

--password root \

--table demo \

--null-string '\\N' \

--null-non-string '\\N' \

--hcatalog-database test \

--hcatalog-table demo \

--hcatalog-partition-keys part_id,part_name \

--hcatalog-partition-values 1,2 \

--m 1

------------增量导入相关------------

#增量导入，其实可以用--query来实现，他的作用只是帮你识别那些数据是新的，如果你能自己写sql识别出来，那么不需要增量导入

--incremental lastmodified/append

#--incremental lastmodified/append

#lastmodified:会处理修改的数据，常用于modify_time，不支持导入hive

#append :只会处理新增数据

--check-column

检测字段

--check-column create_date

--last-value

字段历史值

#--last-value '2022-05-27 10:21:07.0'

Export(Hive/hdfs到mysql)

bin/sqoop export \

--connect jdbc:mysql://dream3:3306/test \

--username root \

--password root \

--columns "id,versions" \

--table demo2 \

--export-dir "/user/hive/warehouse/test.db/demo/part_id=14" \

--input-null-string '\\N' \

--input-null-non-string '\\N' \

--update-mode allowinsert \

--update-key id \

--batch

参数解释

------------dbms相关------------

--connect✳✳

Dbms数据库连接地址

--connect jdbc:mysql://dream3:3306/test \

--username root \

--password root \

--columns

指定迁移字段，不指定的话就是全部字段

--columns "id,versions"

--table✳✳

指定export到哪个表格，必须要预先创建好，否则报错

--table demo2

------------hdfs相关------------

--export-dir✳✳

指定迁移文件地址

这里可以用来指定迁移那个分区的数据，因此如果是分区表那么需要一个分区一个分区的导出

--export-dir "/user/hive/warehouse/test.db/demo/part_id=14"

--input-fields-terminated-by✳

指定hive数据文件的字段分割符，默认是 ‘,’ 逗号

--input-fields-terminated-by ‘\001’

--input-null-string ✳✳

--input-null-non-string✳✳

Hdfs文件中的什么样的东西应该视为NULL

--input-null-non-string ‘\\N’

--input-null-string ‘\\N’

------------重复数据处理相关------------

--update-mode✳

指定更新模式，

# --update-mode 如果选择updateonly，只会对mysql中已存在的id数据进行更新，不存在的id数据不会插入了

# --update-mode 如果选择allowinsert，既会更新已存在id数据，也会插入新的id数据

--export-dir "/user/hive/warehouse/test.db/demo/part_id=14"

--update-key✳

指定用于识别是同一纪录的主键。

1、如果某次迁移的数据已经存在与dbms中，而且没有指定更新模式，那么就会迁移失败，重复主键错误。

2、如果hive中存在重复数据，而且没有指定更新模式，依然会报错，重复主键错误

Hive

所以，这个参数虽然是非必须的，但是最好配置上！！

Caused by: java.sql.BatchUpdateException: Duplicate entry '2' for key 'PRIMARY'

qzWsong

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
sqoop命令大全和踩坑大全

目录Import(Mysql到Hive/hdfs)参数解释------------dbms相关--------------connect✳✳--table --where--columns--query✳--m--split-by id✳✳------------hdfs相关--------------as-textfile--compress--compression-codec gzip--null-non-string✳✳--null-string✳✳--fields-terminated-by✳--
复制链接

扫一扫