sqoop常用命令参考手册

最新推荐文章于 2022-07-17 13:10:25 发布
江湖人称涛哥
最新推荐文章于 2022-07-17 13:10:25 发布
阅读量5.5k
点赞数 1
分类专栏： sqoop 文章标签： sqoop sqoop命令
本文链接：https://blog.csdn.net/coderblack/article/details/103423279
版权
sqoop 专栏收录该内容
2 篇文章 0 订阅
订阅专栏
## 测试命令：列出mysql中所有的库、表
sqoop list-databases \
--connect jdbc:mysql://doit03:3306 \
--username root \
--password root

sqoop list-tables \
--connect jdbc:mysql://doit03:3306/doit_mall \
--username root \
--password root


## 测试命令：从mysql中导入数据到hdfs的指定目录
## 并行度的问题补充：一个maptask从mysql中获取数据的速度约为4-5m/s，而mysql服务器的吞吐量40-50M/s
## 那么，在mysql中的数据量很大的场景下，可以考虑增加maptask的并行度来提高数据迁移速度
## -m就是用来指定maptask的并行度
## 思考：maptask一旦有多个，那么它是怎么划分处理任务？

## 确保sqoop把目标目录视作hdfs中的路径，需要参数配置正确：
# core-site.xml
# <property>
# <name>fs.defaultFS</name>
# <value>hdfs://h1:8020/</value>
# </property>

## 确保sqoop把mr任务提交到yarn上运行，需要参数配置正确：
# mapred-site.xml
# <property>
# <name>mapreduce.framework.name</name>
# <value>yarn</value>
# </property>

sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_base \
--target-dir /sqoopdata/doit_jw_stu_base  \
--fields-terminated-by ',' \
#如果目标路径已存在则删除
--delete-target-dir \
--split-by stu_id \
-m 2

# 可以指定要生成的文件的类型
--as-avrodatafile 
--as-parquetfile  
--as-sequencefile 
--as-textfile     

## 如果需要压缩
--compression-codec gzip

## 空值处理
# 输入方向：
--input-null-non-string   <null-str>
--input-null-string  <null-str>
# 输出方向：
--null-non-string   <null-str>
--null-string  <null-str>                


## 如果没有数字主键，也可以使用文本列来作为切分task的参照，但是需要增加一个-D参数，如下
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password root \
--table noid \
--target-dir /sqooptest3  \
--fields-terminated-by ',' \
--split-by name \
-m 2 



## 导入mysql数据到hive
## 它的实质： 是先将数据从mysql导入hdfs，然后利用hive的元数据操作jar包，去hive的元数据库中生成相应的元数据，并将数据文件导入hive表目录
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_base \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_base \
--delete-target-dir \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by stu_id \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite \
-m 2
# --hive-database xdb 



## 条件导入: --where
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_base \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_base2 \
--delete-target-dir \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by stu_id \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite \
--where "stu_age>25"  \
-m 2


## 条件导入: --columns  指定要导的字段
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_base \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_base3 \
--delete-target-dir \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by stu_id \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite \
--where "stu_age>25"  \
--columns "stu_id,stu_name,stu_phone"   \
-m 2

## 查询导入： --query 
#  有了--query，就不要有--table了，也不要有--where了，也不要有--columns了 

## query自由查询导入时，sql语句中必须带 $CONDITIONS条件 ：  where $CONDITIONS   ，要么  where id>20  and $CONDITIONS 
## 为什么呢？因为sqoop要将你的sql语句交给多个不同的maptask执行，每个maptask执行sql时肯定要按任务规划加范围条件，
## 所以就提供了这个$CONDITIONS作为将来拼接条件的占位符
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_base4  \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by stu_id \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite  \
--query 'select stu_id,stu_name,stu_age,stu_term from doit_jw_stu_base where stu_createtime>"2019-09-24 23:59:59" and stu_sex="1" and $CONDITIONS'  \
--target-dir '/user/root/tmp'   \
-m 2



## --query可以支持复杂查询（包含join、子查询、分组查询）但是，一定要去深入思考你的sql的预期运算逻辑和maptask并行分任务的事实！
# --query "select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and \$CONDITIONS"
# --query 'select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and $CONDITIONS'
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_base6 \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by id \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite  \
--query 'select b.id,a.stu_id,a.stu_name,a.stu_phone,a.stu_sex,b.stu_zsroom from doit_jw_stu_base a join doit_jw_stu_zsgl b on a.stu_id=b.stu_id where $CONDITIONS' \
--target-dir '/user/root/tmp'   \
-m 2


## --增量导入 1    --根据一个递增字段来界定增量数据
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_zsgl \
--hive-import \
--hive-table yiee_dw.doit_jw_stu_zsgl \
--split-by id \
--incremental append \
--check-column id \
--last-value 40 \
-m 2 

## --增量导入 2 --根据修改时间来界定增量数据，  要求必须有一个时间字段，且该字段会跟随数据的修改而修改
## lastmodified 模式下的增量导入，不支持hive导入
sqoop import \
--connect jdbc:mysql://h3:3306/ry \
--username root \
--password haitao.211123 \
--table doit_jw_stu_zsgl \
--target-dir '/sqoopdata/doit_jw_stu_zsgl'  \
--incremental lastmodified \
--check-column stu_updatetime \
--last-value '2019-09-30 23:59:59'  \
--fields-terminated-by ',' \
--merge-key id   \
-m 1 

# 导入后的数据是直接追加，还是进行新旧合并，两个选择：
--append  # 导入的增量数据直接以追加的方式进入目标存储
--merge-key id  \    #导入的增量数据不会简单地追加到目标存储，还会将新旧数据进行合并


## 附录：  数据导入参数大全！
Table 3. Import control arguments:
Argument	Description
--append	Append data to an existing dataset in HDFS
--as-avrodatafile	Imports data to Avro Data Files
--as-sequencefile	Imports data to SequenceFiles
--as-textfile	Imports data as plain text (default)
--as-parquetfile	Imports data to Parquet Files
--boundary-query <statement>	Boundary query to use for creating splits
--columns <col,col,col…>	Columns to import from table
--delete-target-dir	Delete the import target directory if it exists
--direct	Use direct connector if exists for the database
--fetch-size <n>	Number of entries to read from database at once.
--inline-lob-limit <n>	Set the maximum size for an inline LOB
-m,--num-mappers <n>	Use n map tasks to import in parallel
-e,--query <statement>	Import the results of statement.
--split-by <column-name>	Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--split-limit <n>	Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
--autoreset-to-one-mapper	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name>	Table to read
--target-dir <dir>	HDFS destination dir
--temporary-rootdir <dir>	HDFS directory for temporary files created during import (overrides default "_sqoop")
--warehouse-dir <dir>	HDFS parent for table destination
--where <where clause>	WHERE clause to use during import
-z,--compress	Enable compression
--compression-codec <c>	Use Hadoop codec (default gzip)
--null-string <null-string>	The string to be written for a null value for string columns
--null-non-string <null-string>	The string to be written for a null value for non-string columns



## sqoop导出数据
sqoop  export \
--connect jdbc:mysql://h3:3306/dicts \
--username root \
--password haitao.211123 \
--table dau_t \
--export-dir '/user/hive/warehouse/dau_t' \
--batch   # 以batch模式去执行sql


## 控制新旧数据导到mysql时，选择更新模式
sqoop  export \
--connect jdbc:mysql://h3:3306/doit_mall \
--username root \
--password root \
--table person \
--export-dir '/export3/' \
--input-null-string 'NaN' \
--input-null-non-string 'NaN' \
--update-mode allowinsert  \
--update-key id \
--batch


## 附录:export控制参数列表
Table 29. Export control arguments:

Argument	Description
--columns <col,col,col…>	Columns to export to table
--direct	Use direct export fast path
--export-dir <dir>	HDFS source path for the export
-m,--num-mappers <n>	Use n map tasks to export in parallel
--table <table-name>	Table to populate
--call <stored-proc-name>	Stored Procedure to call
--update-key <col-name>	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
--update-mode <mode>	Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowinsert.
--input-null-string <null-string>	The string to be interpreted as null for string columns
--input-null-non-string <null-string>	The string to be interpreted as null for non-string columns
--staging-table <staging-table-name>	The table in which data will be staged before being inserted into the destination table.
--clear-staging-table	Indicates that any data present in the staging table can be deleted.
--batch	Use batch mode for underlying statement execution.




## 附录：
-- mysql修改库、表编码
修改库的编码：
mysql> alter database db_name character set utf8;
修改表的编码：
mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
多易教育，专注大数据培训；课程引领市场，就业乘风破浪
[多易教育官网地址](https://www.51doit.cn/?utm_platform=csdn&utm_campain=coupon)
https://www.51doit.cn
[多易教育在线学习平台](https://v.51doit.cn/?utm_platform=csdn_ck)
https://v.51doit.cn