sqoop命令手册

最新推荐文章于 2024-06-17 08:30:00 发布

吃顿烧烤又胖三斤�

最新推荐文章于 2024-06-17 08:30:00 发布

阅读量431

点赞数

分类专栏： sqoop

本文链接：https://blog.csdn.net/Werkple/article/details/108680633

版权

sqoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

sqoop安装

上传安装包，解压
进入conf目录,修改sqoop-env.sh

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/app/hadoop-2.8.5/

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/app/hadoop-2.8.5/share/hadoop/mapreduce

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/opt/app/hive-2.1.0/

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
~

################sqoop 的工作机制##############################################################################

sqoop会将导入、导出命令，转成一个mr程序去执行
mr程序就去读数据，写目标

导入： mr读mysq表，写hdfs文件
导出： mr读hdfs文件，写mysql的表

增量导入：
1.如何判断哪些数据是增量： a、根据某个递增的字段值 b、根据最后修改时间
2.增量数据导入的策略： --append 追加而已！ --merge-key id 根据id，对增量数据和昨天的数据自动做合并

条件导入：
1. --where
2. --columns
3. --query ‘select id,name,age from t_stu where age>30 and $CONDITIONS’

############################ 测试用的命令 ##############################################

测试命令：列出mysql中所有的库

sqoop list-databases
–connect jdbc:mysql://doit03:3306
–username root
–password root

列出指定库下表

sqoop list-tables
–connect jdbc:mysql://doit03:3306/doit_mall
–username root
–password root

测试命令：从mysql中导入数据到hdfs的指定目录

并行度的问题补充：一个maptask从mysql中获取数据的速度约为4-5m/s，而mysql服务器的吞吐量40-50M/s

那么，在mysql中的数据量很大的场景下，可以考虑增加maptask的并行度来提高数据迁移速度

-m就是用来指定maptask的并行度

思考：maptask一旦有多个，那么它是怎么划分处理任务？

确保sqoop把目标目录视作hdfs中的路径，需要参数配置正确：

core-site.xml

fs.defaultFS

hdfs://h1:8020/

确保sqoop把mr任务提交到yarn上运行，需要参数配置正确：

mapred-site.xml

mapreduce.framework.name

yarn

#################### 从mysql中全量导入一个表，到hdfs中 #######################################################

基本命令

sqoop import
#连接数据库
–connect jdbc:mysql://h3:3306/ry
#账号
–username root
#密码
–password haitao.211123
#指定要导入的表
–table doit_jw_stu_base \
#指定要导入的表的路径和文件名
–target-dir /sqoopdata/doit_jw_stu_base \
#指定每条数据的分隔符
–fields-terminated-by ‘,’ \
#删除已经存在的目标路径
–delete-target-dir
#依据那个字段分任务
–split-by stu_id
#设置2个并行度，分成2个任务
-m 2

细节1：可以指定要生成的文件的类型

–as-avrodatafile
–as-parquetfile
–as-sequencefile
–as-textfile

细节2：如果需要压缩

–compression-codec gzip

细节3：空值处理

import时候用：控制输出(写入hive/hdfs文件)的时候，把真正的null用指定的字符串代替：

–null-non-string
–null-string

export时候用：输入方向：读hive/hdfs文件时，把什么字符串看成空

–input-null-non-string
–input-null-string

细节4：如果没有合适的数字类型字段，也可以使用文本列来作为切分task的参照，但是需要增加一个-D参数，如下

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true
–connect jdbc:mysql://h3:3306/ry
–username root
–password root
–table noid
–target-dir /sqooptest3
–fields-terminated-by ‘,’
–split-by name
-m 2

####################### 导入mysql数据到hive ######################################################################################

导入mysql数据到hive

它的实质：是先将数据从mysql导入hdfs，然后利用hive的元数据操作jar包，去hive的元数据库中生成相应的元数据，并将数据文件导入hive表目录

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.ODS_doit_jw_stu_base
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
-m 2

–hive-database xdb

####################### 条件导入 ######################################################################################

条件导入: --where

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.doit_jw_stu_base2
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–where “stu_age>25”
-m 2

条件导入: --columns 指定要导的字段

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.doit_jw_stu_base3
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–where “stu_age>25”
–columns “stu_id,stu_name,stu_phone”
-m 2

查询导入： --query 指定一条查询语句

有了–query，就不要有–table了，也不要有–where了，也不要有–columns了

query自由查询导入时，sql语句中必须带 $CONDITIONS条件： where $CONDITIONS ，要么 where id>20 and $CONDITIONS

为什么呢？因为sqoop要将你的sql语句交给多个不同的maptask执行，每个maptask执行sql时肯定要按任务规划加范围条件，

所以就提供了这个$CONDITIONS作为将来拼接条件的占位符

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–hive-import
–hive-table titan.doit_jw_stu_base3
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–delete-target-dir
–query ‘select stu_id,stu_name,stu_age,stu_term from doit_jw_stu_base where stu_createtime>“2019-09-24 23:59:59” and stu_sex=“1” and $CONDITIONS’
–target-dir ‘/user/root/tmp’
-m 2

–query可以支持复杂查询（包含join、子查询、分组查询）但是，一定要去深入思考你的sql的预期运算逻辑和maptask并行分任务的事实！

最好不要这么干，可能会导致预期之外的结果

–query “select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and $CONDITIONS”

–query ‘select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and $CONDITIONS’

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–hive-import
–hive-table yiee_dw.doit_jw_stu_base6
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–query ‘select b.id,a.stu_id,a.stu_name,a.stu_phone,a.stu_sex,b.stu_zsroom from doit_jw_stu_base a join doit_jw_stu_zsgl b on a.stu_id=b.stu_id where $CONDITIONS’
–target-dir ‘/user/root/tmp’
-m 2

################################### 增量导入 ##################################################################################

–增量导入 1 --根据一个递增字段来界定增量数据

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_zsgl
–hive-import
–hive-table yiee_dw.doit_jw_stu_zsgl
–split-by id
–incremental append
–check-column id
–last-value 40
-m 2

–增量导入 2 --根据修改时间来界定增量数据，要求必须有一个时间字段，且该字段会跟随数据的修改而修改

lastmodified 模式下的增量导入，不支持hive导入

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_zsgl
–target-dir ‘/sqoopdata/doit_jw_stu_zsgl’
–incremental lastmodified
–check-column stu_updatetime
–last-value ‘2019-09-30 23:59:59’
–fields-terminated-by ‘,’
–merge-key id
-m 1

导入后的数据是直接追加，还是进行新旧合并，两个选择：

–append # 导入的增量数据直接以追加的方式进入目标存储
–merge-key id \ #导入的增量数据不会简单地追加到目标存储，还会将新旧数据进行合并

附录：数据导入参数大全！

Table 3. Import control arguments:
Argument Description
–append Append data to an existing dataset in HDFS
–as-avrodatafile Imports data to Avro Data Files
–as-sequencefile Imports data to SequenceFiles
–as-textfile Imports data as plain text (default)
–as-parquetfile Imports data to Parquet Files
–boundary-query Boundary query to use for creating splits
–columns <col,col,col…> Columns to import from table
–delete-target-dir Delete the import target directory if it exists
–direct Use direct connector if exists for the database
–fetch-size Number of entries to read from database at once.
–inline-lob-limit Set the maximum size for an inline LOB
-m,–num-mappers Use n map tasks to import in parallel
-e,–query Import the results of statement.
–split-by Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
–split-limit Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
–autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by option.
–table Table to read
–target-dir

HDFS destination dir
–temporary-rootdir HDFS directory for temporary files created during import (overrides default “_sqoop”)
–warehouse-dir HDFS parent for table destination
–where WHERE clause to use during import
-z,–compress Enable compression
–compression-codec Use Hadoop codec (default gzip)
–null-string The string to be written for a null value for string columns
–null-non-string The string to be written for a null value for non-string columns

################################### 数据导出 ##############################################################################

sqoop导出数据

sqoop export
–connect jdbc:mysql://h3:3306/dicts
–username root
–password haitao.211123
–table dau_t
–export-dir ‘/user/hive/warehouse/dau_t’
–batch # 以batch模式去执行sql

控制新旧数据导到mysql时，选择更新模式

sqoop export
–connect jdbc:mysql://h3:3306/doit_mall
–username root
–password root
–table person
–export-dir ‘/export3/’
–input-null-string ‘NaN’
–input-null-non-string ‘NaN’
–update-mode allowinsert
–update-key id
–batch

附录:export控制参数列表

Table 29. Export control arguments:

Argument Description
–columns <col,col,col…> Columns to export to table
–direct Use direct export fast path
–export-dir

HDFS source path for the export
-m,–num-mappers Use n map tasks to export in parallel
–table Table to populate
–call Stored Procedure to call
–update-key Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
–update-mode Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowinsert.
–input-null-string The string to be interpreted as null for string columns
–input-null-non-string The string to be interpreted as null for non-string columns
–staging-table The table in which data will be staged before being inserted into the destination table.
–clear-staging-table Indicates that any data present in the staging table can be deleted.
–batch Use batch mode for underlying statement execution.

附录：

– mysql修改库、表编码
修改库的编码：
mysql> alter database db_name character set utf8;
修改表的编码：
mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

吃顿烧烤又胖三斤�

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
sqoop命令手册

sqoop安装上传安装包，解压进入conf目录,修改sqoop-env.sh#Set path to where bin/hadoop is availableexport HADOOP_COMMON_HOME=/opt/app/hadoop-2.8.5/#Set path to where hadoop-*-core.jar is availableexport HADOOP_MAPRED_HOME=/opt/app/hadoop-2.8.5/share/hadoop/mapreduce#s
复制链接

扫一扫