sqoop命令手册

sqoop安装

上传安装包,解压
进入conf目录,修改sqoop-env.sh

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/app/hadoop-2.8.5/

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/app/hadoop-2.8.5/share/hadoop/mapreduce

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/opt/app/hive-2.1.0/

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
~

################sqoop 的工作机制##############################################################################

sqoop会将导入、导出命令,转成一个mr程序去执行
mr程序就去读数据,写目标

导入: mr读mysq表,写hdfs文件
导出: mr读hdfs文件,写mysql的表

增量导入:
1.如何判断哪些数据是增量: a、根据某个递增的字段值 b、根据最后修改时间
2.增量数据导入的策略: --append 追加而已! --merge-key id 根据id,对增量数据和昨天的数据自动做合并

条件导入:
1. --where
2. --columns
3. --query ‘select id,name,age from t_stu where age>30 and $CONDITIONS’

############################ 测试用的命令 ##############################################

测试命令:列出mysql中所有的库

sqoop list-databases
–connect jdbc:mysql://doit03:3306
–username root
–password root

列出指定库下表

sqoop list-tables
–connect jdbc:mysql://doit03:3306/doit_mall
–username root
–password root

测试命令:从mysql中导入数据到hdfs的指定目录

并行度的问题补充:一个maptask从mysql中获取数据的速度约为4-5m/s,而mysql服务器的吞吐量40-50M/s

那么,在mysql中的数据量很大的场景下,可以考虑增加maptask的并行度来提高数据迁移速度

-m就是用来指定maptask的并行度

思考:maptask一旦有多个,那么它是怎么划分处理任务?

确保sqoop把目标目录视作hdfs中的路径,需要参数配置正确:

core-site.xml

fs.defaultFS

hdfs://h1:8020/

确保sqoop把mr任务提交到yarn上运行,需要参数配置正确:

mapred-site.xml

mapreduce.framework.name

yarn

#################### 从mysql中全量导入一个表,到hdfs中 #######################################################

基本命令

sqoop import
#连接数据库
–connect jdbc:mysql://h3:3306/ry
#账号
–username root
#密码
–password haitao.211123
#指定要导入的表
–table doit_jw_stu_base \
#指定要导入的表的路径和文件名
–target-dir /sqoopdata/doit_jw_stu_base \
#指定每条数据的分隔符
–fields-terminated-by ‘,’ \
#删除已经存在的目标路径
–delete-target-dir
#依据那个字段分任务
–split-by stu_id
#设置2个并行度,分成2个任务
-m 2

细节1: 可以指定要生成的文件的类型

–as-avrodatafile
–as-parquetfile
–as-sequencefile
–as-textfile

细节2: 如果需要压缩

–compression-codec gzip

细节3: 空值处理

import时候用: 控制输出(写入hive/hdfs文件)的时候,把真正的null用指定的字符串代替:

–null-non-string
–null-string

export时候用: 输入方向: 读hive/hdfs文件时,把什么字符串看成空

–input-null-non-string
–input-null-string

细节4: 如果没有合适的数字类型字段,也可以使用文本列来作为切分task的参照,但是需要增加一个-D参数,如下

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true
–connect jdbc:mysql://h3:3306/ry
–username root
–password root
–table noid
–target-dir /sqooptest3
–fields-terminated-by ‘,’
–split-by name
-m 2

####################### 导入mysql数据到hive ######################################################################################

导入mysql数据到hive

它的实质: 是先将数据从mysql导入hdfs,然后利用hive的元数据操作jar包,去hive的元数据库中生成相应的元数据,并将数据文件导入hive表目录

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.ODS_doit_jw_stu_base
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
-m 2

–hive-database xdb

####################### 条件导入 ######################################################################################

条件导入: --where

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.doit_jw_stu_base2
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–where “stu_age>25”
-m 2

条件导入: --columns 指定要导的字段

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_base
–hive-import
–hive-table titan.doit_jw_stu_base3
–delete-target-dir
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–where “stu_age>25”
–columns “stu_id,stu_name,stu_phone”
-m 2

查询导入: --query 指定一条查询语句

有了–query,就不要有–table了,也不要有–where了,也不要有–columns了

query自由查询导入时,sql语句中必须带 $CONDITIONS条件 : where $CONDITIONS ,要么 where id>20 and $CONDITIONS

为什么呢?因为sqoop要将你的sql语句交给多个不同的maptask执行,每个maptask执行sql时肯定要按任务规划加范围条件,

所以就提供了这个$CONDITIONS作为将来拼接条件的占位符

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–hive-import
–hive-table titan.doit_jw_stu_base3
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by stu_id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–delete-target-dir
–query ‘select stu_id,stu_name,stu_age,stu_term from doit_jw_stu_base where stu_createtime>“2019-09-24 23:59:59” and stu_sex=“1” and $CONDITIONS’
–target-dir ‘/user/root/tmp’
-m 2

–query可以支持复杂查询(包含join、子查询、分组查询)但是,一定要去深入思考你的sql的预期运算逻辑和maptask并行分任务的事实!

最好不要这么干,可能会导致预期之外的结果

–query “select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and $CONDITIONS”

–query ‘select id,member_id,order_sn,receiver_province from doit_mall.oms_order where id>20 and $CONDITIONS’

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–hive-import
–hive-table yiee_dw.doit_jw_stu_base6
–as-textfile
–fields-terminated-by ‘,’
–compress
–compression-codec gzip
–split-by id
–null-string ‘\N’
–null-non-string ‘\N’
–hive-overwrite
–query ‘select b.id,a.stu_id,a.stu_name,a.stu_phone,a.stu_sex,b.stu_zsroom from doit_jw_stu_base a join doit_jw_stu_zsgl b on a.stu_id=b.stu_id where $CONDITIONS’
–target-dir ‘/user/root/tmp’
-m 2

################################### 增量导入 ##################################################################################

–增量导入 1 --根据一个递增字段来界定增量数据

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_zsgl
–hive-import
–hive-table yiee_dw.doit_jw_stu_zsgl
–split-by id
–incremental append
–check-column id
–last-value 40
-m 2

–增量导入 2 --根据修改时间来界定增量数据, 要求必须有一个时间字段,且该字段会跟随数据的修改而修改

lastmodified 模式下的增量导入,不支持hive导入

sqoop import
–connect jdbc:mysql://h3:3306/ry
–username root
–password haitao.211123
–table doit_jw_stu_zsgl
–target-dir ‘/sqoopdata/doit_jw_stu_zsgl’
–incremental lastmodified
–check-column stu_updatetime
–last-value ‘2019-09-30 23:59:59’
–fields-terminated-by ‘,’
–merge-key id
-m 1

导入后的数据是直接追加,还是进行新旧合并,两个选择:

–append # 导入的增量数据直接以追加的方式进入目标存储
–merge-key id \ #导入的增量数据不会简单地追加到目标存储,还会将新旧数据进行合并

附录: 数据导入参数大全!

Table 3. Import control arguments:
Argument Description
–append Append data to an existing dataset in HDFS
–as-avrodatafile Imports data to Avro Data Files
–as-sequencefile Imports data to SequenceFiles
–as-textfile Imports data as plain text (default)
–as-parquetfile Imports data to Parquet Files
–boundary-query Boundary query to use for creating splits
–columns <col,col,col…> Columns to import from table
–delete-target-dir Delete the import target directory if it exists
–direct Use direct connector if exists for the database
–fetch-size Number of entries to read from database at once.
–inline-lob-limit Set the maximum size for an inline LOB
-m,–num-mappers Use n map tasks to import in parallel
-e,–query Import the results of statement.
–split-by Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
–split-limit Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
–autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by option.
–table Table to read
–target-dir

HDFS destination dir
–temporary-rootdir HDFS directory for temporary files created during import (overrides default “_sqoop”)
–warehouse-dir HDFS parent for table destination
–where WHERE clause to use during import
-z,–compress Enable compression
–compression-codec Use Hadoop codec (default gzip)
–null-string The string to be written for a null value for string columns
–null-non-string The string to be written for a null value for non-string columns

################################### 数据导出 ##############################################################################

sqoop导出数据

sqoop export
–connect jdbc:mysql://h3:3306/dicts
–username root
–password haitao.211123
–table dau_t
–export-dir ‘/user/hive/warehouse/dau_t’
–batch # 以batch模式去执行sql

控制新旧数据导到mysql时,选择更新模式

sqoop export
–connect jdbc:mysql://h3:3306/doit_mall
–username root
–password root
–table person
–export-dir ‘/export3/’
–input-null-string ‘NaN’
–input-null-non-string ‘NaN’
–update-mode allowinsert
–update-key id
–batch

附录:export控制参数列表

Table 29. Export control arguments:

Argument Description
–columns <col,col,col…> Columns to export to table
–direct Use direct export fast path
–export-dir

HDFS source path for the export
-m,–num-mappers Use n map tasks to export in parallel
–table Table to populate
–call Stored Procedure to call
–update-key Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
–update-mode Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowinsert.
–input-null-string The string to be interpreted as null for string columns
–input-null-non-string The string to be interpreted as null for non-string columns
–staging-table The table in which data will be staged before being inserted into the destination table.
–clear-staging-table Indicates that any data present in the staging table can be deleted.
–batch Use batch mode for underlying statement execution.

附录:

– mysql修改库、表编码
修改库的编码:
mysql> alter database db_name character set utf8;
修改表的编码:
mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值