sqoop搭建与使用,sqoop条件导入、增量导入，mysql至hdfs、hive、hbase，hdfs至mysql，生成jar包到本地

房石阳明i

已于 2023-03-03 08:35:06 修改

阅读量1.9k

点赞数 1

文章标签： sqoop apache 大数据 hive mysql

于 2022-11-07 16:11:00 首次发布

本文链接：https://blog.csdn.net/Mogeko1/article/details/127689103

版权

sqoop -> 工具主要负责数据库(例:mysql)<--->HDFS

sqoop应用场景：
数仓：埋点数据，业务数据
业务数据（业务系统产生的数据），导入的数仓
数仓的ADS层的数据，hive、导入到 mysql中

官网 : https://sqoop.apache.org/#
attic.apache.org/projects/sqoop.html
下载:archive.apache.org/dist/sqoop
sqoop.apache.org/#

官网语法：sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_list_databases_literal

命令解析：
https://www.cnblogs.com/LIAOBO/p/13667044.html

linux>tar -xf sqoop.xxx.tar.gz
linux>mv sqoop-xxx  /opt/install/sqoop
linux>cd  /opt/install/sqoop

sqoop 配置：
在sqoop.env.sh中增加：

linux>cp sqoop-env-template.sh  sqoop-env.sh 

linux>vi sqoop-env.sh 
export HADOOP_COMMON_HOME=/opt/install/hadoop
export HADOOP_MAPRED_HOME=/opt/install/hadoop
export HIVE_HOME=/opt/install/hive
export ZOOKEEPER_HOME=/opt/install/zookeeper
export ZOOCFGDIR=/opt/install/zookeeper
export HBASE_HOME=/opt/install/hbase
export  HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

在sqoop的目录下执行测试命令：
#列出mysql中所有的库、表
\ 表示换行不结束

#显示出mysql所有库
linux>bin/sqoop list-databases \
--connect jdbc:mysql://192.168.58.203:3306 \
--username root \
--password 123
#显示出mysql指定库库下的所有表
linux>bin/sqoop list-tables \
--connect jdbc:mysql://192.168.58.203:3306/test \
--username root \
--password 123

示例 mysql--->hdfs

linux>bin/sqoop import \ 导入
--connect jdbc:mysql://192.168.58.203:3306/数据库名 \ 连接数据库
--username 数据库用户名 \
--password 数据库用户密码 \
--table 表名 \   导出数据库哪张表
--target-dir /sqoopdata/xxxx  \ 导入到哪一个HDFS目录
--fields-terminated-by ',' \  字符串分割
--delete-target-dir \ HDFS地址存在删除 --一般不删
--split-by FieldName \ 按指定字段名拆分，必须是表内有的字段否则报错
-m 1       生成文件数量  
#并发的map数量1,如果不设置默认启动4个map task执行数据导入，则需要指定一个列来作为划分map task任务的依据
#split-by 根据不同的参数类型有不同的切分方法，如int型，Sqoop会取最大和最小split-by字段值，然后根据传入的num-mappers来 确定划分几个区域。比如select max(split_by),min(split-by) from得到的max(split-by)和min(split-by)分别为1000和1，而num-mappers（-m）为2的话，则会分成两个区域 (1,500)和(501-1000),同时也会分成2个sql给2个map去进行导入操作，分别为select XXX from table where split-by>=1 and split-by<500和select XXX from table where split-by>=501 and split-by<=1000.最后每个map各自获取各自SQL中的数据进行导入工作。


linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/databasesname \
--username root \
--password 123 \
--table tablename \
--target-dir /sqoopdata/dirname  \
--fields-terminated-by ',' \
--delete-target-dir \
-m 1
#查看数据 
linux>hdfs  hdfs dfs -ls /sqoopdata/     
linux>hdfs dfs -cat /sqoopdata/dirname/part-m-00000


linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/databasesname \
--username root \
--password 123 \
--table tablename \
--target-dir /sqoopdata/dirname  \
--fields-terminated-by ',' \
--delete-target-dir \
--split-by FieldName \ 
-m 2
#查看统计数据
linux>hdfs  dfs -cat /sqoopdata/dirname/part-m-00000  | wc -l
linux>hdfs  dfs -cat /sqoopdata/dirname/part-m-00001  | wc -l

可以指定要生成的文件的类型

--as-avrodatafile 
--as-parquetfile  
--as-sequencefile 
--as-textfile

如果需要压缩

--compress   
--compression-codec gzip

Gzip
优点
压缩解压速度快 , 压缩率高 , hadoop本身支持
处理压缩文件时方便 , 和处理文本一样
大部分linux 系统自带 Gzip 命令 , 使用方便
缺点
不支持切片
使用场景
文件压缩后在130M以内 (一个块大小) , 都可以使用 GZip 压缩(因为Gzip唯一的缺点是不能切片)
总结 : 不需要切片的情况下可以使用
BZip2
优点
压缩率高(高于Gzip)
可以切片
hadoop自带使用方便
缺点
压缩解压速度超级慢
使用场景
不要求压缩速率 ,但是对压缩率有要求的情况下比如备份历史记录 , 备份文件
或者输出的文件较大 , 处理后的数据需要减少磁盘存储并且以后使用数据的情况较少 (解压 / 压缩的情况较少)
对于单个文件较大 ,又想压缩减少磁盘空间 , 并且兼容之前的应用程序的情况
总结 : 对于压缩解压速度没有要求的情况下
Lzo
优点
压缩解压速度比较快 , 压缩率也可以
支持切片是hadoop 比较流行的压缩格式
可以在linux 下安装 lzo命令使用方便
缺点
压缩率比Gzip低一些
hadoop 本身不支持, 需要自己安装
使用Lzo 格式的文件时需要做一些特殊处理(为了支持 Split 需要建立索引 , 还需要将 InputFormat 指定为Lzo 格式 [特殊]
使用场景
压缩以后还大于 200M 的文件 , 且文件越大 Lzo 的优势越明显
总结 : 压缩后文件还是比较大需要切片的情况下推荐使用
Snappy
优点
高压缩解压速度 , 压缩率还可以
缺点
不能切片
压缩率比Gzip小
hadoop本身不支持需要安装

压缩速率 : Snappy > GZIp > Lzo >BZip2
支持切片 : BZIp2 LZo
压缩率 : BZip2 > GZip > Lzo > Snappy

空值处理
# import方向：mysql中的null值，写入hdfs文件时，用什么符号来代替(默认是用的"null")

--null-non-string   '\\N'
--null-string  '\\N'
例:linux>bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://192.168.58.203:3306/databasesname \
--username root \
--password 123 \
--table tablename \
--target-dir /sqoopdata/dirname  \
--delete-target-dir \
--fields-terminated-by ',' \
--null-non-string   '\\N' \
--null-string  '\\N' \
--split-by emp_no \
-m 2 
注：注意表的字段类型null-non-string和null-string是对字段类型位string的

#验证
hive>create table test(
emp_no    string,
name      string,

)
row format delimited 
fields terminated by ',';
hive>load data inpath '/sqoopdata/test/part-m-00000' into table test;
hive>select * from test where name is null;

如果没有数字主键，也可以使用文本列来作为切分task的参照，但是需要增加一个-D参数，如下:

linux>bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://192.168.58.203:3306/databasesname \
--username root \
--password 123 \
--table tablename \
--target-dir /sqoopdata/dirname  \
--delete-target-dir \
--fields-terminated-by ',' \
--split-by FieldName \
-m 2

mysql ---> hive(hive表自动创建)

linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/databasesname \
--username root \
--password 123 \
--table tablename \
--hive-import \				（插入数据到hive当中，使用hive的默认分隔符）
--hive-table hivedatabasesname.hivetablename \	（设置到hive当中的表名）
--delete-target-dir \			
--as-textfile \				（指定文件类型）
--fields-terminated-by ',' \
--hive-overwrite \			（重写插入覆盖）
-m 1

如果报错把hive的配置文件拷贝到sqoop的conf下 hive-site.xml

linux>cp hive-site.xml  /opt/install/sqoop/conf/

条件导入: --where

mysql --> hive(hive表自动创建)
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--hive-import \
--hive-table hive库名.hive表名 \
--delete-target-dir \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by 字段 \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite \
--where "字段='条件值'"  \
-m 2

条件导入: --columns 指定要导的字段

mysql --> hive(hive表自动创建)
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--hive-import \
--hive-table hive库名.hive表名 \
--delete-target-dir \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by 字段 \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite \
--where "字段='条件值'"  \
--columns "字段1,字段2,字段3" \
-m 2

查询导入： --query
query自由查询导入时，sql语句中必须带 $CONDITIONS条件： where $CONDITIONS ，要么 where id>10 and $CONDITIONS
为什么呢？因为sqoop要将你的sql语句交给多个不同的maptask执行，每个maptask执行sql时肯定要按任务规划加范围条件，
所以就提供了这个$CONDITIONS作为将来拼接条件的占位符
有了--query，就不要有--table了，也不要有--where了，也不要有--columns了

mysql --> hive(hive表自动创建)
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--hive-import \
--hive-table hive库名.hive表名  \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by 字段名 \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite  \
--query "select 字段1,字段2,字段3 from 表名 where 字段3='条件值' and \$CONDITIONS" \
--target-dir '/sqoopdata/tmp'   \
-m 2

--query可以支持复杂查询（包含join、子查询、分组查询）

mysql --> hive(hive表自动创建)
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 12345 \
--hive-import \
--hive-table hive库名.hive表名 \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by e.字段名 \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite  \
--query 'select e.字段1,d.字段2,e.字段2,e.字段3,e.字段4 from 表A e join 表B d on e.字段1=d.字段1 where $CONDITIONS' \ 
--target-dir '/sqoopdata/tmp' \
-m 2

hdfs--->mysql（mysql表需要提前手动创建）
--mysql中要有表，先建表

hdfs--->mysql （mysql表需要提前手动创建）
linux>bin/sqoop  export \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--input-fields-terminated-by ',' \
--export-dir '/sqoopdata/有表数据的目录' \  
--batch 

hdfs--->mysql（mysql表需要提前手动创建）
linux>bin/sqoop  export \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--export-dir '/sqoopdata/test' \
--update-mode allowinsert  \
--update-key id \
--batch
# --update-mode 如果选择updateonly，只会对mysql中已存在的id数据进行更新，不存在的id数据不会插入了
# --update-mode 如果选择allowinsert，既会更新已存在id数据，也会插入新的id数据

sqoop增量导入
mysql --> hive

mysql --> hive(hive表自动创建)
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--hive-import \
--hive-table hive库名.hive表名 \
--split-by 字段1 \
--incremental append \   #增量模式append（追加）
--check-column 字段1 \   #检查列
--last-value 编号值 \    #最后一个值  例:last-value  599999
-m 2 

#另外一种

linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--hive-import \
-m 1 \
--hive-table  hive库名.hive表名 \
--incremental lastmodified \               #增量模式lastmodified（上次修改的）
--check-column 时间字段 \                   #检查列
--last-value "2021-12-31 23:59:59"         #最后一个值

mysql --> hdfs
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名 \
--target-dir '/sqoopdata/目录名'  \
--incremental lastmodified \
--check-column 时间字段 \
--last-value '时间字段值'  \
--fields-terminated-by ',' \
-m 1

mysql ---> hbase

mysql  ---> hbase
linux>bin/sqoop import \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名  \
--hbase-table hbase表名 \
--column-family base  \
--hbase-create-table \
--hbase-row-key 字段1

生成jar包到linux本地

linux>bin/sqoop codegen \
--connect jdbc:mysql://192.168.58.203:3306/mysql库名 \
--username root \
--password 123 \
--table 表名  \
--bindir /root/test \
--class-name jarname \
--fields-terminated-by ","
#结果，会在本地/root/test目录下产生jar包（jarname ）

房石阳明i

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
sqoop搭建与使用,sqoop条件导入、增量导入，mysql至hdfs、hive、hbase，hdfs至mysql，生成jar包到本地

sqoop搭建与使用,sqoop条件、增量导入，mysql至hdfs、hive、hbase，hdfs至mysql与怎么生产jar包到本地
复制链接

扫一扫