使用Sqoop将MySQL数据导入Hive

最新推荐文章于 2024-06-23 17:38:41 发布

刘李404not found

最新推荐文章于 2024-06-23 17:38:41 发布

阅读量887

点赞数 1

分类专栏： Hadoop

本文链接：https://blog.csdn.net/qq_39680564/article/details/100098025

版权

Hadoop 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

文章目录

一、说明
二、Hive参数

https://zhuanlan.zhihu.com/p/39113492

一、说明

MySQL数据导入Hive方法与导入HDFS类似：https://blog.csdn.net/qq_39680564/article/details/100081659，Sqoop只是将数据导入HDFS后再转成Hive表

二、Hive参数

Hive除了自己的参数，同样也支持import的参数，见导入HDFS文章

--hive-home <dir>			Hive 的安装目录，可以通过该参数覆盖掉默认的 hive 目录
--hive-import				将表导入Hive
--hive-overwrite			覆盖Hive表中的现有数据
--create-hive-table			默认是 false,如果目标表已经存在了，那么创建任务会失败
--hive-table <table-name>	设置导入hive后的表名
--hive-drop-import-delims	导入到Hive时从字符串字段中 删除\ n，\ r和\ 01
--hive-delims-replacement	导入到Hive时，使用用户定义的字符串 替换字符串字段中的\ n，\ r和\ 01
--map-column-hive <arg>		生成 hive 表时，可以更改生成字段的数据类型

参考模版

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--columns "id,fname,lname" \
--hive-import \
--hive-overwrite  \
--hive-partition-key dt  \
--hive-partition-value %(yesterday)s  \
--hive-table f_fuwu.%(hive_table)s  \
--hive-drop-import-delims  \
--fields-terminated-by '%(fields_terminated)s'  \
--lines-terminated-by '\n'  \
--null-string '\\\\N'  \
--null-non-string '\\\\N'  \
-m 1

2.1 单表导入

2.1.1 默认参数

命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import

HDFS目录
默认在hive的家目录/user/hive/warehouse，有一个以源表名命名的目录，同样没有设置-m参数，生成了4个文件

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/person
Found 4 items
-rwxrwxrwx   3 hdfs hive   62664632 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00000
-rwxrwxrwx   3 hdfs hive   63776345 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00001
-rwxrwxrwx   3 hdfs hive   63778151 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00002
-rwxrwxrwx   3 hdfs hive   63776312 2019-08-27 15:39 /user/hive/warehouse/person/part-m-00003

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881
11戌狗240
12亥猪780
13311
14llkkbllkkbd300
15llkkllkk540
16llkkbllkkbd131
17llkkllkk71
18llkkllkk120
19llkkcllkkc841
20llkkfllkkf471
cat: Unable to write to output stream.

Hive表内容
生成了相同名称的表

hive> show tables;
OK
person
Time taken: 0.028 seconds, Fetched: 1 row(s)
hive> select * from person where id <13;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
11		戌狗		24		0
12		亥猪		78		0
Time taken: 16.497 seconds, Fetched: 12 row(s)

2.1.2 根据数据筛选条件导入

导入id小于11的数据
命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-overwrite \
-m 1 \
--where "id<11"

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/person
Found 2 items
-rwxrwxrwx   3 hdfs hive          0 2019-08-27 16:34 /user/hive/warehouse/person/_SUCCESS
-rwxrwxrwx   3 hdfs hive        185 2019-08-27 16:34 /user/hive/warehouse/person/part-m-00000

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881

Hive表内容

hive> select * from person;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
Time taken: 0.357 seconds, Fetched: 10 row(s)

2.1.3 设置导入后的表名

设置表名为person_hive： --hive-table person_hive
命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-overwrite \
-m 1 \
--where "id<11" \
--hive-table person_hive

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 2 items
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:18 /user/hive/warehouse/person
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:04 /user/hive/warehouse/person_hive

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881

Hive表内容

hive> show tables;
OK
person
person_hive
Time taken: 0.131 seconds, Fetched: 2 row(s)
hive> select * from person_hive;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
Time taken: 0.066 seconds, Fetched: 10 row(s)

2.1.4 更改表的某个字段类型

更改sex字段的类型为STRING

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--map-column-hive sex=STRING \
--hive-import

2.1.5 导入指定库

使用--hive-database参数

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-database liuli \
-m 1

2.1.6 增量导入

--check-column：指定增量导入的依赖字段，通常为自增的主键id或者时间戳
--incremental：指定导入的模式（append或lastmodified）
--last-value：指定导入的上次最大值也就是这次开始的值

2.1.6.1 Append模式

指定一个自增字段，根据指定的自增字段值增量导入(经测试，指定超前的ID不会覆盖)

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--incremental append \
--check-column id \
--last-value 5 \
-m 1

2.1.6.2 lastModify方式（经测试，导入hive不支持时间戳，导入HDFS时可以）

指定一个时间戳字段，根据指定的时间戳增量导入

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--incremental lastmodified  \
--check-column dtime \
--last-value "2014-11-09 21:00:00" \
-m 1

2.1.7 根据指定字段导入

指定导入id,dname

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--hive-overwrite \
--columns id,dname \
-m 1

2.1.8 根据SQL查询结果导入

使用了自定义sql就不能指定--table
自定义sql语句的where条件中必须包含字符串$CONDITIONS，$CONDITIONS是一个变量，用于给多个map任务划分任务范围
使用自定义sql时，如果通过参数-m指定多个map任务，由于自定义sql中可能存在多表查询，因此必须使用参数“--split-by 表名.字段名”指定多个map任务分割数据的根据字段，如--split-by users.id
提示必须要加--target-dir 参数，但是加上之后有没有导入指定的目录，暂时未解

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-database liuli \
--hive-table cars \
--query 'SELECT id,dname,cname FROM cars where id < 8 AND $CONDITIONS' \
--target-dir /usr/1 \
-m 1

2.1.8.1 两表链接导入

会将查询出的数据导入一张表
注意给字段取别名，不然会报错字段重复

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-database liuli \
--hive-table cars \
--query 'SELECT a.id AS aid,a.dname AS adname,a.cname AS acname,b.id AS bid,b.dname AS bdname,b.cname AS bcname FROM cars a LEFT JOIN cars2 b ON a.id = b.id  AND $CONDITIONS' \
--target-dir /usr/1 \
-m 1

2.2 全表导入

满足三个条件：
1、每个表必须都只有一个列作为主键；
2、必须将每个表中所有的数据导入，而不是部分；
3、你必须使用默认分隔列，且WHERE子句无任何强加的条件

不能使用的参数

--table
--split-by
--columns
--where

--exclude-tables：可以用来排除导入某个表

命令

sudo -u hdfs \
sqoop import-all-tables \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-overwrite \
-m 1

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:17 /user/hive/warehouse/cars
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:18 /user/hive/warehouse/person
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:04 /user/hive/warehouse/person_hive

Hive表内容

hive> show tables;
OK
cars
person
person_hive
Time taken: 0.955 seconds, Fetched: 3 row(s)

2.2.1 全表导入，并排除某张表

命令
如需排除多表，用英文逗号隔开

sudo -u hdfs \
sqoop import-all-tables \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-overwrite \
--exclude-tables person \
-m 1

2.3 sqoop脚本

使用--options-file参数可以执行已经编写好的脚本

脚本示例

import
--connect
jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false
--username
root
--password
xxxxxxxx
--table
person
--hive-import
--hive-overwrite
-m
1
--hive-table
person_hive

命令示例

一定确定文件权限HDFS用户可用

sudo -u hdfs sqoop --options-file /hive/hive.txt

刘李404not found

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
使用Sqoop将MySQL数据导入Hive

一、说明MySQL数据导入Hive方法与导入HDFS类似：https://blog.csdn.net/qq_39680564/article/details/100081659，Sqoop只是将数据导入HDFS后再转成Hive表二、Hive参数Hive除了自己的参数，同样也支持import的参数，见导入HDFS文章--hive-home <dir> 覆盖 $HIVE_HOM...
复制链接

扫一扫

专栏目录