sqoop 导入hdfs hive hbase

最新推荐文章于 2024-01-20 06:30:00 发布

人生有如两个橘子

最新推荐文章于 2024-01-20 06:30:00 发布

阅读量773

点赞数

分类专栏： sqoop

本文链接：https://blog.csdn.net/qq_37706484/article/details/102503802

版权

sqoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

参数解析

官网参数解析：http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_incremental_imports

--password-file ：密码文件在hdfs上的路劲。如果密码不能明文或提交job避免每次手动输入密码，使用该配置

-m n ：如果n > 1，需添加--split-by <col>配置，该值是mapper个数，也是在HDFS上生成的文件个数

--split-by <col> ：1，该字段不能有null，否则会丢失为null的数据，2，该字段类型尽量不要是string类型，对于‘00..00’ n个0的数据会重复抽取，并且该字段必须存在--query查询字段中或--columns中

--hive-partition-key 'version' ： version为分区列，可以随便取，但如果是表中存在字段，则在--query查询字段中或--columns中不能存在，相应的最终hive表中也没有该字段，而是分区字段，值为--hive-partition-value配置值
--hive-partition-value '1' ：具体分区值设定，最终在HDFS的hive表下生成version=1文件夹

--hive-overwrite ：覆盖模式，且会自动创建表（分区表），不加该配置为append

--null-string '\\N' --null-non-string '\\N' ：hive中NULL处理

--fields-terminated-by '\007' ：hive中字段分隔符，尽量避免使用逗号“,”，某些地址字段中可能会有，还可用\002,\003

--target-dir ：配置结果存放的hdfs路劲。注意：1.该路径最好设置为表名或唯一，如果同时执行多个任务且该配置路径一样，多任务会在同一路径生成临时文件，如果配置了--delete-target-dir，会造成先完成的任务将数据删除，进行中任务报错; 2.如果配置到目标hive表所在hdfs路径，不要加--delete-target-dir配置项

--delete-target-dir ：删除--target-dir下生成的结果，与--append不能同时使用，包括与--incremental append

--hbase-create-table ：创建表

--hbase-table <table> ：hbase表名

--hbase-row-key <col> ：指定主键或联合主键作为rowkey，唯一主键写法为 "id"，联合主键写法为 "id1,id2"

--column-family <family> ：指定列族

--hbase-bulkload ：开启Bulk Loading

--incremental <mode> ：获取增量数据的方式，其实数据落到目标表里都是追加模式。包括append，lastmodified两种获取方式。append只是针对自增id获取增量数据；lastmodified针对更新时间获取增量数据，结合--merge-key可以做到增量更新

--check-column <col> ：指定在确定导入哪些行时要检查的列。(列的类型不应该是CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)

--last-value <value> ：指定前一个导入的--check-column的最大值。也就是本次导入任务的最小值

--merge-key <col> ：指定要用作合并键的列的名称，也就是表中主键。运行MapReduce程序进行数据合并，可以和lastmodified增量方式搭配使用

--create <job-id> ：创建sqooop任务

--delete <job-id> ：删除sqooop任务

--exec <job-id> ：执行sqooop任务

--show <job-id> ：显示sqooop任务

--list ：列出所有创建好的sqooop任务

导入HDFS

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age from tb where \$CONDITIONS" \
--target-dir /tmp/sqoop_import/$tb \

导入Hive

sqoop将关系型数据库数据导入hive，分两个步骤，一将查询结果集放到HDFS上，二将HDFS数据使用hive的load命令导入hive

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age from tb where \$CONDITIONS" \
--target-dir /tmp/sqoop_import/$tb \
--hive-import \
--hive-overwrite \
--hive-database test \
--hive-table tb \
--hive-partition-key 'version' \
--hive-partition-value '1' \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \

导入Hbase
中间结果默认在/user/${执行用户}/_sqoop/下
添加--target-dir /tmp，该配置路劲下不会生成结果，但是可以解决Import failed:Can not create a Path from a null string

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age from tb where \$CONDITIONS" \
--hbase-create-table \
--hbase-table "h_tb" \
--hbase-row-key "id,name" \
--column-family "f" \
--hbase-bulkload \
--split-by 'age' \
-m 10 \
--target-dir /tmp \

注意：不管是主键还是联合主键，作为rowkey后将不会存在于列中，有时业务需要主键属性，或映射到hive中需要，必须冗余存储，需在sqoop-site.xml中添加如下配置

<property>
    <name>sqoop.hbase.add.row.key</name>
    <value>true</value>
</property>

以下两组可替换使用

--table 'tb' \
--colnums "id,name,age" \
--where "age > 20" \

--query "select name, age from tb where \$CONDITIONS" \

sqlserver

--driver "com.microsoft.jdbc.sqlserver.SQLServerDriver" \
--connect "jdbc:sqlserver://ip:1433;username=sa;password=123456" \

oracle

--driver "oracle.jdbc.driver.OracleDriver" \
--connect "jdbc:oracle:thin:@//OracleServer:OraclePort/OracleService" \

Hive增量导入
--incremental append

主键id --check-column id ：没找到联合主键配置方法

适用场景：导入hive：表中有主键且自增，行数据不更新，增量数据会追加到表里(分区表)，增量获取条件区间(stdId,endId]，左开右闭，最大ID为表中ID，建议使用

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age, updatetime from tb where \$CONDITIONS" \
--target-dir /tmp/sqoop_import/$tb \
--hive-import \
--hive-database test \
--hive-table tb \
--hive-partition-key 'version' \
--hive-partition-value '1' \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \
--incremental append \
--check-column id \
--last-value 1 \

时间戳 --check-column updatetime ：效果等同于 --incremental lastmodified --append，增量获取条件区间(std,end]，左开右闭，最大时间为表中时间，建议使用

适用场景：导入hive：行数据更新且会更新时间，增量数据会追加到表里(分区表)

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age, updatetime from tb where \$CONDITIONS" \
--target-dir /tmp/sqoop_import/$tb \
--hive-import \
--hive-database test \
--hive-table tb \
--hive-partition-key 'version' \
--hive-partition-value '1' \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \
--incremental append \
--check-column updatetime \
--last-value '2019-08-08 08:08:08' \

--incremental lastmodified

lastmodified 不支持--hive-import，要实现hive导入，采用导入HDFS方式，--target-dir指定为hive表路径

追加 --append：效果等同于--incremental append --check-column updatetime，增量获取条件区间[std,end)，左闭右开，end数据在当前增量丢失，下次增量加载，最大时间为当前系统时间，不建议使用

适用场景：导入hive：行数据更新且会更新时间<updatetime>，增量数据会追加到表里(分区表)，分区表的--target-dir指定到分区目录

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age, updatetime from tb where \$CONDITIONS" \
--target-dir /user/hive/warehouse/test.db/tb/version=1 \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \
--incremental lastmodified \
--check-column updatetime \
--last-value "2018-08-08 08:08:08" \
--append \

更新 --merge-key ：没找到联合主键配置方法

适用场景：导入hive：表中有主键，行数据更新且会更新时间字段<updatetime>，增量数据通过MR程序更新到表里(分区表)，分区表的--target-dir指定到分区目录，增量获取条件区间[std,end)，左闭右开，end数据在当前增量丢失，下次增量加载，最大时间为当前系统时间，不建议使用

sqoop import \
--driver "com.mysql.jdbc.Driver" \
--connect "jdbc:mysql://ip:3306/db?useUnicode=true&characterEncoding=utf8" \
--username root \
--password pwd \
--query "select id, name, age, updatetime from tb where \$CONDITIONS" \
--target-dir /user/hive/warehouse/test.db/tb/version=1 \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \
--incremental lastmodified \
--check-column updatetime \
--last-value "2018-08-08 08:08:08" \
--merge-key id \

Hbase增量导入

以下增量方式直接追加到《导入Hbase》基础之上即可实现增量导入

--incremental append：分别按照id和更新时间获取增量，增量获取条件区间(std,end]，左开右闭，end均为表中最大

--incremental append \
--check-column id \
--last-value 1 \

--incremental append \
--check-column updatetime \
--last-value "2018-08-08 08:08:08" \

--incremental lastmodified ：根据更新时间获取增量，增量获取条件区间[std,end)，左闭右开，end为表中系统最大时间

--incremental lastmodified \
--check-column updatetime \
--last-value "2018-08-08 08:08:08" \
--append \

任务job
创建job

sqoop job --create myjob -- import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://ip:3306/test \
--username root \
--password pwd \
--table tb \
--hive-import \
--hive-database test \
--hive-table tb \
--hive-partition-key 'version' \
--hive-partition-value '1' \
--fields-terminated-by '\007' \
--null-string '\\N' \
--null-non-string '\\N' \
--split-by 'age' \
-m 10 \
--incremental lastmodified \
--check-column updatetime \
--last-value '2018-08-08 08:08:08' \
--merge-key id \