Sqoop

最新推荐文章于 2023-12-20 10:38:01 发布

chenxiaokang97

最新推荐文章于 2023-12-20 10:38:01 发布

阅读量723

点赞数 1

分类专栏： Hadoop

本文链接：https://blog.csdn.net/chen45682kang/article/details/81479220

版权

Hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Sqoop

安装
- 下载，解压，配置环境变量
- conf里的配置不需要动，如果没有安装ZooKeeper和Hbase，就把configure-sqoop里有关zk和hbase的脚本全部注释掉；如果安装了zk和hbase，就不需要改。
导入，一个mysql的坑

我们导入hive表的DBS表

➜  sqoop git:(master) ✗ sqoop import --connect jdbc:mysql://localhost:3306/hive --table DBS --username root -password root

java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active.

Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 10:52:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 10:52:24 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 10:52:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 10:52:24 INFO tool.CodeGenTool: Beginning code generation
18/08/07 10:52:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 10:52:25 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

这是MySQL的一个bug，把（lib目录下）mysql的连接jar包mysql-connector-java-5.1.13-bin.jar换成mysql-connector-java-5.1.32.jar就好了。

➜  lib git:(master) ✗ sqoop import --connect jdbc:mysql://localhost:3306/hive --table DBS --username root -password root

Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 11:01:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 11:01:47 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 11:01:47 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 11:01:47 INFO tool.CodeGenTool: Beginning code generation
18/08/07 11:01:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 11:01:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 11:01:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /Users/chenxiaokang/hadoop-2.7.6
Note: /tmp/sqoop-chenxiaokang/compile/3ecfbbea71dfb1dd1314eba358b9a7d7/DBS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/07 11:01:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-chenxiaokang/compile/3ecfbbea71dfb1dd1314eba358b9a7d7/DBS.jar
18/08/07 11:01:52 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/08/07 11:01:52 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/08/07 11:01:52 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/08/07 11:01:52 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/08/07 11:01:52 INFO mapreduce.ImportJobBase: Beginning import of DBS
18/08/07 11:01:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 11:01:53 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/07 11:01:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/07 11:01:56 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
18/08/07 11:02:02 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/07 11:02:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`DB_ID`), MAX(`DB_ID`) FROM `DBS`
18/08/07 11:02:02 INFO mapreduce.JobSubmitter: number of splits:4
18/08/07 11:02:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533537460397_0001
18/08/07 11:02:04 INFO impl.YarnClientImpl: Submitted application application_1533537460397_0001
18/08/07 11:02:05 INFO mapreduce.Job: The url to track the job: http://172.20.10.3:8088/proxy/application_1533537460397_0001/
18/08/07 11:02:05 INFO mapreduce.Job: Running job: job_1533537460397_0001
18/08/07 11:02:23 INFO mapreduce.Job: Job job_1533537460397_0001 running in uber mode : false
18/08/07 11:02:23 INFO mapreduce.Job:  map 0% reduce 0%
18/08/07 11:02:38 INFO mapreduce.Job:  map 50% reduce 0%
18/08/07 11:02:39 INFO mapreduce.Job:  map 100% reduce 0%
18/08/07 11:02:40 INFO mapreduce.Job: Job job_1533537460397_0001 completed successfully
18/08/07 11:02:40 INFO mapreduce.Job: Counters: 31
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=565736
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=417
        HDFS: Number of bytes written=158
        HDFS: Number of read operations=16
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=8
    Job Counters 
        Killed map tasks=1
        Launched map tasks=4
        Other local map tasks=4
        Total time spent by all maps in occupied slots (ms)=50716
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=50716
        Total vcore-milliseconds taken by all map tasks=50716
        Total megabyte-milliseconds taken by all map tasks=51933184
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=417
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=454
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=440926208
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=158
18/08/07 11:02:40 INFO mapreduce.ImportJobBase: Transferred 158 bytes in 44.4693 seconds (3.553 bytes/sec)
18/08/07 11:02:40 INFO mapreduce.ImportJobBase: Retrieved 2 records.

可以看到hdfs中就有了我们导入的mysql的数据

0: jdbc:hive2://localhost:10000> dfs -ls /user/hdfs;
+----------------------------------------------------------------------------+--+
|                                 DFS Output                                 |
+----------------------------------------------------------------------------+--+
| Found 1 items                                                              |
| drwxr-xr-x   - hdfs supergroup          0 2018-08-07 11:02 /user/hdfs/DBS  |
+----------------------------------------------------------------------------+--+
2 rows selected (0.01 seconds)

0: jdbc:hive2://localhost:10000> dfs -cat /user/hdfs/DBS/part-m-00000;
+----------------------------------------------------------------------------------------+--+
|                                       DFS Output                                       |
+----------------------------------------------------------------------------------------+--+
| 1,Default Hive database,hdfs://localhost:9000/user/hive/warehouse,default,public,ROLE  |
+----------------------------------------------------------------------------------------+--+
1 row selected (0.024 seconds)

导入过程
- Sqoop也是通过MapReduce作业进行导入工作，在作业中，会从表中读取一行行记录，然后将其写入HDFS中：
  - 开始导入之前，Sqoop会通过JDBC来获得所需要的数据库元数据，例如导入表的列名、数据类型等；
  - 接着数据库的数据类型（varchar、number等）会被映射成Java的基本数据类型（String、int等），根据这些信息，Sqoop会生成一个与表名同名的类用来完成反序列化工作，保存表中每一行的记录；
  - Sqoop启动MapReduce作业；
  - 启动的作业在input的过程中，会通过JDBC读取数据库表中的内容,这时使用Sqoop生成的类进行反序列化；
  - 最后再将这些记录写到HDFS中，在写入HDFS的过程中，同时会使用Sqoop生成的类进行序列化。
- Sqoop的导入作业通常不只是由一个Map任务完成，也就是每个任务会获取表的一部分数据。如果只由一个Map任务完成导入的话，那么在第四步会执行“`SELECT col1,col2,… FROM table;
- 如果多个Map任务，就必须对表进行水平切分，水平切分的依据通常是表的主键。Sqoop在启动MapReduce作业时，会首先通过JDBC查询切分列的最大值和最小值，再根据启动的任务数（-m指定）划分出每个任务所负责的数据：SELECT col1,col2,... FROM table WHERE id >= 0 AND id < 50000;,SELECT col1,col2,... FROM table WHERE id >= 50000 AND id < 100000;
- 并行导入切分列的数据分布会很大地影响性能，如果均匀分布，性能最好。数据严重倾斜，性能很差。所以在导入之前，有必要对切分列的数据进行抽样检测，了解数据的分布。
- Sqoop可以对导入过程进行精细地控制，不用每次都导入一张表的所有字段。Sqoop允许我们指定表的列，在查询中加入WHERE子句，也可以自定义查询SQL语句，在SQL中可以使用目标数据库所支持的函数。
- 我们在导入到HDFS的时候可以在Hive创建好该表：sqoop create-hive-table --connect jdbc://master:3306/hive --table DBS --fields-terminated-by ',' --username [username] --password [password]，然后LOAD数据即可。
- Sqoop默认导出格式为逗号分隔，所以在Sqoop建表命令中，我们用--fields-terminated-by ','指明Hive中的DBS表的列分隔符。
- 也可将导入HDFS、创建表、加载数据合并为一个步骤：sqoop import --connect jdbc:mysql://master:3306/hive --table DBS --username [usernmae] --password [password] -m [num] --hive-import
导出过程
- 在将Hive中的表导出到数据库时，必须在数据库中新建一张用来接收数据的表。
- 同样的，Sqoop根据目标表的结构会生成一个Java类（第一二步），作用是序列化和反序列化。接着启动一个MapReduce作业（第三步），在作业中会用生成的Java类从HDFS中读取数据（第四步），并生成一批INSERT语句，每条语句都会向MySQL的目标表中插入多条记录（第五步），这样读写都是并行，写入性能受限于目标数据库的写入性能。

例子：导出sc表到MySQL

0: jdbc:hive2://localhost:10000> DESC sc;
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| id        | bigint     |          |
| courseid  | bigint     |          |
| account   | string     |          |
+-----------+------------+----------+--+
3 rows selected (0.185 seconds)
0: jdbc:hive2://localhost:10000>

mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> create table sc(id bigint, courseid bigint, account varchar(32));
Query OK, 0 rows affected (0.03 sec)

➜  bin git:(master) ✗ sqoop export --connect jdbc:mysql://localhost:3306/test --table sc --export-dir /user/hive/warehouse/sc --username root --password root -m 1 --fields-terminated-by ',';
Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 12:23:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 12:23:32 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 12:23:32 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 12:23:32 INFO tool.CodeGenTool: Beginning code generation
18/08/07 12:23:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `sc` AS t LIMIT 1
18/08/07 12:23:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `sc` AS t LIMIT 1
18/08/07 12:23:33 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /Users/chenxiaokang/hadoop-2.7.6
Note: /tmp/sqoop-chenxiaokang/compile/fb350fa941a369d077323a7b646b5380/sc.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/07 12:23:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-chenxiaokang/compile/fb350fa941a369d077323a7b646b5380/sc.jar
18/08/07 12:23:35 INFO mapreduce.ExportJobBase: Beginning export of sc
18/08/07 12:23:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 12:23:35 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/07 12:23:37 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
18/08/07 12:23:42 INFO input.FileInputFormat: Total input paths to process : 1
18/08/07 12:23:42 INFO input.FileInputFormat: Total input paths to process : 1
18/08/07 12:23:43 INFO mapreduce.JobSubmitter: number of splits:1
18/08/07 12:23:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/07 12:23:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533537460397_0002
18/08/07 12:23:44 INFO impl.YarnClientImpl: Submitted application application_1533537460397_0002
18/08/07 12:23:44 INFO mapreduce.Job: The url to track the job: http://172.20.10.3:8088/proxy/application_1533537460397_0002/
18/08/07 12:23:44 INFO mapreduce.Job: Running job: job_1533537460397_0002
18/08/07 12:23:57 INFO mapreduce.Job: Job job_1533537460397_0002 running in uber mode : false
18/08/07 12:23:57 INFO mapreduce.Job:  map 0% reduce 0%
18/08/07 12:24:04 INFO mapreduce.Job:  map 100% reduce 0%
18/08/07 12:24:05 INFO mapreduce.Job: Job job_1533537460397_0002 completed successfully
18/08/07 12:24:05 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=141082
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1044
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4797
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4797
        Total vcore-milliseconds taken by all map tasks=4797
        Total megabyte-milliseconds taken by all map tasks=4912128
    Map-Reduce Framework
        Map input records=82
        Map output records=82
        Input split bytes=132
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=74
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=121110528
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
18/08/07 12:24:05 INFO mapreduce.ExportJobBase: Transferred 1.0195 KB in 28.176 seconds (37.0528 bytes/sec)
18/08/07 12:24:05 INFO mapreduce.ExportJobBase: Exported 82 records.

chenxiaokang97

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Sqoop

Sqoop安装下载，解压，配置环境变量conf里的配置不需要动，如果没有安装ZooKeeper和Hbase，就把configure-sqoop里有关zk和hbase的脚本全部注释掉；如果安装了zk和hbase，就不需要改。导入，一个mysql的坑我们导入hive表的DBS表➜ sqoop git:(master) ✗ sqoop import --connect jd...
复制链接

扫一扫