-
split-by 根据不同的参数类型有不同的切分方法;如int型,Sqoop会取最大和最小split-by字段值,然后根据传入的num-mappers来 确定划分几个区域。比如select max(split_by),min(split-by) from得到的max(split-by)和min(split-by)分别为1000和1,而num-mappers(-m)为2的话,则会分成两个区域 (1,500)和(501-1000),同时也会分成2个sql给2个map去进行导入操作,分别为select XXX from table where split-by>=1 and split-by<500和select XXX from table where split-by>=501 and split-by<=1000.最后每个map各自获取各自SQL中的数据进行导入工作。
sqoop import \ --connect jdbc:mysql://hadoop01:3306/feidata \ --username xxx \ --password xxx \ --table emp \ --hive-overwrite \ --delete-target-dir \ --hive-import --hive-database fei_hive \ --hive-table emp_import_partition \ --fields-terminated-by '\t' \ --split-by deptno 如果不设置的话,默认是--m=4 [hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00000 20/01/13 16:34:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7369 SMITH CLERK 7902 1980-12-17 00:00:00.0 800.00 null 1 7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00.0 1600.00 300.00 2 7521 WARD SALESMAN 7698 1981-02-22 00:00:00.0 1250.00 500.00 3 7876 ADAMS CLERK 7788 1983-01-12 00:00:00.0 1100.00 null 1 7900 JAMES CLERK 7698 1981-12-03 00:00:00.0 950.00 null 2 7902 FORD ANALYST 7566 1981-12-03 00:00:00.0 3000.00 null 3 [hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00001 20/01/13 16:34:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7566 JONES MANAGER 7839 1981-04-02 00:00:00.0 2975.00 null 4 7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00.0 1250.00 1400.00 5 7934 MILLER CLERK 7782 1982-01-23 00:00:00.0 1300.00 null 4 [hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00002 20/01/13 16:34:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00.0 2850.00 null 6 7782 CLARK MANAGER 7839 1981-06-09 00:00:00.0 2450.00 null 7 [hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00003 20/01/13 16:34:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7788 SCOTT ANALYST 7566 1982-12-09 00:00:00.0 3000.00 null 8 7839 KING PRESIDENT null 1981-11-17 00:00:00.0 5000.00 null 9 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00.0 1500.00 0.00 10 上面划分map区域就是(0,3],(3,5],(5,7],(7,10],
上面遇到的问题就是当split-by是int类型但不是自增长的话,各个map分配的数据是不均衡的,可能会有些map很忙,有些map几乎没有数据处理的情况,数据多的时候,会拉数据的时候数据库出现奔溃的情况
-
当split-by不是int型时出现如上场景中的问题。目前想到的解决办法是:将-m设置称1,split-by不设置,即只有一个map运行,缺点是不能并行map录入数据,(注意,当-m设置的值大于1时,split-by必须设置字段)数据多的时候,也会在拉数据的时候数据库出现奔溃的情况。
sqoop import \ --connect jdbc:mysql://hadoop01:3306/feidata \ --username xxx \ --password xxx\ --table emp \ --hive-overwrite \ --delete-target-dir \ --hive-import --hive-database fei_hive \ --hive-table emp_import_partition \ --fields-terminated-by '\t' \ --split-by hiredate --m 1 [hadoop@hadoop01 ~]$ hdfs dfs -cat /user/hive/warehouse/fei_hive3.db/emp_import_partition/part-m-00000 20/01/13 17:07:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 7369 SMITH CLERK 7902 1980-12-17 00:00:00.0 800.00 null 20 7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00.0 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981-02-22 00:00:00.0 1250.00 500.00 30 7566 JONES MANAGER 7839 1981-04-02 00:00:00.0 2975.00 null 20 7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00.0 1250.00 1400.00 30 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00.0 2850.00 null 30 7782 CLARK MANAGER 7839 1981-06-09 00:00:00.0 2450.00 null 10 7788 SCOTT ANALYST 7566 1982-12-09 00:00:00.0 3000.00 null 20 7839 KING PRESIDENT null 1981-11-17 00:00:00.0 5000.00 null 10 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00.0 1500.00 0.00 30 7876 ADAMS CLERK 7788 1983-01-12 00:00:00.0 1100.00 null 20 7900 JAMES CLERK 7698 1981-12-03 00:00:00.0 950.00 null 30 7902 FORD ANALYST 7566 1981-12-03 00:00:00.0 3000.00 null 20 7934 MILLER CLERK 7782 1982-01-23 00:00:00.0 1300.00 null 10
但是可以使用时间字段作为分区字段(月,天),查询条件–query "select * from t where date=‘xxx’ ",时间可以写在一个文件去维护,从而可以使用shell脚本去循环查询,–m=1,这样就可以防止抽取数据的时候,把数据库拉死
Sqoop导入数据关于--split by 和 --m参数详细解析
最新推荐文章于 2024-08-21 15:26:42 发布