利用SQOOP将数据从数据库导入到HDFS

最新推荐文章于 2021-12-31 10:17:13 发布

Omnipotent-Youth

最新推荐文章于 2021-12-31 10:17:13 发布

阅读量990

点赞数

文章标签： hadoop hdfs hive mapreduce 大数据

 
 #Oracle的连接字符串，其中包含了Oracle的地址，SID，和端口号 
 
 CONNECTURL=jdbc:oracle:thin:@20.135.60.21:1521:DWRAC2 
 
 #使用的用户名 
 
 ORACLENAME=kkaa 
 
 #使用的密码 
 
 ORACLEPASSWORD=kkaa123 
 
 #需要从Oracle中导入的表名 
 
 oralceTableName=tt 
 
 #需要从Oracle中导入的表中的字段名 
 
 columns=AREA_ID,TEAM_NAME 
 
 #将Oracle中的数据导入到HDFS后的存放路径 
 
 hdfsPath=apps/ 
 as 
 /hive/$oralceTableName 

 #执行导入逻辑。将Oracle中的数据导入到HDFS中 
 
 sqoop import  
 --append --connect $CONNECTURL  
 
 --username $ORACLENAME --password $ORACLEPASSWORD  
 
 --target-dir $hdfsPath  
 
 --num-mappers 1  
 
 --table $oralceTableName  
 
 --columns $columns  
 
 --fields-terminated-by '\001' 
 
执行上面的脚本之后，导入程序就完成了。
   接下来，用户可以自己创建外部表，将外部表的路径和HDFS中存放Oracle数据的路径对应上即可。 
  
      注意：这个程序导入到HDFS中的数据是文本格式，所以在创建Hive外部表的时候，不需要指定文件的格式为RCFile，而使用默认的TextFile即可。数据间的分隔符为'\001'。如果多次导入同一个表中的数据，数据以append的形式插入到HDFS目录中。 
    
并行导入
   假设有这样这个sqoop命令，需要将Oracle中的数据导入到HDFS中： 
 
sqoop import --append --connect $CONNECTURL 
 
 --username $ORACLENAME --password $ORACLEPASSWORD  
 
 --target-dir $hdfsPath   
 
 --m 1 --table $oralceTableName --columns $columns  
 
 --fields-terminated-by '\001'   
 
 --where "data_desc='2011-02-26'"
 
 请注意，在这个命令中，有一个参数“-m”，代表的含义是使用多少个并行，这个参数的值是1，说明没有开启并行功能。
  
 现在，我们可以将“-m”参数的值调大，使用并行导入的功能，如下面这个命令：
  
 sqoop import --append --connect $CONNECTURL  
 
 --username $ORACLENAME --password $ORACLEPASSWORD  
 
 --target-dir $hdfsPath  --m 4  
 
 --table $oralceTableName --columns $columns  
 
 --fields-terminated-by '\001'   
 
 --where "data_desc='2011-02-26'"
  
一般来说，Sqoop就会开启4个进程，同时进行数据的导入操作。
    但是，如果从Oracle中导入的表没有主键，那么会出现如下的错误提示： 
  
ERROR tool.ImportTool: Error during import: No primary key could be found for table creater_user.popt_cas_redirect_his. Please specify one with --split-by or perform a sequential import with '-m 1'.
 
在这种情况下，为了更好的使用Sqoop的并行导入功能，我们就需要从原理上理解Sqoop并行导入的实现机制。
    如果需要并行导入的Oracle表的主键是id，并行的数量是4，那么Sqoop首先会执行如下一个查询： 
  
 select max(id) as max, select min(id) as min from table [where 如果指定了where子句];
  
通过这个查询，获取到需要拆分字段（id）的最大值和最小值，假设分别是1和1000。
    然后，Sqoop会根据需要并行导入的数量，进行拆分查询，比如上面的这个例子，并行导入将拆分为如下4条SQL同时执行： 
  
  select 
   *  
  from 
   table 
   where 
   0 <= id < 250; 
 
  select 
   *  
  from 
   table 
   where 
   250 <= id < 500; 
 
  select 
   *  
  from 
   table 
   where 
   500 <= id < 750; 
 
  select 
   *  
  from 
   table 
   where 
   750 <= id < 1000; 
 
注意，这个拆分的字段需要是整数。
    从上面的例子可以看出，如果需要导入的表没有主键，我们应该如何手动选取一个合适的拆分字段，以及选择合适的并行数。 
   
      再举一个实际的例子来说明： 
     
        我们要从Oracle中导入creater_user.popt_cas_redirect_his。 
       
          这个表没有主键，所以我们需要手动选取一个合适的拆分字段。 
         
            首先看看这个表都有哪些字段： 
           
              然后，我假设ds_name字段是一个可以选取的拆分字段，然后执行下面的sql去验证我的想法： 
            
select min(ds_name), max(ds_name) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
 
发现结果不理想，min和max的值都是相等的。所以这个字段不合适作为拆分字段。
    再测试一下另一个字段：CLIENTIP 
  
select min(CLIENTIP), max(CLIENTIP) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
 
这个结果还是不错的。所以我们使用CLIENTIP字段作为拆分字段。
    所以，我们使用如下命令并行导入： 
  
sqoop import --append --connect $CONNECTURL 
 
 --username $ORACLENAME --password $ORACLEPASSWORD  
 
 --target-dir $hdfs 
 
 Path  --m 12 --split-by CLIENTIP  
 
 --table $oralceTableName --columns $columns  
 
 --fields-terminated-by '\001'   
 
 --where "data_desc='2011-02-26'" 
 
这次执行这个命令，可以看到，消耗的时间为：20mins, 35sec，导入了33,222,896条数据。
    另外，如果觉得这种拆分不能很好满足我们的需求，可以同时执行多个Sqoop命令，然后在where的参数后面指定拆分的规则。如： 
  
  sqoop import  
  --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD  
 
  --target-dir $hdfsPath  --m 1  
 
  --table $oralceTableName --columns $columns  
 
  --fields-terminated-by '\001'   
 
  --where "data_desc='2011-02-26' logtime<10:00:00" 
 
  sqoop import  
  --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD  
 
  --target-dir $hdfsPath  --m 1  
 
  --table $oralceTableName --columns $columns  
 
  --fields-terminated-by '\001'   
 
  --where "data_desc='2011-02-26' logtime>=10:00:00" 
 
 从而达到并行导入的目的。

Omnipotent-Youth

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
利用SQOOP将数据从数据库导入到HDFS

#Oracle的连接字符串，其中包含了Oracle的地址，SID，和端口号CONNECTURL=jdbc:oracle:thin:@20.135.60.21:1521:DWRAC2#使用的用户名ORACLENAME=kkaa#使用的密码ORACLEPASSWORD=kkaa123#需要从Oracle中导入的表名oralceTableName=
复制链接

扫一扫