SQOOP 从 MySQL 导入到 Hive 之 Python 脚本

6 篇文章 0 订阅

目录

1、当前业务数据收集详情

2、脚本部署树形图

3、同步 sqoop + python的 管理脚本到 inf 环境

4、定制 crontab 启动

5、测试

5.1 执行失败

5.2 执行成功


1、当前业务数据收集详情

  • 当前的业务数据收集是使用的 sqoop + python 的管理脚本实现的
  • 部署在 sqoop@yz-bi-web01.lehe.com:/home/sqoop/sqoop 上
  • 脚本是通过 crontab 定期启动的
  • 核心脚本 /home/sqoop/sqoop/bin/sqoop_import.py 将任务分为了天级和小时级两种,通过 crontab 定期调用 /home/sqoop/sqoop/bin/sqoop_import_hour.sh和/home/sqoop/sqoop/bin/sqoop_import_day.sh 来实现
  • /home/sqoop/sqoop/conf/confset 中记录了需要导入的分数据库的配置名称
  • /home/sqoop/sqoop/conf/higo_account.conf 等记录了需要导库的配置的内容
  • 同时也可以通过手动的形式强制进行某个已配置的数据库的导入
  • cd /home/sqoop/sqoop && python bin/sqoop_import.py -t ods_higo_pandora_order
  • 将强制进行一次 t_pandora_order 到 hive 中 ods_higo_pandora_order 库的导入操作
  • 可导入 Hadoop 的 mysql 库配置在 yz-bi-store00 机器上的 higo_bilog 库的 t_dolphin_stat_db_info 表中

2、脚本部署树形图

.
├── bin
│   ├── find.sh
│   ├── logger.py
│   ├── mailer.py
│   ├── mailman.py
│   ├── mailssl2.py
│   ├── mailssl.py
│   ├── make_time_flag.py
│   ├── monitor.conf
│   ├── sendSMS.py
│   ├── sqoop_import_day.sh
│   ├── sqoop_import_hour.sh
│   ├── sqoop_import.py
│   ├── sqoop-monitor-day.py
│   ├── sqoop-monitor-day.sh
│   ├── statutil.py
│   └── table_info.py
├── conf
│   ├── brand_library.conf
│   ├── column.conf
│   ├── confset
│   ├── gen.py
│   ├── higo_account.conf
│   ├── higo_adx.conf
│   ├── higo_base.conf
│   ├── higo_comment.conf
│   ├── higo_content.conf
│   ├── higo_coupon.conf
│   ├── higo_crm.conf
│   ├── higo_data.conf
│   ├── higo_front.conf
│   ├── higo_goods.conf
│   ├── higo_order.conf
│   ├── higo_push.conf
│   ├── higo_risk.conf
│   ├── higo_trolley.conf
│   ├── hive_meta.conf
│   ├── im.conf
│   ├── kefu.conf
│   ├── metric_meta.conf
│   └── xz_store.conf
├── failed_log
├── log
│   ├── sqoop_import_day_2020-05-20.log
│   ├── sqoop_import_hour_2020-05-20_10.log
│   └── sqoop_info.2020-05-20_10:05:01
├── script
│   ├── acaleph_sign.py
│   ├── binlog_sqoop
│   │   ├── gen_cmd.py
│   │   ├── map.conf
│   │   ├── run_hql.sh
│   │   ├── run.py
│   │   └── transform_binlog.py
│   ├── compare_all_task.py
│   ├── sqoop_import.test.py
│   ├── sqoop_mysql.txt
│   └── trans_ini2info.py
└── test
    ├── conf
    │   ├── bat.conf
    │   ├── cheetah.conf
    │   ├── confset
    │   ├── db_processesnum.conf
    │   └── mlspay_account.conf
    ├── table_cfg_maker.py
    └── test_connect.py

3、同步 sqoop + python的 管理脚本到 inf 环境

# cd /home/sqoop
# rsync -avprP -e ssh 10.20.2.24:/home/sqoop/galaxy ./
# 创建软连接
# ln -s galaxy/trunk/libra/allstars/sqoop sqoop
# chown -R sqoop.sqoop /home/sqoop/

4、定制 crontab 启动

$ crontab -l
5 * * * * sh /home/sqoop/sqoop/bin/sqoop_import_hour.sh > /tmp/sqoop_import_hour.log 2>&1 &
20 0 * * * sh /home/sqoop/sqoop/bin/sqoop_import_day.sh > /tmp/sqoop_import_day.log 2>&1 &

5、测试

5.1 执行失败

# 问题 01(sqoop 导入数据的时候出现):ERROR tool.ImportTool: Import failed: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
    at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
    at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
    at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
    at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.connect(Job.java:1256)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:200)
    at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:173)
    at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:270)
    at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
    at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:127)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
 
# 问题02(qoop 导入数据的时候出现):sUnsupported major.minor version 52.0
# 问题原因分析:由于脚本中显示指定了运行 sqoop 的jdk版本为 1.8,CDH 集群的 jdk 为 1.7,导致执行 sqoop 时报错:Unsupported major.minor version 52.0。出现该问题是由于 cloudera manager 和系统的 jdk 版本不一致导致的,更具体点是cm使用的是低版本的 jdk(我的是1.7),而操作系统通过 java -version 出来的版本是高版本的 jdk (我的是1.8),由于 sqoop 会自动生成代码并编译程 jar 然后提交给 map-reduce 执行,所以 shell 命令行执行sqoop 编译出来的 job jar 是 1.8 编译的,而我的 cm5.3.0 安装的时候是用的 1.7 的 jdk,导致的问题就是高版本 JDK 编译的代码在低版本上执行,就出现这个错误了。
# 解决方案:改回 1.7 的 jdk
# ll
drwxr-xr-x  8 root  root  4096 Dec 19  2014 jdk1.7.0_76
drwxr-xr-x  8 root  root  4096 Jul 22  2017 jdk1.8.0_144
# vim /etc/profile.d/java.sh
    #export JAVA_HOME=/data/jdk1.8.0_144
    #export PATH=$JAVA_HOME/bin:$PATH
    #export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
    export JAVA_HOME=/data/jdk1.7.0_76
    export PATH=$JAVA_HOME/bin:$PATH
# source /etc/profile
  
# 修改“$HADOOP_HOME/etc/hadoop”下 hadoop-env.sh 配置文件中的 java 环境变量 JAVA_HOME
# echo $HADOOP_HOME
# vim /hadoop/hadoop-2.7.1/etc/hadoop/hadoop-env.sh
    26 export JAVA_HOME=/data/jdk1.7.0_76

5.2 执行成功

$ cd /home/sqoop/sqoop && python bin/sqoop_import.py -t ods_data_visual_table_wf
2019-06-25 11:55:33,447-bin/sqoop_import.py-clean_env-1272-INFO-source /etc/bashrc; hadoop fs -rmr /tmp/sqoop_tmp/20190622_* 2>/dev/null
[INFO][get_config_db]db:higo_hive, master=0
[INFO][init_db_info]self.db: higo_hive, self.master: 0
[INFO]Connect: {'db': u'higo_hive', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3412}
[INFO][get_config_db]db:higo_hive, master=0
[INFO][init_db_info]self.db: higo_hive, self.master: 0
[INFO]Connect: {'db': u'higo_hive', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3412}
[INFO][get_config_db]db:higo_bilog, master=0
[INFO][init_db_info]self.db: higo_bilog, self.master: 0
[INFO]Connect: {'db': u'higo_bilog', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3411}
  
.....
  
2019-06-25 11:55:33,690-bin/sqoop_import.py-worker-873-INFO-cmd:/hadoop/sqoop/bin/sqoop import --connect "jdbc:mysql://10.20.2.22:3411/metric_meta?zeroDateTimeBehavior=convertToNull&socketTimeout=0" --username mlsreader --password 'RMlSxs&^c6OpIAQ1' --fields-terminated-by '\001' --warehouse-dir /tmp/sqoop_tmp/20190625_86116640 --outdir /tmp/sqoop/20190625_86116640 --hive-import --hive-overwrite --hive-delims-replacement=' '  --num-mappers 1 --table t_visual_table_wf  --columns id,cn_name,project,creater,flag,type,modify_user,chinese_name,create_date  --map-column-java id=Long,flag=Integer,type=Integer,create_date=String  --map-column-hive id=BIGINT,flag=TINYINT,type=TINYINT,create_date=TIMESTAMP  --hive-table default.ods_data_visual_table_wf
2019-06-25 11:55:33,690-bin/sqoop_import.py-worker-874-INFO-overwrite_cmd:None
19/06/25 11:55:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
19/06/25 11:55:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/06/25 11:55:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/06/25 11:55:34 INFO tool.CodeGenTool: Beginning code generation
19/06/25 11:55:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop/hadoop/share/hadoop/mapreduce
Note: /tmp/sqoop-sqoop/compile/cda9524fa9816307b96174b1316369d5/t_visual_table_wf.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/06/25 11:55:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-sqoop/compile/cda9524fa9816307b96174b1316369d5/t_visual_table_wf.jar
19/06/25 11:55:35 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/06/25 11:55:35 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/06/25 11:55:35 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/06/25 11:55:35 INFO mapreduce.ImportJobBase: Beginning import of t_visual_table_wf
19/06/25 11:55:36 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/06/25 11:55:36 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/06/25 11:55:37 INFO impl.TimelineClientImpl: Timeline service address: http://yz-higo-jt.lehe.com:8188/ws/v1/timeline/
19/06/25 11:55:37 INFO client.RMProxy: Connecting to ResourceManager at yz-higo-jt.lehe.com/10.20.3.47:8032
19/06/25 11:55:38 INFO db.DBInputFormat: Using read commited transaction isolation
19/06/25 11:55:38 INFO mapreduce.JobSubmitter: number of splits:1
19/06/25 11:55:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1554441177879_240989
19/06/25 11:55:39 INFO impl.YarnClientImpl: Submitted application application_1554441177879_240989
19/06/25 11:55:39 INFO mapreduce.Job: The url to track the job: http://yz-higo-jt.lehe.com:8088/proxy/application_1554441177879_240989/
19/06/25 11:55:39 INFO mapreduce.Job: Running job: job_1554441177879_240989
19/06/25 11:55:45 INFO mapreduce.Job: Job job_1554441177879_240989 running in uber mode : false
19/06/25 11:55:45 INFO mapreduce.Job:  map 0% reduce 0%
19/06/25 11:55:51 INFO mapreduce.Job:  map 100% reduce 0%
19/06/25 11:55:51 INFO mapreduce.Job: Job job_1554441177879_240989 completed successfully
19/06/25 11:55:51 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=138498
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=17660
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=6170400
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3428
        Total vcore-seconds taken by all map tasks=3428
        Total megabyte-seconds taken by all map tasks=6170400
    Map-Reduce Framework
        Map input records=189
        Map output records=189
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=23
        CPU time spent (ms)=1080
        Physical memory (bytes) snapshot=377196544
        Virtual memory (bytes) snapshot=2446786560
        Total committed heap usage (bytes)=1019609088
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=17660
19/06/25 11:55:51 INFO mapreduce.ImportJobBase: Transferred 17.2461 KB in 14.9276 seconds (1.1553 KB/sec)
19/06/25 11:55:51 INFO mapreduce.ImportJobBase: Retrieved 189 records.
19/06/25 11:55:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:51 WARN hive.TableDefWriter: Column create_date had to be cast to a less precise type in Hive
19/06/25 11:55:51 INFO hive.HiveImport: Loading uploaded data into Hive
19/06/25 11:55:55 INFO hive.HiveImport:
19/06/25 11:55:55 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/hadoop/hive-2.1.0/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
19/06/25 11:55:58 INFO hive.HiveImport: OK
19/06/25 11:55:58 INFO hive.HiveImport: Time taken: 1.751 seconds
19/06/25 11:55:58 INFO hive.HiveImport: Loading data to table default.ods_data_visual_table_wf
19/06/25 11:55:58 INFO hive.HiveImport: Moved: 'hdfs://yz-higo-nn1:9000/user/hive/warehouse/ods_data_visual_table_wf/part-m-00000' to trash at: hdfs://yz-higo-nn1:9000/user/sqoop/.Trash/Current
19/06/25 11:55:59 INFO hive.HiveImport: OK
19/06/25 11:55:59 INFO hive.HiveImport: Time taken: 0.804 seconds
19/06/25 11:55:59 INFO hive.HiveImport: Hive import complete.
[INFO][get_config_db]db:higo_bilog, master=0
2019-06-25 11:56:02,039-bin/sqoop_import.py-exec_stat_sql-167-INFO-replace_sql:replace into t_dolphin_sqoop_table_status(`date`, `tablename`, `hpath`, `hsize`,                 `mysqltbl`, `host`, `port`, `user`, `passwd`, `mysqldb`, `begintime`, `costtime`)                 value('2019-06-25', 'ods_data_visual_table_wf', '/user/hive/warehouse/ods_data_visual_table_wf', 17660, 't_visual_table_wf', '10.20.2.22', 3411, 'mlsreader', 'RMlSxs&^c6OpIAQ1', 'metric_meta', '01:10', 26)
[INFO][init_db_info]self.db: higo_bilog, self.master: 1
[INFO]Connect: {'db': u'higo_bilog', 'host': u'10.20.2.22', 'master': 1, 'user': u'mlswriter', 'port': 3411}
 /hadoop/hadoop/bin/hadoop fs -touchz /user/data_ready_tag/default_ods_data_visual_table_wf/ods_data_visual_table_wf_2019-06-25_day
touchz: `/user/data_ready_tag/default_ods_data_visual_table_wf/ods_data_visual_table_wf_2019-06-25_day': No such file or directory
2019-06-25 11:56:04,139-bin/sqoop_import.py-manager-1046-INFO-cmd:source /etc/bashrc; hadoop fs -dus /user/hive/warehouse/ods_data_visual_table_wf 2>/dev/null | awk '{print $1}'
2019-06-25 11:56:08,080-bin/sqoop_import.py-manager-1138-INFO-37920 40401 执行任务数:1 用时:34 second
2019-06-25 11:56:08,146-bin/sqoop_import.py-report_mail-1209-INFO-ts:ods_data_visual_table_wf,mts:t_visual_table_wf,hsize:17660
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值