目录
3、同步 sqoop + python的 管理脚本到 inf 环境
1、当前业务数据收集详情
- 当前的业务数据收集是使用的 sqoop + python 的管理脚本实现的
- 部署在 sqoop@yz-bi-web01.lehe.com:/home/sqoop/sqoop 上
- 脚本是通过 crontab 定期启动的
- 核心脚本 /home/sqoop/sqoop/bin/sqoop_import.py 将任务分为了天级和小时级两种,通过 crontab 定期调用 /home/sqoop/sqoop/bin/sqoop_import_hour.sh和/home/sqoop/sqoop/bin/sqoop_import_day.sh 来实现
- /home/sqoop/sqoop/conf/confset 中记录了需要导入的分数据库的配置名称
- /home/sqoop/sqoop/conf/higo_account.conf 等记录了需要导库的配置的内容
- 同时也可以通过手动的形式强制进行某个已配置的数据库的导入
- cd /home/sqoop/sqoop && python bin/sqoop_import.py -t ods_higo_pandora_order
- 将强制进行一次 t_pandora_order 到 hive 中 ods_higo_pandora_order 库的导入操作
- 可导入 Hadoop 的 mysql 库配置在 yz-bi-store00 机器上的 higo_bilog 库的 t_dolphin_stat_db_info 表中
2、脚本部署树形图
.
├── bin
│ ├── find.sh
│ ├── logger.py
│ ├── mailer.py
│ ├── mailman.py
│ ├── mailssl2.py
│ ├── mailssl.py
│ ├── make_time_flag.py
│ ├── monitor.conf
│ ├── sendSMS.py
│ ├── sqoop_import_day.sh
│ ├── sqoop_import_hour.sh
│ ├── sqoop_import.py
│ ├── sqoop-monitor-day.py
│ ├── sqoop-monitor-day.sh
│ ├── statutil.py
│ └── table_info.py
├── conf
│ ├── brand_library.conf
│ ├── column.conf
│ ├── confset
│ ├── gen.py
│ ├── higo_account.conf
│ ├── higo_adx.conf
│ ├── higo_base.conf
│ ├── higo_comment.conf
│ ├── higo_content.conf
│ ├── higo_coupon.conf
│ ├── higo_crm.conf
│ ├── higo_data.conf
│ ├── higo_front.conf
│ ├── higo_goods.conf
│ ├── higo_order.conf
│ ├── higo_push.conf
│ ├── higo_risk.conf
│ ├── higo_trolley.conf
│ ├── hive_meta.conf
│ ├── im.conf
│ ├── kefu.conf
│ ├── metric_meta.conf
│ └── xz_store.conf
├── failed_log
├── log
│ ├── sqoop_import_day_2020-05-20.log
│ ├── sqoop_import_hour_2020-05-20_10.log
│ └── sqoop_info.2020-05-20_10:05:01
├── script
│ ├── acaleph_sign.py
│ ├── binlog_sqoop
│ │ ├── gen_cmd.py
│ │ ├── map.conf
│ │ ├── run_hql.sh
│ │ ├── run.py
│ │ └── transform_binlog.py
│ ├── compare_all_task.py
│ ├── sqoop_import.test.py
│ ├── sqoop_mysql.txt
│ └── trans_ini2info.py
└── test
├── conf
│ ├── bat.conf
│ ├── cheetah.conf
│ ├── confset
│ ├── db_processesnum.conf
│ └── mlspay_account.conf
├── table_cfg_maker.py
└── test_connect.py
3、同步 sqoop + python的 管理脚本到 inf 环境
# cd /home/sqoop
# rsync -avprP -e ssh 10.20.2.24:/home/sqoop/galaxy ./
# 创建软连接
# ln -s galaxy/trunk/libra/allstars/sqoop sqoop
# chown -R sqoop.sqoop /home/sqoop/
4、定制 crontab 启动
$ crontab -l
5 * * * * sh /home/sqoop/sqoop/bin/sqoop_import_hour.sh > /tmp/sqoop_import_hour.log 2>&1 &
20 0 * * * sh /home/sqoop/sqoop/bin/sqoop_import_day.sh > /tmp/sqoop_import_day.log 2>&1 &
5、测试
5.1 执行失败
# 问题 01(sqoop 导入数据的时候出现):ERROR tool.ImportTool: Import failed: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1256)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:200)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:173)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:270)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
# 问题02(qoop 导入数据的时候出现):sUnsupported major.minor version 52.0
# 问题原因分析:由于脚本中显示指定了运行 sqoop 的jdk版本为 1.8,CDH 集群的 jdk 为 1.7,导致执行 sqoop 时报错:Unsupported major.minor version 52.0。出现该问题是由于 cloudera manager 和系统的 jdk 版本不一致导致的,更具体点是cm使用的是低版本的 jdk(我的是1.7),而操作系统通过 java -version 出来的版本是高版本的 jdk (我的是1.8),由于 sqoop 会自动生成代码并编译程 jar 然后提交给 map-reduce 执行,所以 shell 命令行执行sqoop 编译出来的 job jar 是 1.8 编译的,而我的 cm5.3.0 安装的时候是用的 1.7 的 jdk,导致的问题就是高版本 JDK 编译的代码在低版本上执行,就出现这个错误了。
# 解决方案:改回 1.7 的 jdk
# ll
drwxr-xr-x 8 root root 4096 Dec 19 2014 jdk1.7.0_76
drwxr-xr-x 8 root root 4096 Jul 22 2017 jdk1.8.0_144
# vim /etc/profile.d/java.sh
#export JAVA_HOME=/data/jdk1.8.0_144
#export PATH=$JAVA_HOME/bin:$PATH
#export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME=/data/jdk1.7.0_76
export PATH=$JAVA_HOME/bin:$PATH
# source /etc/profile
# 修改“$HADOOP_HOME/etc/hadoop”下 hadoop-env.sh 配置文件中的 java 环境变量 JAVA_HOME
# echo $HADOOP_HOME
# vim /hadoop/hadoop-2.7.1/etc/hadoop/hadoop-env.sh
26 export JAVA_HOME=/data/jdk1.7.0_76
5.2 执行成功
$ cd /home/sqoop/sqoop && python bin/sqoop_import.py -t ods_data_visual_table_wf
2019-06-25 11:55:33,447-bin/sqoop_import.py-clean_env-1272-INFO-source /etc/bashrc; hadoop fs -rmr /tmp/sqoop_tmp/20190622_* 2>/dev/null
[INFO][get_config_db]db:higo_hive, master=0
[INFO][init_db_info]self.db: higo_hive, self.master: 0
[INFO]Connect: {'db': u'higo_hive', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3412}
[INFO][get_config_db]db:higo_hive, master=0
[INFO][init_db_info]self.db: higo_hive, self.master: 0
[INFO]Connect: {'db': u'higo_hive', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3412}
[INFO][get_config_db]db:higo_bilog, master=0
[INFO][init_db_info]self.db: higo_bilog, self.master: 0
[INFO]Connect: {'db': u'higo_bilog', 'host': u'10.20.2.22', 'master': 0, 'user': u'mlsreader', 'port': 3411}
.....
2019-06-25 11:55:33,690-bin/sqoop_import.py-worker-873-INFO-cmd:/hadoop/sqoop/bin/sqoop import --connect "jdbc:mysql://10.20.2.22:3411/metric_meta?zeroDateTimeBehavior=convertToNull&socketTimeout=0" --username mlsreader --password 'RMlSxs&^c6OpIAQ1' --fields-terminated-by '\001' --warehouse-dir /tmp/sqoop_tmp/20190625_86116640 --outdir /tmp/sqoop/20190625_86116640 --hive-import --hive-overwrite --hive-delims-replacement=' ' --num-mappers 1 --table t_visual_table_wf --columns id,cn_name,project,creater,flag,type,modify_user,chinese_name,create_date --map-column-java id=Long,flag=Integer,type=Integer,create_date=String --map-column-hive id=BIGINT,flag=TINYINT,type=TINYINT,create_date=TIMESTAMP --hive-table default.ods_data_visual_table_wf
2019-06-25 11:55:33,690-bin/sqoop_import.py-worker-874-INFO-overwrite_cmd:None
19/06/25 11:55:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
19/06/25 11:55:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/06/25 11:55:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/06/25 11:55:34 INFO tool.CodeGenTool: Beginning code generation
19/06/25 11:55:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop/hadoop/share/hadoop/mapreduce
Note: /tmp/sqoop-sqoop/compile/cda9524fa9816307b96174b1316369d5/t_visual_table_wf.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/06/25 11:55:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-sqoop/compile/cda9524fa9816307b96174b1316369d5/t_visual_table_wf.jar
19/06/25 11:55:35 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/06/25 11:55:35 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/06/25 11:55:35 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/06/25 11:55:35 INFO mapreduce.ImportJobBase: Beginning import of t_visual_table_wf
19/06/25 11:55:36 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/06/25 11:55:36 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/06/25 11:55:37 INFO impl.TimelineClientImpl: Timeline service address: http://yz-higo-jt.lehe.com:8188/ws/v1/timeline/
19/06/25 11:55:37 INFO client.RMProxy: Connecting to ResourceManager at yz-higo-jt.lehe.com/10.20.3.47:8032
19/06/25 11:55:38 INFO db.DBInputFormat: Using read commited transaction isolation
19/06/25 11:55:38 INFO mapreduce.JobSubmitter: number of splits:1
19/06/25 11:55:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1554441177879_240989
19/06/25 11:55:39 INFO impl.YarnClientImpl: Submitted application application_1554441177879_240989
19/06/25 11:55:39 INFO mapreduce.Job: The url to track the job: http://yz-higo-jt.lehe.com:8088/proxy/application_1554441177879_240989/
19/06/25 11:55:39 INFO mapreduce.Job: Running job: job_1554441177879_240989
19/06/25 11:55:45 INFO mapreduce.Job: Job job_1554441177879_240989 running in uber mode : false
19/06/25 11:55:45 INFO mapreduce.Job: map 0% reduce 0%
19/06/25 11:55:51 INFO mapreduce.Job: map 100% reduce 0%
19/06/25 11:55:51 INFO mapreduce.Job: Job job_1554441177879_240989 completed successfully
19/06/25 11:55:51 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=138498
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=17660
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=6170400
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3428
Total vcore-seconds taken by all map tasks=3428
Total megabyte-seconds taken by all map tasks=6170400
Map-Reduce Framework
Map input records=189
Map output records=189
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=23
CPU time spent (ms)=1080
Physical memory (bytes) snapshot=377196544
Virtual memory (bytes) snapshot=2446786560
Total committed heap usage (bytes)=1019609088
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=17660
19/06/25 11:55:51 INFO mapreduce.ImportJobBase: Transferred 17.2461 KB in 14.9276 seconds (1.1553 KB/sec)
19/06/25 11:55:51 INFO mapreduce.ImportJobBase: Retrieved 189 records.
19/06/25 11:55:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_visual_table_wf` AS t LIMIT 1
19/06/25 11:55:51 WARN hive.TableDefWriter: Column create_date had to be cast to a less precise type in Hive
19/06/25 11:55:51 INFO hive.HiveImport: Loading uploaded data into Hive
19/06/25 11:55:55 INFO hive.HiveImport:
19/06/25 11:55:55 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/hadoop/hive-2.1.0/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
19/06/25 11:55:58 INFO hive.HiveImport: OK
19/06/25 11:55:58 INFO hive.HiveImport: Time taken: 1.751 seconds
19/06/25 11:55:58 INFO hive.HiveImport: Loading data to table default.ods_data_visual_table_wf
19/06/25 11:55:58 INFO hive.HiveImport: Moved: 'hdfs://yz-higo-nn1:9000/user/hive/warehouse/ods_data_visual_table_wf/part-m-00000' to trash at: hdfs://yz-higo-nn1:9000/user/sqoop/.Trash/Current
19/06/25 11:55:59 INFO hive.HiveImport: OK
19/06/25 11:55:59 INFO hive.HiveImport: Time taken: 0.804 seconds
19/06/25 11:55:59 INFO hive.HiveImport: Hive import complete.
[INFO][get_config_db]db:higo_bilog, master=0
2019-06-25 11:56:02,039-bin/sqoop_import.py-exec_stat_sql-167-INFO-replace_sql:replace into t_dolphin_sqoop_table_status(`date`, `tablename`, `hpath`, `hsize`, `mysqltbl`, `host`, `port`, `user`, `passwd`, `mysqldb`, `begintime`, `costtime`) value('2019-06-25', 'ods_data_visual_table_wf', '/user/hive/warehouse/ods_data_visual_table_wf', 17660, 't_visual_table_wf', '10.20.2.22', 3411, 'mlsreader', 'RMlSxs&^c6OpIAQ1', 'metric_meta', '01:10', 26)
[INFO][init_db_info]self.db: higo_bilog, self.master: 1
[INFO]Connect: {'db': u'higo_bilog', 'host': u'10.20.2.22', 'master': 1, 'user': u'mlswriter', 'port': 3411}
/hadoop/hadoop/bin/hadoop fs -touchz /user/data_ready_tag/default_ods_data_visual_table_wf/ods_data_visual_table_wf_2019-06-25_day
touchz: `/user/data_ready_tag/default_ods_data_visual_table_wf/ods_data_visual_table_wf_2019-06-25_day': No such file or directory
2019-06-25 11:56:04,139-bin/sqoop_import.py-manager-1046-INFO-cmd:source /etc/bashrc; hadoop fs -dus /user/hive/warehouse/ods_data_visual_table_wf 2>/dev/null | awk '{print $1}'
2019-06-25 11:56:08,080-bin/sqoop_import.py-manager-1138-INFO-37920 40401 执行任务数:1 用时:34 second
2019-06-25 11:56:08,146-bin/sqoop_import.py-report_mail-1209-INFO-ts:ods_data_visual_table_wf,mts:t_visual_table_wf,hsize:17660