Linux脚本开发
一、通过脚本(awk,grep)清洗实体类数据,存入数据库
👊1.前言
部分数据库在建表的时候,不能像MySQL一样直接在字段后加comment对字段做注释。通过comment on column dbname.tablename.fieldname is '';
的方式一个一个加字段注释较为麻烦,而一般在项目代码中都会对字段做备注,所以打算使用Linux的grep和awk写脚本进行处理,然后存入到数据库的一张表中,以便查看。
Linux三剑客,grep擅长查找功能,sed擅长取行和替换,awk擅长取列
如下,就简单模拟一些实体类文件进行介绍,在实际场景中也是同样的操作。
👊2.逐步处理过程
🙇♀2.1 cat实体类
cat
命令可直接在终端查看文件数据
[root@zxy_master dao]# cat ./User.java
package com.zxy.pojo;
import java.util.Date;
public class User {
private Integer id;//ID
private String email;//邮箱
private String phonenum;//电话
private String password;//密码
private String code;//编码
private String nickname;//昵称
...................
🙇♀2.2 grep只查看字段行
将cat
到的结果以管道的形式,传给grep
做处理。使用grep
只查询包含private
的行
[root@zxy_master dao]# cat ./User.java | grep private
private Integer id;//ID
private String email;//邮箱
private String phonenum;//电话
private String password;//密码
private String code;//编码
private String nickname;//昵称
private String sex;//性别
private String birthday;//出生日期
private String address;//地址
private String imgurl;//头像URL
private Date createtime;//创建时间
🙇♀2.3 awk空格切分取第三列
将grep
查询到的数据,以管道的形式的传给awk做处理。awk
命令加-F
,可以指定按照什么切分数据。使用print
输出数据,$3
可以用来取切分后的第三列数据。
[root@zxy_master dao]# cat ./User.java | grep private | awk -F ' ' '{print $3}'
id;//ID
email;//邮箱
phonenum;//电话
password;//密码
code;//编码
nickname;//昵称
sex;//性别
birthday;//出生日期
address;//地址
imgurl;//头像URL
createtime;//创建时间
🙇♀2.4 awk指定符号切分,并依据空格输出
经过上一步的处理,得到了字段和注释,现在字段和注释之间只有符号;//
,所以再次使用awk
,按照这些字符切分后,分别取到第一和第二个数据输出即可。为方便后续导入到数据库中,在字段和注释之间,以空格划分。
[root@zxy_master dao]# cat ./User.java | grep private | awk -F ' ' '{print $3}' | awk -F ';//' '{print $1"\t"$2}'
id ID
email 邮箱
phonenum 电话
password 密码
code 编码
nickname 昵称
sex 性别
birthday 出生日期
address 地址
imgurl 头像URL
createtime 创建时间
🙇♀2.5.1 输出处理结果
将上一步处理的结果,以文件的形式处理,这里采用>>
文件数据追加的方式。
[root@zxy_master dao]# cat ./User.java | grep private | awk -F ' ' '{print $3}' | awk -F ';//' '{print $1"\t"$2}' >> JavaETLFile.txt
[root@zxy_master dao]# cat JavaETLFile.txt
id ID
email 邮箱
phonenum 电话
password 密码
code 编码
nickname 昵称
sex 性别
birthday 出生日期
address 地址
imgurl 头像URL
createtime 创建时间
🙇♀2.5.2 优化输出处理结果
从上一步可以看出,输出的结果中只有数据,如果后续将数据导入表中的话,是需要对数据做一下标注的。
这里为了效果明细,先把上一步生成的文件删除,在awk后添加,BEGIN {print "field""\t""remark"}
,即可实现在第一行输出指定数据。
[root@zxy_master dao]# rm -r JavaETLFile.txt
rm: remove regular file ‘JavaETLFile.txt’? y
[root@zxy_master dao]# cat ./User.java | grep private | awk -F ' ' '{print $3}' | awk -F ';//' 'BEGIN {print "field""\t""remark"} {print $1"\t"$2}' >> JavaETLFile.txt
[root@zxy_master dao]# cat JavaETLFile.txt
field remark
id ID
email 邮箱
phonenum 电话
password 密码
code 编码
nickname 昵称
sex 性别
birthday 出生日期
address 地址
imgurl 头像URL
createtime 创建时间
👊3.完善处理过程一
🙇♀3.1 脚本介绍
以上内容,简单介绍了整体的处理流程,但是使用命令行的形式,一个一个的处理实体类还是比较麻烦的,这一步就通过将命令写到脚本中,通过传入实体类名的形式依次处理。
通过
1
接收脚本传入的参数
,
将传入的参数赋给
F
i
l
e
N
a
m
e
,后续能直接使用这个参数接收值。主要在两个部分用到,第一个是
‘
1接收脚本传入的参数,将传入的参数赋给FileName,后续能直接使用这个参数接收值。主要在两个部分用到,第一个是`
1接收脚本传入的参数,将传入的参数赋给FileName,后续能直接使用这个参数接收值。主要在两个部分用到,第一个是‘{FileName}.java,从而获取原始数据。第二个是
-v filename=“${FileName}”,使用
-v可以动态获取参数,从而将
filename的数据使用
{print filename"\t"$1"\t"$2}命令。
[root@zxy_master dao]# vim JavaToTxt.sh
#!/bin/bash
# filename: JavaToTxt.sh
# auth: zxy
# date: 2022-09-04
# 接收参数
FileName=$1
cat ./${FileName}.java | grep private | awk -F ' ' '{print $3}' | awk -F ';//' -v filename="${FileName}" 'BEGIN {print "tablename""\t""field""\t""remark"} {print filename"\t"$1"\t"$2}' >> JavaETLFile.txt
🙇♀3.2 脚本测试
[root@zxy_master dao]# sh JavaToTxt.sh User
[root@zxy_master dao]# cat JavaETLFile.txt
tablename field remark
User id ID
User email 邮箱
User phonenum 电话
User password 密码
User code 编码
User nickname 昵称
User sex 性别
User birthday 出生日期
User address 地址
User imgurl 头像URL
User createtime 创建时间
👊4.完善处理过程二
通过如上脚本,即可以在执行脚本的时候传入实体类名称进行处理数据,下面就再次通过脚本,一次性获取所有实体类的执行命令。对这个指令就不做详细的介绍,不太明白的可以参考第二节的介绍
[root@zxy_master dao]# ls
Admin.java Course.java JavaToTxt.sh QueryVo.java Speaker.java User.java
[root@zxy_master dao]# ll | grep '.java' | awk -F ' ' '{print $9}' | awk -F '.' '{print "sh JavaToTxt.sh "$1}'
sh JavaToTxt.sh Admin
sh JavaToTxt.sh Course
sh JavaToTxt.sh QueryVo
sh JavaToTxt.sh Speaker
sh JavaToTxt.sh User
[root@zxy_master dao]# sh JavaToTxt.sh Admin
[root@zxy_master dao]# sh JavaToTxt.sh Course
[root@zxy_master dao]# sh JavaToTxt.sh QueryVo
[root@zxy_master dao]# sh JavaToTxt.sh Speaker
[root@zxy_master dao]# sh JavaToTxt.sh User
[root@zxy_master dao]#
[root@zxy_master dao]# cat JavaETLFile.txt
tablename field remark
Admin id ID
Admin username 用户名
Admin password 密码
Admin roles 角色
tablename field remark
Course id ID
Course courseTitle 课程标题
Course subjectId 主题ID
Course courseDesc 课程详情
Course videoList;
tablename field remark
QueryVo title 标题
QueryVo courseId 课程ID
QueryVo speakerId 讲师ID
tablename field remark
Speaker id ID
Speaker speakerName 讲师名
Speaker speakerJob 讲师工作
Speaker headImgUrl 头像URL
Speaker speakerDesc 讲师介绍
tablename field remark
👊5.完整处理脚本
#!/bin/bash
# filename: JavaToTxt.sh
# auth: zxy
# date: 2022-09-04
# 接收参数
FileName=$1
# 帮助函数
usage() {
echo "usage:"
echo "sh JavaToTxt.sh filename"
echo "获取批量处理文件命令如下:"
echo "ll | grep '.java' | awk -F ' ' '{print $9}' | awk -F '.' '{print "sh JavaToTxt.sh "$1}'"
exit 0
}
cat ./${FileName}.java | grep private | awk -F ' ' '{print $3}' | awk -F ';//' -v filename="${FileName}" 'BEGIN {print "tablename""\t""field""\t""remark"} {print filename"\t"$1"\t"$2}' >> JavaETLFile.txt
👊6.数据导入
将txt文件下载下来,导入到数据库中即可,方法很多,仅供参考。
👊7.数据简单处理
因为在执行脚本时,每执行一次都会输出一次使用awk
的Begin
的内容,也就是tablename,field,remark
,在导入数据后做简单处理。另一种是,当字段没有注释的时候,会存在切分后取到一行空值的,处理即可。
二、使用脚本管理日志采集过程
实现将数据通过hudi,导入到hdfs中,并创建hive表来存储数据
1.流程介绍:
首先通过frp穿透,将数据数据穿透到openresty
再从openresty,通过lua脚本,将数据从生产端生产到Kafka中
再从Kafka通过代码将数据通过hudi和Struct Streaming保存到hdfs中
并在hive中建立表,方便后续分析
2.Real_time_data_warehouse.sh
#!/bin/bash
# filename:Real_time_data_warehouse.sh
# autho:zxy
# date:2021-07-19
# 使用脚本实现开启hdfs和yarn进程,hive服务,启动frp穿透,开启openresty采集数据,开启Kafka的服务,执行log2hudi脚本
# 1.start-dfs.sh
# start-yarn.sh
# stop-yarn.sh
# stop-dfs.sh
# 2.sh /opt/apps/scripts/hiveScripts/start-supervisord-hive.sh start
# sh /opt/apps/scripts/hiveScripts/start-supervisord-hive.sh stop
# 3.sh /opt/apps/scripts/frpScripts/frp-agent.sh start
# sh /opt/apps/scripts/frpScripts/frp-agent.sh stop
# 3.openresty -p /opt/apps/collect-app/
# openresty -p /opt/apps/collect-app/ -s stop
# 4.sh /opt/apps/scripts/kafkaScripts/start-kafka.sh zookeeper
# sh /opt/apps/scripts/kafkaScripts/start-kafka.sh kafka
# sh /opt/apps/scripts/kafkaScripts/start-kafka.sh stop
# 5.sh /opt/apps/scripts/log2hudi.sh start
# sh /opt/apps/scripts/log2hudi.sh stop
#JDK环境
JAVA_HOME=/data/apps/jdk1.8.0_261
#Hadoop环境
HADOOP_HOME=/data/apps/hadoop-2.8.1
#Hive环境
HIVE_HOME=/data/apps/hive-1.2.1
#Kafka环境
KAFKA_HOME=/data/apps/kafka_2.11-2.4.1
#Frp穿透环境
FRP_HOME=/data/apps/frp
#Openresty环境
COLLECT_HOME=/data/apps/collect-app
#脚本目录
SCRIPTS=/opt/apps/scripts
#接收参数
CMD=$1
DONE=$2
## 帮助函数
usage() {
echo "usage:"
echo "Real_time_data_warehouse.sh hdfs start/stop"
echo "Real_time_data_warehouse.sh yarn start/stop"
echo "Real_time_data_warehouse.sh hive start/stop"
echo "Real_time_data_warehouse.sh frp start/stop"
echo "Real_time_data_warehouse.sh openresty start/stop"
echo "Real_time_data_warehouse.sh zookeeper start/stop"
echo "Real_time_data_warehouse.sh kafka start/stop"
echo "Real_time_data_warehouse.sh log2hudi start/stop"
echo "description:"
echo " hdfs:start/stop hdfsService"
echo " yarn:start/stop yarnService"
echo " hive:start/stop hiveService"
echo " frp:start/stop frpService"
echo " openresty:start/stop openrestyService"
echo " zookeeper:start/stop zookeeperService"
echo " kafka:start/stop kafkaService"
echo " log2hudi:start/stop log2hudiService"
exit 0
}
# 1.管理hdfs服务
if [ ${CMD} == "hdfs" ] && [ ${DONE} == "start" ];then
# 启动hdfs
start-dfs.sh
echo " hdfs启动成功 "
elif [ ${CMD} == "hdfs" ] && [ ${DONE} == "stop" ];then
# 关闭hdfs服务
stop-dfs.sh
echo " hdfs关闭成功 "
# 2.管理yarn服务
elif [ ${CMD} == "yarn" ] && [ ${DONE} == "start" ];then
# 启动yarn
start-yarn.sh
echo " yarn启动成功 "
elif [ ${CMD} == "yarn" ] && [ ${DONE} == "stop" ];then
# 关闭yarn服务
stop-yarn.sh
echo " yarn关闭成功 "
# 3.管理hive服务
elif [ ${CMD} == "hive" ] && [ ${DONE} == "start" ];then
# 启动hive
sh ${SCRIPTS}/hiveScripts/start-supervisord-hive.sh start
echo " hive启动成功 "
elif [ ${CMD} == "hive" ] && [ ${DONE} == "stop" ];then
# 关闭hive服务
sh ${SCRIPTS}/hiveScripts/start-supervisord-hive.sh stop
echo " hive关闭成功 "
# 4.管理frp服务
elif [ ${CMD} == "frp" ] && [ ${DONE} == "start" ];then
# 启动frp
sh ${SCRIPTS}/frpScripts/frp-agent.sh start
echo " frp启动成功 "
elif [ ${CMD} == "frp" ] && [ ${DONE} == "stop" ];then
# 关闭frp服务
sh ${SCRIPTS}/frpScripts/frp-agent.sh start
echo " frp关闭成功 "
# 4.管理openresty服务
elif [ ${CMD} == "openresty" ] && [ ${DONE} == "start" ];then
# 启动openresty
openresty -p ${COLLECT_HOME}
echo " openresty启动成功 "
elif [ ${CMD} == "openresty" ] && [ ${DONE} == "stop" ];then
# 关闭openresty服务
openresty -p ${COLLECT_HOME} -s stop
echo " openresty关闭成功 "
# 5.管理kafka服务
elif [ ${CMD} == "zookeeper" ] && [ ${DONE} == "start" ];then
# 启动zookeeper
sh ${SCRIPTS}/kafkaScripts/start-kafka.sh zookeeper
echo " zookeeper启动成功 "
elif [ ${CMD} == "zookeeper" ] && [ ${DONE} == "stop" ];then
# 关闭zookeeper服务
sh ${SCRIPTS}/kafkaScripts/start-kafka.sh stopz
echo " zookeeper关闭成功 "
elif [ ${CMD} == "kafka" ] && [ ${DONE} == "start" ];then
# 开启kafka服务
sh ${SCRIPTS}/kafkaScripts/start-kafka.sh kafka
echo " Kafka启动成功 "
elif [ ${CMD} == "kafka" ] && [ ${DONE} == "stop" ];then
# 关闭kafka服务
sh ${SCRIPTS}/kafkaScripts/start-kafka.sh stopk
echo " Kafka关闭成功 "
# 6.管理log2hudi服务
elif [ ${CMD} == "log2hudi" ] && [ ${DONE} == "start" ];then
# 启动log2hudi
sh ${SCRIPTS}/log2hudi.sh start
echo " log2hudi启动成功 "
elif [ ${CMD} == "log2hudi" ] && [ ${DONE} == "stop" ];then
# 关闭log2hudi服务
sh ${SCRIPTS}/log2hudi.sh stop
echo " log2hudi关闭成功 "
else
usage
fi
3.启动测试运行
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh hdfs start
Starting namenodes on [hadoop]
hadoop: starting namenode, logging to /data/apps/hadoop-2.8.1/logs/hadoop-root-namenode-hadoop.out
localhost: starting datanode, logging to /data/apps/hadoop-2.8.1/logs/hadoop-root-datanode-hadoop.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /data/apps/hadoop-2.8.1/logs/hadoop-root-secondarynamenode-hadoop.out
hdfs启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh yarn start
starting yarn daemons
starting resourcemanager, logging to /data/apps/hadoop-2.8.1/logs/yarn-root-resourcemanager-hadoop.out
localhost: starting nodemanager, logging to /data/apps/hadoop-2.8.1/logs/yarn-root-nodemanager-hadoop.out
yarn启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh hive start
● supervisord.service - Process Monitoring and Control Daemon
Loaded: loaded (/usr/lib/systemd/system/supervisord.service; disabled; vendor preset: disabled)
Active: active (running) since 一 2021-07-19 20:24:02 CST; 9ms ago
Process: 88087 ExecStart=/usr/bin/supervisord -c /etc/supervisord.conf (code=exited, status=0/SUCCESS)
Main PID: 88090 (supervisord)
CGroup: /system.slice/supervisord.service
└─88090 /usr/bin/python /usr/bin/supervisord -c /etc/supervisord.conf
7月 19 20:24:02 hadoop systemd[1]: Starting Process Monitoring and Control Daemon...
7月 19 20:24:02 hadoop systemd[1]: Started Process Monitoring and Control Daemon.
hive启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh openresty start
openresty启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh frp start
root 80039 0.0 0.0 113184 1460 pts/0 S+ 20:11 0:00 sh Real_time_data_warehouse.sh frp start
root 80040 0.0 0.0 113184 1432 pts/0 S+ 20:11 0:00 sh /opt/apps/scripts/frpScripts/frp-agent.sh start
root 80041 0.0 0.0 11644 4 pts/0 D+ 20:11 0:00 [frpc]
root 80043 0.0 0.0 112728 952 pts/0 S+ 20:11 0:00 grep frp
frp启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh zookeeper start
zookeeper启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh kafka start
Kafka启动成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh log2hudi start
starting -------------->
log2hudi启动成功
[root@hadoop MyScripts]# yarn application --list
21/07/19 20:13:57 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.130.111:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1626695833179_0001 log2hudi SPARK root root.root RUNNING UNDEFINED 10% http://hadoop:32795
- yarn进程
- 日志成功保存到hdfs文件
4.关闭测试运行
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh log2hudi stop
21/07/19 20:17:50 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.130.111:8032
21/07/19 20:17:52 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.130.111:8032
Killing application application_1626695833179_0001
21/07/19 20:17:53 INFO impl.YarnClientImpl: Killed application application_1626695833179_0001
stop ok
log2hudi关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh kafka stop
kafka关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh zookeeper stop
No zookeeper server to stop
zookeeper关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh openresty stop
openresty关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh frp stop
root 80041 0.3 0.1 713160 12480 pts/0 Sl 20:11 0:01 frpc http --sd zxy -l 8802 -s frp.qfbigdata.com:7001 -u zxy
root 84665 0.0 0.0 113184 1460 pts/0 S+ 20:18 0:00 sh Real_time_data_warehouse.sh frp stop
root 84666 0.0 0.0 113184 1432 pts/0 S+ 20:18 0:00 sh /opt/apps/scripts/frpScripts/frp-agent.sh start
root 84667 0.0 0.0 714312 8124 pts/0 Sl+ 20:18 0:00 frpc http --sd zxy -l 8802 -s frp.qfbigdata.com:7001 -u zxy
root 84669 0.0 0.0 112728 952 pts/0 S+ 20:18 0:00 grep frp
frp关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh hive stop
● supervisord.service - Process Monitoring and Control Daemon
Loaded: loaded (/usr/lib/systemd/system/supervisord.service; disabled; vendor preset: disabled)
Active: inactive (dead)
7月 19 17:33:23 hadoop systemd[1]: Stopping Process Monitoring and Control Daemon...
7月 19 17:33:34 hadoop systemd[1]: Stopped Process Monitoring and Control Daemon.
7月 19 17:33:43 hadoop systemd[1]: Starting Process Monitoring and Control Daemon...
7月 19 17:33:43 hadoop systemd[1]: Started Process Monitoring and Control Daemon.
7月 19 19:46:17 hadoop systemd[1]: Stopping Process Monitoring and Control Daemon...
7月 19 19:46:27 hadoop systemd[1]: Stopped Process Monitoring and Control Daemon.
7月 19 19:57:25 hadoop systemd[1]: Starting Process Monitoring and Control Daemon...
7月 19 19:57:25 hadoop systemd[1]: Started Process Monitoring and Control Daemon.
7月 19 20:19:03 hadoop systemd[1]: Stopping Process Monitoring and Control Daemon...
7月 19 20:19:14 hadoop systemd[1]: Stopped Process Monitoring and Control Daemon.
hive关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh yarn stop
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
yarn关闭成功
[root@hadoop MyScripts]# sh Real_time_data_warehouse.sh hdfs stop
Stopping namenodes on [hadoop]
hadoop: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
hdfs关闭成功
[root@hadoop MyScripts]# jps
85651 Jps
5.帮助文档
# 开启项目
sh Real_time_data_warehouse.sh hdfs start
sh Real_time_data_warehouse.sh yarn start
sh Real_time_data_warehouse.sh hive start
sh Real_time_data_warehouse.sh openresty start
sh Real_time_data_warehouse.sh frp start
sh Real_time_data_warehouse.sh zookeeper start
sh Real_time_data_warehouse.sh kafka start
sh Real_time_data_warehouse.sh log2hudi start
# 关闭项目
sh Real_time_data_warehouse.sh log2hudi stop
sh Real_time_data_warehouse.sh kafka stop
sh Real_time_data_warehouse.sh zookeeper stop
sh Real_time_data_warehouse.sh openresty stop
sh Real_time_data_warehouse.sh frp stop
sh Real_time_data_warehouse.sh hive stop
sh Real_time_data_warehouse.sh yarn stop
sh Real_time_data_warehouse.sh hdfs stop
6.涉及脚本
- start-supervisord-hive.sh
hive脚本使用介绍 - frp-agent.sh
个人视情况而定 - start-kafka.sh
Kafka脚本使用介绍 - log2hudi.sh
log2hudi脚本使用介绍