采集项目
采集项目说明:采集项目是五台服务器(HA)
项目规划:
1.跳板机准备
2.记得开在服务器上开3306端口
说明:如果不开,电脑上的navicat连接不上
3.服务器准备
1.配置本机hosts
vim /etc/hosts
39.98.58.145 hadoop100
39.98.58.198 hadoop101
39.98.60.160 hadoop102
47.92.108.72 hadoop103
39.98.54.179 hadoop104
注意:这里是公网
2.配置服务器的hosts
172.19.21.168 hadoop101
172.19.21.169 hadoop102
172.28.19.250 hadoop103
172.19.21.167 hadoop104
172.28.19.251 hadoop100
注意:这里是私网
3.修改服务器hostname
vim /etc/hostname
hadoop100
4.安装插件
yum install -y epel-release
yum install -y net-tools
yum install -y vim
yum install -y lrzsz
yum install -y epel-release
yum install -y psmisc nc rsync lrzsz ntp libzstd openssl-static tree iotop git
5.防火墙关闭
systemctl stop firewalld
systemctl disable firewalld.service
注意:在企业开发时,通常单个服务器的防火墙时关闭的。公司整体对外会设置非常安全的防火墙
6.添加用户
1.创建atguigu用户
[root@hadoop100 ~]# useradd atguigu
[root@hadoop100 ~]# passwd atguigu
2.配置atguigu用户有root权限
[root@hadoop100 ~]# vim /etc/sudoers
## Allow root to run any commands anywhere
root ALL=(ALL) ALL
## Allows people in group wheel to run all commands
%wheel ALL=(ALL) ALL
atguigu ALL=(ALL) NOPASSWD:ALL
注意:atguigu这一行不要直接放到root行下面,因为所有用户都属于wheel组,你先配置了atguigu具有免密功能,但是程序执行到%wheel行时,该功能又被覆盖回需要密码。所以atguigu要放到%wheel这行下面。
3.在/opt创建module,software文件夹
[root@hadoop100 ~]# mkdir /opt/module
[root@hadoop100 ~]# mkdir /opt/software
[root@hadoop100 ~]# chown atguigu:atguigu /opt/module
[root@hadoop100 ~]# chown atguigu:atguigu /opt/software
4.配置ssh免密
[atguigu@hadoop102 .ssh]$ ssh-keygen -t rsa
然后敲(三个回车),就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop100
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop101
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104
注意:免密在那台机器要用,直接配置
集群分发脚本xsync
1.在家目录/home/atguigu下创建bin文件
[atguigu@hadoop102 ~]$ mkdir bin
2.编写脚本
[atguigu@hadoop102 ~]$ cd /home/atguigu/bin
[atguigu@hadoop102 ~]$ vim xsync
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
3.添加执行权限
[atguigu@hadoop102 bin]$ chmod 777 xsync
4.测试脚本
[atguigu@hadoop102 bin]$ xsync xsync
4.JDK安装
1.卸载JDK
注意:五台机器都执行
[atguigu@hadoop102 opt]# sudo rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps
2.上传
[atguigu@hadoop102 software]# ls /opt/software/
3.解压
[atguigu@hadoop102 software]# tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
[atguigu@hadoop102 module]$ mv jdk1.8.0_212/ jdk
4.环境变量
[atguigu@hadoop102 module]# sudo vim /etc/profile.d/my_env.sh
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk
export PATH=$PATH:$JAVA_HOME/bin
[atguigu@hadoop102 software]$ source /etc/profile.d/my_env.sh
5.测试
[atguigu@hadoop102 module]# java -version
6.分发
[atguigu@hadoop102 module]$ xsync /opt/module/jdk
[atguigu@hadoop102 module]$ sudo /home/atguigu/bin/xsync /etc/profile.d/my_env.sh
7.其它机器刷新
[atguigu@hadoop103 module]$ source /etc/profile.d/my_env.sh
5.安装Mysql
1.安装包准备
[atguigu@hadoop102~]# mkdir /opt/software/mysql
[atguigu@hadoop102 software]$ cd /opt/software/mysql/
install_mysql.sh
mysql-community-client-8.0.31-1.el7.x86_64.rpm
mysql-community-client-plugins-8.0.31-1.el7.x86_64.rpm
mysql-community-common-8.0.31-1.el7.x86_64.rpm
mysql-community-icu-data-files-8.0.31-1.el7.x86_64.rpm
mysql-community-libs-8.0.31-1.el7.x86_64.rpm
mysql-community-libs-compat-8.0.31-1.el7.x86_64.rpm
mysql-community-server-8.0.31-1.el7.x86_64.rpm
mysql-connector-j-8.0.31.jar
2.安装插件
说明:由于阿里云服务器安装的是Linux最小系统版,没有如下工具,所以需要安装。(都可以安装 没事)
(1)卸载MySQL依赖,虽然机器上没有装MySQL,但是这一步不可少
[atguigu@hadoop102 mysql]# sudo yum remove mysql-libs
(2)下载依赖并安装
[atguigu@hadoop102 mysql]# sudo yum install libaio
[atguigu@hadoop102 mysql]# sudo yum -y install autoconf
3.安装
[atguigu@hadoop102 mysql]$ su root
[root@hadoop102 mysql]# sh install_mysql.sh
[root@hadoop102 mysql]# exit
6.数据准备(企业安装不需要)
1.上传
cd /opt/module/data_mocker
2.修改配置application.yml
3.创建数据库
7.安装Zookeeper
注意:安装在hadoop102,hadoop103,hadoop104中
1.解压
[atguigu@hadoop102 software]$ tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz -C /opt/module/
[atguigu@hadoop102 module]$ mv apache-zookeeper-3.7.1-bin/ zookeeper
2.配置服务器编号
[atguigu@hadoop102 zookeeper]$ mkdir zkData
[atguigu@hadoop102 zkData]$ vim myid
2
注意:myid是唯一的,在hadoop103中改为3,在hadoop104中改为4
3.配置zoo.cfg文件
[atguigu@hadoop102 conf]$ mv zoo_sample.cfg zoo.cfg
[atguigu@hadoop102 conf]$ vim zoo.cfg
添加如下配置:
dataDir=/opt/module/zookeeper/zkData
#######################cluster##########################
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888
注意:一定不能有空格,server后的数字与上面配置myid一致
分发:
[atguigu@hadoop102 module]$ xsync zookeeper/
4.集群操作
[atguigu@hadoop102 zookeeper]$ bin/zkServer.sh start
[atguigu@hadoop103 zookeeper]$ bin/zkServer.sh start
[atguigu@hadoop104 zookeeper]$ bin/zkServer.sh start
[atguigu@hadoop102 zookeeper]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Mode: follower
[atguigu@hadoop103 zookeeper]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Mode: leader
[atguigu@hadoop104 zookeeper]# bin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Mode: follower
注意:启动后一定要查看状态,有leader才算成功
5.启动与停止脚本
[atguigu@hadoop102 ~]$ cd
[atguigu@hadoop102 ~]$ cd bin/
[atguigu@hadoop102 bin]$ vim myzk.sh
#!/bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 启动 ------------
ssh $i "/opt/module/zookeeper/bin/zkServer.sh start"
done
};;
"stop"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 停止 ------------
ssh $i "/opt/module/zookeeper/bin/zkServer.sh stop"
done
};;
"status"){
for i in hadoop102 hadoop103 hadoop104
do
echo ---------- zookeeper $i 状态 ------------
ssh $i "/opt/module/zookeeper/bin/zkServer.sh status"
done
};;
esac
添加权限:
[atguigu@hadoop102 bin]$ chmod 777 zk.sh
客户端启动:
[atguigu@hadoop103 zookeeper]$ bin/zkCli.sh
命令基本语法 | 功能描述 |
help | 显示所有操作命令 |
ls path | 使用 ls 命令来查看当前znode的子节点-w 监听子节点变化-s 附加次级信息 |
create | 普通创建-s 含有序列-e 临时(重启或者超时消失) |
get path | 获得节点的值-w 监听节点内容变化-s 附加次级信息 |
set | 设置节点的具体值 |
stat | 查看节点状态 |
delete | 删除节点 |
deleteall | 递归删除节点 |
8.安装hadoop
规划:
1.配置(5个)
1.core-site.xml
<configuration>
<!-- 把多个NameNode的地址组装成一个集群mycluster -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop/data</value>
</property>
<!-- 配置HDFS网页登录使用的静态用户为atguigu -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>atguigu</value>
</property>
<property>
<name>hadoop.proxyuser.atguigu.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.atguigu.groups</name>
<value>*</value>
</property>
<!--配置atguigu用户能够代理的用户为任意用户-->
<property>
<name>hadoop.proxyuser.atguigu.users</name>
<value>*</value>
</property>
<!-- 指定zkfc要连接的zkServer地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value>
</property>
<property>
<name>ipc.client.connect.max.retries</name>
<value>100</value>
<description>
Indicates the number of retries a client will make to establisha server connection.
</description>
</property>
<property>
<name>ipc.client.connect.retry.interval</name>
<value>10000</value>
<description>Indicates the number of milliseconds a client will wait for
before retrying to establish a server connection.
</description>
</property>
</configuration>
2.hdfs-site.xml
<configuration>
<!-- NameNode数据存储目录 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/name</value>
</property>
<!-- DataNode数据存储目录 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/data</value>
</property>
<!-- JournalNode数据存储目录 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${hadoop.tmp.dir}/jn</value>
</property>
<!-- 完全分布式集群名称 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- 集群中NameNode节点都有哪些 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- NameNode的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop100:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop101:8020</value>
</property>
<!-- NameNode的http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop100:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop101:9870</value>
</property>
<!-- 指定NameNode元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop102:8485;hadoop103:8485;hadoop104:8485/mycluster</value>
</property>
<!-- 访问代理类:client用于确定哪个NameNode为Active -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用隔离机制时需要ssh秘钥登录-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/atguigu/.ssh/id_rsa</value>
</property>
<!-- 启用nn故障自动转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>
3.workers(datanode位置)
hadoop102
hadoop103
hadoop104
4.yarn-size.xml
<configuration>
<!-- 指定MR走shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 启用resourcemanager ha -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- 声明两台resourcemanager的地址 -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster-yarn1</value>
</property>
<!--指定resourcemanager的逻辑列表-->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!-- ========== rm1的配置 ========== -->
<!-- 指定rm1的主机名 -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop100</value>
</property>
<!-- 指定rm1的web端地址 -->
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop100:8088</value>
</property>
<!-- 指定rm1的内部通信地址 -->
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>hadoop100:8032</value>
</property>
<!-- 指定AM向rm1申请资源的地址 -->
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>hadoop100:8030</value>
</property>
<!-- 指定供NM连接的地址 -->
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm1</name>
<value>hadoop100:8031</value>
</property>
<!-- ========== rm2的配置 ========== -->
<!-- 指定rm2的主机名 -->
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop101</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop101:8088</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>hadoop101:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>hadoop101:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm2</name>
<value>hadoop101:8031</value>
</property>
<!-- 指定zookeeper集群的地址 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value>
</property>
<!-- 启用自动恢复 -->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!-- 指定resourcemanager的状态信息存储在zookeeper集群 -->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop100:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
5.mapred-site.xml
<configuration>
<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop100:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop100:19888</value>
</property>
</configuration>
2.环境变量
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
3.检查Zookeeper是否打开(一定要打开)!!!
4.初始化jn
1.在102 103 104节点上,输入以下命令启动journalnode服务(jn在那台就在那台执行)
hdfs --daemon start journalnode
5.格式nn
2.在100上对其进行格式化,并启动
hdfs namenode -format
hdfs --daemon start namenode
6.同步nn
3.在101上同步nn1的元数据信息,并启动namenode
hdfs namenode -bootstrapStandby
hdfs --daemon start namenode
7.启动dn
4.在102 103 104节点上,启动datanode
hdfs --daemon start datanode
8.格式zkfc
5.在100 101节点上,格式化zkfc
hdfs zkfc -formatZK
9.启动zkfc
6.在100 101节点上启动zkfc
hdfs --daemon start zkfc
7.查看hadoop100:9870 hadoop101:9870是否有active
10.启停脚本
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit ;
fi
case $1 in
"start")
echo " =================== 启动 hadoop集群 ==================="
echo " --------------- 启动 hdfs ---------------"
ssh hadoop100 "/opt/module/hadoop/sbin/start-dfs.sh"
echo " --------------- 启动 yarn ---------------"
ssh hadoop100 "/opt/module/hadoop/sbin/start-yarn.sh"
echo " --------------- 启动 historyserver ---------------"
ssh hadoop100 "/opt/module/hadoop/bin/mapred --daemon start historyserver"
;;
"stop")
echo " =================== 关闭 hadoop集群 ==================="
echo " --------------- 关闭 historyserver ---------------"
ssh hadoop100 "/opt/module/hadoop/bin/mapred --daemon stop historyserver"
echo " --------------- 关闭 yarn ---------------"
ssh hadoop100 "/opt/module/hadoop/sbin/stop-yarn.sh"
echo " --------------- 关闭 hdfs ---------------"
ssh hadoop100 "/opt/module/hadoop/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error..."
;;
esac
11.安装错误,如何格式化
[atguigu@hadoop100 hadoop]$ rm -fr data/
[atguigu@hadoop100 hadoop]$ rm -fr logs/
[atguigu@hadoop100 tmp]$ cd /tmp/
[atguigu@hadoop100 tmp]$ rm -fr hadoop*
9.安装Kafka
规划:
1.解压
[atguigu@hadoop102 software]$ tar -zxvf kafka_2.12-3.3.1.tgz -C /opt/module/
[tguigu@hadoop102 module]$ mv kafka_2.12-3.3.1/ kafka
2.修改配置
[atguigu@hadoop102 kafka]$ cd config/
[atguigu@hadoop102 config]$ vim server.properties
#broker的全局唯一编号,不能重复,只能是数字。
broker.id=0
#处理网络请求的线程数量
num.network.threads=3
#用来处理磁盘IO的线程数量
num.io.threads=8
#发送套接字的缓冲区大小
socket.send.buffer.bytes=102400
#接收套接字的缓冲区大小
socket.receive.buffer.bytes=102400
#请求套接字的缓冲区大小
socket.request.max.bytes=104857600
#kafka运行日志(数据)存放的路径,路径不需要提前创建,kafka自动帮你创建,可以配置多个磁盘路径,路径与路径之间可以用","分隔
log.dirs=/opt/module/kafka/datas
#topic在当前broker上的分区个数
num.partitions=1
#用来恢复和清理data下数据的线程数量
num.recovery.threads.per.data.dir=1
# 每个topic创建时的副本数,默认时1个副本
offsets.topic.replication.factor=1
#segment文件保留的最长时间,超时将被删除
log.retention.hours=168
#每个segment文件的大小,默认最大1G
log.segment.bytes=1073741824
# 检查过期数据的时间,默认5分钟检查一次是否数据过期
log.retention.check.interval.ms=300000
#配置连接Zookeeper集群地址(在zk根目录下创建/kafka,方便管理)
zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka
注意:#broker的全局唯一编号,不能重复,只能是数字。 broker.id=0
log.dirs=/opt/module/kafka/datas 记得创建路径!!!!!
3.分发
[atguigu@hadoop102 module]$ xsync kafka/
4.修改broker.id
[atguigu@hadoop103 module]$ vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
[atguigu@hadoop104 module]$ vim kafka/config/server.properties
修改:
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=2
5.环境变量
[atguigu@hadoop102 module]$ sudo vim /etc/profile.d/my_env.sh
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin
[atguigu@hadoop102 module]$ sudo /home/atguigu/bin/xsync /etc/profile.d/my_env.sh
[atguigu@hadoop103 module]$ source /etc/profile
[atguigu@hadoop104 module]$ source /etc/profile
6.检查是否启动Zookeeper,一定要启动!!!!!!
7.启动kafka,停止kafka
[atguigu@hadoop102 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties
[atguigu@hadoop103 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties
[atguigu@hadoop104 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties
[atguigu@hadoop102 kafka]$ bin/kafka-server-stop.sh
[atguigu@hadoop103 kafka]$ bin/kafka-server-stop.sh
[atguigu@hadoop104 kafka]$ bin/kafka-server-stop.sh
8.启停脚本
#! /bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------启动 $i Kafka-------"
ssh $i "/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties"
done
};;
"stop"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------停止 $i Kafka-------"
ssh $i "/opt/module/kafka/bin/kafka-server-stop.sh "
done
};;
esac
*注意:*停止Kafka集群时,一定要等Kafka所有节点进程全部停止后再停止Zookeeper集群。因为Zookeeper集群当中记录着Kafka集群相关信息,Zookeeper集群一旦先停止,Kafka集群就没有办法再获取停止进程的信息,只能手动杀死Kafka进程了。
9.主题命令行操作
参数 | 描述 |
--bootstrap-server <String: server toconnect to> | 连接的Kafka Broker主机名称和端口号。 |
--topic <String: topic> | 操作的topic名称。 |
--create | 创建主题。 |
--delete | 删除主题。 |
--alter | 修改主题。 |
--list | 查看所有主题。 |
--describe | 查看主题详细描述。 |
--partitions <Integer: # of partitions> | 设置分区数。 |
--replication-factor<Integer: replication factor> | 设置分区副本。 |
--config <String: name=value> | 更新系统默认的配置。 |
9.生产者命令行操作
参数 | 描述 |
--bootstrap-server <String: server toconnect to> | 连接的Kafka Broker主机名称和端口号。 |
--topic <String: topic> | 操作的topic名称。 |
10.消费者命令行操作
参数 | 描述 |
--bootstrap-server <String: server toconnect to> | 连接的Kafka Broker主机名称和端口号。 |
--topic <String: topic> | 操作的topic名称。 |
--from-beginning | 从头开始消费。 |
--group <String: consumer group id> | 指定消费者组名称。 |
10.安装Flume
1.解压
[atguigu@hadoop100 software]$ tar -zxvf /opt/software/apache-flume-1.10.1-bin.tar.gz -C /opt/module/
[atguigu@hadoop100 module]$ mv /opt/module/apache-flume-1.10.1-bin /opt/module/flume
2.配置
[atguigu@hadoop100 conf]$ vim log4j2.xml
<Properties>
<Property name="LOG_DIR">/opt/module/flume/log</Property>
</Properties>
. . . . . .
# 在最下面引入控制台输出,方便学习查看日志
<Root level="INFO">
<AppenderRef ref="LogFile" />
<AppenderRef ref="Console" />
</Root>
11.日志采集(日志文件到kafka)
1.写配置
[atguigu@hadoop100 flume]$ mkdir job
[atguigu@hadoop100 flume]$ vim job/file_to_kafka.conf
a1.sources = r1
a1.channels = c1
#配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#要采集的文件
a1.sources.r1.filegroups.f1 = /opt/module/data_mocker/log/app.*
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
#配置channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
#kafka集群
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
#kafka主题
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false
#组装
a1.sources.r1.channels = c1
2.启动
注意:启动zookeeper和kafka集群!!!!!!!!!
[atguigu@hadoop100 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/file_to_kafka.conf
3.测试
1.开启生成数据(生成环境不用)
2.在hadoop102(安装kafka机器都行)启动消费者
[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_log
4.启停脚本
#!/bin/bash
case $1 in
"start"){
echo " --------启动 $i 采集flume-------"
ssh hadoop100 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf/ -f /opt/module/flume/job/file_to_kafka.conf >/dev/null 2>&1 &"
};;
"stop"){
echo " --------停止 $i 采集flume-------"
ssh hadoop100 "ps -ef | grep file_to_kafka | grep -v grep | awk '{print \$2}' | xargs -n1 kill -9 "
}
;;
esac
12.安装Maxwell
1.安装
[atguigu@hadoop100 maxwell]$ tar -zxvf maxwell-1.29.2.tar.gz -C /opt/module/
[atguigu@hadoop100 module]$ mv maxwell-1.29.2/ maxwell
2.配置MySQL
mysql到kafka的路径
[atguigu@hadoop100 ~]$ sudo vim /etc/my.cnf
添加如下配置
#数据库id
server-id = 1
#启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin
#binlog类型,maxwell要求为row类型
binlog_format=row
#启用binlog的数据库,需根据实际情况作出修改
binlog-do-db=atguigu
重启Mysql
[atguigu@hadoop100 ~]$ sudo systemctl restart mysqld
3.穿件Maxwell所需要数据库和用户
msyql> CREATE DATABASE maxwell;
创建Maxwell用户并赋予其必要权限
mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
4.配置Maxwell
[atguigu@hadoop100 maxwell]$ cd /opt/module/maxwell
[atguigu@hadoop100 maxwell]$ cp config.properties.example config.properties
# tl;dr config
log_level=info
producer=kafka
kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092
kafka_topic=topic_db
# mysql login info
host=hadoop100
user=maxwell
password=maxwell
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai
5.启停
注意:启动zookeeper 启动kafka!!!!!!!
启动
[atguigu@hadoop100 ~]$ /opt/module/maxwell/bin/maxwell --config /opt/module/maxwell/config.properties --daemon
停止
[atguigu@hadoop100 ~]$ ps -ef | grep maxwell | grep -v grep | grep maxwell | awk '{print $2}' | xargs kill -9
6.测试
1.开启数据生成器
2.在hadoop102启动消费者(在有kafka的机器上就可以启动)
[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_db
3.假如能消费就正确咯
7.启停脚本
#!/bin/bash
MAXWELL_HOME=/opt/module/maxwell
status_maxwell(){
result=`ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l`
return $result
}
start_maxwell(){
status_maxwell
if [[ $? -lt 1 ]]; then
echo "启动Maxwell"
/opt/module/maxwell/bin/maxwell --config /opt/module/maxwell/config.properties --daemon
else
echo "Maxwell正在运行"
fi
}
stop_maxwell(){
status_maxwell
if [[ $? -gt 0 ]]; then
echo "停止Maxwell"
ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}' | xargs kill -9
else
echo "Maxwell未在运行"
fi
}
case $1 in
start )
start_maxwell
;;
stop )
stop_maxwell
;;
restart )
stop_maxwell
start_maxwell
;;
esac
8.历史数据全量同步
注意:项目搭建的第一次使用(其它时间不用)
[atguigu@hadoop100 maxwell]$ /opt/module/maxwell/bin/maxwell-bootstrap --database atguigu --table base_province --config /opt/module/maxwell/config.properties
13.日志消费Flume配置(kafka到hdfs日志)
这个flume可以随便安装如何机器
1.配置
[atguigu@hadoop101 flume]$ vim job/kafka_to_hdfs_log.conf
#定义组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1
#配置source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
#a1.sources.r1.interceptors.i1.type = com.atguigu.gmall.flume.interceptor.TimestampInterceptor$Builder
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimestampInterceptor$Builder
#配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
#配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall_remake/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
#控制输出文件类型
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
#组装
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2.解决数据漂移
这种解决不标准,企业的时候根据业务逻辑写
1.pom
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.10.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.62</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<!-- 打包插件爆红,但是不影响使用,继续执行下面操作 -->
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
2.实现拦截器(interceptor)
注意还有一个静态类
public class TimestampInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
String s = new String(event.getBody(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = JSONObject.parseObject(s);
String ts = jsonObject.getString("ts");
headers.put("timestamp",ts);
return event;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
@Override
public List<Event> intercept(List<Event> events) {
Iterator<Event> iterator = events.iterator();
while (iterator.hasNext()){
Event next = iterator.next();
if(intercept(next) == null){
iterator.remove();
}
}
return events;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TimestampInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
3.打包
创建plugins.d
[atguigu@hadoop101 flume]$ mkdir plugins.d
创建myTimestampInterceptor(名字随便取)
[atguigu@hadoop101 plugins.d]$ mkdir myTimestampInterceptor/
创建三个目录lib,libext,native(固定,不能修改)
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir lib
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir libext
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir native
将打好的包放在lib下
4.启动日志消费Flume
1.启动zookerper和kafka集群
2.启动日志采集flume
[atguigu@hadoop100 ~]$ f1.sh start
3.启动日志消费flume
[atguigu@hadoop101 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf
4.开启数据生成器
5.检查hdfs,如果有着成功
5.脚本
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop101 日志数据flume-------"
ssh hadoop101 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop101 日志数据flume-------"
ssh hadoop101 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
14.安装DataX即使用
在hadoop100安装
1.安装
[atguigu@hadoop100 software]$ tar -zxvf datax.tar.gz -C /opt/module/
2.自检
[atguigu@hadoop100 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json
假如能执行这成功
3.DataX配置文件生成
[atguigu@hadoop100 ~]$ mkdir /opt/module/gen_datax_config
[atguigu@hadoop100 ~]$ cd /opt/module/gen_datax_config
修改configuration.properties
mysql.username=root
mysql.password=000000
mysql.host=hadoop100
mysql.port=3306
mysql.database=atguigu
mysql.tables=base_province
#mysql.tables=activity_info,activity_rule,base_trademark,cart_info,base_category1,base_category2,base_category3,coupon_info,sku_attr_value,sku_sale_attr_value,base_dic,sku_info,base_province,spu_info, base_region,promotion_pos,promotion_refer
hdfs.uri=hdfs://hadoop100:8020
import_outdir=/opt/module/datax/job/import
#export_outdir=/opt/module/datax/job/export
执行:
[atguigu@hadoop100 ~]$ java -jar datax-config-generator-1.0.1-jar-with-dependencies.jar
观察结果:
[atguigu@hadoop100 ~]$ cd /opt/module/datax/job/import
[atguigu@hadoop100 import]$ ll
4.启动
[atguigu@hadoop100 bin]$ python /opt/module/datax/bin/datax.py -p"-Dtargetdir=/origin_data/gmall_remake/db/dataX/2022-06-08" /opt/module/datax/job/import/atguigu.base_province.json
5.全量启动脚本
这是例子,企业用的时候记得修改
#!/bin/bash
DATAX_HOME=/opt/module/datax
# 如果传入日期则do_date等于传入的日期,否则等于前一天日期
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
#处理目标路径,此处的处理逻辑是,如果目标路径不存在,则创建;若存在,则清空,目的是保证同步任务可重复执行
handle_targetdir() {
hadoop fs -test -e $1
if [[ $? -eq 1 ]]; then
echo "路径$1不存在,正在创建......"
hadoop fs -mkdir -p $1
else
echo "路径$1已经存在"
fs_count=$(hadoop fs -count $1)
content_size=$(echo $fs_count | awk '{print $3}')
if [[ $content_size -eq 0 ]]; then
echo "路径$1为空"
else
echo "路径$1不为空,正在清空......"
hadoop fs -rm -r -f $1/*
fi
fi
}
#数据同步
import_data() {
datax_config=$1
target_dir=$2
handle_targetdir $target_dir
python $DATAX_HOME/bin/datax.py -p"-Dtargetdir=$target_dir" $datax_config
}
case $1 in
"activity_info")
import_data /opt/module/datax/job/import/gmall_remake.activity_info.json /origin_data/gmall_remake/db/activity_info_full/$do_date
;;
"activity_rule")
import_data /opt/module/datax/job/import/gmall_remake.activity_rule.json /origin_data/gmall_remake/db/activity_rule_full/$do_date
;;
"base_category1")
import_data /opt/module/datax/job/import/gmall_remake.base_category1.json /origin_data/gmall_remake/db/base_category1_full/$do_date
;;
"base_category2")
import_data /opt/module/datax/job/import/gmall_remake.base_category2.json /origin_data/gmall_remake/db/base_category2_full/$do_date
;;
"base_category3")
import_data /opt/module/datax/job/import/gmall_remake.base_category3.json /origin_data/gmall_remake/db/base_category3_full/$do_date
;;
"base_dic")
import_data /opt/module/datax/job/import/gmall_remake.base_dic.json /origin_data/gmall_remake/db/base_dic_full/$do_date
;;
"base_province")
import_data /opt/module/datax/job/import/gmall_remake.base_province.json /origin_data/gmall_remake/db/base_province_full/$do_date
;;
"base_region")
import_data /opt/module/datax/job/import/gmall_remake.base_region.json /origin_data/gmall_remake/db/base_region_full/$do_date
;;
"base_trademark")
import_data /opt/module/datax/job/import/gmall_remake.base_trademark.json /origin_data/gmall_remake/db/base_trademark_full/$do_date
;;
"cart_info")
import_data /opt/module/datax/job/import/gmall_remake.cart_info.json /origin_data/gmall_remake/db/cart_info_full/$do_date
;;
"coupon_info")
import_data /opt/module/datax/job/import/gmall_remake.coupon_info.json /origin_data/gmall_remake/db/coupon_info_full/$do_date
;;
"sku_attr_value")
import_data /opt/module/datax/job/import/gmall_remake.sku_attr_value.json /origin_data/gmall_remake/db/sku_attr_value_full/$do_date
;;
"sku_info")
import_data /opt/module/datax/job/import/gmall_remake.sku_info.json /origin_data/gmall_remake/db/sku_info_full/$do_date
;;
"sku_sale_attr_value")
import_data /opt/module/datax/job/import/gmall_remake.sku_sale_attr_value.json /origin_data/gmall_remake/db/sku_sale_attr_value_full/$do_date
;;
"spu_info")
import_data /opt/module/datax/job/import/gmall_remake.spu_info.json /origin_data/gmall_remake/db/spu_info_full/$do_date
;;
"promotion_pos")
import_data /opt/module/datax/job/import/gmall_remake.promotion_pos.json /origin_data/gmall_remake/db/promotion_pos_full/$do_date
;;
"promotion_refer")
import_data /opt/module/datax/job/import/gmall_remake.promotion_refer.json /origin_data/gmall_remake/db/promotion_refer/$do_date
;;
"all")
import_data /opt/module/datax/job/import/gmall_remake.activity_info.json /origin_data/gmall_remake/db/activity_info_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.activity_rule.json /origin_data/gmall_remake/db/activity_rule_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_category1.json /origin_data/gmall_remake/db/base_category1_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_category2.json /origin_data/gmall_remake/db/base_category2_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_category3.json /origin_data/gmall_remake/db/base_category3_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_dic.json /origin_data/gmall_remake/db/base_dic_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_province.json /origin_data/gmall_remake/db/base_province_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_region.json /origin_data/gmall_remake/db/base_region_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.base_trademark.json /origin_data/gmall_remake/db/base_trademark_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.cart_info.json /origin_data/gmall_remake/db/cart_info_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.coupon_info.json /origin_data/gmall_remake/db/coupon_info_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.sku_attr_value.json /origin_data/gmall_remake/db/sku_attr_value_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.sku_info.json /origin_data/gmall_remake/db/sku_info_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.sku_sale_attr_value.json /origin_data/gmall_remake/db/sku_sale_attr_value_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.spu_info.json /origin_data/gmall_remake/db/spu_info_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.promotion_pos.json /origin_data/gmall_remake/db/promotion_pos_full/$do_date
import_data /opt/module/datax/job/import/gmall_remake.promotion_refer.json /origin_data/gmall_remake/db/promotion_refer/$do_date
;;
esac
15.增量表数据同步
kafka到hdfs的数据库数据
1.配置
[atguigu@hadoop101 flume]$ vim job/kafka_to_hdfs_db.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092
a1.sources.r1.kafka.topics = topic_db
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
#a1.sources.r1.interceptors.i1.type = com.atguigu.gmall.flume.interceptor.TimestampAndTableNameInterceptor$Builder
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimestampAndTableNameInterceptor$Builder
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall_remake/db/%{tableName}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
2.解决数据漂移
public class TimestampAndTableNameInterceptor implements Interceptor {
private SimpleDateFormat dateFormat = null;
@Override
public void initialize() {
dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
}
/*
* 将body当中的table,放到header当中的 tableName
* 将body当中的时间,放到header当中的timestamp
* */
@Override
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
// 1 获取header 和body
byte[] body = event.getBody();
String log = new String(body, StandardCharsets.UTF_8);
// 2 解析body当中的数据
try {
JSONObject jsonObject = JSONObject.parseObject(log);
//解析表名
String table = jsonObject.getString("table");
// 解析更新类型
String type = jsonObject.getString("type");
// 解析data
String dataJSONString = jsonObject.getString("data");
String time = null;
String formatTime = null;
if ("insert".equals(type)) {
time = JSONObject.parseObject(dataJSONString).getString("create_time");
formatTime = String.valueOf(dateFormat.parse(time).getTime());
headers.put("timestamp", formatTime);
} else if ("update".equals(type)) {
time = JSONObject.parseObject(dataJSONString).getString("operate_time");
formatTime = String.valueOf(dateFormat.parse(time).getTime());
headers.put("timestamp", formatTime);
} else if ("bootstrap-insert".equals(type)) {
String ts = jsonObject.getString("ts") + "000";
headers.put("timestamp", ts);
}
headers.put("tableName",table);
return event;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
@Override
public List<Event> intercept(List<Event> list) {
Iterator<Event> iterator = list.iterator();
while (iterator.hasNext()) {
Event event = iterator.next();
if (intercept(event) == null) {
iterator.remove();
}
}
return list;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new TimestampAndTableNameInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
3.打包
创建plugins.d
[atguigu@hadoop101 flume]$ mkdir plugins.d
创建myTimestampInterceptor(名字随便取)
[atguigu@hadoop101 plugins.d]$ mkdir myTimestampInterceptor/
创建三个目录lib,libext,native(固定,不能修改)
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir lib
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir libext
[atguigu@hadoop101 myTimestampInterceptor]$ mkdir native
将打好的包放在lib下
4.测试
开启zookerper和kafka
[atguigu@hadoop101 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_db.conf
5.脚本
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop101 业务数据flume-------"
ssh hadoop101 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop101 业务数据flume-------"
ssh hadoop101 "ps -ef | grep kafka_to_hdfs_db | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
6.采集通道启动/停止脚本
企业不用,不要瞎搞
#!/bin/bash
case $1 in
"start"){
echo ================== 启动 集群 ==================
#启动 Zookeeper集群
myzk.sh start
#启动 Hadoop集群
myhadoop start
#启动 Kafka采集集群
mykafka.sh start
#启动采集 Flume
f1.sh start
#启动日志消费 Flume
f2.sh start
#启动业务消费 Flume
f3.sh start
#启动 maxwell
mymxw.sh start
"stop"){
echo ================== 停止 集群 ==================
#停止 Maxwell
mymxw.sh stop
#停止 业务消费Flume
f3.sh stop
#停止 日志消费Flume
f2.sh stop
#停止 日志采集Flume
f1.sh stop
#停止 Kafka采集集群
mykafka.sh stop
#停止 Hadoop集群
myhadoop stop
#停止 Zookeeper集群
myzk.sh stop
};;
esac
16.安装hive
1.解压
[atguigu@hadoop100 software]$ tar -zxvf /opt/software/apache-hive-3.1.3.tar.gz -C /opt/module/
[atguigu@hadoop100 software]$ mv /opt/module/apache-hive-3.1.3-bin/ /opt/module/hive
2.环境变量
[atguigu@hadoop100 software]$ sudo vim /etc/profile.d/my_env.sh
#HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin
[atguigu@hadoop100 software]$ source /etc/profile.d/my_env.sh
解决日志Jar包冲突,进入/opt/module/hive/lib
[atguigu@hadoop100 lib]$ mv log4j-slf4j-impl-2.17.1.jar log4j-slf4j-impl-2.17.1.jar.bak
3.hive元数据配置到mysql
拷贝驱动
[atguigu@hadoop102 lib]$ cp /opt/software/mysql/mysql-connector-j-8.0.31.jar /opt/module/hive/lib/
配置Metastore到mysql
[atguigu@hadoop102 conf]$ vim hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop100:3306/metastore?useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>000000</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop100</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
</configuration>
4.启动hive
1.登录mysql
[atguigu@hadoop100 conf]$ mysql -uroot -p000000
2.新建hive元数据库
mysql> create database metastore;
3.初始化hive元数据库
[atguigu@hadoop100 conf]$ schematool -initSchema -dbType mysql -verbose
4.修改元数据字符集
mysql>use metastore;
mysql> alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
mysql> alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;
mysql> quit;
5.启动hive客户端
[atguigu@hadoop100 hive]$ bin/hive
6.用客户端软件连接时
[atguigu@hadoop100 bin]$ hiveserver2