采集项目(HA)(五台服务器)

采集项目


采集项目说明:采集项目是五台服务器(HA)
项目规划:

1.跳板机准备

2.记得开在服务器上开3306端口


说明:如果不开,电脑上的navicat连接不上

3.服务器准备


1.配置本机hosts

vim /etc/hosts

39.98.58.145 hadoop100

39.98.58.198 hadoop101

39.98.60.160 hadoop102

47.92.108.72 hadoop103

39.98.54.179 hadoop104

注意:这里是公网

2.配置服务器的hosts

172.19.21.168 hadoop101

172.19.21.169 hadoop102

172.28.19.250 hadoop103

172.19.21.167 hadoop104

172.28.19.251 hadoop100

注意:这里是私网

3.修改服务器hostname

vim /etc/hostname

hadoop100

4.安装插件

yum install -y epel-release

yum install -y net-tools

yum install -y vim

yum install -y lrzsz

yum install -y epel-release

yum install -y psmisc nc rsync lrzsz ntp libzstd openssl-static tree iotop git

5.防火墙关闭

systemctl stop firewalld

systemctl disable firewalld.service

注意:在企业开发时,通常单个服务器的防火墙时关闭的。公司整体对外会设置非常安全的防火墙

6.添加用户

1.创建atguigu用户

[root@hadoop100 ~]# useradd atguigu

[root@hadoop100 ~]# passwd atguigu

2.配置atguigu用户有root权限

[root@hadoop100 ~]# vim /etc/sudoers

## Allow root to run any commands anywhere

root ALL=(ALL) ALL

## Allows people in group wheel to run all commands

%wheel ALL=(ALL) ALL

atguigu ALL=(ALL) NOPASSWD:ALL

注意:atguigu这一行不要直接放到root行下面,因为所有用户都属于wheel组,你先配置了atguigu具有免密功能,但是程序执行到%wheel行时,该功能又被覆盖回需要密码。所以atguigu要放到%wheel这行下面。

3.在/opt创建module,software文件夹

[root@hadoop100 ~]# mkdir /opt/module

[root@hadoop100 ~]# mkdir /opt/software

[root@hadoop100 ~]# chown atguigu:atguigu /opt/module

[root@hadoop100 ~]# chown atguigu:atguigu /opt/software

4.配置ssh免密

[atguigu@hadoop102 .ssh]$ ssh-keygen -t rsa

然后敲(三个回车),就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop100

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop101

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104

注意:免密在那台机器要用,直接配置

集群分发脚本xsync
1.在家目录/home/atguigu下创建bin文件

[atguigu@hadoop102 ~]$ mkdir bin

2.编写脚本

[atguigu@hadoop102 ~]$ cd /home/atguigu/bin

[atguigu@hadoop102 ~]$ vim xsync

#!/bin/bash

#1. 判断参数个数

if [ $# -lt 1 ]

then

echo Not Enough Arguement!

exit;

fi

#2. 遍历集群所有机器

for host in hadoop102 hadoop103 hadoop104

do

echo ==================== $host ====================

#3. 遍历所有目录,挨个发送

for file in $@

do

#4 判断文件是否存在

if [ -e $file ]

then

#5. 获取父目录

pdir=$(cd -P $(dirname $file); pwd)

#6. 获取当前文件的名称

fname=$(basename $file)

ssh $host "mkdir -p $pdir"

rsync -av $pdir/$fname $host:$pdir

else

echo $file does not exists!

fi

done

done

3.添加执行权限

[atguigu@hadoop102 bin]$ chmod 777 xsync

4.测试脚本

[atguigu@hadoop102 bin]$ xsync xsync

4.JDK安装

1.卸载JDK

注意:五台机器都执行

[atguigu@hadoop102 opt]# sudo rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps

2.上传

[atguigu@hadoop102 software]# ls /opt/software/

3.解压

[atguigu@hadoop102 software]# tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

[atguigu@hadoop102 module]$ mv jdk1.8.0_212/ jdk

4.环境变量

[atguigu@hadoop102 module]# sudo vim /etc/profile.d/my_env.sh

#JAVA_HOME

export JAVA_HOME=/opt/module/jdk

export PATH=$PATH:$JAVA_HOME/bin

[atguigu@hadoop102 software]$ source /etc/profile.d/my_env.sh

5.测试

[atguigu@hadoop102 module]# java -version

6.分发

[atguigu@hadoop102 module]$ xsync /opt/module/jdk

[atguigu@hadoop102 module]$ sudo /home/atguigu/bin/xsync /etc/profile.d/my_env.sh

7.其它机器刷新

[atguigu@hadoop103 module]$ source /etc/profile.d/my_env.sh

5.安装Mysql

1.安装包准备

[atguigu@hadoop102~]# mkdir /opt/software/mysql

[atguigu@hadoop102 software]$ cd /opt/software/mysql/

install_mysql.sh

mysql-community-client-8.0.31-1.el7.x86_64.rpm

mysql-community-client-plugins-8.0.31-1.el7.x86_64.rpm

mysql-community-common-8.0.31-1.el7.x86_64.rpm

mysql-community-icu-data-files-8.0.31-1.el7.x86_64.rpm

mysql-community-libs-8.0.31-1.el7.x86_64.rpm

mysql-community-libs-compat-8.0.31-1.el7.x86_64.rpm

mysql-community-server-8.0.31-1.el7.x86_64.rpm

mysql-connector-j-8.0.31.jar

2.安装插件

说明:由于阿里云服务器安装的是Linux最小系统版,没有如下工具,所以需要安装。(都可以安装 没事)

(1)卸载MySQL依赖,虽然机器上没有装MySQL,但是这一步不可少

[atguigu@hadoop102 mysql]# sudo yum remove mysql-libs

(2)下载依赖并安装

[atguigu@hadoop102 mysql]# sudo yum install libaio

[atguigu@hadoop102 mysql]# sudo yum -y install autoconf

3.安装

[atguigu@hadoop102 mysql]$ su root

[root@hadoop102 mysql]# sh install_mysql.sh

[root@hadoop102 mysql]# exit

6.数据准备(企业安装不需要)

1.上传

cd /opt/module/data_mocker

2.修改配置application.yml
3.创建数据库

7.安装Zookeeper

注意:安装在hadoop102,hadoop103,hadoop104中

1.解压

[atguigu@hadoop102 software]$ tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz -C /opt/module/

[atguigu@hadoop102 module]$ mv apache-zookeeper-3.7.1-bin/ zookeeper

2.配置服务器编号

[atguigu@hadoop102 zookeeper]$ mkdir zkData

[atguigu@hadoop102 zkData]$ vim myid

2

注意:myid是唯一的,在hadoop103中改为3,在hadoop104中改为4

3.配置zoo.cfg文件

[atguigu@hadoop102 conf]$ mv zoo_sample.cfg zoo.cfg

[atguigu@hadoop102 conf]$ vim zoo.cfg

添加如下配置:

dataDir=/opt/module/zookeeper/zkData

#######################cluster##########################

server.2=hadoop102:2888:3888

server.3=hadoop103:2888:3888

server.4=hadoop104:2888:3888

注意:一定不能有空格,server后的数字与上面配置myid一致

分发:

[atguigu@hadoop102 module]$ xsync zookeeper/

4.集群操作

[atguigu@hadoop102 zookeeper]$ bin/zkServer.sh start

[atguigu@hadoop103 zookeeper]$ bin/zkServer.sh start

[atguigu@hadoop104 zookeeper]$ bin/zkServer.sh start

[atguigu@hadoop102 zookeeper]# bin/zkServer.sh status

JMX enabled by default

Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg

Mode: follower

[atguigu@hadoop103 zookeeper]# bin/zkServer.sh status

JMX enabled by default

Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg

Mode: leader

[atguigu@hadoop104 zookeeper]# bin/zkServer.sh status

JMX enabled by default

Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg

Mode: follower

注意:启动后一定要查看状态,有leader才算成功

5.启动与停止脚本

[atguigu@hadoop102 ~]$ cd

[atguigu@hadoop102 ~]$ cd bin/

[atguigu@hadoop102 bin]$ vim myzk.sh

#!/bin/bash

case $1 in

"start"){

for i in hadoop102 hadoop103 hadoop104

do

echo ---------- zookeeper $i 启动 ------------

ssh $i "/opt/module/zookeeper/bin/zkServer.sh start"

done

};;

"stop"){

for i in hadoop102 hadoop103 hadoop104

do

echo ---------- zookeeper $i 停止 ------------

ssh $i "/opt/module/zookeeper/bin/zkServer.sh stop"

done

};;

"status"){

for i in hadoop102 hadoop103 hadoop104

do

echo ---------- zookeeper $i 状态 ------------

ssh $i "/opt/module/zookeeper/bin/zkServer.sh status"

done

};;

esac

添加权限:

[atguigu@hadoop102 bin]$ chmod 777 zk.sh

客户端启动:

[atguigu@hadoop103 zookeeper]$ bin/zkCli.sh

命令基本语法

功能描述

help

显示所有操作命令

ls path

使用 ls 命令来查看当前znode的子节点-w 监听子节点变化-s 附加次级信息

create

普通创建-s 含有序列-e 临时(重启或者超时消失)

get path

获得节点的值-w 监听节点内容变化-s 附加次级信息

set

设置节点的具体值

stat

查看节点状态

delete

删除节点

deleteall

递归删除节点

8.安装hadoop

规划:

1.配置(5个)

1.core-site.xml

<configuration>

<!-- 把多个NameNode的地址组装成一个集群mycluster -->

<property>

<name>fs.defaultFS</name>

<value>hdfs://mycluster</value>

</property>

<!-- 指定hadoop数据的存储目录 -->

<property>

<name>hadoop.tmp.dir</name>

<value>/opt/module/hadoop/data</value>

</property>

<!-- 配置HDFS网页登录使用的静态用户为atguigu -->

<property>

<name>hadoop.http.staticuser.user</name>

<value>atguigu</value>

</property>

<property>

<name>hadoop.proxyuser.atguigu.hosts</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.atguigu.groups</name>

<value>*</value>

</property>

<!--配置atguigu用户能够代理的用户为任意用户-->

<property>

<name>hadoop.proxyuser.atguigu.users</name>

<value>*</value>

</property>

<!-- 指定zkfc要连接的zkServer地址 -->

<property>

<name>ha.zookeeper.quorum</name>

<value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value>

</property>

<property>

<name>ipc.client.connect.max.retries</name>

<value>100</value>

<description>

Indicates the number of retries a client will make to establisha server connection.

</description>

</property>

<property>

<name>ipc.client.connect.retry.interval</name>

<value>10000</value>

<description>Indicates the number of milliseconds a client will wait for

before retrying to establish a server connection.

</description>

</property>

</configuration>

2.hdfs-site.xml

<configuration>

<!-- NameNode数据存储目录 -->

<property>

<name>dfs.namenode.name.dir</name>

<value>file://${hadoop.tmp.dir}/name</value>

</property>

<!-- DataNode数据存储目录 -->

<property>

<name>dfs.datanode.data.dir</name>

<value>file://${hadoop.tmp.dir}/data</value>

</property>

<!-- JournalNode数据存储目录 -->

<property>

<name>dfs.journalnode.edits.dir</name>

<value>${hadoop.tmp.dir}/jn</value>

</property>

<!-- 完全分布式集群名称 -->

<property>

<name>dfs.nameservices</name>

<value>mycluster</value>

</property>

<!-- 集群中NameNode节点都有哪些 -->

<property>

<name>dfs.ha.namenodes.mycluster</name>

<value>nn1,nn2</value>

</property>

<!-- NameNode的RPC通信地址 -->

<property>

<name>dfs.namenode.rpc-address.mycluster.nn1</name>

<value>hadoop100:8020</value>

</property>

<property>

<name>dfs.namenode.rpc-address.mycluster.nn2</name>

<value>hadoop101:8020</value>

</property>

<!-- NameNode的http通信地址 -->

<property>

<name>dfs.namenode.http-address.mycluster.nn1</name>

<value>hadoop100:9870</value>

</property>

<property>

<name>dfs.namenode.http-address.mycluster.nn2</name>

<value>hadoop101:9870</value>

</property>

<!-- 指定NameNode元数据在JournalNode上的存放位置 -->

<property>

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoop102:8485;hadoop103:8485;hadoop104:8485/mycluster</value>

</property>

<!-- 访问代理类:client用于确定哪个NameNode为Active -->

<property>

<name>dfs.client.failover.proxy.provider.mycluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

<!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应 -->

<property>

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

</property>

<!-- 使用隔离机制时需要ssh秘钥登录-->

<property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/home/atguigu/.ssh/id_rsa</value>

</property>

<!-- 启用nn故障自动转移 -->

<property>

<name>dfs.ha.automatic-failover.enabled</name>

<value>true</value>

</property>

</configuration>

3.workers(datanode位置)

hadoop102

hadoop103

hadoop104

4.yarn-size.xml

<configuration>

<!-- 指定MR走shuffle -->

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<!-- 启用resourcemanager ha -->

<property>

<name>yarn.resourcemanager.ha.enabled</name>

<value>true</value>

</property>

<!-- 声明两台resourcemanager的地址 -->

<property>

<name>yarn.resourcemanager.cluster-id</name>

<value>cluster-yarn1</value>

</property>

<!--指定resourcemanager的逻辑列表-->

<property>

<name>yarn.resourcemanager.ha.rm-ids</name>

<value>rm1,rm2</value>

</property>

<!-- ========== rm1的配置 ========== -->

<!-- 指定rm1的主机名 -->

<property>

<name>yarn.resourcemanager.hostname.rm1</name>

<value>hadoop100</value>

</property>

<!-- 指定rm1的web端地址 -->

<property>

<name>yarn.resourcemanager.webapp.address.rm1</name>

<value>hadoop100:8088</value>

</property>

<!-- 指定rm1的内部通信地址 -->

<property>

<name>yarn.resourcemanager.address.rm1</name>

<value>hadoop100:8032</value>

</property>

<!-- 指定AM向rm1申请资源的地址 -->

<property>

<name>yarn.resourcemanager.scheduler.address.rm1</name>

<value>hadoop100:8030</value>

</property>

<!-- 指定供NM连接的地址 -->

<property>

<name>yarn.resourcemanager.resource-tracker.address.rm1</name>

<value>hadoop100:8031</value>

</property>

<!-- ========== rm2的配置 ========== -->

<!-- 指定rm2的主机名 -->

<property>

<name>yarn.resourcemanager.hostname.rm2</name>

<value>hadoop101</value>

</property>

<property>

<name>yarn.resourcemanager.webapp.address.rm2</name>

<value>hadoop101:8088</value>

</property>

<property>

<name>yarn.resourcemanager.address.rm2</name>

<value>hadoop101:8032</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address.rm2</name>

<value>hadoop101:8030</value>

</property>

<property>

<name>yarn.resourcemanager.resource-tracker.address.rm2</name>

<value>hadoop101:8031</value>

</property>

<!-- 指定zookeeper集群的地址 -->

<property>

<name>yarn.resourcemanager.zk-address</name>

<value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value>

</property>

<!-- 启用自动恢复 -->

<property>

<name>yarn.resourcemanager.recovery.enabled</name>

<value>true</value>

</property>

<!-- 指定resourcemanager的状态信息存储在zookeeper集群 -->

<property>

<name>yarn.resourcemanager.store.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>

</property>

<!-- 环境变量的继承 -->

<property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

<!-- 开启日志聚集功能 -->

<property>

<name>yarn.log-aggregation-enable</name>

<value>true</value>

</property>

<!-- 设置日志聚集服务器地址 -->

<property>

<name>yarn.log.server.url</name>

<value>http://hadoop100:19888/jobhistory/logs</value>

</property>

<!-- 设置日志保留时间为7天 -->

<property>

<name>yarn.log-aggregation.retain-seconds</name>

<value>604800</value>

</property>

</configuration>

5.mapred-site.xml

<configuration>

<!-- 指定MapReduce程序运行在Yarn上 -->

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<!-- 历史服务器端地址 -->

<property>

<name>mapreduce.jobhistory.address</name>

<value>hadoop100:10020</value>

</property>

<!-- 历史服务器web端地址 -->

<property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>hadoop100:19888</value>

</property>

</configuration>

2.环境变量

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

3.检查Zookeeper是否打开(一定要打开)!!!
4.初始化jn

1.在102 103 104节点上,输入以下命令启动journalnode服务(jn在那台就在那台执行)

hdfs --daemon start journalnode

5.格式nn

2.在100上对其进行格式化,并启动

hdfs namenode -format

hdfs --daemon start namenode

6.同步nn

3.在101上同步nn1的元数据信息,并启动namenode

hdfs namenode -bootstrapStandby

hdfs --daemon start namenode

7.启动dn

4.在102 103 104节点上,启动datanode

hdfs --daemon start datanode

8.格式zkfc

5.在100 101节点上,格式化zkfc

hdfs zkfc -formatZK

9.启动zkfc

6.在100 101节点上启动zkfc

hdfs --daemon start zkfc

7.查看hadoop100:9870 hadoop101:9870是否有active

10.启停脚本

#!/bin/bash

if [ $# -lt 1 ]

then

echo "No Args Input..."

exit ;

fi

case $1 in

"start")

echo " =================== 启动 hadoop集群 ==================="

echo " --------------- 启动 hdfs ---------------"

ssh hadoop100 "/opt/module/hadoop/sbin/start-dfs.sh"

echo " --------------- 启动 yarn ---------------"

ssh hadoop100 "/opt/module/hadoop/sbin/start-yarn.sh"

echo " --------------- 启动 historyserver ---------------"

ssh hadoop100 "/opt/module/hadoop/bin/mapred --daemon start historyserver"

;;

"stop")

echo " =================== 关闭 hadoop集群 ==================="

echo " --------------- 关闭 historyserver ---------------"

ssh hadoop100 "/opt/module/hadoop/bin/mapred --daemon stop historyserver"

echo " --------------- 关闭 yarn ---------------"

ssh hadoop100 "/opt/module/hadoop/sbin/stop-yarn.sh"

echo " --------------- 关闭 hdfs ---------------"

ssh hadoop100 "/opt/module/hadoop/sbin/stop-dfs.sh"

;;

*)

echo "Input Args Error..."

;;

esac

11.安装错误,如何格式化

[atguigu@hadoop100 hadoop]$ rm -fr data/

[atguigu@hadoop100 hadoop]$ rm -fr logs/

[atguigu@hadoop100 tmp]$ cd /tmp/

[atguigu@hadoop100 tmp]$ rm -fr hadoop*

9.安装Kafka

规划:

1.解压

[atguigu@hadoop102 software]$ tar -zxvf kafka_2.12-3.3.1.tgz -C /opt/module/

[tguigu@hadoop102 module]$ mv kafka_2.12-3.3.1/ kafka

2.修改配置

[atguigu@hadoop102 kafka]$ cd config/

[atguigu@hadoop102 config]$ vim server.properties

#broker的全局唯一编号,不能重复,只能是数字。

broker.id=0

#处理网络请求的线程数量

num.network.threads=3

#用来处理磁盘IO的线程数量

num.io.threads=8

#发送套接字的缓冲区大小

socket.send.buffer.bytes=102400

#接收套接字的缓冲区大小

socket.receive.buffer.bytes=102400

#请求套接字的缓冲区大小

socket.request.max.bytes=104857600

#kafka运行日志(数据)存放的路径,路径不需要提前创建,kafka自动帮你创建,可以配置多个磁盘路径,路径与路径之间可以用","分隔

log.dirs=/opt/module/kafka/datas

#topic在当前broker上的分区个数

num.partitions=1

#用来恢复和清理data下数据的线程数量

num.recovery.threads.per.data.dir=1

# 每个topic创建时的副本数,默认时1个副本

offsets.topic.replication.factor=1

#segment文件保留的最长时间,超时将被删除

log.retention.hours=168

#每个segment文件的大小,默认最大1G

log.segment.bytes=1073741824

# 检查过期数据的时间,默认5分钟检查一次是否数据过期

log.retention.check.interval.ms=300000

#配置连接Zookeeper集群地址(在zk根目录下创建/kafka,方便管理)

zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka

注意:#broker的全局唯一编号,不能重复,只能是数字。 broker.id=0

log.dirs=/opt/module/kafka/datas 记得创建路径!!!!!

3.分发

[atguigu@hadoop102 module]$ xsync kafka/

4.修改broker.id

[atguigu@hadoop103 module]$ vim kafka/config/server.properties

修改:

# The id of the broker. This must be set to a unique integer for each broker.

broker.id=1

[atguigu@hadoop104 module]$ vim kafka/config/server.properties

修改:

# The id of the broker. This must be set to a unique integer for each broker.

broker.id=2

5.环境变量

[atguigu@hadoop102 module]$ sudo vim /etc/profile.d/my_env.sh

#KAFKA_HOME

export KAFKA_HOME=/opt/module/kafka

export PATH=$PATH:$KAFKA_HOME/bin

[atguigu@hadoop102 module]$ sudo /home/atguigu/bin/xsync /etc/profile.d/my_env.sh

[atguigu@hadoop103 module]$ source /etc/profile

[atguigu@hadoop104 module]$ source /etc/profile

6.检查是否启动Zookeeper,一定要启动!!!!!!
7.启动kafka,停止kafka

[atguigu@hadoop102 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties

[atguigu@hadoop103 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties

[atguigu@hadoop104 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties

[atguigu@hadoop102 kafka]$ bin/kafka-server-stop.sh

[atguigu@hadoop103 kafka]$ bin/kafka-server-stop.sh

[atguigu@hadoop104 kafka]$ bin/kafka-server-stop.sh

8.启停脚本

#! /bin/bash

case $1 in

"start"){

for i in hadoop102 hadoop103 hadoop104

do

echo " --------启动 $i Kafka-------"

ssh $i "/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties"

done

};;

"stop"){

for i in hadoop102 hadoop103 hadoop104

do

echo " --------停止 $i Kafka-------"

ssh $i "/opt/module/kafka/bin/kafka-server-stop.sh "

done

};;

esac

*注意:*停止Kafka集群时,一定要等Kafka所有节点进程全部停止后再停止Zookeeper集群。因为Zookeeper集群当中记录着Kafka集群相关信息,Zookeeper集群一旦先停止,Kafka集群就没有办法再获取停止进程的信息,只能手动杀死Kafka进程了。

9.主题命令行操作

参数

描述

--bootstrap-server <String: server toconnect to>

连接的Kafka Broker主机名称和端口号。

--topic <String: topic>

操作的topic名称。

--create

创建主题。

--delete

删除主题。

--alter

修改主题。

--list

查看所有主题。

--describe

查看主题详细描述。

--partitions <Integer: # of partitions>

设置分区数。

--replication-factor<Integer: replication factor>

设置分区副本。

--config <String: name=value>

更新系统默认的配置。

9.生产者命令行操作

参数

描述

--bootstrap-server <String: server toconnect to>

连接的Kafka Broker主机名称和端口号。

--topic <String: topic>

操作的topic名称。

10.消费者命令行操作

参数

描述

--bootstrap-server <String: server toconnect to>

连接的Kafka Broker主机名称和端口号。

--topic <String: topic>

操作的topic名称。

--from-beginning

从头开始消费。

--group <String: consumer group id>

指定消费者组名称。

10.安装Flume

1.解压

[atguigu@hadoop100 software]$ tar -zxvf /opt/software/apache-flume-1.10.1-bin.tar.gz -C /opt/module/

[atguigu@hadoop100 module]$ mv /opt/module/apache-flume-1.10.1-bin /opt/module/flume

2.配置

[atguigu@hadoop100 conf]$ vim log4j2.xml

<Properties>

<Property name="LOG_DIR">/opt/module/flume/log</Property>

</Properties>

. . . . . .

# 在最下面引入控制台输出,方便学习查看日志

<Root level="INFO">

<AppenderRef ref="LogFile" />

<AppenderRef ref="Console" />

</Root>

11.日志采集(日志文件到kafka)

1.写配置

[atguigu@hadoop100 flume]$ mkdir job

[atguigu@hadoop100 flume]$ vim job/file_to_kafka.conf

a1.sources = r1

a1.channels = c1

#配置source

a1.sources.r1.type = TAILDIR

a1.sources.r1.filegroups = f1

#要采集的文件

a1.sources.r1.filegroups.f1 = /opt/module/data_mocker/log/app.*

a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json

#配置channel

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel

#kafka集群

a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092

#kafka主题

a1.channels.c1.kafka.topic = topic_log

a1.channels.c1.parseAsFlumeEvent = false

#组装

a1.sources.r1.channels = c1

2.启动

注意:启动zookeeper和kafka集群!!!!!!!!!

[atguigu@hadoop100 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/file_to_kafka.conf

3.测试

1.开启生成数据(生成环境不用)

2.在hadoop102(安装kafka机器都行)启动消费者

[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_log

4.启停脚本

#!/bin/bash

case $1 in

"start"){

echo " --------启动 $i 采集flume-------"

ssh hadoop100 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf/ -f /opt/module/flume/job/file_to_kafka.conf >/dev/null 2>&1 &"

};;

"stop"){

echo " --------停止 $i 采集flume-------"

ssh hadoop100 "ps -ef | grep file_to_kafka | grep -v grep | awk '{print \$2}' | xargs -n1 kill -9 "

}

;;

esac

12.安装Maxwell

1.安装

[atguigu@hadoop100 maxwell]$ tar -zxvf maxwell-1.29.2.tar.gz -C /opt/module/

[atguigu@hadoop100 module]$ mv maxwell-1.29.2/ maxwell

2.配置MySQL

mysql到kafka的路径

[atguigu@hadoop100 ~]$ sudo vim /etc/my.cnf

添加如下配置

#数据库id

server-id = 1

#启动binlog,该参数的值会作为binlog的文件名

log-bin=mysql-bin

#binlog类型,maxwell要求为row类型

binlog_format=row

#启用binlog的数据库,需根据实际情况作出修改

binlog-do-db=atguigu

重启Mysql

[atguigu@hadoop100 ~]$ sudo systemctl restart mysqld

3.穿件Maxwell所需要数据库和用户

msyql> CREATE DATABASE maxwell;

创建Maxwell用户并赋予其必要权限

mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';

mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';

mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';

4.配置Maxwell

[atguigu@hadoop100 maxwell]$ cd /opt/module/maxwell

[atguigu@hadoop100 maxwell]$ cp config.properties.example config.properties

# tl;dr config

log_level=info

producer=kafka

kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092

kafka_topic=topic_db

# mysql login info

host=hadoop100

user=maxwell

password=maxwell

jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai

5.启停

注意:启动zookeeper 启动kafka!!!!!!!

启动

[atguigu@hadoop100 ~]$ /opt/module/maxwell/bin/maxwell --config /opt/module/maxwell/config.properties --daemon

停止

[atguigu@hadoop100 ~]$ ps -ef | grep maxwell | grep -v grep | grep maxwell | awk '{print $2}' | xargs kill -9

6.测试

1.开启数据生成器

2.在hadoop102启动消费者(在有kafka的机器上就可以启动)

[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_db

3.假如能消费就正确咯

7.启停脚本

#!/bin/bash

MAXWELL_HOME=/opt/module/maxwell

status_maxwell(){

result=`ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l`

return $result

}

start_maxwell(){

status_maxwell

if [[ $? -lt 1 ]]; then

echo "启动Maxwell"

/opt/module/maxwell/bin/maxwell --config /opt/module/maxwell/config.properties --daemon

else

echo "Maxwell正在运行"

fi

}

stop_maxwell(){

status_maxwell

if [[ $? -gt 0 ]]; then

echo "停止Maxwell"

ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}' | xargs kill -9

else

echo "Maxwell未在运行"

fi

}

case $1 in

start )

start_maxwell

;;

stop )

stop_maxwell

;;

restart )

stop_maxwell

start_maxwell

;;

esac

8.历史数据全量同步

注意:项目搭建的第一次使用(其它时间不用)

[atguigu@hadoop100 maxwell]$ /opt/module/maxwell/bin/maxwell-bootstrap --database atguigu --table base_province --config /opt/module/maxwell/config.properties

13.日志消费Flume配置(kafka到hdfs日志)

这个flume可以随便安装如何机器

1.配置

[atguigu@hadoop101 flume]$ vim job/kafka_to_hdfs_log.conf

#定义组件

a1.sources=r1

a1.channels=c1

a1.sinks=k1

#配置source1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.r1.batchSize = 5000

a1.sources.r1.batchDurationMillis = 2000

a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092

a1.sources.r1.kafka.topics=topic_log

a1.sources.r1.interceptors = i1

#a1.sources.r1.interceptors.i1.type = com.atguigu.gmall.flume.interceptor.TimestampInterceptor$Builder

a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimestampInterceptor$Builder

#配置channel

a1.channels.c1.type = file

a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1

a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1

a1.channels.c1.maxFileSize = 2146435071

a1.channels.c1.capacity = 1000000

a1.channels.c1.keep-alive = 6

#配置sink

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /origin_data/gmall_remake/log/topic_log/%Y-%m-%d

a1.sinks.k1.hdfs.filePrefix = log

a1.sinks.k1.hdfs.round = false

a1.sinks.k1.hdfs.rollInterval = 10

a1.sinks.k1.hdfs.rollSize = 134217728

a1.sinks.k1.hdfs.rollCount = 0

#控制输出文件类型

a1.sinks.k1.hdfs.fileType = CompressedStream

a1.sinks.k1.hdfs.codeC = gzip

#组装

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

2.解决数据漂移

这种解决不标准,企业的时候根据业务逻辑写

1.pom

<dependencies>

<dependency>

<groupId>org.apache.flume</groupId>

<artifactId>flume-ng-core</artifactId>

<version>1.10.1</version>

<scope>provided</scope>

</dependency>

<dependency>

<groupId>com.alibaba</groupId>

<artifactId>fastjson</artifactId>

<version>1.2.62</version>

</dependency>

</dependencies>

<build>

<plugins>

<plugin>

<!-- 打包插件爆红,但是不影响使用,继续执行下面操作 -->

<artifactId>maven-compiler-plugin</artifactId>

<version>2.3.2</version>

<configuration>

<source>1.8</source>

<target>1.8</target>

</configuration>

</plugin>

<plugin>

<artifactId>maven-assembly-plugin</artifactId>

<configuration>

<descriptorRefs>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

</configuration>

<executions>

<execution>

<id>make-assembly</id>

<phase>package</phase>

<goals>

<goal>single</goal>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

2.实现拦截器(interceptor)

注意还有一个静态类

public class TimestampInterceptor implements Interceptor {

@Override

public void initialize() {

}

@Override

public Event intercept(Event event) {

Map<String, String> headers = event.getHeaders();

String s = new String(event.getBody(), StandardCharsets.UTF_8);

try {

JSONObject jsonObject = JSONObject.parseObject(s);

String ts = jsonObject.getString("ts");

headers.put("timestamp",ts);

return event;

} catch (Exception e) {

e.printStackTrace();

return null;

}

}

@Override

public List<Event> intercept(List<Event> events) {

Iterator<Event> iterator = events.iterator();

while (iterator.hasNext()){

Event next = iterator.next();

if(intercept(next) == null){

iterator.remove();

}

}

return events;

}

@Override

public void close() {

}

public static class Builder implements Interceptor.Builder{

@Override

public Interceptor build() {

return new TimestampInterceptor();

}

@Override

public void configure(Context context) {

}

}

}

3.打包

创建plugins.d

[atguigu@hadoop101 flume]$ mkdir plugins.d

创建myTimestampInterceptor(名字随便取)

[atguigu@hadoop101 plugins.d]$ mkdir myTimestampInterceptor/

创建三个目录lib,libext,native(固定,不能修改)

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir lib

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir libext

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir native

将打好的包放在lib下

4.启动日志消费Flume

1.启动zookerper和kafka集群

2.启动日志采集flume

[atguigu@hadoop100 ~]$ f1.sh start

3.启动日志消费flume

[atguigu@hadoop101 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf

4.开启数据生成器

5.检查hdfs,如果有着成功

5.脚本

#!/bin/bash

case $1 in

"start")

echo " --------启动 hadoop101 日志数据flume-------"

ssh hadoop101 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"

;;

"stop")

echo " --------停止 hadoop101 日志数据flume-------"

ssh hadoop101 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"

;;

esac

14.安装DataX即使用

在hadoop100安装

1.安装

[atguigu@hadoop100 software]$ tar -zxvf datax.tar.gz -C /opt/module/

2.自检

[atguigu@hadoop100 ~]$ python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json

假如能执行这成功

3.DataX配置文件生成

[atguigu@hadoop100 ~]$ mkdir /opt/module/gen_datax_config

[atguigu@hadoop100 ~]$ cd /opt/module/gen_datax_config

修改configuration.properties

mysql.username=root

mysql.password=000000

mysql.host=hadoop100

mysql.port=3306

mysql.database=atguigu

mysql.tables=base_province

#mysql.tables=activity_info,activity_rule,base_trademark,cart_info,base_category1,base_category2,base_category3,coupon_info,sku_attr_value,sku_sale_attr_value,base_dic,sku_info,base_province,spu_info, base_region,promotion_pos,promotion_refer

hdfs.uri=hdfs://hadoop100:8020

import_outdir=/opt/module/datax/job/import

#export_outdir=/opt/module/datax/job/export

执行:

[atguigu@hadoop100 ~]$ java -jar datax-config-generator-1.0.1-jar-with-dependencies.jar

观察结果:

[atguigu@hadoop100 ~]$ cd /opt/module/datax/job/import

[atguigu@hadoop100 import]$ ll

4.启动

[atguigu@hadoop100 bin]$ python /opt/module/datax/bin/datax.py -p"-Dtargetdir=/origin_data/gmall_remake/db/dataX/2022-06-08" /opt/module/datax/job/import/atguigu.base_province.json

5.全量启动脚本

这是例子,企业用的时候记得修改

#!/bin/bash

DATAX_HOME=/opt/module/datax

# 如果传入日期则do_date等于传入的日期,否则等于前一天日期

if [ -n "$2" ] ;then

do_date=$2

else

do_date=`date -d "-1 day" +%F`

fi

#处理目标路径,此处的处理逻辑是,如果目标路径不存在,则创建;若存在,则清空,目的是保证同步任务可重复执行

handle_targetdir() {

hadoop fs -test -e $1

if [[ $? -eq 1 ]]; then

echo "路径$1不存在,正在创建......"

hadoop fs -mkdir -p $1

else

echo "路径$1已经存在"

fs_count=$(hadoop fs -count $1)

content_size=$(echo $fs_count | awk '{print $3}')

if [[ $content_size -eq 0 ]]; then

echo "路径$1为空"

else

echo "路径$1不为空,正在清空......"

hadoop fs -rm -r -f $1/*

fi

fi

}

#数据同步

import_data() {

datax_config=$1

target_dir=$2

handle_targetdir $target_dir

python $DATAX_HOME/bin/datax.py -p"-Dtargetdir=$target_dir" $datax_config

}

case $1 in

"activity_info")

import_data /opt/module/datax/job/import/gmall_remake.activity_info.json /origin_data/gmall_remake/db/activity_info_full/$do_date

;;

"activity_rule")

import_data /opt/module/datax/job/import/gmall_remake.activity_rule.json /origin_data/gmall_remake/db/activity_rule_full/$do_date

;;

"base_category1")

import_data /opt/module/datax/job/import/gmall_remake.base_category1.json /origin_data/gmall_remake/db/base_category1_full/$do_date

;;

"base_category2")

import_data /opt/module/datax/job/import/gmall_remake.base_category2.json /origin_data/gmall_remake/db/base_category2_full/$do_date

;;

"base_category3")

import_data /opt/module/datax/job/import/gmall_remake.base_category3.json /origin_data/gmall_remake/db/base_category3_full/$do_date

;;

"base_dic")

import_data /opt/module/datax/job/import/gmall_remake.base_dic.json /origin_data/gmall_remake/db/base_dic_full/$do_date

;;

"base_province")

import_data /opt/module/datax/job/import/gmall_remake.base_province.json /origin_data/gmall_remake/db/base_province_full/$do_date

;;

"base_region")

import_data /opt/module/datax/job/import/gmall_remake.base_region.json /origin_data/gmall_remake/db/base_region_full/$do_date

;;

"base_trademark")

import_data /opt/module/datax/job/import/gmall_remake.base_trademark.json /origin_data/gmall_remake/db/base_trademark_full/$do_date

;;

"cart_info")

import_data /opt/module/datax/job/import/gmall_remake.cart_info.json /origin_data/gmall_remake/db/cart_info_full/$do_date

;;

"coupon_info")

import_data /opt/module/datax/job/import/gmall_remake.coupon_info.json /origin_data/gmall_remake/db/coupon_info_full/$do_date

;;

"sku_attr_value")

import_data /opt/module/datax/job/import/gmall_remake.sku_attr_value.json /origin_data/gmall_remake/db/sku_attr_value_full/$do_date

;;

"sku_info")

import_data /opt/module/datax/job/import/gmall_remake.sku_info.json /origin_data/gmall_remake/db/sku_info_full/$do_date

;;

"sku_sale_attr_value")

import_data /opt/module/datax/job/import/gmall_remake.sku_sale_attr_value.json /origin_data/gmall_remake/db/sku_sale_attr_value_full/$do_date

;;

"spu_info")

import_data /opt/module/datax/job/import/gmall_remake.spu_info.json /origin_data/gmall_remake/db/spu_info_full/$do_date

;;

"promotion_pos")

import_data /opt/module/datax/job/import/gmall_remake.promotion_pos.json /origin_data/gmall_remake/db/promotion_pos_full/$do_date

;;

"promotion_refer")

import_data /opt/module/datax/job/import/gmall_remake.promotion_refer.json /origin_data/gmall_remake/db/promotion_refer/$do_date

;;

"all")

import_data /opt/module/datax/job/import/gmall_remake.activity_info.json /origin_data/gmall_remake/db/activity_info_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.activity_rule.json /origin_data/gmall_remake/db/activity_rule_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_category1.json /origin_data/gmall_remake/db/base_category1_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_category2.json /origin_data/gmall_remake/db/base_category2_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_category3.json /origin_data/gmall_remake/db/base_category3_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_dic.json /origin_data/gmall_remake/db/base_dic_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_province.json /origin_data/gmall_remake/db/base_province_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_region.json /origin_data/gmall_remake/db/base_region_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.base_trademark.json /origin_data/gmall_remake/db/base_trademark_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.cart_info.json /origin_data/gmall_remake/db/cart_info_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.coupon_info.json /origin_data/gmall_remake/db/coupon_info_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.sku_attr_value.json /origin_data/gmall_remake/db/sku_attr_value_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.sku_info.json /origin_data/gmall_remake/db/sku_info_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.sku_sale_attr_value.json /origin_data/gmall_remake/db/sku_sale_attr_value_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.spu_info.json /origin_data/gmall_remake/db/spu_info_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.promotion_pos.json /origin_data/gmall_remake/db/promotion_pos_full/$do_date

import_data /opt/module/datax/job/import/gmall_remake.promotion_refer.json /origin_data/gmall_remake/db/promotion_refer/$do_date

;;

esac

15.增量表数据同步

kafka到hdfs的数据库数据

1.配置

[atguigu@hadoop101 flume]$ vim job/kafka_to_hdfs_db.conf

a1.sources = r1

a1.channels = c1

a1.sinks = k1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource

a1.sources.r1.batchSize = 5000

a1.sources.r1.batchDurationMillis = 2000

a1.sources.r1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092

a1.sources.r1.kafka.topics = topic_db

a1.sources.r1.kafka.consumer.group.id = flume

a1.sources.r1.setTopicHeader = true

a1.sources.r1.topicHeader = topic

a1.sources.r1.interceptors = i1

#a1.sources.r1.interceptors.i1.type = com.atguigu.gmall.flume.interceptor.TimestampAndTableNameInterceptor$Builder

a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.TimestampAndTableNameInterceptor$Builder

a1.channels.c1.type = file

a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2

a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/

a1.channels.c1.maxFileSize = 2146435071

a1.channels.c1.capacity = 1000000

a1.channels.c1.keep-alive = 6

## sink1

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /origin_data/gmall_remake/db/%{tableName}_inc/%Y-%m-%d

a1.sinks.k1.hdfs.filePrefix = db

a1.sinks.k1.hdfs.round = false

a1.sinks.k1.hdfs.rollInterval = 10

a1.sinks.k1.hdfs.rollSize = 134217728

a1.sinks.k1.hdfs.rollCount = 0

a1.sinks.k1.hdfs.fileType = CompressedStream

a1.sinks.k1.hdfs.codeC = gzip

## 拼装

a1.sources.r1.channels = c1

a1.sinks.k1.channel= c1

2.解决数据漂移

public class TimestampAndTableNameInterceptor implements Interceptor {

private SimpleDateFormat dateFormat = null;

@Override

public void initialize() {

dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

}

/*

* 将body当中的table,放到header当中的 tableName

* 将body当中的时间,放到header当中的timestamp

* */

@Override

public Event intercept(Event event) {

Map<String, String> headers = event.getHeaders();

// 1 获取header 和body

byte[] body = event.getBody();

String log = new String(body, StandardCharsets.UTF_8);

// 2 解析body当中的数据

try {

JSONObject jsonObject = JSONObject.parseObject(log);

//解析表名

String table = jsonObject.getString("table");

// 解析更新类型

String type = jsonObject.getString("type");

// 解析data

String dataJSONString = jsonObject.getString("data");

String time = null;

String formatTime = null;

if ("insert".equals(type)) {

time = JSONObject.parseObject(dataJSONString).getString("create_time");

formatTime = String.valueOf(dateFormat.parse(time).getTime());

headers.put("timestamp", formatTime);

} else if ("update".equals(type)) {

time = JSONObject.parseObject(dataJSONString).getString("operate_time");

formatTime = String.valueOf(dateFormat.parse(time).getTime());

headers.put("timestamp", formatTime);

} else if ("bootstrap-insert".equals(type)) {

String ts = jsonObject.getString("ts") + "000";

headers.put("timestamp", ts);

}

headers.put("tableName",table);

return event;

} catch (Exception e) {

e.printStackTrace();

return null;

}

}

@Override

public List<Event> intercept(List<Event> list) {

Iterator<Event> iterator = list.iterator();

while (iterator.hasNext()) {

Event event = iterator.next();

if (intercept(event) == null) {

iterator.remove();

}

}

return list;

}

@Override

public void close() {

}

public static class Builder implements Interceptor.Builder {

@Override

public Interceptor build() {

return new TimestampAndTableNameInterceptor();

}

@Override

public void configure(Context context) {

}

}

}

3.打包

创建plugins.d

[atguigu@hadoop101 flume]$ mkdir plugins.d

创建myTimestampInterceptor(名字随便取)

[atguigu@hadoop101 plugins.d]$ mkdir myTimestampInterceptor/

创建三个目录lib,libext,native(固定,不能修改)

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir lib

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir libext

[atguigu@hadoop101 myTimestampInterceptor]$ mkdir native

将打好的包放在lib下

4.测试

开启zookerper和kafka

[atguigu@hadoop101 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_db.conf

5.脚本

#!/bin/bash

case $1 in

"start")

echo " --------启动 hadoop101 业务数据flume-------"

ssh hadoop101 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"

;;

"stop")

echo " --------停止 hadoop101 业务数据flume-------"

ssh hadoop101 "ps -ef | grep kafka_to_hdfs_db | grep -v grep |awk '{print \$2}' | xargs -n1 kill"

;;

esac

6.采集通道启动/停止脚本

企业不用,不要瞎搞

#!/bin/bash

case $1 in

"start"){

echo ================== 启动 集群 ==================

#启动 Zookeeper集群

myzk.sh start

#启动 Hadoop集群

myhadoop start

#启动 Kafka采集集群

mykafka.sh start

#启动采集 Flume

f1.sh start

#启动日志消费 Flume

f2.sh start

#启动业务消费 Flume

f3.sh start

#启动 maxwell

mymxw.sh start

"stop"){

echo ================== 停止 集群 ==================

#停止 Maxwell

mymxw.sh stop

#停止 业务消费Flume

f3.sh stop

#停止 日志消费Flume

f2.sh stop

#停止 日志采集Flume

f1.sh stop

#停止 Kafka采集集群

mykafka.sh stop

#停止 Hadoop集群

myhadoop stop

#停止 Zookeeper集群

myzk.sh stop

};;

esac

16.安装hive

1.解压

[atguigu@hadoop100 software]$ tar -zxvf /opt/software/apache-hive-3.1.3.tar.gz -C /opt/module/

[atguigu@hadoop100 software]$ mv /opt/module/apache-hive-3.1.3-bin/ /opt/module/hive

2.环境变量

[atguigu@hadoop100 software]$ sudo vim /etc/profile.d/my_env.sh

#HIVE_HOME

export HIVE_HOME=/opt/module/hive

export PATH=$PATH:$HIVE_HOME/bin

[atguigu@hadoop100 software]$ source /etc/profile.d/my_env.sh

解决日志Jar包冲突,进入/opt/module/hive/lib

[atguigu@hadoop100 lib]$ mv log4j-slf4j-impl-2.17.1.jar log4j-slf4j-impl-2.17.1.jar.bak

3.hive元数据配置到mysql

拷贝驱动

[atguigu@hadoop102 lib]$ cp /opt/software/mysql/mysql-connector-j-8.0.31.jar /opt/module/hive/lib/

配置Metastore到mysql

[atguigu@hadoop102 conf]$ vim hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://hadoop100:3306/metastore?useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.cj.jdbc.Driver</value>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>root</value>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>000000</value>

</property>

<property>

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

</property>

<property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

</property>

<property>

<name>hive.server2.thrift.port</name>

<value>10000</value>

</property>

<property>

<name>hive.server2.thrift.bind.host</name>

<value>hadoop100</value>

</property>

<property>

<name>hive.metastore.event.db.notification.api.auth</name>

<value>false</value>

</property>

<property>

<name>hive.cli.print.header</name>

<value>true</value>

</property>

<property>

<name>hive.cli.print.current.db</name>

<value>true</value>

</property>

</configuration>

4.启动hive

1.登录mysql

[atguigu@hadoop100 conf]$ mysql -uroot -p000000

2.新建hive元数据库

mysql> create database metastore;

3.初始化hive元数据库

[atguigu@hadoop100 conf]$ schematool -initSchema -dbType mysql -verbose

4.修改元数据字符集

mysql>use metastore;

mysql> alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;

mysql> alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;

mysql> quit;

5.启动hive客户端

[atguigu@hadoop100 hive]$ bin/hive

6.用客户端软件连接时

[atguigu@hadoop100 bin]$ hiveserver2

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值