大数据集群部署zookeeper+hadoop+flink+kafka+canal及flink应用Demo
准备工作
服务器准备
IP | 主机名 | 系统 | 组件 |
---|---|---|---|
192.168.213.131 | node1 | CentOS Linux release 7.5 | Zookeeper+hadoop (master)+flink+kafka (master)+canal |
192.168.213.132 | node2 | CentOS Linux release 7.5 | Zookeeper+hadoop (slave)+flink+kafka (slave) |
192.168.213.133 | node3 | CentOS Linux release 7.5 | Zookeeper+hadoop (slave)+flink+kafka (slave) |
设置主机名
分别修改每台主机名称为:node1,node2,node3
然后配置hosts,具体内容如下:
[root@node1 ~]# echo 'node1' > /etc/hostname
[root@node1 ~]# vim /etc/hosts
192.168.213.131 node1
192.168.213.132 node2
192.168.213.133 node3
免密设置
分别在每台主机上生成ssh密钥:
[root@node1 ~]# ssh-keygen -t rsa
分别copy密钥到每台主机上:
[root@node1 ~]# ssh-copy-id node1
[root@node1 ~]# ssh-copy-id node2
[root@node1 ~]# ssh-copy-id node3
[root@node1 ~]# scp ~/.ssh/authorized_keys node2:~/.ssh/
[root@node1 ~]# scp ~/.ssh/authorized_keys node3:~/.ssh/
关闭防火墙和关闭selinux
分别关闭每台主机防火墙
[root@node1 ~]# systemctl stop firewalld && systemctl disable firewalld
[root@node1 ~]# swapoff -a && sed -i ‘/ swap / s/^(.)$/#\1/g’ /etc/fstab
[root@node1 ~]# setenforce 0 && sed -i 's/^SELINUX=./SELINUX=disabled/’ /etc/selinux/config
创建用户
分别为每台主机创建hadoop用户
[root@node1 ~]# adduser hadoop
[root@node1 ~]# usermod -g root hadoop
[root@node1 ~]# passwd hadoop
hadoop用户免密登录
分别在每台主机上生成ssh密钥:
[root@node1 ~]# su - hadoop
[hadoop@node1 ~]# ssh-keygen -t rsa
分别copy密钥到每台主机上:
[hadoop@node1 ~]# ssh-copy-id node1
[hadoop@node1 ~]# ssh-copy-id node2
[hadoop@node1 ~]# ssh-copy-id node3
[hadoop@node1 ~]# scp ~/.ssh/authorized_keys node2:~/.ssh/
[hadoop@node1 ~]# scp ~/.ssh/authorized_keys node3:~/.ssh/
环境搭建
安装java环境(root用户)
切换到root用户,分别在每台主机安装java环境,采用yum安装,选择devel版本,
[hadoop@node1 ~]# su root
[root@node1 ~]# yum search java-1.8.0
[root@node1 ~]# yum install -y java-1.8.0-openjdk-devel.x86_64
[root@node1 ~]# java -version
[root@node1 ~]# vim /etc/profile
分别在配置环境变量,采用scp同步环境变量配置
[root@node1 ~]# vim /etc/profile
# JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin
[root@node1 ~]# soure /etc/profile
[root@node1 ~]# scp /etc/profile node2:/etc/profile
[root@node1 ~]# ssh node2 soure /etc/profile
[root@node1 ~]# scp /etc/profile node3:/etc/profile
[root@node1 ~]# ssh node3 soure /etc/profile
安装Zookeeper3.6.4(hadoop用户)
采用阿里镜像站下载安装包:https://developer.aliyun.com/mirror/
- 下载解压后重命名为zookeeper
[hadoop@node1 ~]# wget https://mirrors.aliyun.com/apache/zookeeper/zookeeper-3.6.4/apache-zookeeper-3.6.4-bin.tar.gz
[hadoop@node1 ~]# tar -zxvf apache-zookeeper-3.6.4-bin.tar.gz
[hadoop@node1 ~]# mv apache-zookeeper-3.6.4-bin zookeeper
- 将/home/hadoop/zookeeper/conf/ 路径下的 zoo_sample.cfg 修改为 zoo.cfg;
[hadoop@node1 ~]# cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
- 修改 zoo.cfg 文件配置
[hadoop@node1 ~]# vim zookeeper/conf/zoo.cfg
# 修改数据保存位置
dataDir=/home/hadoop/zookeeper/tmp
# 添加集群节点
server.1=node1:2888:3888
server.2=node1:2888:3888
server.3=node1:2888:3888
- 创建数据目录及myid(每台的配置myid不同,请自行修改)
[hadoop@node1 ~]# mkdir zookeeper/tmp
[hadoop@node1 ~]# vim mkdir zookeeper/tmp/myid
# myid 集群中每台配置的id值不能相同
1
- 同步应用到其他主机
[hadoop@node1 ~]# scp -r zookeeper node2:~/
[hadoop@node1 ~]# scp -r zookeeper node3:~/
- 启动服务(每台设备执行)
[hadoop@node1 ~]# ./zookeeper/bin/zkServer.sh start
[hadoop@node1 ~]# ssh node2
[hadoop@node2 ~]# echo '2' > zookeeper/tmp/myid
[hadoop@node2 ~]# ./zookeeper/bin/zkServer.sh start
[hadoop@node2 ~]# exit
[hadoop@node1 ~]# ssh node3
[hadoop@node3 ~]# echo '3' > zookeeper/tmp/myid
[hadoop@node3 ~]# ./zookeeper/bin/zkServer.sh start
- 查看进程是否启动
[hadoop@node1 ~]# jps
4020 Jps
4001 QuorumPeerMain
- 查看状态
[hadoop@node1 ~]# ./zookeeper/bin/zkServer.sh status
安装hadoop2.10.1(hadoop用户)
采用阿里镜像站下载安装包:https://developer.aliyun.com/mirror/
- 下载解压后重命名为zookeeper
[hadoop@node1 ~]# wget https://mirrors.aliyun.com/apache/hadoop/core/hadoop-2.10.1/hadoop-2.10.1.tar.gz
[hadoop@node1 ~]# tar -zxvf hadoop-2.10.1.tar.gz
[hadoop@node1 ~]# mv hadoop-2.10.1 hadoop
- 创建数据储存目录
[hadoop@node1 ~]# mkdir -p hadoop/hdfs/tmp
[hadoop@node1 ~]# mkdir -p hadoop/hdfs/name
[hadoop@node1 ~]# mkdir -p hadoop/hdfs/data
- 修改配置文件
[hadoop@node1 ~]# vim hadoop/etc/hadoop/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoop/hdfs/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:9000</value>
</property>
[hadoop@node1 ~]# vim hadoop/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node1:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
- 添加java环境变量
[hadoop@node1 ~]# vim hadoop/etc/hadoop/hadoop-env.sh
# 添加java环境变量
export JAVA_HOME=/usr/lib/jvm/java
[hadoop@node1 ~]# vim hadoop/etc/hadoop/yarn-env.sh
# 添加java环境变量
export JAVA_HOME=/usr/lib/jvm/java
[hadoop@node1 ~]# vim hadoop/etc/hadoop/slaves
# 从主机地址,新版文件名为works
node2
node3
- 同步应用到其他主机
[hadoop@node1 ~]# scp -r hadoop node2:~/
[hadoop@node1 ~]# scp -r hadoop node3:~/
- 添加环境变量(切换root用户)
[hadoop@node1 ~]# su root
[root@node1 ~]# vim /etc/profile
# HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
[root@node1 ~]# soure /etc/profile
[root@node1 ~]# scp /etc/profile node2:~/etc/profile
[root@node1 ~]# ssh node2 soure /etc/profile
[root@node1 ~]# scp /etc/profile node3:~/etc/profile
[root@node1 ~]# ssh node3 soure /etc/profile
- 启动服务
[hadoop@node1 ~]# hadoop namenode -format
[hadoop@node1 ~]# mr-jobhistory-daemon.sh start historyserver
[hadoop@node1 ~]# start-all.sh
#查看数据节点情况
[hadoop@node1 ~]# hadoop dfsadmin -report
- 停止服务(选用)
[hadoop@node1 ~]# stop-all.sh
安装flink(hadoop用户)
- 下载解压后重命名为flink
[hadoop@node1 ~]# wget https://mirrors.aliyun.com/apache/flink/flink-1.12.5/flink-1.12.5-bin-scala_2.12.tgz
[hadoop@node1 ~]# tar -xvf flink-1.12.5-bin-scala_2.12.tgz
[hadoop@node1 ~]# mv flink-1.12.5-bin-scala_2.12 flink
- 配置flink-conf.yml
[hadoop@node1 ~]# vim fink/conf/flink-conf.yml
jobmanager.rpc.address: node1
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: node1:2181,node1:2181,node1:2181
taskmanager.numberOfTaskSlots: 2 #等于cpu个数
- 配置masters和slaves
[hadoop@node1 ~]# vim fink/conf/masters
master:8081
[hadoop@node1 ~]# vim fink/conf/slaves
node2
node3
- 拷贝到其他机器
[hadoop@node1 ~]# scp -r fink node2:~/
[hadoop@node1 ~]# scp -r fink node3:~/
- 每个节点启动服务
[hadoop@node1 ~]# ./flink/bin/start-cluster.sh
[hadoop@node2 ~]# ./flink/bin/start-cluster.sh
[hadoop@node3 ~]# ./flink/bin/start-cluster.sh
- 访问服务及应用测试
游览器访问:http://192.168.213.131:8081/
使用 submit new job 功能上传案列: flink/examples/streaming/wordCount.jar
安装kafka(hadoop用户)
- 下载解压后重命名为kafka
[hadoop@node1 ~]# wget https://mirrors.aliyun.com/apache/kafka/3.0.0/kafka_2.12-3.0.0.tgz
[hadoop@node1 ~]# tar -xvf kafka_2.12-3.0.0.tgz
[hadoop@node1 ~]# mv kafka_2.12-3.0.0 kafka
- 配置server.properties
[hadoop@node1 ~]# vim kafka/conf/server.properties
#从1开始,每个节点一个值,不能重复
broker.id=1
log.dirs=/home/hadoop/kafka/logs
zookeeper.connect=node1:2181,node2:2181,node3:2181
#机器数量
num.parititions=3
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://node1:9092
- 拷贝到其他机器
[hadoop@node1 ~]# scp -r kafka node2:~/
[hadoop@node1 ~]# scp -r kafka node3:~/
- 每个节点启动服务
[hadoop@node1 ~]# ./kafka/bin/kafka-server-start.sh -daemon ./kafka/config/server.properties
[hadoop@node2 ~]# ./kafka/bin/kafka-server-start.sh -daemon ./kafka/config/server.properties
[hadoop@node3 ~]# ./kafka/bin/kafka-server-start.sh -daemon ./kafka/config/server.properties
- 停止服务(选用)
[hadoop@node1 ~]# ./kafka/bin/kafka-server-stop.sh
- 创建和查询节点
[hadoop@node1 ~]# ./kafka/bin/kafka-topics.sh --bootstarp-server node1:9092 --topic --create zljd
[hadoop@node1 ~]# ./kafka/bin/kafka-topics.sh --bootstarp-server node1:9092 --topic --list
安装canal(hadoop用户)
- 修改mysql配置开启log-bin并重启服务
[hadoop@node1 ~]# vim /etc/my.conf
[mysqld]
log-bin=mysql-bin # 开启 binlog
binlog-format=ROW # 选择 ROW 模式
server_id=1 # 配置 MySQL replaction 需要定义,不要和 canal 的 slaveId 重复
- 下载解压为canal
[hadoop@node1 ~]# wget https://github.com/alibaba/canal/releases/download/canal-1.1.5/canal.deployer-1.1.5.tar.gz
[hadoop@node1 ~]# mkdir canal
[hadoop@node1 ~]# tar -zxvf canal.deployer-1.1.5.tar.gz -C canal
- 修改全局配置canal.properties
[hadoop@node1 ~]# vim canal/conf/canal.properties
#运行模式,默认tcp,支持kafka、mq等
canal.serverMode = kafka
- 修改实例配置canal.properties
[hadoop@node1 ~]# vim canal/conf/example/instance.properties
#数据库地址及账号密码
canal.instance.dbUsername=root
canal.instance.dbPassword=zwp123456+
canal.instance.master.address=node1:3306
#kafka节点名称
canal.mq.topic=zljd
- 启动服务
[hadoop@node1 ~]# ./canal/bin/startup.sh
- 停止服务
[hadoop@node1 ~]# ./canal/bin/stop.sh
canal+flink应用demo
官方文档地址:https://nightlies.apache.org/flink/flink-docs-release-1.12
- 使用maven快速构建应用,修改DarchetypeVersion为指定1.12.5版本
mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=1.12.5 -DgroupId=com.zwp -DartifactId=flink-java -Dversion=1.0 -Dpackage=com.zwp
- 修改pom.xml支持kafka和mysql
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.31</version>
</dependency>
<dependency>
<groupId>com.alibaba.otter</groupId>
<artifactId>canal.protocol</artifactId>
<version>1.1.6</version>
</dependency>
- 修改StreamingJob.java入口文件方法
public static void main(String[] args) throws Exception {
// 1.获取运行环境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 2.配置kafka连接
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "node1:9092");
properties.setProperty("group.id", "flink");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("zljd", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(kafkaConsumer);
// 3.配置接收kafka消息的mysql处理sink
stream.addSink(new MysqlSink());
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
- 新增mysql处理任务MysqlSink.java
package com.lzm.sink;
import com.alibaba.fastjson2.JSON;
import com.alibaba.otter.canal.protocol.FlatMessage;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.List;
import java.util.Map;
public class MysqlSink extends RichSinkFunction<String> {
private Connection connection;
private PreparedStatement preparedStatement;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
// 加载数据库驱动
Class.forName("com.mysql.cj.jdbc.Driver");
// 获取连接
connection = DriverManager.getConnection("jdbc:mysql://node1:3306/zljd", "root", "zwp2018");
}
@Override
public void close() throws Exception {
super.close();
if (connection != null) {
connection.close();
}
}
@Override
public void invoke(String message, Context context) throws Exception {
try {
System.out.println("FlatMessage:"+message);
FlatMessage value = null;
if (JSON.isValid(message)){
value = JSON.parseObject(message, FlatMessage.class);
}
if(value == null || !"zljd".equals(value.getDatabase()) || !"zljk_bd_check_unit_ability".equals(value.getTable())){
return;
}
List<Map<String, String>> data = value.getData();
List<Map<String, String>> old = value.getOld();
// 拼接SQL, 执行SQL
switch (value.getType()){
case "INSERT":
preparedStatement = connection.prepareStatement("INSERT INTO zljk_bd_check_unit (id, ability) VALUES (?, ?)");
preparedStatement.setString(1, data.get(0).get("id"));
preparedStatement.setString(2, data.get(0).get("ability"));
preparedStatement.executeUpdate();
break;
case "UPDATE":
preparedStatement = connection.prepareStatement("UPDATE zljk_bd_check_unit SET id=?,ability=? WHERE id=?");
String id = data.get(0).get("id");
if(old != null && old.size() > 0 && old.get(0).get("id") != null){
id = old.get(0).get("id");
}
preparedStatement.setString(1, data.get(0).get("id"));
preparedStatement.setString(2, data.get(0).get("ability"));
preparedStatement.setString(3, id);
preparedStatement.executeUpdate();
break;
case "DELETE":
preparedStatement = connection.prepareStatement("DELETE FROM zljk_bd_check_unit WHERE id=?");
preparedStatement.setString(1, data.get(0).get("id"));
preparedStatement.executeUpdate();
break;
default:
}
if (preparedStatement != null) {
preparedStatement.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
- 打包flink应用
修改pom.xml文件,注释scope使用mvn打包为jar包:
clean+compile+package
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
- 上传jar包到flink运行