Mac M1 通过VMan安装Centos7.9,并搭建 Hadoop/Hive/Kafka/Flink/Iceberg 本地进行数据湖测试。
问题:
Paralles Desktop 没找到免费的,所以用了VM,VM也可以网上找。
Centos7.9官方版本在VM中不成功,所以使用了别人编译的版本:在m1芯片的MacBook上安装centos7
JDK使用 yum 安装 arm64架构的1.8.322版本。
MySQL使用官网下载arm64版本。
大数据相关组件使用官网二进制包。
集群信息
主机名 | 内网IP |
---|---|
datalake-01 | 10.0.0.10 |
datalake-02 | 10.0.0.11 |
datalake-03 | 10.0.0.12 |
配置信息
CUP | 内存 | OS |
---|---|---|
4 | 8GB | Centos 7.9 aarch64 |
组件版本
组件 | 版本 |
---|---|
Java | 1.8.332.aarch64 |
Scala | 2.12.15 |
Hadoop | 3.2.3 |
Zookeeper | 3.5.9 |
Hive | 3.1.3 |
kafka | 3.1.1 |
Flink | 1.14.4 |
Iceberg | 0.13.1 |
MySQL | 8.0.15.aarch64 |
组件信息
组件 | 服务 | |
---|---|---|
Zookeeper | 3节点 | |
Hadoop HA | NameNode | 01,02节点 |
DataNode | 3节点 | |
YARN | ResourceManager | 01,02节点 |
NodeManager | 3节点 | |
Hive | Metastore | 01,02节点 |
Hiveserver2 | 01,02节点 | |
Kafka | Broker | 3节点 |
Flink | JobManager | 01,02节点 |
TaskManager | 3节点 | |
MySQL | 01节点 |
VM虚拟机安装Centos7
Tabby远程连接虚拟服务器
Navicat连接MySQL
环境变量
Centos7 和 Mac 都配置在这里,Mac 配置 JDK 和 Maven。
~/.bash_perofile
# java
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.aarch64/jre
# scala
export SCALA_HOME=/opt/apps/scala-2.12.15
# zookeeper
export ZK_HOME=/opt/apps/zookeeper-3.5.9
# hadoop
export HADOOP_HOME=/opt/apps/hadoop-3.2.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CLASSPATH=`hadoop classpath`
# hive
export HIVE_HOME=/opt/apps/hive-3.1.3
export HIVE_CONF_DIR=$HIVE_HOME/conf
# hadoop
export KAFKA_HOME=/opt/apps/kafka-3.1.1
# maven
export M2_HOME=/opt/apps/maven-3.6.3
# flink
export FLINK_HOME=/opt/apps/flink-1.14.4
export PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin:$ZK_HOME/bin:$HIVE_HOME/bin:$KAFKA_HOME/bin:$M2_HOME/bin:$FLINK_HOME/bin
sql-client 测试
Flink相关配置
仅对测试环境使用
xecution.checkpointing.interval: 10s
execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION #[DELETE_ON_CANCELLATION, RETAIN_ON_CANCELLATION]
execution.checkpointing.max-concurrent-checkpoints: 1
execution.checkpointing.min-pause: 0
state.checkpoints.num-retained: 20
execution.checkpointing.mode: EXACTLY_ONCE #[EXACTLY_ONCE, AT_LEAST_ONCE]
execution.checkpointing.timeout: 10min
execution.checkpointing.tolerable-failed-checkpoints: 3
execution.checkpointing.unaligned: false
启动Flink Standalone集群
start-cluster.sh
# 停止
stop-cluster.sh
Web
http://datalake-01:8081/
SQL文件
sql-client-conf.sql
create catalog hive_catalog with (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://datalake-01:9083',
'clients'='5',
'property-version'='2',
'warehouse'='/user/hive/warehouse/'
);
create catalog hadoop_catalog with (
'type' = 'iceberg',
'catalog-type' = 'hadoop',
'property-version' = '2',
'warehouse' = '/user/hive/warehouse/'
);
启动 sql-client
sql-client.sh -i ../sql-client-conf.sql
show catalogs;
+-----------------+
| catalog name |
+-----------------+
| default_catalog |
| hadoop_catalog |
| hive_catalog |
+-----------------+
测试
# cdc 0.13已支持cdc写入,但是不支持cdc流读,即只支持流读append数据,不支持流读update数据
drop table if exists default_catalog.default_database.cdc_source_table;
create table if not exists default_catalog.default_database.cdc_source_table (
id int,
data string,
dt string,
primary key (id) not enforced
) with (
'connector' = 'mysql-cdc',
'hostname'