概述
作为实时计算的新贵,Flink受到越来越多公司的青睐,它强大的流批一体的处理能力可以很好地解决流处理和批处理需要构建实时和离线两套处理平台的问题,可以通过一套Flink处理完成,降低成本,Flink结合数据湖的处理方式可以满足我们实时数仓和离线数仓的需求,构建一套数据湖,存储多样化的数据,实现离线查询和实时查询的需求。目前数据湖方面有Hudi和Iceberg,Hudi属于相对成熟的数据湖方案,主要用于增量的数据处理,它跟spark结合比较紧密,Flink结合Hudi的方案目前应用不多。Iceberg属于数据湖的后起之秀,可以实现高性能的分析与可靠的数据管理,目前跟Flink集合方面相对较好。
环境搭建
环境:
hadoop 2.7.7
hive 2.3.6
Flink 1.11.3
iceberg 0.11.1
jdk 1.8
mac os
下载软件
Hadoop :https://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/
Hive:https://archive.apache.org/dist/hive/hive-2.3.6/
Flink: https://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.13.0/flink-1.13.0-bin-scala_2.11.tgz
Iceberg:https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime/0.11.1/
查看环境
安装配置
安装软件
解压hadoop压缩包:
tar -xvf hadoop-2.7.7.tar.gz /Users/xxx/work
解压hive压缩包:
tar -xvf apache-hive-2.3.4-bin.tar.gz /Users/xxx/work/hadoop-2.7.7/apache-hive-2.3.4-bin
重命名:
cd /Users/xxx/work/hadoop-2.7.7/
mv apache-hive-2.3.4-bin hive
解压flink压缩包:
tar -xvf flink-1.11.3-bin-scala_2.11.tgz /Users/xxx/work
配置环境变量
打开配置文件(针对mac系统):
cd ~
vim .bash_profile
添加环境变量:
export HADOOP_HOME=/Users/xxx/work/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HIVE_HOME=/Users/xxx/work/hadoop-2.7.7/hive
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin:$HIVE_HOME/conf
执行source:
source .bash_profile
验证是否配置完成:
xxx@jiacunxu ~ % hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
xxx@jiacunxu ~ % hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/xxx/work/hadoop-2.7.7/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/xxx/work/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in file:/Users/xxx/work/hadoop-2.7.7/hive/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>
如上显示标识,hadoop和hive环境变量配置OK,已经生效
配置Hadoop
进入hadoop目录:
cd /Users/xxx/work/hadoop-2.7.7/etc/hadoop
配置hadoop-env.sh,配置如下一行
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_281.jdk/Contents/Home
配置core-site.xml:
<configuration>
<property>