搭建Hibench测试flink环境手册

最新推荐文章于 2024-08-14 19:29:22 发布

BoRoBoRoMe

最新推荐文章于 2024-08-14 19:29:22 发布

阅读量2.5k

点赞数 1

分类专栏：手把手文章标签： flink 大数据 kafka zookeeper

本文链接：https://blog.csdn.net/boroborome/article/details/104643178

版权

手把手专栏收录该内容

7 篇文章

订阅专栏

Hibench官网

这里不搭建全套的Hibench尽搭建Flink相关的一些组件

ZooKeeper
Kafka
Flink

本地需要提前准备好

maven
git
Python 2.x (>=2.6)

安装ZooKeeper

从官网下载ZooKeeper压缩包

官网地址，建议下载这个3.5.7
在本地找一个地方解压缩

配置并启动

在ZooKeeper目录执行如下命令

# 直接复制样例配置文件为需要使用的文件
cp conf/zoo_sample.cfg conf/zoo.cfg

# 启动ZooKeeper
bin/zkServer.sh start

# 不需要ZooKeeper的时候通过下面命令停止ZooKeeper
bin/zkServer.sh stop

安装Kafa

从官网下载Kafka压缩包

官网地址，建议下载这个kafka_2.11-2.4.0.tgz，
在本地找一个地方解压缩
启动

需要的ZooKeeper已经在配置文件config/server.properties中默认配置好了，所以可以直接启动
```
bin/kafka-server-start.sh config/server.properties
```

其他相关命令参考

# 查看当前所有topic列表
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

# 查看topic identity里面的数据
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic identity --from-beginning

安装Hadoop

参见这个文档

在Mac上安装Hadoop

安装Flink

从官网下载压缩包

官网地址，建议下载这个版本flink-1.10.0 for scala 2.11
我们采用Standalone Cluster的部署方式部署

官网指导地址
配置ssh

要求无密码的，ssh localhost。没有验证过不配置是否能用

配置过程参见hadoop的配置

配置flink

conf/flink-conf.yaml

# 原来默认是1个，实际使用不够用
taskmanager.numberOfTaskSlots: 30

启动集群

# 启动集群
bin/start-cluster.sh

# 停止集群
bin/stop-cluster.sh

参考Hibench官网Flink配置部分

Flink安装，下载地址，可以选择最高版本flink-1.10.0

采用本地集群部署方案，官网说明地址

在官网配置过程的基础上增加配置

conf/flink-conf.yaml

# 原来默认是1个，实际使用不够用
taskmanager.numberOfTaskSlots: 30

构建Hibench项目

通过命令将项目下载在本地

git clone git@github.com:Intel-bigdata/HiBench.git

build项目

官网说明

这里仅仅构建flink的项目

mvn -Pflinkbench -Dspark=2.1 -Dscala=2.11 clean package

配置Hadoop

执行如下命令

cp conf/hadoop.conf.template conf/hadoop.conf

修改配置文件内容如下

# Hadoop的Home目录，根据自己情况填写
hibench.hadoop.home     /Users/<username>/Downloads/2020-03/hadoop-2.10.0

# The path of hadoop executable
hibench.hadoop.executable     ${hibench.hadoop.home}/bin/hadoop

# Hadoop configraution directory
hibench.hadoop.configure.dir  ${hibench.hadoop.home}/etc/hadoop

# The root HDFS path to store HiBench data。指定一个已经存在的目录，如果没有需要使用hadoop命令创建
hibench.hdfs.master       hdfs://localhost:9000/user/<username>


# Hadoop release provider. Supported value: apache, cdh5, hdp
hibench.hadoop.release    apache

配置kafka

conf/hibench.conf 修改如下内容

# 配置自己本地的kafka安装目录
hibench.streambench.kafka.home                  /Users/<username>/Tools/kafka_2.11-2.4.0
# zookeeper host:port of kafka cluster, host1:port1,host2:port2...
hibench.streambench.zkHost  localhost:2181
# Kafka broker lists, written in mode host:port,host:port,..
hibench.streambench.kafka.brokerList    localhost:9092

配置数据生成

conf/hibench.conf中hibench.streambench.datagen开头的配置。

这块都有默认值，可以不管

配置Flink到Hibench

官网配置地址

执行命令

cp conf/flink.conf.template conf/flink.conf

配置文件内容如下

# 根据flink安装位置自己调整
hibench.streambench.flink.home                   /Users/<username>/Tools/flink-1.10.0

hibench.flink.master                             localhost:8081

# Default parallelism of flink job。这里的数字必须小于flink中slot数量
hibench.streambench.flink.parallelism            20
hibench.streambench.flink.bufferTimeout          10
hibench.streambench.flink.checkpointDuration     1000

执行生成数据过程

执行下面命令，可能有错误，参见错误描述中内容修改

bin/workloads/streaming/identity/prepare/genSeedDataset.sh
bin/workloads/streaming/identity/prepare/dataGen.sh

执行flink 的job

bin/workloads/streaming/identity/flink/run.sh

生成报告

# 执行生成报告脚本
bin/workloads/streaming/identity/common/metrics_reader.sh

# 上面的脚本会列出类似下面的topic名字
FLINK_identity_1_5_50_1583118115848
FLINK_identity_1_5_50_1583118729972
FLINK_identity_1_5_50_1583119730761
FLINK_identity_1_5_50_1583120900468
FLINK_identity_1_5_50_1583121043536
FLINK_identity_1_5_50_1583131260923
FLINK_identity_1_5_50_1583207113628
__consumer_offsets
identity
test

# 在下面提示后输入一个FLINK_identity开头的topic
Please input the topic:FLINK_identity_1_5_50_1583118115848

Collected 0 results for partition: 11
# 最后控制体输出信息中输出了报告文件名称
written out metrics to 
/Users/<username>/Projects/HiBench/report/FLINK_identity_1_5_50_15831181...

错误处理

workload_functions.sh: line 57: N=0: command not found

错误信息如下：

/Users/ysgao/Projects/me/HiBench/bin/functions/workload_functions.sh: line 57: N=0: command not found
expr: not a decimal number: 'N'
expr: syntax error

此时需要执行如下操作

安装MacPorts

https://www.macports.org/install.php

https://distfiles.macports.org/MacPorts/MacPorts-2.6.2-10.15-Catalina.pkg

命令执行

sudo xcode-select -switch /Applications/Xcode.app/Contents/Developer # version 11.3.1
sudo port install coreutils @8.31

此时已经有了命令gdate

修改代码

HiBench/bin/functions/workload_functions.sh,将date修改为gdate

function timestamp(){		# get current timestamp
    sec=`gdate +%s`
    nanosec=`gdate +%N`
    re='^[0-9]+$'
    if ! [[ $nanosec =~ $re ]] ; then
	$nanosec=0
    fi
    tmp=`expr $sec \* 1000 `
    msec=`expr $nanosec / 1000000 `
    echo `expr $tmp + $msec`
}

重新执行出错命令前需要将控制台的窗口关闭重新打开一下。

env:python2: No such file or directory
这问题是因为python文件在执行的时候指定使用python2来执行，但是本地没有/usr/bin/python2
```
#!/usr/bin/env python2
```
简单修改方式是直接将用到的地方修改为
```
#!/usr/bin/env python
```
相关脚本都在目录bin/functions中
有些错误可能来自hadoop，可以参见在Mac上安装Hadoop