Hibench是一个大数据 benchmark 套件,用来测试各种大数据框架的速度,吞吐量,系统资源利用率。
它支持的框架有:hadoopbench、sparkbench、stormbench、flinkbench、gearpumpbench。
参考网址:
https://github.com/intel-hadoop/HiBench
https://github.com/intel-hadoop/HiBench/blob/master/docs/build-hibench.md
https://github.com/intel-hadoop/HiBench/blob/master/docs/run-hadoopbench.md
问题: 测试无法生成hibench.report
因为测试的节点上没有安装bc工具
问题: org.apache.Hadoop.dfs.SafeModeException: Cannot delete xxxxxx. Name node is in safe mode
在分布式文件系统启动的时候,开始的时候会有安全模式,当分布式文件系统处于安全模式的情况下,文件系统中的内容不允许修改也不允许删除,直到安全模式结束。安全模式主要是为了系统启动的时候检查各个DataNode上数据块的有效性,同时根据策略必要的复制或者删除部分数据块。运行期通过命令也可以进入安全模式。在实践过程中,系统启动的时候去修改和删除文件也会有安全模式不允许修改的出错提示,只需要等待一会儿即可。
现在就清楚了,那现在要解决这个问题,我想让Hadoop不处在safe mode 模式下,能不能不用等,直接解决呢?
答案是可以的,只要在Hadoop的目录下输入:
bin/hadoop dfsadmin -safemode leave
也就是关闭Hadoop的安全模式,这样问题就解决了。
下载
$ git clone https://github.com/intel-hadoop/HiBench.git
编译
首先,要安装maven,其次,要有网络
编译所有框架和模块
mvn clean package
编译特定框架
mvn -Phadoopbench -Psparkbench clean package
编译单个模块
mvn -Phadoopbench -Dmodules -Psql -Dspark=2.1 -Dscala=2.11 clean package
支持的模块有:micro, ml(machine learning), sql, websearch, graph, streaming, structuredStreaming(spark 2.0 or 2.1).
编译Structured Streaming
默认不会编译Structured Streaming,只支持Spark 2.0 and Spark 2.1
mvn -Psparkbench -Dmodules -PstructuredStreaming clean package
运行hadoopbench
前提
运行测试的节点上要求
+ Python 2.x(>=2.6) is required.
bc is required to generate the HiBench report.
Supported Hadoop version: Apache Hadoop 2.x, CDH5.x, HDP
Build HiBench according to build HiBench.
Start HDFS, Yarn in the cluster.
配置
创建和编辑conf/hadoop.conf
cp conf/hadoop.conf.template conf/hadoop.conf
Property Meaning
hibench.hadoop.home The Hadoop installation location
hibench.hadoop.executable The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dir Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.master The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.release Hadoop release provider. Supported value: apache, cdh5, hdp
我的配置是
# Hadoop home
hibench.hadoop.home /usr/local/hadoop
# The path of hadoop executable
hibench.hadoop.executable ${hibench.hadoop.home}/bin/hadoop
# Hadoop configraution directory
hibench.hadoop.configure.dir ${hibench.hadoop.home}/etc/hadoop
# The root HDFS path to store HiBench data
hibench.hdfs.master hdfs://172.17.0.2:9000
#hibench.hdfs.master hdfs://localhost:50070
# Hadoop release provider. Supported value: apache, cdh5, hdp
hibench.hadoop.release apache
运行
单个测试实例运行
bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/hadoop/run.sh
运行所有在conf/benchmarks.lst 和 conf/frameworks.lst配置的测试实例.
bin/run_all.sh
查看报告
<HiBench_Root>/report/hibench.report 总结报告
<workload>/hadoop/bench.log: Raw logs on client side.
<workload>/hadoop/monitor.html: System utilization monitor results.
<workload>/hadoop/conf/<workload>.conf: Generated environment variable configurations for this workload.
配置输入数据规模
onf/hibench.conf的字段hibench.scale.profile
可以配置tiny, small, large, huge, gigantic and bigdata
值的定义可以在对应测试实例的配置文件中找到,例如,wordcount的在
conf/workloads/micro/wordcount.conf
配置并行数
conf/hibench.conf
Property | Meaning |
---|---|
hibench.default.map.parallelism | Mapper number in hadoop |
hibench.default.shuffle.parallelism | Reducer number in hadoop |