2020/11/27 sunhaiqi@bonc.com.cn
文章目录
- Hadoop Benchmarking
-
- 一、调试集群
- 二、测试组件
- 921 Measurement "op_count" = 14200 Measurement "successes" = 3829 Rate for measurement "op_count" = 2398.244 operations/sec Rate for measurement "successes" = 646.681 successes/sec
- Basic report for operation type SliveMapper
- Measurement "milliseconds_taken" = 9595422 Measurement "op_count" = 199400 Rate for measurement "op_count" = 20.781 operations/sec
- Basic report for operation type TruncateOp
- Measurement "bytes_written" = 0 Measurement "failures" = 7666 Measurement "files_not_found" = 6432 Measurement "milliseconds_taken" = 95 Measurement "op_count" = 14200 Measurement "successes" = 102 Rate for measurement "bytes_written" = 0 MB/sec Rate for measurement "op_count" = 149473.684 operations/sec Rate for measurement "successes" = 1073.684 successes/sec
Hadoop Benchmarking
一、调试集群
在开始测试之前应当启用HDFS服务以及YARN服务
在启动yarn服务时发现resourceManager启动不了,通过查看日志发现错误:
2020-11-23 15:56:44,775 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: registered UNIX signal handlers for [TERM, HUP, INT]
2020-11-23 15:56:45,408 INFO org.apache.hadoop.conf.Configuration: found resource core-site.xml at file:/home/bduser101/modules/hadoop/etc/hadoop/core-site.xml
2020-11-23 15:56:45,931 FATAL org.apache.hadoop.conf.Configuration: error parsing conf java.io.BufferedInputStream@74294adb
org.xml.sax.SAXParseException; lineNumber: 19; columnNumber: 38; An 'include' failed, and no 'fallback' element was found.
这个错误来自配置联邦时修改了core-site.xml中的引入文件,将mountTable.xml文件中的配置都写入core-site.xml中,并将该文件同步至所有节点之后,即可正常启动yarn服务。
【错误来源】
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="mountTable.xml"/>
更正后:
二、测试组件
当我们部署完一个新的集群,或者对集群做了升级,或调整集群中的性能参数后,想观察集群性能的变化,那么我们就需要一些集群测试工具。
hadoop自带的测试包,在这个测试包下有很多测试工具,其中DFSCIOTest、mrbench、nnbench应用广泛。
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.3.jar
-
DFSCIOTest: Distributed i/o benchmark of libhdfs.
(测试libhdfs中的分布式I/O的基准。Libhdfs是一个为C/C++应用程序提供HDFS文件服务的共享库。)
-
DistributedFSCheck: Distributed checkup of the file system consistency.
(文件系统一致性的分布式检查)
-
JHLogAnalyzer: Job History Log analyzer.
-
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
-
NNdataGenerator: Generate the data to be used by NNloadGenerator
-
NNloadGenerator: Generate load on Namenode using NN loadgenerator run WITHOUT MR
-
NNloadGeneratorMR: Generate load on Namenode using NN loadgenerator run as MR job
-
NNstructureGenerator: Generate the structure to be used by NNdataGenerator
-
SliveTest: HDFS Stress Test and Live Data Verification.
-
TestDFSIO: Distributed i/o benchmark.
(分布式的I/O基准)
-
fail: a job that always fails
-
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
(测量HDFS的吞吐量)
-
largesorter: Large-Sort tester
-
loadgen: Generic map/reduce load generator
-
mapredtest: A map/reduce test check.
-
minicluster: Single process HDFS and MR cluster.
-
mrbench: A map/reduce benchmark that can create many small jobs
(创建大量小作业的MapReduce基准)
-
nnbench: A benchmark that stresses the namenode.
(NameNode的性能基)
-
sleep: A job that sleeps at each map and reduce task.
-
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
-
testfilesystem: A test for FileSystem read/write.
(文件系统读写测试)
-
testmapredsort: A map/reduce program that validates the map-reduce framework’s sort.
(用于校验MapReduce框架的排序的程序)
-
testsequencefile: A test for flat files of binary key value pairs.
(对包含二进制键值对的文本文件的测试)
-
testsequencefileinputformat: A test for sequence file input format.
(对序列文件输入格式的测试)
-
testtextinputformat: A test for text input format.
(对文本输入格式的测试。)
-
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
(对比输出一个排序块的Map作业和输出多个排序块的Map作业的性能)
2.1、TestDFSIO
TestDFSIO用于测试HDFS的IO性能,使用一个MapReduce作业来并发地执行读写操作,每个map任务用于读或写每个文件,map的输出用于收集与处理文件相关的统计信息,reduce用于累积统计信息,并产生summary。
TestDFSIO的用法如下:
$>:hadoop jar hadoop-mapreduce-client-jobclient-2.7.6-tests.jar TestDFSIO
Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]
在测试程序执行结束之后会在本地文件目录下生成文件TestDFSIO_results.log,可以查看运行结果的日志
2.1.1、向HDFS上传10个100MB的文件
$>cd /home/bduser101/modules/hadoop/share/hadoop/mapreduce
$>hadoop jar hadoop-mapreduce-client-jobclient-2.7.6-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100
20/11/23 17:03:48 INFO fs.TestDFSIO: TestDFSIO.1.8
20/11/23 17:03:48 INFO fs.TestDFSIO: nrFiles = 10
20/11/23 17:03:48 INFO fs.TestDFSIO: nrBytes (MB) = 100.0
20/11/23 17:03:48 INFO fs.TestDFSIO: bufferSize = 1000000
20/11/23 17:03:48 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/11/23 17:03:50 INFO fs.TestDFSIO: creating control file: 104857600 bytes, 10 files
遇到错误WARN hdfs.DFSClient: Caught exception java.lang.Interrupted
Exceptionat java.lang.Object.wait(Native Method)不用慌,根据网上大多数人的情况来看,这是hadoop的bug
20/11/23 17:04:39 INFO mapreduce.Job: map 0% reduce 0%
20/11/23 17:05:03 INFO mapreduce.Job: map 13% reduce 0%
20/11/23 17:05:20 INFO mapreduce.Job: map 17% reduce 0%
20/11/23 17:05:21 INFO mapreduce.Job: map 20% reduce 0%
20/11/23 17:05:35 INFO mapreduce.Job: map 20% reduce 7%
20/11/23 17:05:44 INFO mapreduce.Job: map 27% reduce 7%
20/11/23 17:05:50 INFO mapreduce.Job: map 30% reduce 10%
20/11/23 17:05:58 INFO mapreduce.Job: map 77% reduce 10%
20/11/23 17:06:52 INFO mapreduce.Job: map 80% reduce 10%
Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-112231132-192.168.159.101-1584624837518:blk_1073742201_1382 does not exist or is not under Constructionnull
遇到错误org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-112231132-192.168.159.101-1584624837518:blk_1073742201_1382 does not exist or is not under Constructionnull;这是关于平衡器的bug 具体参照官方文档https://issues.apache.org/jira/browse/hdfs-8093
排除办法有
-
系统或hdfs是否有空间
修改配置文件core-site.xml 将配置项fs.default.name从viewfs://my-cluser改为hdfs://node101:8020
hadoop/bin$>./hdfs dfsadmin -report
结果显示集群剩余空间仍然有很多
-
datanode数是否正常
-
是否在safemode
-
防火墙关闭
-
配置方面
-
把NameNode的tmp文件清空,然后重新格式化NameNode
20/11/23 17:06:53 INFO mapreduce.Job: map 77% reduce 10%
20/11/23 17:07:14 INFO mapreduce.Job: map 80% reduce 10%
20/11/23 17:07:27 INFO mapreduce.Job: map 90% reduce 13%
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /benchmarks/TestDFSIO/io_data/test_io_6 (inode 16834): File does not exist. Holder DFSClient_attempt_1606119502234_0004_m_000006_0_-1509478354_1 does not have any open files.
遇到错误Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /benchmarks/TestDFSIO/io_data/test_io_6 (inode 16834): File does not exist. Holder DFSClient_attempt_1606119502234_0004_m_000006_0_-1509478354_1 does not have any open files.
-
这个问题实际上就是data stream操作过程中文件被删掉了。,通常是因为Mapred多个task操作同一个文件,一个task完成后删掉文件导致
-
此错误与hadoop的特性有关:Hadoop不会尝试诊断和修复运行缓慢的任务,而是尝试检测(推测)它们并为其运行备份任务。真正的原因是,在任务执行缓慢的情况下,Hadoop运行另一个任务以执行相同的操作(在我的情况下是将数据保存在hadoop的文件系统中),当两个相同的任务中的一个完成时,将删除一些临时文件,另一个任务完成之后将会删除同样的临时文件,所以这样会造成这种错误
-
这个错误本身并不会影响该测试程序的运行结果,可以忽略。可以通过关闭spark和hadoop的推测来解决此问题:
程序运行结束之后会有以下信息打印,包括该测试程序在运行期间的mapReduce任务,吞吐量,速率等数据
20/11/23 17:07:28 INFO mapreduce.Job: map 87% reduce 13%
20/11/23 17:07:31 INFO mapreduce.Job: map 90% reduce 13%
20/11/23 17:07:32 INFO mapreduce.Job: map 93% reduce 13%
20/11/23 17:07:33 INFO mapreduce.Job: map 97% reduce 13%
20/11/23 17:07:34 INFO mapreduce.Job: map 100% reduce 13%
20/11/23 17:07:36 INFO mapreduce.Job: map 100% reduce 100%
20/11/23 17:07:37 INFO mapreduce.Job: Job job_1606119502234_0004 completed successfully
20/11/23 17:07:38 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=857
FILE: Number of bytes written=1377714
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2330
HDFS: Number of bytes written=1048576078
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=12
VIEWFS: Number of bytes read=0
VIEWFS: Number of bytes written=0
VIEWFS: Number of read operations=0
VIEWFS: Number of large read operations=0
VIEWFS: Number of write operations=0
Job Counters
Failed map tasks=2
Killed map tasks=6
Launched map tasks=19
Launched reduce tasks=1
Other local map tasks=1
Data-local map tasks=18
Total time spent by all maps in occupied slots (ms)=1723294
Total time spent by all reduces in occupied slots (ms)=133402
Total time spent by all map tasks (ms)=1723294
Total time spent by all reduce tasks (ms)=133402
Total vcore-milliseconds taken by all map tasks=1723294
Total vcore-milliseconds taken by all reduce tasks=133402
Total megabyte-milliseconds taken by all map tasks=1764653056
Total megabyte-milliseconds taken by all reduce tasks=136603648
Map-Reduce Framework
Map input records=10
Map output records=50
Map output bytes=751
Map output materialized bytes=911
Input split bytes=1210
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=911
Reduce input records=50
Reduce output records=5
Spilled Records=100
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=71333
CPU time spent (ms)=66340
Physical memory (bytes) snapshot=1764884480
Virtual memory (bytes) snapshot=22712225792
Total committed heap usage (bytes)=2045894656
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
测试日志
20/11/23 17:07:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
20/11/23 17:07:38 INFO fs.TestDFSIO: Date & time: Mon Nov 23 17:07:38 CST 2020
20/11/23 17:07:38 INFO fs.TestDFSIO: Number of files: 10
20/11/23 17:07:38 INFO fs.TestDFSIO: Total MBytes processed: 1000
20/11/23 17:07:38 INFO fs.TestDFSIO: Throughput mb/sec: 2.09 吞吐量
20/11/23 17:07:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 3.48 平均IO速率
20/11/23 17:07:38 INFO fs.TestDFSIO: IO rate std deviation: 2.52 IO率STD偏差
20/11/23 17:07:38 INFO fs.TestDFSIO: Test exec time sec: 226.16 测试执行时间秒
20/11/23 17:07:38 INFO fs.TestDFSIO:
在公司测试集群执行相同的测试10次之后的统计分析
2.1.2、从HDFS读取10个1000MB的文件
在读取之前应当运行上一个测试用例,以生成数据
$>cd /home/bduser101/modules/hadoop/share/hadoop/mapreduce
$>hadoop jar hadoop-mapreduce-client-jobclient-2.7.6-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
20/11/24 15:16:24 INFO fs.TestDFSIO: TestDFSIO.1.8
20/11/24 15:16:24 INFO fs.TestDFSIO: nrFiles = 10
20/11/24 15:16:24 INFO fs.TestDFSIO: nrBytes (MB) = 100.0
20/11/24 15:16:24 INFO fs.TestDFSIO: bufferSize = 1000000
20/11/24 15:16:24 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/11/24 15:16:26 INFO fs.TestDFSIO: creating control file: 104857600 bytes, 10 files
依然会遇到之前上传文件时的Exception:WARN hdfs.DFSClient: Caught exception java.lang.Interrupted
Exceptionat java.lang.Object.wait(Native Meth