用hadoop也一年多了,只是知道一些MR以及基础Hadoop命令的使用,今天突发奇想如果我在原有集群的基础上添加了设备,对整个集群有什么性能上的提升,怎么评估?
针对这个问题我上网上搜了一下,发现hadoop在发布的时候本身就包含了一个基准测试的工具包用来进行集群的测试,下面简单说一下基准测试相关的使用方法.
- 首先说一下基准测试的包的位置以及hadoop的版本,因为我在网上看到有发的帖子说明的路径我并没有找,防止之前不了解的人在此采坑,在这说明一下,如果与我当前描述的版本不一致,请自行查找包位置,测试的方式是一样的
[hadoop@hadoop1 ~]$ hadoop version
Hadoop 2.6.0-cdh5.4.8
Subversion Unknown -r Unknown
Compiled by hadoopcdh532 on 2015-10-29T06:22Z
Compiled with protoc 2.5.0
From source with checksum e3ea30a354dfe490b21f10ab2e3693
This command was run using /home/hadoop/yarn/hadoop-2.6.0-cdh5.4.8/share/hadoop/common/hadoop-common-2.6.0-cdh5.4.8.jar
//基准测试的包在如下目录
[hadoop@hadoop1]$ cd $HADOOP_HOME/share/hadoop/mapreduce
[hadoop@hadoop1 mapreduce]$ ls
hadoop-mapreduce-client-app-2.6.0-cdh5.4.8.jar
hadoop-mapreduce-client-hs-plugins-2.6.0-cdh5.4.8.jar hadoop-mapreduce-client-shuffle-2.6.0-cdh5.4.8.jar
hadoop-mapreduce-client-common-2.6.0-cdh5.4.8.jar
**hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8.jar** hadoop-mapreduce-examples-2.6.0-cdh5.4.8.jar
hadoop-mapreduce-client-core-2.6.0-cdh5.4.8.jar
hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar
hadoop-mapreduce-client-hs-2.6.0-cdh5.4.8.jar
hadoop-mapreduce-client-nativetask-2.6.0-cdh5.4.8.jar
- 用来基准测试的包名是hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8.jar
- 如何进行基准测试
使用该命令查看有哪些基准测试方法
[hadoop@hadoop1 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar
以下为基准命令使用帮助
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
TestDFSIO、mrbench和nnbench是三个广泛被使用的测试。
TestDFSIO
TestDFSIO用于测试HDFS的IO性能,使用一个MapReduce作业来并发地执行读写操作,每个map任务用于读或写每个文件,map的输出用于收集与处理文件相关的统计信息,reduce用于累积统计信息,并产生summary。TestDFSIO的用法如下:
以下的例子将往HDFS中写入10个1000MB的文件:
hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
结果将会写到一个本地文件TestDFSIO_results.log
以下的例子将从HDFS中读取10个1000MB的文件:
hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
使用以下命令删除测试数据:
hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar TestDFSIO -clean
nnbench
nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。
以下例子使用12个mapper和6个reducer来创建1000个文件:
hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s`
mrbench
mrbench会多次重复执行一个小作业,用于检查在机群上小作业的运行是否可重复以及运行是否高效。
以下例子会运行一个小作业50次:
hadoop jar hadoop-mapreduce-client-jobclient-2.6.0-cdh5.4.8-tests.jar mrbench -numRuns 50
运行结果如下所示:
DataLines | Maps | Reduces | AvgTime (milliseconds) |
---|---|---|---|
1 | 2 | 1 | 14237 |
以上结果表示平均作业完成时间是14秒。
Hadoop Examples
除了上文提到的测试,Hadoop还自带了一些例子,比如WordCount和TeraSort,这些例子在hadoop-mapreduce-examples-2.6.0-cdh5.4.8.jar中。执行以下命令会列出所有的示例程序:
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
具体如何使用example,请查看下面博文。