Benchmark性能测试工具,TestDFSIO/TeraSort

最新推荐文章于 2023-05-22 08:02:30 发布

chouchi1749

最新推荐文章于 2023-05-22 08:02:30 发布

阅读量544

点赞数

文章标签：大数据数据库

原文链接：https://my.oschina.net/guanxun/blog/517889

版权

TestDFSIO

//用法
hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]

TestDFSIO给每个文件都起一个map任务。

写测试：生成10个文件，每个文件100M

pwd
/home/mr/yarn/share/hadoop/mapreduce
hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100

vmaxspark1集群测试结果：

15/06/29 22:59:34 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
15/06/29 22:59:34 INFO fs.TestDFSIO: Date & time: Mon Jun 29 22:59:34 CST 2015
15/06/29 22:59:34 INFO fs.TestDFSIO: Number of files: 10
15/06/29 22:59:34 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
15/06/29 22:59:34 INFO fs.TestDFSIO: Throughput mb/sec: 2.2699105201272967
15/06/29 22:59:34 INFO fs.TestDFSIO: Average IO rate mb/sec: 11.470916748046875
15/06/29 22:59:34 INFO fs.TestDFSIO: IO rate std deviation: 15.038400232638908
15/06/29 22:59:34 INFO fs.TestDFSIO: Test exec time sec: 80.936
15/06/29 22:59:34 INFO fs.TestDFSIO:

读测试：读10个文件，每个文件100M

hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100
15/06/29 23:02:28 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
15/06/29 23:02:28 INFO fs.TestDFSIO: Date & time: Mon Jun 29 23:02:28 CST 2015
15/06/29 23:02:28 INFO fs.TestDFSIO: Number of files: 10
15/06/29 23:02:28 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
15/06/29 23:02:28 INFO fs.TestDFSIO: Throughput mb/sec: 1540.8320493066255
15/06/29 23:02:28 INFO fs.TestDFSIO: Average IO rate mb/sec: 1566.176025390625
15/06/29 23:02:28 INFO fs.TestDFSIO: IO rate std deviation: 207.60517212156435
15/06/29 23:02:28 INFO fs.TestDFSIO: Test exec time sec: 19.235

清除测试数据：默认的测试数据文件在 /benchmarks/TestDFSIO

hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -clean

说明：

Throughput mb/sec for a TestDFSIO job using N map tasks is defined as follows. The index 1 <= i <= N denotes the individual map tasks:

T h r o u g h p u t (N) = \sum N i = 0 f i l e s i z e i \sum N i = 0 t i m e i

Average IO rate mb/sec is defined as:

A v e r a g e I O r a t e (N) = \sum N i = 0 r a t e i N = \sum N i = 0 f i l e s i z e i t i m e i N

从这个定义可以看出，如果有10个Map任务，则实际并发吞吐量为10 * 2.27 = 22.7mb/sec（有疑问，不太明白？）

另一个对结果有很大影响的是HDFS replication factor。设置dfs.replication属性可以调整replication factor。

hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -D dfs.replication=2 -write -nrFiles 10 -fileSize 100
15/06/29 23:32:47 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
15/06/29 23:32:47 INFO fs.TestDFSIO: Date & time: Mon Jun 29 23:32:47 CST 2015
15/06/29 23:32:47 INFO fs.TestDFSIO: Number of files: 10
15/06/29 23:32:47 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
15/06/29 23:32:47 INFO fs.TestDFSIO: Throughput mb/sec: 4.046895424175344
15/06/29 23:32:47 INFO fs.TestDFSIO: Average IO rate mb/sec: 9.856432914733887
15/06/29 23:32:47 INFO fs.TestDFSIO: IO rate std deviation: 14.509421322080607
15/06/29 23:32:47 INFO fs.TestDFSIO: Test exec time sec: 138.103

TeraSort Benchmark

输入数据：TeraGen

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teragen 1000 /test/input100M

teragen后的数值单位是行数；因为每行100个字节，所以如果要产生1T的数据量，则这个数值应为1T/100=10000000000(10个0)。

设置 dfs.block.size属性可以调整hdfs块大小，如 teragen -D dfs.block.size=536870912 ...

运行TeraSort

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar terasort /test/input100M /test/output100M

结果的校验：TeraValidate

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teravalidate /test/output1TB /test/validate1TB

vmaxspark1测试结果

[mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teragen 100000000 /gx/tera/input10G
15/07/02 20:47:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
15/07/02 20:47:47 INFO terasort.TeraSort: Generating 100000000 using 2
15/07/02 20:47:47 INFO mapreduce.JobSubmitter: number of splits:2
15/07/02 20:47:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1435833769205_0028
15/07/02 20:47:48 INFO impl.YarnClientImpl: Submitted application application_1435833769205_0028
15/07/02 20:47:48 INFO mapreduce.Job: The url to track the job: http://vmaxspark3:8088/proxy/application_1435833769205_0028/
15/07/02 20:47:48 INFO mapreduce.Job: Running job: job_1435833769205_0028
15/07/02 20:47:54 INFO mapreduce.Job: Job job_1435833769205_0028 running in uber mode : false
15/07/02 20:47:54 INFO mapreduce.Job: map 0% reduce 0%
...
15/07/02 21:08:20 INFO mapreduce.Job: map 99% reduce 0%
15/07/02 21:08:41 INFO mapreduce.Job: map 100% reduce 0%
15/07/02 21:08:45 INFO mapreduce.Job: Job job_1435833769205_0028 completed successfully
15/07/02 21:08:45 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=191996
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=170
HDFS: Number of bytes written=10000000000
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=7422519
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2474173
Total vcore-seconds taken by all map tasks=2474173
Total megabyte-seconds taken by all map tasks=3800329728
Map-Reduce Framework
Map input records=100000000
Map output records=100000000
Input split bytes=170
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=39813
CPU time spent (ms)=347490
Physical memory (bytes) snapshot=975953920
Virtual memory (bytes) snapshot=3772518400
Total committed heap usage (bytes)=1979842560
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=214760662691937609
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=10000000000
[mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar terasort /gx/tera/input10G /gx/tera/output10G
15/07/02 21:54:06 INFO terasort.TeraSort: starting
15/07/02 21:54:07 INFO input.FileInputFormat: Total input paths to process : 2
Spent 125ms computing base-splits.
Spent 4ms computing TeraScheduler splits.
Computing input splits took 130ms
Sampling 10 splits of 76
Making 1 from 100000 sampled records
Computing parititions took 409ms
Spent 542ms computing partitions.
15/07/02 21:54:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
15/07/02 21:54:07 INFO mapreduce.JobSubmitter: number of splits:76
15/07/02 21:54:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1435833769205_0038
15/07/02 21:54:08 INFO impl.YarnClientImpl: Submitted application application_1435833769205_0038
15/07/02 21:54:08 INFO mapreduce.Job: The url to track the job: http://vmaxspark3:8088/proxy/application_1435833769205_0038/
15/07/02 21:54:08 INFO mapreduce.Job: Running job: job_1435833769205_0038
15/07/02 21:54:14 INFO mapreduce.Job: Job job_1435833769205_0038 running in uber mode : false
15/07/02 21:54:30 INFO mapreduce.Job: map 3% reduce 0%
...
15/07/02 21:55:16 INFO mapreduce.Job: map 84% reduce 25%
...
15/07/02 21:57:36 INFO mapreduce.Job: map 100% reduce 76%
15/07/02 21:58:25 INFO mapreduce.Job: map 100% reduce 100%
15/07/02 21:58:25 INFO mapreduce.Job: Job job_1435833769205_0038 completed successfully
15/07/02 21:58:25 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=8746650762
FILE: Number of bytes written=13195291236
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10000008436
HDFS: Number of bytes written=10000000000
HDFS: Number of read operations=231
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=76
Launched reduce tasks=1
Data-local map tasks=64
Rack-local map tasks=12
Total time spent by all maps in occupied slots (ms)=8175426
Total time spent by all reduces in occupied slots (ms)=906864
Total time spent by all map tasks (ms)=2725142
Total time spent by all reduce tasks (ms)=226716
Total vcore-seconds taken by all map tasks=2725142
Total vcore-seconds taken by all reduce tasks=226716
Total megabyte-seconds taken by all map tasks=4185818112
Total megabyte-seconds taken by all reduce tasks=464314368
Map-Reduce Framework
Map input records=100000000
Map output records=100000000
Map output bytes=10200000000
Map output materialized bytes=4406714080
Input split bytes=8436
Combine input records=0
Combine output records=0
Reduce input groups=100000000
Reduce shuffle bytes=4406714080
Reduce input records=100000000
Reduce output records=100000000
Spilled Records=299321120
Shuffled Maps =76
Failed Shuffles=0
Merged Map outputs=76
GC time elapsed (ms)=466626
CPU time spent (ms)=2438220
Physical memory (bytes) snapshot=47665377280
Virtual memory (bytes) snapshot=145613328384
Total committed heap usage (bytes)=79307472896
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=10000000000
File Output Format Counters
Bytes Written=10000000000
15/07/02 21:58:25 INFO terasort.TeraSort: done
[mr@vmaxspark1 mapreduce]$

可用的Benchmark 和 Testing 工具：

[mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-*test*
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
[mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-*example*
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

来自为知笔记(Wiz)

转载于:https://my.oschina.net/guanxun/blog/517889

chouchi1749

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Benchmark性能测试工具,TestDFSIO/TeraSort

TestDFSIO //用法 hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-res...
复制链接

扫一扫