Hadoop3.0版本安装、性能研究

Hadoop3.0安装

环境:Ubuntu14.04 64

1.adduser advhadoop添加用户和组

2.hadoop用户添加权限

sudo gedit /etc/sudoers

3.安装ssh

sudo apt-get install openssh-server

安装完成后启动ssh server服务

sudo /etc/init.d/ssh start

查看ssh服务是否启动

ps -e | grep ssh

如果看到ssh等字样说明启动成功

4.设置免密码登录,生成私钥和公钥

ssh-keygen -t rsa -P ""

将公钥追加到authorized_keys中,它用户保存所有允许以当前用户身份登录到ssh客户端用户的公钥内容。

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

5.登录ssh

ssh localhost

退出

exit

6.安装jdk1.8

下载jdk-8u131-linux-x64.tar.gz,解压到//usr/lib/jvm/jdk1.8.0_131文件夹下,设置~/.bash中的JAVA_HOMEPATH

update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_131/bin/java 300

update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_131/bin/javac 300

update-alternatives --config java

update-alternatives --config javac

7.下载hadoop-3.0.0-alpha1.tar.gz,解压到/usr/local/advhadoop

8.chmod +x hadoop-env.sh

./hadoop-env.sh

9.bin/hadoop将会显示hadoop脚本的使用文档。

10.伪分布式配置

使用如下的 etc/hadoop/core-site.xml:

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>hdfs://localhost:9000</value>

    </property>

</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>

    <property>

        <name>dfs.replication</name>

        <value>1</value>

    </property>

</configuration>

11.chmod 0600 ~/.ssh/authorized_keys

12.开启hadoop

格式化一个新的分布式文件系统: bin/hdfs namenode -format

启动namenodedatanode守护进程:sbin/start-dfs.sh

Hadoop守护进程的日志写入到 $HADOOP_LOG_DIR目录(默认是  $HADOOP_LOG_DIR /logs).
浏览NameNode网络接口,它的地址默认为:

NameNode —— http://localhost:9870/

13.执行WordCount的例子

$ bin/hdfs dfs -mkdir /user

$ bin/hdfs dfs -mkdir /user/hduser

$ bin/hdfs dfs -mkdir /user/hduser/input

$ bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar grep /user/hduser/input output 'dfs[a-z.]+'

$ bin/hdfs dfs -get output output

$ cat output/*

$ bin/hdfs dfs -cat output/*

14.关闭hadoop sbin/stop-dfs.sh

 

参考资料:

hadoop3.0安装配置http://blog.csdn.net/sum__mer/article/details/52472420

Ubuntu 14.04Hadoop3.0-alpha的安装方法http://www.2cto.com/kf/201703/613371.html

 

终端输出:

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs namenode -format

2017-05-02 00:00:53,219 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   user = advhadoop

STARTUP_MSG:   host = happy-Lenovo-IdeaPad-Y480/127.0.1.1

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 3.0.0-alpha1

…...

2017-05-02 00:01:09,112 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

2017-05-02 00:01:09,118 INFO util.ExitUtil: Exiting with status 0

2017-05-02 00:01:09,122 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at happy-Lenovo-IdeaPad-Y480/127.0.1.1

************************************************************/

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$ sbin/start-dfs.sh

Starting namenodes on [localhost]

localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Starting datanodes

localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Starting secondary namenodes [happy-Lenovo-IdeaPad-Y480]

happy-Lenovo-IdeaPad-Y480: Warning: Permanently added 'happy-lenovo-ideapad-y480' (ECDSA) to the list of known hosts.

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user/hduser

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user/hduser/input

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar grep /user/hduser/input output 'dfs[a-z.]+'

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hdfs dfs -get output output

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$cat output/*

cat: output/output: 是一个目录

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$cat output/output

cat: output/output: 是一个目录

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$cat output/output/*

1 dfsadmin

1 dfs.replication

 

问题:

1.需要安装jdk1.8,否则无法正常启动

2.重新格式化format后,提示namenode错误:

先停hadoop
hadoop.tmp.dir这里配置路径清除文件;(hadoop.tmp.dir默认:/tmp/hadoop-${user.name}
然后hadoop namenode -format
最后重启hadoop

 

 

hadoop3.0性能研究

新特性

1.Java最低版本要求java8,使用java7或者更低版本的需要升级到8

2.HDFS支持纠编码erasure encoding,简称EC技术。EC技术可以防止数据丢失,又可以解决HDFS存储空间翻倍的问题。劣势是: 一旦数据需要恢复,会带来网络消耗,因为不仅要读取原数据块,还要读取校验块。 存储文件,或者恢复文件需要编码解码,会有CPU消耗。 建议EC存储用于冷数据,由于冷数据确实数量大,可以减少副本从而降低存储空间,另外冷数据稳定,一旦需要恢复数据,对业务不会有太大影响。

Hadoop common的变化

精简了内核,剔除了过期的API和实现,废弃hftp转由webhdfs替代。

Classpath isolation防止不同版本jar包冲突,例如google guava在混合使用hadoophbasespark时出现冲突。mapreduce有参数控制忽略hadoop环境中的jar,而使用用户提交的第三方jar,但提交spark任务却不能解决此问题,需要在自己的jar包中重写需要的第三方类或者整个spark环境升级。classpath isolation用户可以很方便的选择自己需要的第三方依赖包。参见HADOOP-11656 

hadoop shell脚本重构,修复了大量bug,增加了新特性,支持动态命令。

Hadoop namenode支持一个active,多个standby的部署方式。在hadoop-2.xresourcemanager已经支持这个特性。

Mapreduce task-level native优化,mapreduce增加了map output collectornative实现,对于shuffle密集型任务,可以提高30%的执行效率。

内存参数自动推断。在Hadoop 2.0中,为MapReduce作业设置内存参数非常繁琐,涉及到两个参数:mapreduce.{map,reduce}.memory.mbmapreduce.{map,reduce}.java.opts,一旦设置不合理,则会使得内存资源浪费严重,比如将前者设置为4096MB,但后者却是“-Xmx2g”,则剩余2g实际上无法让java heap使用到。

Hadoop Yarn,cgroup增加了内存和io disk的隔离,timeline service v2,YARN container resizing等等。

 

基准测试

1.TestDFSIO

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar

An example program must be given as the first argument.

Valid program names are:

  DFSCIOTest: Distributed i/o benchmark of libhdfs.

  DistributedFSCheck: Distributed checkup of the file system consistency.

  JHLogAnalyzer: Job History Log analyzer.

  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures

  NNdataGenerator: Generate the data to be used by NNloadGenerator

  NNloadGenerator: Generate load on Namenode using NN loadgenerator run WITHOUT MR

  NNloadGeneratorMR: Generate load on Namenode using NN loadgenerator run as MR job

  NNstructureGenerator: Generate the structure to be used by NNdataGenerator

  SliveTest: HDFS Stress Test and Live Data Verification.

  TestDFSIO: Distributed i/o benchmark.

  fail: a job that always fails

  filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)

  largesorter: Large-Sort tester

  loadgen: Generic map/reduce load generator

  mapredtest: A map/reduce test check.

  minicluster: Single process HDFS and MR cluster.

  mrbench: A map/reduce benchmark that can create many small jobs

  nnbench: A benchmark that stresses the namenode w/ MR.

  nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.

  sleep: A job that sleeps at each map and reduce task.

  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce

  testfilesystem: A test for FileSystem read/write.

  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.

  testsequencefile: A test for flat files of binary key value pairs.

  testsequencefileinputformat: A test for sequence file input format.

  testtextinputformat: A test for text input format.

  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

  timelineperformance: A job that launches mappers to test timline service performance.

 

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB

 

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB

 

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$ cat TestDFSIO_results.log

----- TestDFSIO ----- : write

            Date & time: Tue May 02 11:50:08 CST 2017

        Number of files: 10

 Total MBytes processed: 100

      Throughput mb/sec: 103.84

Total Throughput mb/sec: 0.03

 Average IO rate mb/sec: 112.82

  IO rate std deviation: 26.36

     Test exec time sec: 3.97

 

----- TestDFSIO ----- : read

            Date & time: Tue May 02 11:52:09 CST 2017

        Number of files: 10

 Total MBytes processed: 100

      Throughput mb/sec: 546.45

Total Throughput mb/sec: 0.04

 Average IO rate mb/sec: 647.18

  IO rate std deviation: 257.8

     Test exec time sec: 2.82

 

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -clean

2017-05-02 18:25:01,647 INFO fs.TestDFSIO: TestDFSIO.1.8

2017-05-02 18:25:01,652 INFO fs.TestDFSIO: nrFiles = 1

2017-05-02 18:25:01,652 INFO fs.TestDFSIO: nrBytes (MB) = 1.0

2017-05-02 18:25:01,652 INFO fs.TestDFSIO: bufferSize = 1000000

2017-05-02 18:25:01,652 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO

2017-05-02 18:25:02,500 INFO fs.TestDFSIO: Cleaning up test files

 

2. TeraSort

一个完整的TeraSort测试需要按以下三步执行:

TeraGen生成随机数据对输入数据运行TeraSortTeraValidate验证排好序的输出数据并不需要在每次测试时都生成输入数据,生成一次数据之后,每次测试可以跳过第一步。

TeraGen的用法如下:

hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>

以下命令运行TeraGen生成1GB的输入数据,并输出到目录/examples/terasort-input

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar teragen 10000000 /examples/terasort-input

 

以下命令运行TeraSort对数据进行排序,并将结果输出到目录/examples/terasort-output

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar terasort /examples/terasort-input /examples/terasort-output

 

以下命令运行TeraValidate来验证TeraSort输出的数据是否有序,如果检测到问题,将乱序的key输出到目录/examples/terasort-validate

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar terasort /examples/terasort-output /examples/terasort-validate

 

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop fs -count /examples/terasort-validate

           1            3         1000000000 /examples/terasort-validate

 

3.nnbench

nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。nnbench的用法如下:

以下例子使用12mapper6reducer来创建1000个文件:

advhadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /bench/NNBench-'hostname -s'

…...

DataLines Maps Reduces AvgTime (milliseconds)

1 2 1 1124

 

以上结果表示平均作业完成时间是11秒。

 

参考资料:

http://blog.csdn.net/flygoa/article/details/52127382

http://www.aixchina.net/Question/177983

 

对比hadooop2.7.1

1. TestDFSIO

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$cat TestDFSIO_results.log

----- TestDFSIO ----- : write

           Date & time: Wed May 03 00:14:30 CST 2017

       Number of files: 10

Total MBytes processed: 100.0

     Throughput mb/sec: 121.50668286755771

Average IO rate mb/sec: 128.30081176757812

 IO rate std deviation: 23.2361211216607

    Test exec time sec: 4.6

 

----- TestDFSIO ----- : read

           Date & time: Wed May 03 00:15:21 CST 2017

       Number of files: 10

Total MBytes processed: 100.0

     Throughput mb/sec: 257.0694087403599

Average IO rate mb/sec: 282.7495422363281

 IO rate std deviation: 65.24276389107759

    Test exec time sec: 2.828

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -clean

 

2.TeraSort

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar teragen 10000000 /examples/terasort-input

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /examples/terasort-input /examples/terasort-output

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /examples/terasort-output /examples/terasort-validate

 

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop fs -count /examples/terasort-validate

           1            3         1000000000 /examples/terasort-validate

 

 

 

3.nnbench

hadoop@happy-Lenovo-IdeaPad-Y480:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /bench/NNBench-'hostname -s'

…...

17/05/03 00:32:40 INFO hdfs.NNBench: -------------- NNBench -------------- :

17/05/03 00:32:40 INFO hdfs.NNBench:                                Version: NameNode Benchmark 0.4

17/05/03 00:32:40 INFO hdfs.NNBench:                            Date & time: 2017-05-03 00:32:40,41

17/05/03 00:32:40 INFO hdfs.NNBench:

17/05/03 00:32:40 INFO hdfs.NNBench:                         Test Operation: create_write

17/05/03 00:32:40 INFO hdfs.NNBench:                             Start time: 2017-05-03 00:32:34,942

17/05/03 00:32:40 INFO hdfs.NNBench:                            Maps to run: 12

17/05/03 00:32:40 INFO hdfs.NNBench:                         Reduces to run: 6

17/05/03 00:32:40 INFO hdfs.NNBench:                     Block Size (bytes): 1

17/05/03 00:32:40 INFO hdfs.NNBench:                         Bytes to write: 0

17/05/03 00:32:40 INFO hdfs.NNBench:                     Bytes per checksum: 1

17/05/03 00:32:40 INFO hdfs.NNBench:                        Number of files: 1000

 

以上结果表示平均作业完成时间是14秒。

 

可以看出hadoop3.0相比hadoop2.7.1TestDFSIO中文件读写速度方面有不少的提升,在TeraSort的结果没有差别,在nnbench中测试NameNode的负载也有明显提升。


通过安装hadoop3.0,了解到与hadoop2.7.1的安装过程基本类似;在性能方面,学习hadoop基准测试的方法,总体来看hadoop3.0有不少的提升;另外两者的NameNode的端口地址不同。


  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值