Ubuntu下安装Hadoop(单机)
确保已安装Java
Hadoop是用Java开发的,必须先安装Java环境,Oracle和OpenJDK都可以。具体版本可以参考官方wiki:https://wiki.apache.org/hadoop/HadoopJavaVersions
设置ssh无密码登录localhost
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
设置成功之后,使用 ssh localhost
可以正常登录
$ ssh localhost
退出ssh登录的shell:
$ exit
安装Hadoop
- 下载安装文件
$ wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.9.1/hadoop-2.9.1.tar.gz
- 移动到/usr/local目录下并解压
$ sudo mv hadoop-2.9.1.tar.gz /usr/local/ $ sudo tar -zxvf hadoop-2.9.1.tar.gz
- 在hadoop主目录下运行
bin/hadoop
将会看到使用说明:Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file note: please use "yarn jar" to launch YARN applications, not this command. checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the Hadoop jar and the required libraries credential interact with credential providers daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings Most commands print help when invoked w/o parameters.
- 编辑~/.bashrc 文件将hadoop加入到路径变量中,这样在终端中直接可执行hadoop命令
在文件最后添加如下内容(JAVA_HOME根据实际情况配置):$ gedit ~/.bashrc
使配置生效:export JAVA_HOME=/usr/local/jdk1.8 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export HADOOP_HOME=/usr/local/hadoop-2.9.1 export PATH=.:${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH
$ source ~/.bashrc
运行Hadoop(伪集群)
- 将hadoop主目录修属主修改为当前用户($USER是当前用户,如果是其他用户或组请明确指定):
$ sudo chown -R $USER:$USER /usr/local/hadoop-2.9.1
- 编辑/usr/local/hadoop-2.9.1/etc/hadoop/core-site.xml 文件,修改为以下内容:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-2.9.1/mydata/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> </configuration>
- 编辑/usr/local/hadoop-2.9.1/etc/hadoop/hdfs-site.xml文件,修改为以下内容:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
- 格式化hdfs文件系统:
$ hdfs namenode -format
- 修改hadoop-env.sh配置
将JAVA_HOME修改为如下:$ sudo gedit /usr/local/hadoop-2.9.1/etc/hadoop/hadoop-env.sh
#export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/local/jdk1.8
- 启动NameNode和DataNode守护进程:
$ start-dfs.sh
- 打开 http://localhost:50070/ 可以看到管理界面NameNode的相关信息
执行MapReduce任务,使用hadoop预置的示例程序进行演示
- 进入hadoop主目录
$ cd /usr/local/hadoop-2.9.1/
- 创建HDFS目录
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/$USER
- 拷贝文件到hdfs中
$ hdfs dfs -put etc/hadoop input
- 运行样例程序
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar grep input output 'dfs[a-z.]+'
- 上面命令执行后会在input同级目录下生成一个output目录,通过以下命令查看生成的文件
输出如下:$ hdfs dfs -cat output/*
6 dfs.audit.logger 4 dfs.class 3 dfs.logger 3 dfs.server.namenode. 2 dfs.audit.log.maxbackupindex 2 dfs.period 2 dfs.audit.log.maxfilesize 1 dfs.log 1 dfs.file 1 dfs.servers 1 dfsadmin 1 dfsmetrics.log 1 dfs.replication
关闭hdfs
如果要关闭HDFS,执行以下命令:
$ stop-dfs.sh
运行YARN
- 修改etc/hadoop/mapred-site.xml:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- 修改etc/hadoop/yarn-site.xml:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
- 启动YARN(确保HDFS已经启动)
用浏览器打开 http://localhost:8088/ 就可以访问资源管理器$ start-yarn.sh
- 执行MapReduce任务,还是执行之前的例子,在执行前要先删除output目录:
可以看到控制台输出跟之前不同,这里连接到了本地8032端口的ResourceManager:$ hadoop fs -rm -r output $ cd /usr/local/hadoop-2.9.1/ $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar grep input output 'dfs[a-z.]+'
刷新 http://localhost:8088/ 可以看到执行状态及结果18/11/22 23:12:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/11/22 23:12:51 INFO input.FileInputFormat: Total input files to process : 29 18/11/22 23:12:51 INFO mapreduce.JobSubmitter: number of splits:29 18/11/22 23:12:52 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
关闭YARN
如果要关闭YARN,执行以下命令:
$ stop-yarn.sh