Ubuntu15.10下Hadoop2.6.0伪分布式环境安装配置及Hadoop Streaming的体验

Ubuntu用的是Ubuntu15.10Beta2版本,正式的版本好像要到这个月的22号才发布。
参考的资料主要是http://www.powerxing.com/install-hadoop-cluster/和《Hadoop基础教程》这本书。
我的用户名是wuyouwulv,所以在接下来的代码中如果出现wuyouwulv的地方只要更改一下用户名就可以了。
搭建hadoop伪分布式环境并不需要为此创建一个新的group和user,所以我这里用的一直都是wuyouwulv这个用户。
我所需的文件都放在我的U盘根目录下的hadoop2.6目录下,它们包括:
    core-site.xml
    hadoop-2.6.0.tar.gz
    hadoop-env.sh
    hdfs-site.xml
    mapred-site.xml
    onenodeinstall.sh
    readme.txt
其中主要的内容如下:
core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/wuyouwulv/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
core-site.xml

hadoop-env.sh(这里其实就是改了JAVA_HOME)

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/default-java

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
hadoop-env.sh

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/wuyouwulv/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/wuyouwulv/hadoop/tmp/dfs/data</value>
    </property>
</configuration>
hdfs-site.xml

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>
mapred-site.xml

onenodeinstall.sh

#!/bin/bash

# enable ssh localhost
ssh-keygen
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh

# install hadoop
tar -zxvf /mnt/usb/hadoop2.6/hadoop-2.6.0.tar.gz -C /home/wuyouwulv/
mv hadoop-2.6.0/ hadoop

# hadoop environment setting
cp /mnt/usb/hadoop2.6/core-site.xml hadoop/etc/hadoop/core-site.xml
cp /mnt/usb/hadoop2.6/hdfs-site.xml hadoop/etc/hadoop/hdfs-site.xml 
cp /mnt/usb/hadoop2.6/mapred-site.xml hadoop/etc/hadoop/mapred-site.xml
cp /mnt/usb/hadoop2.6/hadoop-env.sh hadoop/etc/hadoop/hadoop-env.sh
onenodeinstall.sh

hadoop-2.6.0.tar.gz可以从网上下载得到。

在安装Hadoop之前需要安装Java,我安装的是默认的jdk版本
$ echo $JAVA_HOME
/usr/lib/jvm/default-java
需要配置shh,使其能够ssh localhost。
因为我的相关素材都是放在U盘的hadoop2.6目录下的,所以在正式安装hadoop之前我需要将其挂载到/mnt/usb/目录下:
$ sudo mkdir /mnt/usb
$ sudo mount -t vfat /dev/sdb1 /mnt/usb/
我准备吧hadoop安装在~/hadoop目录下,安装的指令如下:
# install hadoop
~$ tar -zxvf /mnt/usb/hadoop2.6/hadoop-2.6.0.tar.gz -C /home/wuyouwulv/
~$ mv hadoop-2.6.0/ hadoop

# hadoop environment setting
~$ cp /mnt/usb/hadoop2.6/core-site.xml hadoop/etc/hadoop/core-site.xml
~$ cp /mnt/usb/hadoop2.6/hdfs-site.xml hadoop/etc/hadoop/hdfs-site.xml
~$ cp /mnt/usb/hadoop2.6/mapred-site.xml hadoop/etc/hadoop/mapred-site.xml
~$ cp /mnt/usb/hadoop2.6/hadoop-env.sh hadoop/etc/hadoop/hadoop-env.sh

这样就安装好了hadoop,现在我们可以启动hadoop:
~$ cd /usr/local/hadoop
~/hadoop$ bin/hdfs namenode -format       # namenode 格式化
~/hadoop$ sbin/start-dfs.sh               # 开启守护进程
~/hadoop$ jps                             # 判断是否启动成功

若成功启动则会列出如下进程: NameNode、DataNode和SecondaryNameNode。

Hadoop Streaming运行WordCount的python的MapReduce程序:
~/hadoop$ bin/hdfs dfs -mkdir -p /user/wuyouwulv     # 创建HDFS目录
~/hadoop$ bin/hdfs dfs -mkdir input
~/hadoop$ bin/hdfs dfs -copyFromLocal test.txt input     # test.txt中包含一些单词
~/hadoop$ bin/hadoop java share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
> -file wcmapper.py -mapper wcmapper.py -file wcreducer.py -reducer wcreducer.py \
> -input input -output output
运行之后就会生成结果。
~/hadoop$ bin/hdfs dfs -cat output/*    # 查看输出

wcmapper.py:

#!/usr/bin/python
import sys

for line in sys.stdin:
    a = line.split()
    for x in a:
        print x + "\t1"
wcmapper.py

wcreducer.py:

#!/usr/bin/python
import sys

current = ""
count = 0

for line in sys.stdin:
    word, c = line.split("\t")
    if word == current:
        count += 1
    else:
        if current != "":
            print current + "\t" + str(count)
        current = word
        count = 1
print current + "\t" + str(count)
        
wcreducer.py

这里注意的是“bin/hdfs dfs -mkdir -p /user/wuyouwulv”处的wuyouwulv必须是当前的这个用户,见http://stackoverflow.com/questions/20821584/hadoop-2-2-installation-no-such-file-or-directory
input和output对应的目录是HDFS中的目录而不是本地目录。
最终这个程序实现了WordCount的功能。
两个python程序要加上可执行权限:
~/hadoop$ chmod a+x *.py

转载于:https://www.cnblogs.com/wuyouwulv/p/hadoop_pseudo_distributed_install_and_hadoop_streaming_python.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值