Hadoop 快速开始

Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。

Download

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/core/hadoop-2.8.1/hadoop-2.8.1.tar.gz

Github

Version / 2.8.1
hadoop-getstarted

Env

export HADOOP_HOME=/opt/apache/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Config

Standalone

hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

hadoop/etc/hadoop/mapred-site.xml

<configuration>
</configuration>

hadoop/etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://127.0.0.1:9000</value>
        </property>
</configuration>

hadoop/etc/hadoop/hdfs-site.xml

<configuration>
</configuration>

Start

$HADOOP_HOME/bin/hdfs namenode -format
$HADOOP_HOME/bin/hdfs getconf -namenodes
$HADOOP_HOME/sbin/start-all.sh

Check status

jps

Example

# ~/opt/apache/hadoop
## Usage
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount

## Put file for processing
hadoop fs -put LICENSE.txt

## schedule job
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar  wordcount LICENSE.txt LICENSE.wc
hadoop fs -get LICENSE.wc
cat LICENSE.wc/part-r-00000

Web Client

# Web UI
http://desert:8088/cluster/cluster

# Datanode
http://desert:50070/dfshealth.html#tab-overview

# Job history server
# http://www.cnblogs.com/luogankun/p/4019303.html
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Workflow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i1c0l2dV-1602391041278)(https://static-public.chatopera.com/backlog/chatbot/images/2017/07/hadoop2.png)]

在这里插入图片描述在这里插入图片描述

Streaming

Hadoop Stream允许我们使用任何可执行的脚本处理按行组织的数据流,数据取自Unix的标准输入STDIN,并输出到标准输出到STDOUT。
https://hadoop.apache.org/docs/r2.7.3/hadoop-streaming/HadoopStreaming.html

Example

http://www.cnblogs.com/dandingyy/archive/2013/03/01/2938442.html

Download data

wget http://www.nber.org/patents/Cite75_99.zip -O data/Cite75_99.zip

Python Streaming, RandomSample.py

#!/usr/bin/env python
import sys, random

for line in sys.stdin:
    if random.randint(1, 100) <= int(sys.argv[1]):
        print line.strip()

Submit Job

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar \
        -input data/cite75_99.txt \
        -output cite75_99_sample \
        -mapper 'RandomSample.py 10' \
        -file RandomSample.py \
        -D mapred.reduce.tasks=1

By default, using IdentityReducer, after job is finished, use getmergeto get final result.

Breaking changes

TaskTracker and JobTracker are replaced.

In Hadoop 2.0, the JobTracker and TaskTracker no longer exist and have been replaced by three components:

ResourceManager: a scheduler that allocates available resources in the cluster amongst the competing applications.

NodeManager: runs on each node in the cluster and takes direction from the ResourceManager. It is responsible for managing resources available on a single node.

ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs a specific YARN job and is responsible for negotiating resources from the ResourceManager and also working with the NodeManager to execute and monitor Containers.

So as far as you are seeing ResourceManager(on NN) & NodeManager(on DN) processes you are good to go.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值