Hadoop 快速开始

Chatopera 研发团队

于 2020-10-11 12:40:39 发布

阅读量203

点赞数

分类专栏：开源文章标签： hadoop 大数据 Java Python

本文链接：https://blog.csdn.net/samurais/article/details/109010363

版权

开源专栏收录该内容

500 篇文章 6 订阅

订阅专栏

Hadoop实现了一个分布式文件系统（Hadoop Distributed File System），简称HDFS。Hadoop的框架最核心的设计就是：HDFS和MapReduce。HDFS为海量的数据提供了存储，则MapReduce为海量的数据提供了计算。

Download

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/core/hadoop-2.8.1/hadoop-2.8.1.tar.gz

Github

Version / 2.8.1
hadoop-getstarted

Env

export HADOOP_HOME=/opt/apache/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Config

Standalone

hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

hadoop/etc/hadoop/mapred-site.xml

<configuration>
</configuration>

hadoop/etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://127.0.0.1:9000</value>
        </property>
</configuration>

hadoop/etc/hadoop/hdfs-site.xml

<configuration>
</configuration>

Start

$HADOOP_HOME/bin/hdfs namenode -format
$HADOOP_HOME/bin/hdfs getconf -namenodes
$HADOOP_HOME/sbin/start-all.sh

Check status

jps

Example

# ~/opt/apache/hadoop
## Usage
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount

## Put file for processing
hadoop fs -put LICENSE.txt

## schedule job
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar  wordcount LICENSE.txt LICENSE.wc
hadoop fs -get LICENSE.wc
cat LICENSE.wc/part-r-00000

Web Client

# Web UI
http://desert:8088/cluster/cluster

# Datanode
http://desert:50070/dfshealth.html#tab-overview

# Job history server
# http://www.cnblogs.com/luogankun/p/4019303.html
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Workflow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i1c0l2dV-1602391041278)(https://static-public.chatopera.com/backlog/chatbot/images/2017/07/hadoop2.png)]

在这里插入图片描述

Streaming

Hadoop Stream允许我们使用任何可执行的脚本处理按行组织的数据流，数据取自Unix的标准输入STDIN，并输出到标准输出到STDOUT。
https://hadoop.apache.org/docs/r2.7.3/hadoop-streaming/HadoopStreaming.html

Example

http://www.cnblogs.com/dandingyy/archive/2013/03/01/2938442.html

Download data

wget http://www.nber.org/patents/Cite75_99.zip -O data/Cite75_99.zip

Python Streaming, RandomSample.py

#!/usr/bin/env python
import sys, random

for line in sys.stdin:
    if random.randint(1, 100) <= int(sys.argv[1]):
        print line.strip()

Submit Job

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar \
        -input data/cite75_99.txt \
        -output cite75_99_sample \
        -mapper 'RandomSample.py 10' \
        -file RandomSample.py \
        -D mapred.reduce.tasks=1

By default, using IdentityReducer, after job is finished, use getmergeto get final result.

Breaking changes

TaskTracker and JobTracker are replaced.

In Hadoop 2.0, the JobTracker and TaskTracker no longer exist and have been replaced by three components:

ResourceManager: a scheduler that allocates available resources in the cluster amongst the competing applications.

NodeManager: runs on each node in the cluster and takes direction from the ResourceManager. It is responsible for managing resources available on a single node.

ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs a specific YARN job and is responsible for negotiating resources from the ResourceManager and also working with the NodeManager to execute and monitor Containers.

So as far as you are seeing ResourceManager(on NN) & NodeManager(on DN) processes you are good to go.