Hadoop学习（三）——Python框架和Hadoop Streaming

狻猊来当程序媛

于 2023-03-03 14:26:57 发布

阅读量224

点赞数

分类专栏： Hadoop 文章标签： hadoop 学习大数据

本文链接：https://blog.csdn.net/qq_44274736/article/details/129315632

版权

Hadoop 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

通常，MapReduce应用程序由3个Java类组成：Job、Mapper和Reducer

后两个处理键值对计算的细节，通过shuffle和sort阶段连接

一、Hadoop Streaming

是一个实用程序，被打包为Hadoop MapReduce发行版附带的JAR文件。

像普通Hadoop作业一样，通过作业客户端传递到集群。

利用标准Unix流进行输入和输出，输入都是从stdin读取，python通过sys模块访问stdin；

Streaming执行作业时：

每个mapper任务在自己的进程内启动提供的可执行文件
输入数据转换为文本行并输送到外部进程的stdin，同时从stdout收集输出
对mapper输出进行shuffle和sort后，reducer启动了可执行文件

1.使用Streaming在CSV数据上运行计算

# 在file_system_example下建立一个rita-transtats文件夹
mkdir -p rita-transtats

# 下载文件
wget https://github.com/bbengfort/hadoop-fundamentals/raw/master/data/flight_data.zip

# 解压文件
unzip flight_data.zip

# 创建Python脚本
vi mapper.py
vi reducer.py

mapper.py

#!/usr/bin/env python
import sys
import csv

SEP = '\t'

class Mapper(object):
    def __init__(self, stream, sep=SEP):
        self.stream = stream
        self.sep = sep

<!-- 由sep 分隔的键和值作为单行写入stdout --!>
    def emit(self, key, value):
        sys.stdout.write("{}{}{}\n".format(key, self.sep, value))

    def map(self):
        for row in self:
            self.emit(row[3], row[6])

<!-- 一个特殊函数，通过yield生成器，保证该类是可迭代的 --!>
    def __iter__(self):
        reader = csv.reader(self.stream)
        for row in reader:
            yield row


if __name__ == "__main__":
    mapper = Mapper(sys.stdin)
    mapper.map()

reducer.py

#!/usr/bin/env python
import sys

<!-- 一个内存安全的迭代器函数 groupy --!>
from itertools import groupby
<!-- 一个操作符函数 itemgetter --!>
from operator import itemgetter

SEP = '\t'

class Reducer(object):
    def __init__(self, stream, sep=SEP):
        self.stream = stream
        self.sep = sep

    def emit(self, key, value):
        sys.stdout.write("{}{}{}\n".format(key, self.sep, value))

    def reduce(self):
        for current, group in groupby(self, itemgetter(0)):
            total = 0
            count = 0

            for item in group:
                total += item[1]
                count += 1
                self.emit(current, float(total) / float(count))

    def __iter__(self):
        for line in self.stream:
            try:
                parts = line.split(self.sep)
                yield parts[0], float(parts[1])
            except:
                continue

if __name__ == "__main__":
    reducer = Reducer(sys.stdin)
    reducer.reduce()

2.执行Streaming作业

# 赋予执行权限
chmod +x mapper.py
chmod +x reducer.py

# 测试每个机场平均延误时间
cat flights.csv | ./mapper.py | sort | ./reducer.py

# 执行Streaming作业，在集群上
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
    -input flights.csv \
    -output averaage_delay \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py

二、MapReduce进阶

1.combiner

主要的MapReduce优化技术

combiner通过执行一个mapper局部的reduce来减少网络流量

只要满足交换律和结合律，combiner和reducer就是相同的

2.Partitioner

通过划分键空间来控制如何将键和值发送到每个reducer，默认使用HashPartitioner满足通常的需求。

通过计算键的散列值并将键分配给由reducer数量确定的键空间，来将键均匀的分配给每一个reducer。

给定均匀分布的键空间后，每个reducer能获得相对平均的工作负载。

【只能使用java API创建partitioner】

3.作业链

线性作业链、数据流作业链

作业链是很多个小作业的组合，依赖于前一个作业的输出，中间步骤还会包含前一个作业的值

狻猊来当程序媛

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop学习（三）——Python框架和Hadoop Streaming

WordCount程序识别文本重要短语频率的MapReduce作业高级的MapReduce主题如何将这些主题应用于Python编写的Streaming作业中
复制链接

扫一扫