7.实例mapreduce：计算最高气温(hadoop-streaming python)

最新推荐文章于 2022-05-30 15:10:28 发布

Pingszi

最新推荐文章于 2022-05-30 15:10:28 发布

阅读量2k

点赞数

分类专栏： hadoop python 文章标签： hadoop mapreduce hadoop-streaming python

本文链接：https://blog.csdn.net/zhouping118/article/details/83214797

版权

python 同时被 2 个专栏收录

18 篇文章 0 订阅

订阅专栏

hadoop

4 篇文章 0 订阅

订阅专栏

1.目标

使用mapreduce计算hadoop权威指南第四版计算最高气温；
使用hadoop-streaming api；
使用python3.5.4编写代码；

2.步骤

2.1.下载数据

地址：ftp://ftp.ncdc.noaa.gov/pub/data/gsod/
压缩包下载到本地

def get_data(remote="ftp://ftp.ncdc.noaa.gov/pub/data/gsod/", local="d:/data/"):
    """下载数据"""
    if not os.path.exists(local):
        os.makedirs(local)

    start, end = 1929, 1950
    for year in range(start, end + 1):
        file = "gsod_{}.tar".format(year)
        path = "{0}/{1}".format(year, file)
        resp = request.urlretrieve("{0}{1}".format(remote, path), local + file)

    print(resp)

手动全部解压(每个压缩包解压为一个单独的文件夹)；
把每年的数据合并

def merge(year, dir="D:/data/gsod_1929", savedir="D:/data/1/"):
    """把包含的gz文件的内容合并为一个文本文件"""
    files = os.listdir(dir)

    with open('{0}{1}.txt'.format(savedir, year), 'w') as newfile:
        for i, file in enumerate(files):
            with gzip.open(dir + "/" + file, 'r') as pf:
                for line in pf:
                    line = str(line)
                    if i and ("YEARMODA" in line):
                        continue

                    line = line[2:-3]
                    newfile.write(line + "\n")


def merges(dir="D:/data/"):
    """获取所的目录"""
    files = os.listdir(dir)
    for file in files:
        if file != '1':
            merge(file[-4:], dir + file)

2.2.上传文件到hdfs

使用python hdfs模块；
- pip install hdfs;

上传：PythonDemo -> hadoop -> myhdfs.py；

注意：不要命名为hdfs.py，python不允许模块重名；

# **hdfs操作
import os
from hdfs import client

#**url：ip：端口
client = client.InsecureClient("http://yh001:50070", user="hadoop" ,root="/") 


def upload(remote_dir="/pings/wordcount/", local_dir="D:/data/1/"):
    """上传本地文件到hdfs"""
    client.delete(remote_dir, True)
    client.makedirs(remote_dir)
    for file in os.listdir(local_dir):
        client.upload(remote_dir, local_dir + file)

2.3.mapreduce

原理：使用hadoop streaming是利用hadoop流的API，stdin(标准输入)、stdout(标准输出)在map函数和reduce函数之间传递数据；
实现：利用Python的sys.stdin读取输入数据，并把我们的输出传送给sys.stdout；

2.3.1.map

原始数据，根据空格切分字符串(line = re.split(r" *", line))，取第下标为2和17的元素为年份和每日最高温度(年份取前4位，华氏温度转换成摄氏温度)；

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
030050 99999  19291001    45.3  4    40.0  4  1001.6  4  9999.9  0   17.1  4    4.5  4    8.9  999.9    51.1    44.1*  0.00I 999.9  000000
030050 99999  19291002    49.5  4    45.2  4   977.6  4  9999.9  0    9.3  4   17.5  4   29.9  999.9    53.1*   44.1  99.99  999.9  010000
030050 99999  19291003    49.0  4    41.7  4   975.7  4  9999.9  0   10.9  4   10.0  4   23.9  999.9    53.1    46.0  99.99  999.9  010000

代码

#!/usr/bin/python3
# **mapper
import sys
import re


def toC(qty):
    """华氏温度转换成摄氏温度（摄氏＝5/9(°F－32)）"""
    return round(5 / 9 * (qty - 32), 2)


def max_temperature_mapper():
    """计算年份的最高温度"""
    for line in sys.stdin:
        line = re.split(r" *", line)
        if len(line) > 17:
            qty = line[17]
            if qty != '9999.9':
                qty = qty[:-1] if qty.endswith("*") else qty
                print("{0}\t{1}".format(line[2][:4], toC(float(qty))))


max_temperature_mapper()

2.3.2.reduce

reducer接收的数据，已经按年份排序(没有像java mapreduce程序按key合并)；

#!/usr/bin/python3
# **reducer
import sys
from itertools import groupby
from operator import itemgetter


def read1():
    """测试数据源"""
    datas = [
             '1929 19.61',
             '1929 19.11',
             '1930 20.72',
             '1930 24.61',
             '1931 21.28'
            ]

    for line in datas:
        yield line.split(" ", 1)


def read2():
    """hadoop数据源"""
    for line in sys.stdin:
        yield line.split("\t", 1)


def max_temperature_reducer():
    """计算年份的最高温度"""
    for year, temperature_list in groupby(read2(), itemgetter(0)):
        max = 0
        for t in temperature_list:
            if len(t) > 1:
                temp = float(t[1])
                max = temp if temp > max else max
        print("{0}\t{1}".format(year, max))
    

max_temperature_reducer()

# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -file max_temperature_mapper.py -mapper max_temperature_mapper.py -file max_temperature_reducer.py -reducer max_temperature_reducer.py -input /pings/1929.txt -output /pings/out/

4.运行结果

3.遇到的问题

找不到文件或目录/没有执行权限/PipeMapRed.waitOutputThreads(): subprocess failed with code 126
- 解决：
  - python文件头添加：#!/usr/bin/python3；
    - 集群每台机器都要安装python3；
    - 使用以上方式，一定要有/usr/bin/python3文件/符号链接(本人使用源码安装，添加了环境变量，但是实际没有/usr/bin/python3文件，导致一直出现PipeMapRed.waitOutputThreads(): subprocess failed with code 126)；
  - 执行命令添加：-file max_temperature_mapper.py；
PipeMapRed.waitOutputThreads(): subprocess failed with code 1
- 原因：代码错误；
- 解决：调试并修改代码；

4.源码

PythonDemo -> hadoop

Pingszi

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
7.实例mapreduce：计算最高气温(hadoop-streaming python)

1.目标使用mapreduce计算hadoop权威指南第四版计算最高气温；使用hadoop-streaming api；使用python3.5.4编写代码；2.步骤2.1.下载数据地址：ftp://ftp.ncdc.noaa.gov/pub/data/gsod/压缩包下载到本地def get_data(remote="ftp://ftp.ncdc.noaa.gov/pub/...
复制链接

扫一扫