hadoop的mapreducer处理数据（Python）_-jobconf hadoop.job.ugi=-CSDN博客

本文链接：https://blog.csdn.net/xzpdxz/article/details/88694510

1.hadoop客户端环境

1.直接找有hadoop服务的机器，这样你访问的就是本机的hadoop集群，也就不用在配置了

2.如果你要远程其他hadoop集群，那么你需要配置相关文件，配置方式如同配置hadoop集群一样

hadoop集群搭建详见：https://blog.csdn.net/xzpdxz/article/details/86692631 修改相应的配置

注意确保你的环境有java

2.mapreducer

mapper：可以理解为数据分片计算

reducer：可以理解为将分片进行合计算

最常见的就是计算词频

a) 准备长篇英语文章，input.txt

上传input.txt到hdfs

[hadoop_test@hserver1 hadoop_test] # hadoop dfs -put input.txt /user/hadoop_test/xxxx/input.txt

b) mapper.py函数

实现的功能为将单词按空格分开，也就是单词

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

for line in sys.stdin:

    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' %(word, 1)

c) reducer.py函数

将mapper的单词用来计算词频

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    words = line.strip()
    word, count = words.split('\t')

    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' %(current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print '%s\t%s' %(current_word, current_count)

d) run.sh运行函数

在hadoop上运行mapreducer

#! /bin/bash

HADOOP=hadoop       // hadoop命令，如果你的环境没有hadoop则必须使用hadoop全路径
STREAM=~/hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar  // 环境的stream
task_name="lijiacai"    // 任务名称
mapper_num=2            // mapper 任务数
reducer_num=2            // reducer 任务数
priority=HIGH            // 优先级
capacity_mapper=5000    // mapper最大数
capacity_reducer=1000    // reducer最大数

mapper_file=./mapper.py        // mapper文件，可以取别的名字，由于我上面用的mapper
reducer_file=./reducer.py        // reducer文件

input_path=/user/hadoop_test/xxxx/input.txt    // hadoop上的数据输入
output_path=/user/hadoop_test/xxx/output        // reducer之后的数据输出

name="hadoop_test"        // hadoop用户名
passwd="123456"            // hadoop用户密码

$HADOOP fs -rm -r $output_path   // 每次运行前先删除之前的output目录，不然无法在写入该路径

$HADOOP jar $STREAM \
        -D mapred.job.name="$task_name" \   // 
        -D mapred.job.priority=$priority \
        -D mapred.map.tasks=$mapper_num \
        -D mapred.reducer.tasks=$reducer_num \
        -D mapred.job.map.capacity=$capacity_mapper \
        -D mapred.job.reduce.capacity=$capacity_mapper \
        -D hadoop.job.ugi="${name},${passwd}" \
        -input ${input_path} \
        -output ${output_path} \
        -mapper $mapper_file \
        -reducer $reducer_file \
        -file $mapper_file \
        -file $reducer_file

运行得到如下结果：

[hadoop_test@hserver1 hadoop_test] # sh run.sh  
Deleted /user/hadoop_test/lijiacai/output
19/03/20 17:21:05 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-unjar3119144331866952236/] [] /tmp/streamjob6424092896909659797.jar tmpDir=null
19/03/20 17:21:06 INFO client.RMProxy: Connecting to ResourceManager at hserver1/10.58.107.38:8032
19/03/20 17:21:06 INFO client.RMProxy: Connecting to ResourceManager at hserver1/10.58.107.38:8032
19/03/20 17:21:07 INFO mapred.FileInputFormat: Total input files to process : 1
19/03/20 17:21:07 INFO mapreduce.JobSubmitter: number of splits:2
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.job.priority is deprecated. Instead, use mapreduce.job.priority
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/20 17:21:07 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/03/20 17:21:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1548924494331_0031
19/03/20 17:21:07 INFO impl.YarnClientImpl: Submitted application application_1548924494331_0031
19/03/20 17:21:07 INFO mapreduce.Job: The url to track the job: http://hserver1:8088/proxy/application_1548924494331_0031/
19/03/20 17:21:07 INFO mapreduce.Job: Running job: job_1548924494331_0031
19/03/20 17:21:15 INFO mapreduce.Job: Job job_1548924494331_0031 running in uber mode : false
19/03/20 17:21:15 INFO mapreduce.Job:  map 0% reduce 0%
19/03/20 17:21:22 INFO mapreduce.Job:  map 100% reduce 0%
19/03/20 17:21:27 INFO mapreduce.Job:  map 100% reduce 100%
19/03/20 17:21:28 INFO mapreduce.Job: Job job_1548924494331_0031 completed successfully