MapReduce的API实现词频统计

MapReduce的API操作


MapReduce的工作流程

参考文章:MapReduce工作流程
在这里插入图片描述

词频统计API实现

一、环境准备:参考HDFS的API操作
二、编码实现:
创建3个类:Mapper、Reducer、Driver
在这里插入图片描述

  1. 创建Map阶段的WordCountMapper
    WordCountMapper代码;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * KEYIN,map阶段输入的kek类型:LongWritable
 * VALUEIN,map阶段输入的value类型:Text
 * KEYOUT,map阶段输出的key类型:Text
 * VALUEOUT,map阶段输出value类型:IntWritable
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    //在最外层分配输出key、value对象,这样比写在循环里面更节省空间
    Text outK = new Text();
    IntWritable outV = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        super.map(key, value, context);
        // 1 获取一行
        String line = value.toString();

        // 2 切割
        String[] words = line.split(" ");

        // 3 循环写出
        for (String word : words) {
            //封装
            outK.set(word);

            //写出
            context.write(outK, outV);
        }

    }
}

  1. 创建Reduce阶段的WordCountReducer
    WordCountReducer代码:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        int sum = 0;
        //执行结果格式为 abc(1,1)
        for (IntWritable value : values) {
            sum += value.get();
        }

        outV.set(sum);

        //写出
        context.write(key,outV);
    }
}
  1. 编写运行时的Driver类WordCountDriver
    WordCountDriver代码:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // 1 获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2 设置jar包路径
        job.setJarByClass(WordCountDriver.class);

        // 3 关联mapper和reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4 设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入路径
        FileInputFormat.setInputPaths(job, new Path("F:\\hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\hdfs\\output1"));

        // 7 提交job,成功返回0,否则返回1
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

这里的输入文本为
在这里插入图片描述

运行结果在指定目录下的part-r-00000
在这里插入图片描述

### Python MapReduce Framework Word Frequency Counting Application Example In a typical implementation using the Hadoop Streaming API with Python, two primary components are involved: Mapper and Reducer scripts. For word frequency counting: Mapper script reads input lines from standard input (stdin), splits each line into words, and outputs key-value pairs where keys represent individual words while values equal "1". This process can be illustrated as follows[^1]: ```python #!/usr/bin/env python3 import sys for line in sys.stdin: # Remove leading/trailing whitespace characters such as '\n' line = line.strip() # Split the line into words based on space delimiter words = line.split() # Iterate over all extracted words for word in words: # Write tab-separated tuples to stdout print(f"{word}\t1") ``` Reducer receives these intermediate results through stdin again but now grouped by identical keys (words). The task here involves summing up counts associated with every unique term before printing out final tallies. The reducer code snippet is given below: ```python #!/usr/bin/env python3 from collections import defaultdict import sys current_word = None count_sum = 0 word = None # Read data from STDIN one line at time for line in sys.stdin: # Strip off any extra spaces/newlines line = line.strip() # Parse incoming tuple consisting of 'key\tvalue' format try: word, count = line.rsplit('\t', 1) # Convert string representation back into integer type count = int(count) except ValueError: continue if current_word == word: count_sum += count else: if current_word: print(f'{current_word}\t{count_sum}') current_word = word count_sum = count if current_word == word: print(f'{current_word}\t{count_sum}') ``` To execute this program within an environment supporting Hadoop streaming jobs, save both pieces above under filenames `mapper.py` and `reducer.py`, respectively; ensure they have executable permissions set properly via Unix shell commands like `chmod +x mapper.py`. Then submit job configuration files along with paths pointing towards your custom mappers/reducers written earlier when invoking hadoop jar command.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

浩茫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值