Hadoop WordCount程序原理

最新推荐文章于 2024-01-10 19:57:47 发布

虚心若愚求知若渴

最新推荐文章于 2024-01-10 19:57:47 发布

阅读量426

点赞数 1

分类专栏：大数据文章标签：大数据 hadoop mapreduce wordcount

本文链接：https://blog.csdn.net/weixin_39806100/article/details/105335764

版权

大数据专栏收录该内容

12 篇文章 0 订阅

订阅专栏

在这里插入图片描述

Hadoop运行wordcount 案例

cd /opt/moudle/hadoop-2.7.3/share/hadoop/mapreduce

hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount ~/temp/a.txt ~/temp/out

hadoop-mapreduce-examples-2.7.3.jar WordCount原码

package org.apache.hadoop.examples;

import java.io.IOException;
import java.io.PrintStream;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
    
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        for (int i = 0; i < otherArgs.length - 1; i++) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[(otherArgs.length - 1)]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

原码分析

1、Mapper阶段：
将数据转换为String
对数据进行切分处理
把每个单词后加1
输出到reducer阶段

2、Reducer阶段
根据key进行聚合
输出key出现总的次数

3、Driver阶段
创建任务
关联使用的Mapper和Reducer类
指定Mapper输出数据的kv类型
指定Reducer输出的数据的kv类型
指定数据的输入路径和输出路径
提交

Hadoop数据类型

Java数据类型	Hadoop数据类型
byte	ByteWritable
short	ShortWritable
int	IntWritable
long	LongWritable
float	FloatWritable
double	DoubleWritable
boolean	BooleanWritable
String	Text

Hadoop数据类型实现了序列化

Hadoop WordCount代码编写

依赖

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.3</version>
</dependency>

Mapper

package wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

/**
 *
 * 泛型<LongWritable, Text, Text, IntWritable>
 * 输入key keyIn  数据的起始偏移量
 * 输入value valueIn
 * 输出key keyOut
 * 输出value valueOut
 *
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String v = String.valueOf(value);
        // 1、切分数据
        String[] words = v.split(" ");
        for(String w : words) {
            // 2、写出数据 <key, 1>
            context.write(new Text(w), new IntWritable(1));
        }
    }
}

Reducer

package wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

/**
 *
 * 泛型<Text, IntWritable, Text, IntWritable>
 * 输入key keyIn
 * 输入value valueIn
 * 输出key keyOut
 * 输出value valueOut
 *
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        // 1、累计次数
        int sum = 0;
        for (IntWritable count : values) {
           sum += count.get();
        }
        // 2、输出总次数
        context.write(key, new IntWritable(sum));
    }
}

Driver

package wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1、创建job任务
        Configuration conf = new Configuration();
        Job job = Job.getInstance();

        //2、指定jar包位置
        job.setJarByClass(WordCountDriver.class);

        //3、关联使用的Mapper类
        job.setMapperClass(WordCountMapper.class);

        //4、关联使用Reducer类
        job.setReducerClass(WordCountReducer.class);

        //5、设置Mapper阶段输出的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //6、设置Reducer阶段输出的数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //7、设置数据输入路径
        FileInputFormat.addInputPath(job, new Path(args[0]));

        //8、设置数据输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //9、提交
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

运行出现的问题

java.lang.UnsatisfiedLinkError:org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

1、检查是否配置hadoop home环境变量，并配置path变量
2、HADOOP_HOME/bin里面是否有winutils.exe和hadoop.dll文件
3、C:\Windows\System32是否有hadoop.dll。
重启电脑

出现下面问题：

java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V

hadoop.dll版本不一致。
下载地址
https://github.com/steveloughran/winutils
比如我使用的是hadoop2.7.3
可以使用hadoop-2.7.1的hadoop.dll
解决问题

虚心若愚求知若渴

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop WordCount程序原理

Hadoop运行wordcount 案例cd /opt/moudle/hadoop-2.7.3/share/hadoop/mapreducehadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount ~/temp/a.txt ~/temp/outhadoop-mapreduce-examples-2.7.3.jar WordCoun...
复制链接

扫一扫