WordCount

最新推荐文章于 2024-04-09 15:24:13 发布

IT新手村小蒋

最新推荐文章于 2024-04-09 15:24:13 发布

阅读量304

点赞数

分类专栏： MapReduce 文章标签： WordCount

本文链接：https://blog.csdn.net/JiangDongS/article/details/93665769

版权

MapReduce 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

WordCount 案例：

1、需求：给定一个文本文件，统计输出文本文件中每个单词出现的总次数。

2、数据文本：

//文件内容：
	hello world
    atguigu atguigu
    hadoop 
    spark
    hello world
    atguigu atguigu
    hadoop 
    spark
    hello world
    atguigu atguigu
    hadoop 
    spark

3、IDEA 的 pom.xml 配置：

	<!-- 新加转码 -->
	<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

	<!-- 依赖 -->
	 <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.4</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.8.4</version>
        </dependency>


        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.16.10</version>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>

    </dependencies>

	<!-- 加版本号 -->
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.5.1</version>
                <configuration>
                    <encoding>UTF-8</encoding>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

4、分析：按照 MapReduce 编程规范，分别编写 Mapper，Reduce，Driver

在这里插入图片描述

MapReduce 程序：

1）、编写 mapper 类：

/**
 * <KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * <LongWritable, Text,Text, IntWritable>
 *
 * @author Jds
 */

public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    Text k = new Text();
    IntWritable v = new IntWritable();

    /**
     * ctrl + o 呼出继承的方法
     */
    @Override
    protected void map(LongWritable key,
                       Text value,
                       Context context) throws IOException, InterruptedException {

        //1、转换格式
        String line = value.toString();

        //2、切分数据    -->以空格切分
        String[] words = line.split(" ");


        //3、输出成<单词，1>
        for (String word :words
                ) {
            k.set(word);
            v.set(1);
            context.write(k,v);
        }


    }
}

2）、编写 Reducer 类：

/**
 * <KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 * <Text, IntWritable,Text, IntWritable>
 *
 * @author Jds
 */

public class WordCountReducer extends Reducer<Text, IntWritable,Text, IntWritable> {

    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key,
                          Iterable<IntWritable> values,
                          Context context) throws IOException, InterruptedException {

        //1、初始化次数
        int count = 0;


        //2、汇总次数
        for (IntWritable value:values
             ) {
            count += value.get();
        }

        v.set(count);

        //2、输出总次数
        context.write(key,v);
    }
}

3）、编写 Driver 类：

/**
 * @author Jds
 */

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //输入数据的 Windows 路径，输出的 Windows 路径
        args = new String[]{"C:\\Users\\Jds\\Desktop\\mapreduce\\aaa.txt",
                "C:\\Users\\Jds\\Desktop\\mapreduce\\A1"};

        //获取配置信息
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //反射三个类
        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);


        //Map 输出的 k,V 类型 Text, IntWritable 为 Reduce 的输入
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //Reduce 输出的 K,V 类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //数据的输入和输出指定目录
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //提交 job ---> waitForCompletion 包含 submit() 两个都是提交
        job.waitForCompletion(true);
    }
}

MRjar 在 Linux 上运行：

1、在 Linux 上跑的程序不需要 Windows 的路径，所以需要将代码中获取 Windows 路径的代码注释掉

2、进行打包：选择右侧的 Maven --> 项目名 --> Lifecycle --> 双击 clean --> 双击 package

3、运行完后会在项目下创建 target 目录，该目录下会生成一个 项目名-版本号.jar 的文件

例：MapReduce-1.0-SNAPSHOT.jar

4、将 .jar 文件传入到 Linux 中

5、在 Linux 中存放 .jar 的目录下运行：

hadoop jar MapReduce-1.0-SNAPSHOT.jar WordCount.WordCountDriver /aaa /aaa1

/aaa ：hdfs 上需要处理问的文本 /aaa1：处理后存放的目录（不可提前存在）

6、查看处理后的数据：

hadoop fs -cat /aaa1/part-r-00000

IT新手村小蒋

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
WordCount

WordCount 案例： 1、需求：给定一个文本文件，统计输出文本文件中每个单词出现的总次数。 2、数据文本：//文件内容： hello world atguigu atguigu hadoop spark hello world atguigu atguigu hadoop spark hello world...
复制链接

扫一扫