利用Hadoop MapReduce实现单词统计——Wordcount

原创 2018年04月16日 20:59:02

Hadoop MapReduce实现单词统计——Wordcount


环境:Centos 7系统+IDEA


本程序是利用IDEA中的Maven来实现的,主要是因为Maven省去了在本地搭建Hadoop环境的麻烦,只需要在配置文件中进行相应的配置即可。如果你还没有安装IDEA,可以参考Linux下如何安装IntelliJ IDEA

(1)新建java Project ,并命名为WordCount。如果不知道如何使用IDEA的Maven新建java工程,可参考利用IDEA的Maven创建第一个java程序

在pom.xml中添加项目所需要的依赖项,内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.miaozhen.lyf</groupId>
    <artifactId>Test</artifactId>
    <version>1.0-SNAPSHOT</version>
<!-- 此处以上是创建时默认生成的,下面是添加的内容 -->
    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

        <hadoop.version>2.6.4</hadoop.version>
        <parquet.version>1.9.0</parquet.version>
        <fastjson.version>1.2.29</fastjson.version>
        <commons.version>3.5</commons.version>
        <junit.version>4.12</junit.version>

        <shade.plugin.version>3.0.0</shade.plugin.version>
        <compiler.plugin.version>3.6.1</compiler.plugin.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
    </dependencies>

    <build>
        <finalName>wordcount</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>${shade.plugin.version}</version>
                <configuration>
                    <outputDirectory>/tmp</outputDirectory>
                    <createDependencyReducedPom>false</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${compiler.plugin.version}</version>
                <configuration>
                    <source>${maven.compiler.source}</source>
                    <target>${maven.compiler.target}</target>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>


</project>


(2)创建WCMapper.java,代码如下:

package com.miaozhen.dmp.test.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;

public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private Text outputKey = new Text();
    private final LongWritable outputValue = new LongWritable(1);

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer st = new StringTokenizer(value.toString());
        while(st.hasMoreTokens()){
            outputKey.set(st.nextToken());
            context.write(outputKey,outputValue);
        }
    }
}


(3)创建WCReducer.java,代码如下:

package com.miaozhen.dmp.test.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    private LongWritable outputValue = new LongWritable();

    public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        Long count = 0L;
        for(LongWritable value: values){
            count += value.get();
        }
        outputValue.set(count);
        context.write(key,outputValue);
    }
}


(4)创建WCRunner.java,代码如下:

package com.miaozhen.dmp.test.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ToolRunner;

public class WCRunner extends Configured implements org.apache.hadoop.util.Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();

        Job wcjob = Job.getInstance(conf);

        wcjob.setJarByClass(WCRunner.class);
        wcjob.setMapperClass(WCMapper.class);
        wcjob.setReducerClass(WCReducer.class);

        wcjob.setOutputKeyClass(org.apache.hadoop.io.Text.class);
        wcjob.setOutputValueClass(LongWritable.class);

        wcjob.setMapOutputKeyClass(org.apache.hadoop.io.Text.class);
        wcjob.setMapOutputValueClass(LongWritable.class);

        FileInputFormat.addInputPath(wcjob, new Path(args[0]));
        FileOutputFormat.setOutputPath(wcjob, new Path(args[1]));

        boolean rt = wcjob.waitForCompletion(true);

        return rt? 0: 1;
    }

    public static void main(String[] args) throws Exception {
        System.out.println(args[0]+args[1]);
        Configuration conf = new Configuration();
        int retnum = ToolRunner.run(conf, new WCRunner(), args);
    }
}


(5)运行

首先Run->Edit Configurations,进行相关的配置,主要是输入输出路径(注意,output文件夹是自动生成的,不需要配自己创建,如果已经存在,程序会报错)。我这里的input路径如下:




最后点击Run->Run “wordcount”运行即可。

(6)结果

输入的文本信息和输出的结果如下:







版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/YF_Li123/article/details/79966259

Mapreduce实例---统计单词个数(wordcount)

一:需要的jar包 Hadoop-2.4.1\share\hadoop\hdfs\hadoop-hdfs-2.4.1.jar hadoop-2.4.1\share\hadoop\hdfs\lib\所有...
  • litianxiang_kaola
  • litianxiang_kaola
  • 2017-05-03 20:51:32
  • 7139

MapReduce基础开发之一词汇统计和排序(wordcount)

统计/var/log/boot.log中含k的字符的数量,并对含k的字符按照数量排序。需分两个job完成,一个用来统计,一个用来排序。 一、统计 1、上传文件到hadoop:    1)新建文...
  • fjssharpsword
  • fjssharpsword
  • 2016-06-22 17:13:20
  • 3415

hadoop案例实现之WordCount (计算单词出现的频数)

一、编写java代码,实现map函数以及reduce函数package com.paic.elis.test; import java.io.IOException; import java.util...
  • uniquewonderq
  • uniquewonderq
  • 2016-02-15 16:41:21
  • 750

MapReduce的TopK统计加排序

Hadoop技术内幕中指出Top K算法有两步,一是统计词频,二是找出词频最高的前K个词。在网上找了很多MapReduce的Top K案例,这些案例都只有排序功能,所以自己写了个案例。 这个案例分两个...
  • zhangxiango
  • zhangxiango
  • 2014-05-20 16:17:36
  • 2325

Hadoop-MapReduce初步应用-统计单词个数

关于hadoop在电脑上安装的过程,请参考我的上一篇博文: Hadoop-利用java API操作HDFS文件我的安装和配置环境是Windows下伪分布模式hadoop下使用eclipse进行开发。...
  • u010156024
  • u010156024
  • 2015-11-30 20:55:57
  • 4839

Hadoop WordCount改进实现正确识别单词以及词频降序排序

package org.apache.hadoop.examples; import java.io.IOException; import java.util.Random; import j...
  • xw13106209
  • xw13106209
  • 2011-01-07 15:21:00
  • 11244

hadoop基础----hadoop实战(三)-----hadoop运行MapReduce---对单词进行统计--经典的自带例子wordcount

ass
  • q383965374
  • q383965374
  • 2016-08-29 16:04:51
  • 2214

和我一起学Hadoop(五):MapReduce的单词统计,wordcount

mapred 单词统计
  • qq_18675693
  • qq_18675693
  • 2017-01-16 15:25:26
  • 2518

Hadoop读书笔记(五)MapReduce统计单词demo

Hadoop读书笔记(五)MapReduce统计单词demo
  • caicongyang
  • caicongyang
  • 2014-11-24 21:33:50
  • 2066

Hadoop实例WordCount程序修改--词频降序

修改wordcount实例,改为: 1、 对词频按降序排列 2、 输出排序为前三,和后三的数据首先是第一项: 对词频排序,主要针对的是最后输出的部分。**分析程序内容:** WordCou...
  • u010223431
  • u010223431
  • 2016-04-19 16:49:04
  • 3589
收藏助手
不良信息举报
您举报文章:利用Hadoop MapReduce实现单词统计——Wordcount
举报原因:
原因补充:

(最多只允许输入30个字)