计算文件词频

最新推荐文章于 2023-02-13 19:59:10 发布

603946254

最新推荐文章于 2023-02-13 19:59:10 发布

阅读量124

点赞数

本文链接：https://blog.csdn.net/qq603946254/article/details/80730942

版权

需求：

每个文件中各个单词的出现次数并倒叙排列
输出所有文件中出现的数量最多的单词

测试文件：

随意在网上找几篇文章即可
这里给出三分文件：

news1:
don’t know what I do now is right, those are wrong, and when I finally Laosi when I know these.
So I can do now is to try to do well in everything, and then wait to die a natural death.Sometimes
I can be very happy to talk to everyone, can be very presumptuous, but no one knows, it is but very
deliberatelycamouflage, camouflage; I can make him very happy very happy,
but couldn’t find the source of happiness, just giggle.

news2:
If not to the sun for smiling, warm is still in the sun there, but wewill laugh more confident calm;
if turned to found his own shadow, appropriate escape, the sun will be through the heart,warm each place
behind the corner; if an outstretched palm cannot fall butterfly, then clenched waving arms, given power;
if I can’t have bright smile, it will face to the sunshine, and sunshine smile together, in full bloom.

news3:
Time is like a river, the left bank is unable to forget the memories, right is
worth grasp the youth, the middle of the fast flowing, is the sad young faint.
There are many good things, buttruly belong to own but not much. See the
courthouse blossom,honor or disgrace not Jing, hope heaven Yunjuanyunshu,
has no intention to stay. In this round the world, all can learn to use a
normal heart to treat all around, is also a kind of realm!

测试代码:

将所有文件存放在相同文件夹中（我用的是word_test）,注意：文件夹中不要存放别的文件，以免影响程序运行

package word_count;

import java.io.IOException;
import java.util.HashMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class MRMapper extends Mapper<LongWritable, Text, WordBean, NullWritable> {

        public static HashMap<String, Integer> words1 = new HashMap<String, Integer>();// 文件名-word num

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String filename = context.getInputSplit().toString();

            filename = filename.substring(filename.indexOf("news"), filename.indexOf(".txt")) + ".txt";

            String[] split = value.toString().toLowerCase().replaceAll("n't", " not").trim().replaceAll("\\W", " ").replaceAll("\\s+", " ").split(" ");

            for (String s : split) {
                // 文件名-word num
                String file_word = filename + "-" + s;

                if (words1.get(file_word) != null) {

                    words1.put(file_word, words1.get(file_word) + 1);
                } else {
                    words1.put(file_word, 1);
                }
            }
        }

        protected void cleanup(Context context) throws IOException, InterruptedException {

            for (HashMap.Entry<String, Integer> m : words1.entrySet()) {
                String[] split = m.getKey().split("-");
                context.write(new WordBean(split[0], split[1], m.getValue()), NullWritable.get());
            }
            words1.clear();//cleanup方法会在结束一份文件的读取时执行，因此需要在将此文件中内容输出完后清空map，防止内容重复输出
        }
    }

    public static class MRReducer extends Reducer<WordBean, NullWritable, Text, NullWritable> {

        public static HashMap<String, Integer> words = new HashMap<String, Integer>();
        public static HashMap<String, String> file = new HashMap<String, String>();

        public static String word_filename = "";
        public static int max_word_num = 0;

        protected void reduce(WordBean bean, Iterable<NullWritable> values, Context context)
                throws IOException, InterruptedException {

            if(words.get(bean.getFilename()) == null){
                words.put(bean.getFilename(), bean.getNum());
                file.put(bean.getFilename(), bean.getWord()+"-"+bean.getNum());
            }

            if(bean.getNum() > words.get(bean.getFilename())){
                words.put(bean.getFilename(), bean.getNum());
                file.put(bean.getFilename(), bean.getWord()+"-"+bean.getNum());
            }

            if (bean.getNum() > max_word_num) {
                word_filename = bean.getFilename();
                max_word_num = bean.getNum();
            }
            context.write(new Text(bean.toString()), NullWritable.get());
        }

        protected void cleanup(Context context) throws IOException, InterruptedException {

            for (HashMap.Entry<String, String> m : file.entrySet()) {

                String str = m.getKey() + "中出现次数最多的是："+ m.getValue().replaceAll("-", ",共出现") + "次";

                context.write(new Text(str), NullWritable.get());
            }

            String str = "所有文件中出现次数最多的是：" + file.get(word_filename).replaceAll("-", ",出现在" + word_filename + "中,共出现") + "次";

            context.write(new Text(str), NullWritable.get());
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();
        conf.set("fs.default.name", "hdfs://hadoop5:9000");

        Job job = Job.getInstance(conf, "Word sort");
        job.setJarByClass(WordCount.class);

        job.setMapperClass(MRMapper.class);
        job.setReducerClass(MRReducer.class);

        job.setMapOutputKeyClass(WordBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path("/input/word_test/"));
        FileOutputFormat.setOutputPath(job, new Path("/output/put1"));
        System.out.println(job.waitForCompletion(true) ? 1 : 0);
    }
}

排序的方法有很多种，这里这是用一种比较简单常用的方法。
利用javaBean实现WritableComparable接口来实现自定义排序：
注：Writable接口实现序列化，Comparable接口的compareTo方法实现排序

package word_count;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class WordBean implements WritableComparable<WordBean> {

    private String filename;
    private String word;
    private Integer num;

    public void readFields(DataInput in) throws IOException {
        this.filename = in.readUTF();
        this.word = in.readUTF();
        this.num = in.readInt();
    }

    public void write(DataOutput out) throws IOException {
        out.writeUTF(filename);
        out.writeUTF(word);
        out.writeInt(num);
    }

    public int compareTo(WordBean o) {
        if(o.getFilename().hashCode()==filename.hashCode()){
            return o.getNum() > num?1:-1;
        }else{
            return o.getFilename().hashCode()>filename.hashCode()?-1:1;
        }
    }

    public String toString() {

        return filename + "\t" + word + "\t" + num;
    }

    public WordBean() {
        super();
    }
    public WordBean(String filename, String word, Integer num) {
        this.filename = filename;
        this.word = word;
        this.num = num;
    }

    public String getFilename() {
        return filename;
    }

    public void setFilename(String filename) {
        this.filename = filename;
    }

    public String getWord() {
        return word;
    }
    public void setWord(String word) {
        this.word = word;
    }
    public Integer getNum() {
        return num;
    }
    public void setNum(Integer num) {
        this.num = num;
    }
}

测试结果：

news1.txt   i   6
news1.txt   very    5
news1.txt   to  5
news1.txt   can 4
news1.txt   do  4
news1.txt   happy   3
news1.txt   but 3
news1.txt   is  3
news1.txt   now 2
news1.txt   when    2
news1.txt   know    2
news1.txt   be  2
news1.txt   and 2
news1.txt   not 2
news1.txt   a   1
news1.txt   then    1
news1.txt   the 1
news1.txt   him 1
news1.txt   well    1
news1.txt   no  1
news1.txt   talk    1
news1.txt   laosi   1
news1.txt   just    1
news1.txt   knows   1
news1.txt   of  1
news1.txt   right   1
news1.txt   what    1
news1.txt   everything  1
news1.txt   make    1
news1.txt   happiness   1
news1.txt   it  1
news1.txt   those   1
news1.txt   die 1
news1.txt   wait    1
news1.txt   so  1
news1.txt   find    1
news1.txt   sometimes   1
news1.txt   death   1
news1.txt   deliberatelycamouflage  1
news1.txt   are 1
news1.txt   source  1
news1.txt   could   1
news1.txt   natural 1
news1.txt   in  1
news1.txt   giggle  1
news1.txt   one 1
news1.txt   camouflage  1
news1.txt   finally 1
news1.txt   wrong   1
news1.txt   everyone    1
news1.txt   these   1
news1.txt   presumptuous    1
news1.txt   try 1
news2.txt   the 6
news2.txt   if  4
news2.txt   to  3
news2.txt   sun 3
news2.txt   smile   2
news2.txt   sunshine    2
news2.txt   will    2
news2.txt   in  2
news2.txt   warm    2
news2.txt   not 2
news2.txt   and 1
news2.txt   be  1
news2.txt   there   1
news2.txt   it  1
news2.txt   bloom   1
news2.txt   heart   1
news2.txt   escape  1
news2.txt   through 1
news2.txt   but 1
news2.txt   calm    1
news2.txt   have    1
news2.txt   butterfly   1
news2.txt   is  1
news2.txt   cannot  1
news2.txt   waving  1
news2.txt   own 1
news2.txt   an  1
news2.txt   found   1
news2.txt   ca  1
news2.txt   corner  1
news2.txt   face    1
news2.txt   more    1
news2.txt   laugh   1
news2.txt   for 1
news2.txt   arms    1
news2.txt   then    1
news2.txt   confident   1
news2.txt   clenched    1
news2.txt   wewill  1
news2.txt   power   1
news2.txt   i   1
news2.txt   shadow  1
news2.txt   full    1
news2.txt   turned  1
news2.txt   place   1
news2.txt   together    1
news2.txt   given   1
news2.txt   behind  1
news2.txt   his 1
news2.txt   each    1
news2.txt   still   1
news2.txt   bright  1
news2.txt   appropriate 1
news2.txt   palm    1
news2.txt   fall    1
news2.txt   smiling 1
news2.txt   outstretched    1
news3.txt   the 8
news3.txt   to  5
news3.txt   is  5
news3.txt   a   3
news3.txt   of  2
news3.txt   all 2
news3.txt   not 2
news3.txt   bank    1
news3.txt   belong  1
news3.txt   disgrace    1
news3.txt   there   1
news3.txt   no  1
news3.txt   has 1
news3.txt   faint   1
news3.txt   courthouse  1
news3.txt   but 1
news3.txt   flowing 1
news3.txt   yunjuanyunshu   1
news3.txt   treat   1
news3.txt   also    1
news3.txt   normal  1
news3.txt   stay    1
news3.txt   youth   1
news3.txt   kind    1
news3.txt   much    1
news3.txt   intention   1
news3.txt   unable  1
news3.txt   around  1
news3.txt   fast    1
news3.txt   heart   1
news3.txt   right   1
news3.txt   honor   1
news3.txt   jing    1
news3.txt   things  1
news3.txt   world   1
news3.txt   many    1
news3.txt   worth   1
news3.txt   memories    1
news3.txt   heaven  1
news3.txt   forget  1
news3.txt   hope    1
news3.txt   time    1
news3.txt   realm   1
news3.txt   use 1
news3.txt   round   1
news3.txt   good    1
news3.txt   grasp   1
news3.txt   own 1
news3.txt   river   1
news3.txt   or  1
news3.txt   can 1
news3.txt   this    1
news3.txt   sad 1
news3.txt   see 1
news3.txt   left    1
news3.txt   are 1
news3.txt   blossom 1
news3.txt   like    1
news3.txt   buttruly    1
news3.txt   young   1
news3.txt   learn   1
news3.txt   middle  1
news3.txt   in  1
news3.txt中出现次数最多的是：the,共出现8次
news2.txt中出现次数最多的是：the,共出现6次
news1.txt中出现次数最多的是：i,共出现6次
所有文件中出现次数最多的是：the,出现在news3.txt中,共出现8次

这里给出测试结果供大家参考！

603946254

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
计算文件词频

&#13; 需求：每个文件中各个单词的出现次数并倒叙排列输出所有文件中出现的数量最多的单词测试文件：随意在网上找几篇文章即可这里给出三分文件： news1: don’t know what I do now is right, those are wrong, and when I finall...
复制链接

扫一扫