PTA 词频统计 (java)

小黑喵了个咪

已于 2023-08-17 18:17:22 修改

阅读量375

点赞数 2

文章标签： java 开发语言算法数据结构

于 2023-08-17 16:01:43 首次发布

本文链接：https://blog.csdn.net/xiao_hei_a/article/details/132342222

版权

请编写程序，对一段英文文本，统计其中所有不同单词的个数，以及词频最大的前10%的单词。

所谓“单词”，是指由不超过80个单词字符组成的连续字符串，但长度超过15的单词将只截取保留前15个单词字符。而合法的“单词字符”为大小写字母、数字和下划线，其它字符均认为是单词分隔符。

输入格式:

输入给出一段非空文本，最后以符号#结尾。输入保证存在至少10个不同的单词。

输出格式:

在第一行中输出文本中所有不同单词的个数。注意“单词”不区分英文大小写，例如“PAT”和“pat”被认为是同一个单词。

随后按照词频递减的顺序，按照词频:单词的格式输出词频最大的前10%的单词。若有并列，则按递增字典序输出。

输入样例：

This is a test.

The word "this" is the word with the highest frequency.

Longlonglonglongword should be cut off, so is considered as the same as longlonglonglonee. But this_8 is different than this, and this, and this...#
this line should be ignored.

输出样例：（注意：虽然单词`the`也出现了4次，但因为我们只要输出前10%（即23个单词中的前2个）单词，而按照字母序，`the`排第3位，所以不输出。）

23

5:this

4:is

思路：

通过正则表达式排除非题目要求符号，在对读取的数据加工

然后排序，输入

代码：

import java.util.*;

public class Main {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);
        ArrayList<vocabulary> list = new ArrayList<>();//定义了一个单词类的列表
        String regex = "[^a-zA-Z0-9#_]";//正则表达式：识别除字母，数字，#和_线的符号
        while (true) {
            String str = sc.nextLine();//获取输入
            String[] allStr = str.replaceAll(regex, " ").split(" ");
            //一步一步看
            //replaceAll() 通过正则表达式将非表达式的符号转化为空格
            //split() 通过对空格进行切割化为数组

            for (String s : allStr) {
                if (s.equals("#")) {//判断是否读取结束
                    printfList(list);//定义了一个方法将全部单词统计词频后按题目需求打印
                    System.exit(0);//停止虚拟机的运行
                }
                if(s.equals("")) {//判断是否为空串，这个大多是在一行中什么也没输入直接回车造成的
                    continue;
                }
                if (s.length() >= 15) {//判断单词长度是否大于等于15
                    s = s.substring(0, 15);
                }
                int i;
                for (i = 0; i < list.size(); i++) {
                    vocabulary vocabulary = list.get(i);//获取列表里的单词
                    String name = vocabulary.getName();//获取单词名字
                    if (s.equalsIgnoreCase(name)) {//判断是否与列表里的单词相同
                        int num = vocabulary.getNum();//获取单词词频
                        num++;//自增
                        vocabulary.setNum(num);//修改
                        break;
                    }
                }
                if (i == list.size()) {//列表中没有相同的单词
                    list.add(new vocabulary(s.toLowerCase(), 1));//自立门户
                }
            }
        }
    }

    private static void printfList(ArrayList<vocabulary> list) {//全部单词统计词频后按题目需求打印
        System.out.println(list.size());//打印一共多少单词

        list.sort((o1, o2) -> o2.getNum() == o1.getNum() ? o1.getName().compareTo(o2.getName()) : o2.getNum() - o1.getNum());
        //一步一步看
        //这里用了一个java的lambda表达式 和 三元表达式
        //sort() -> {} 此处的lambda表达式是对列表的单词按词频排序
        //o2.getNum() == o1.getNum() ? o1.getName().compareTo(o2.getName()) : o2.getNum() - o1.getNum()
        //这一串三元表达式判断两个单词的词频是否相同
        //o1.getName().compareTo(o2.getName()) 相同则按字母字典顺序来排序
        //o2.getNum() - o1.getNum() 否则降序排序

        int num = (int) (list.size() * 0.1);//取得前10%的单词
        for (int i = 0; i < num; i++) {//for打印
            vocabulary vocabulary = list.get(i);
            System.out.println(vocabulary);
        }
    }
}

class vocabulary {
    private String name;//单词名字
    private int num;//词频


    public vocabulary() {
    }

    public vocabulary(String name, int num) {
        this.name = name;
        this.num = num;
    }

    /**
     * 获取
     *
     * @return name
     */
    public String getName() {
        return name;
    }

    /**
     * 设置
     *
     * @param name
     */
    public void setName(String name) {
        this.name = name;
    }

    /**
     * 获取
     *
     * @return num
     */
    public int getNum() {
        return num;
    }

    /**
     * 设置
     *
     * @param num
     */
    public void setNum(int num) {
        this.num = num;
    }

    public String toString() {
        return num + ":" + name;
    }
}

小黑喵了个咪

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

PTA 词频统计 (java)

输入格式:

输出格式:

输入样例：

输出样例：（注意：虽然单词the也出现了4次，但因为我们只要输出前10%（即23个单词中的前2个）单词，而按照字母序，the排第3位，所以不输出。）

思路：

代码：

输出样例：（注意：虽然单词`the`也出现了4次，但因为我们只要输出前10%（即23个单词中的前2个）单词，而按照字母序，`the`排第3位，所以不输出。）