【java办公自动化（8）】-- 朴素贝叶斯自动新闻分类

最新推荐文章于 2024-05-15 23:33:31 发布

艳学网

最新推荐文章于 2024-05-15 23:33:31 发布

阅读量340

点赞数

本文链接：https://blog.csdn.net/sinat_15153911/article/details/104635968

版权

【java办公自动化（8）】-- 朴素贝叶斯自动新闻分类

自动新闻分类，很简单，只需要一亿点细节，再经过2千年后，数据已经分类好了，我当时害怕极了。

我们已经用朴素贝叶斯自动筛选垃圾邮件，自动检测人名性别。同理，今天实现自动将文章分类。首先，需要足够足够多的文本数据。。。

1、特征表示

一篇新闻中，可以把新闻中出现的词作为特征向量表示出来，如 X = {昨日，是，国内，投资，市场…}

2、特征选择

特征中由于一些词对分类没有比较显著的帮助，甚至会有导致一些噪音，
因此，我们需要减一亿点细节。。。
我们需要去除，如“是”、“昨日”等，经过选择的特征可能是 X = {国内，投资，市场…}

3、模型选择

实战步骤：
创建文件夹，创建文件，多线程爬取，js模拟点击获取内容。在这里插入图片描述

在这里插入图片描述

for(var i = 0;i<100;i++){
    $(".more").click();
}

int b = f33.length + 1;
                                key44 = b+"";
                                if(key44.length() == 1){
                                    key44 = "000"+key44;
                                }else if(key44.length() == 2){
                                    key44 = "00"+key44;
                                }else if(key44.length() == 3){
                                    key44 = "0"+key44;
                                }
                                key44 = key44+".txt";

 * 多项式朴素贝叶斯分类结果
 * P(C_i|w_1,w_2...w_n) = P(w_1,w_2...w_n|C_i) * P(C_i) / P(w_1,w_2...w_n)
 * = P(w_1|C_i) * P(w_2|C_i)...P(w_n|C_i) * P(C_i) / (P(w_1) * P(w_2) ...P(w_n))

在这里插入图片描述

难点：

深度优先遍历

Files.walkFileTree(Paths.get(trainFileDir.getAbsolutePath()), new SimpleFileVisitor<Path>() {
            @Override// 正在访问一个文件时要干啥
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
                musicList2.add(file.toFile());
                String filePath = file.toFile().getAbsolutePath();
                //分词处理，获取每条训练集文本的词和词频
                Map<String, Integer> contentSegs = null;
                try {
                    contentSegs = IKWordSegmentation.segString(FileOptionUtil.readFile(filePath));
                } catch (Exception e) {
                    e.printStackTrace();
                }
                if (allTrainFileSegsMap.containsKey(trainFileDir.getName())) {
                    Map<String, Map<String, Integer>> allSegsMap = allTrainFileSegsMap.get(trainFileDir.getName());
                    allSegsMap.put(filePath, contentSegs);
                    allTrainFileSegsMap.put(trainFileDir.getName(), allSegsMap);
                } else {
                    Map<String, Map<String, Integer>> allSegsMap = new HashMap<String, Map<String, Integer>>();
                    allSegsMap.put(filePath, contentSegs);
                    allTrainFileSegsMap.put(trainFileDir.getName(), allSegsMap);
                }

福利函数：

/**
     * 词频统计
     *
     * @param content     内容
     * @param frequencies 词频；key：词语；value:出现次数
     * @return
     * @throws IOException
     */
    public static Map<String, Integer> count(String content, Map<String, Integer> frequencies) throws IOException {
        if (frequencies == null) {
            frequencies = new HashMap<>();
        }
        if (StringUtils.isBlank(content)) {
            return frequencies;
        }

        IKSegmenter ikSegmenter = new IKSegmenter(new StringReader(content), true);

        Lexeme lexeme;
        while ((lexeme = ikSegmenter.next()) != null) {
            final String text = lexeme.getLexemeText();

            if (text.length() > 1) {
                //递增
                if (frequencies.containsKey(text)) {
                    frequencies.put(text, frequencies.get(text) + 1);
                } else {//首次出现
                    frequencies.put(text, 1);
                }
            }
        }

        return frequencies;


    }

    /**
     * 按出现次数，从高到低排序
     *
     * @param data
     * @return
     */
    public static List<Map.Entry<String, Integer>> order(Map<String, Integer> data) {
        List<Map.Entry<String, Integer>> result = new ArrayList<>(data.entrySet());
        Collections.sort(result, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue() - o1.getValue();
            }
        });
        return result;
    }

福利类：

public class FileOptionUtil {
    public static void main(String[] args) {
        readDirs("C:\\Users\\yanhui\\Desktop\\课件和代码\\第2课\\Lecture_2\\Lecture_2\\Naive-Bayes-Text-Classifier\\Database\\SogouC\\Sample");
    }

    public static List<String> readDirs(String absolutePath) {
        List<String> readDirs = new ArrayList<>();
        File Dir = new File(absolutePath);
        //获取文件夹路径下的所有java文件
        File[] arr = Dir.listFiles();//获取文件或文件夹对象
        for (File file : arr) {//遍历File数组
            if (file.isFile() && file.getName().endsWith(".txt")) {//判断对象是否是以.java结尾的类型的文件，是的话就输出
                readDirs.add(file.getAbsolutePath());
            } else if (file.isDirectory()) {//判断是否是目录，是的话，就继续调用PrintJavaFile（）方法进行递归
                readDirs(file.getAbsolutePath());
            }
        }
        return readDirs;
    }

    public static String readFile(String filePath) {
        return _txtUtils.readTxtFile(filePath);
    }
}

艳学网

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【java办公自动化（8）】-- 朴素贝叶斯自动新闻分类

【java办公自动化（8）】-- 朴素贝叶斯自动新闻分类自动新闻分类，很简单，只需要一亿点细节，再经过2千年后，数据已经分类好了，我当时害怕极了。我们已经用朴素贝叶斯自动筛选垃圾邮件，自动检测人名性别。同理，今天实现自动将文章分类。首先，需要足够足够多的文本数据。。。 1、特征表示一篇新闻中，可以把新闻中出现的词作为特征向量表示出来，如 X = {昨日，是，国内，投资，市场…} ...
复制链接

扫一扫