在一个很大的日志文件中查找到出现最多的ip并且记录次数

最新推荐文章于 2024-06-09 18:30:23 发布

lijiahangmax

最新推荐文章于 2024-06-09 18:30:23 发布

阅读量3.4k

点赞数 5

本文链接：https://blog.csdn.net/qq_41011894/article/details/88538872

版权

面试题专栏收录该内容

2 篇文章 0 订阅

订阅专栏

问题

在一个100G的日志文件中, 查找到访问最多的IP, 获得前3个IP, 限制内存只有 1G, 不能使用MapReduce, 请使用Java实现

问题解析

既然内存只有1G 那么就不能直接使用HashMap进行统计, 可以使用MapReduce原理, 先切片, 通过Hash码进行分片, IP 相同的肯定在一个文件中, 分片不宜太大,也不宜太小, 就用1000片吧, 之后统计每个文件中出现最多次数的 IP, 合并到一个文件中, 最后统计合并的文件, 取最终结果

代码片段1 生成日志

    /**
     * 模拟生成日志
     */
    public static void createFile() {
        // 先生成200000条ip信息 到文件
        for (int i = 0; i < 200000; i++) {
            final Random random = new Random();
            try {
                // 这里使用FileUtils方便插入数据更方便
                FileUtils.write(new File("D:\\temp\\log.txt"), "192.168." + random.nextInt(256) + "." + random.nextInt(256) + "\n", "UTF-8", true);
            } catch (IOException e) {
                e.printStackTrace();
            }
            System.out.println("已生成IP: " + i);
        }
    }

代码片段2 切片数据


    /**
     * 将数据切片 分配到小文件中
     */
    private static void cutBlock() {
        try {
            int i = 0;
            // 用BufferReader读取每一行数据
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("D:\\temp\\log.txt")));
            String line = null;
            while ((line = br.readLine()) != null) {
                System.out.println(++i);
                // 获得hash码 获取对应的文件
                int hash = Objects.hash(line) % 1000;
                // hash有可能为负数
                int fileIndex = (hash >= 0) ? hash : -hash;
                // 将数据写入到片中
                FileUtils.write(new File("D:\\temp\\block" + fileIndex + ".txt"), line + "\n", "UTF-8", true);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

代码片段3 Map阶段对小文件进行排序取值

   /**
     * 将小文件进行排序取值
     */
    private static void map() {
        int f = 0;
        for (File file : new File("D:\\temp\\block").listFiles()) {
            System.out.println("已处理的文件数量: " + (++f));
            try {
                // 统计IP出现次数
                Map<String, Integer> map = new HashMap<>();
                FileUtils.readLines(file, "UTF-8").forEach(s -> {
                    if (map.containsKey(s)) {
                        map.put(s, map.get(s) + 1);
                    } else {
                        map.put(s, 1);
                    }
                });
                // 用Stream对Map进行倒排序 获得IP出现次数最多的三条数据
                map.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(3).forEachOrdered(e -> {
                    try {
                        // 将前三的数据写入到map文件中
                        FileUtils.write(new File("D:\\temp\\block\\map.txt"), e.getKey() + "\t" + e.getValue() + "\n", "UTF-8", true);
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                });
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

代码片段4 Reduce阶段将map的文件排序取最终结果


    /**
     * 将map数据进行排序 取最终结果
     */
    private static void reduce() {
        try {
            Map<String, Integer> map = new HashMap<>();
            // 获得map中IP出现的次数
            FileUtils.readLines(new File("D:\\temp\\block\\map.txt"), "UTF-8").forEach(s -> {
                String[] split = s.split("\t");
                map.put(split[0], Integer.valueOf(split[1]));
            });
            // 用Stream对Map进行倒排序 获得IP出现次数最多的三条数据
            map.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(3).forEach(e -> {
                try {
                    // 将前三数据写到reduce文件中
                    FileUtils.write(new File("D:\\temp\\block\\reduce.txt"), e.getKey() + "\t" + e.getValue() + "\n", "UTF-8", true);
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            });
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

代码片段5 main方法测试

    public static void main(String[] args) {
        createFile();
        cutBlock();
        map();
        reduce();
    }

涉及到的问题

1. MapReduce原理过程传送

2. StreamApi对HashMap排序传送

结束

这就是对本题的讲解可能不是特别好的方法感觉有用就点个赞吧如果有错误或更好的方法评论区请多多指出相互学习共同进步

lijiahangmax

关注

5
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
在一个很大的日志文件中查找到出现最多的ip并且记录次数

问题在一个100G的日志文件中, 查找到访问最多的IP, 获得前3个IP, 限制内存只有 1G, 不能使用MapReduce, 请使用Java实现问题解析既然内存只有1G 那么就不能直接使用HashMap进行统计, 可以使用MapReduce原理, 先切片, 通过Hash码进行分片, IP 相同的肯定在一个文件中, 分片不宜太大,也不宜太小, 就用1000片吧...
复制链接

扫一扫

专栏目录