一个求90分位数的算法优化

最新推荐文章于 2024-01-31 13:55:51 发布

月亮124073734

最新推荐文章于 2024-01-31 13:55:51 发布

阅读量287

点赞数

文章标签：算法 java 开发语言

本文链接：https://blog.csdn.net/u010002517/article/details/130805688

版权

题目

给定文件，每个文件中有一行逗号分隔的数据，请找出该数据流中tp90 line，即第90百分位数。即按顺序处于第90%位置的数。

说明

比如文件内容： 2,3,4,5,10,8,9,1,6,7 排序后第90%位置为第9个，即为9。注意如果第90%长度不是整数，则向下取整。如数据流长度为115，115 * 90% = 103.5，则取第103个数。

思路

其实很简单，最简单的做法就是转成数组，使用jdk自带的sort(TimSort)方法排序，然后求出对应值即可。1000万数据在笔记本（SSD、8G内存、i5处理器）上运行，大概耗时7s左右。

优化

优化思路

不用全部排序，我们只需要大致定位区间，然后在这个区间排序即可
亿级以内无需使用多线程，多线程的损耗大于收益，比如对cpu的二级缓存不友好等，这是很重要的优化点
jdk的集合类，包装类能不用就不用，哪怕是一个字符串转数字，就有不小的优化空间，同样也是缓存友好

上代码

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.Arrays;

public class Solution {
    public static final int ASCII_0 = 48;
    public int getTp90Line(File file){
        int[] record = new int[1024*10];//初始化一堆区间，用来记录这个区间内有多少个数
        byte[] buffer = new byte[1024*1024*10];//IO缓冲
        int all = 0;
        int temp = 0;
        try(BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file))){
            int i;
            while((i = bis.read(buffer,0,buffer.length)) > 0){
                for(int a = 0;a<i;a++){
                    if(buffer[a] == ','){
                        all++;
                        int index = temp >> 10;//除以1024，用位运算
                        if(index >= record.length){
                            //扩容
                            int[] newRecord = new int[index+1];
                            System.arraycopy(record,0,newRecord,0,record.length);
                            record = newRecord;
                        }
                        record[index]++;
                        temp = 0;
                    }else{
                        int n = buffer[a] - ASCII_0;
                        temp = temp*10+n;//字符串转数字，使用基本类型计算
                    }
                }
            }
            all++;
            int index = temp >> 10;
            if(index >= record.length){
                int[] newRecord = new int[index+1];
                System.arraycopy(record,0,newRecord,0,record.length);
                record = newRecord;
            }
            record[index]++;
        } catch (Exception e){
            e.printStackTrace();
        }
        int tp90 = (int) (all*0.9d)-1;
        int start = 0;
        int lessThanTp90 = 0;
        while(start < record.length){
            int t = lessThanTp90 + record[start];
            if(t > tp90){
                break;
            }else{
                lessThanTp90 = t;
            }
            start++;
        }
        //找到目标所在区间
        int[] targetSection = new int[record[start]];
        int len = 0;
        try(BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file))){
            int i;//第二次读取，因为已经确定了90分位数所在区间，那么只需要记录该区间的值，然后排序即可
            while((i = bis.read(buffer,0,buffer.length)) > 0){
                for(int a = 0;a<i;a++){
                    if(buffer[a] == ','){
                        int index = temp >> 10;
                        if(index == start ){
                            targetSection[len++]=temp;
                        }
                        temp = 0;
                    }else{
                        int n = buffer[a] - ASCII_0;
                        temp = temp*10+n;
                    }
                }
            }
            int index = temp >> 10;
            if(index == start ){
                targetSection[len++]=temp;
            }
        } catch (Exception e){
            e.printStackTrace();
        }
        Arrays.sort(targetSection);
        return targetSection[tp90 - lessThanTp90];
    }
}

实际效果

同样的配置，1000万数据大概在300~400ms，内存占用更是小了很多，时间上有大概20倍的提升，空间提升（应该也超过20倍）不考虑，需要注意的是，要根据数据的分布因地制宜，这个算法只是一些优化的思路，也就是说，在一些场景下，自己实现一些比较简陋的类库，结合自身的数据情况，大幅提升应用的性能，毕竟7秒到350毫秒，体验上有云泥之别，而350毫秒到35毫秒，却没有那么明显。