文件倒叙读取行内容识别文件编码多线程

米修斯的橘子

于 2021-03-02 17:12:51 发布

阅读量658

点赞数 2

分类专栏： java8 文件统计文章标签： java spring

本文链接：https://blog.csdn.net/qq_40980455/article/details/114289672

版权

java8 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

文件统计

3 篇文章 0 订阅

订阅专栏

文件倒叙读取行内容识别文件编码多线程

1.文件倒叙读取的方式

文件倒叙读取行内容参考了许多博客，大体是一种采用RandomAccessFile进行读取，考虑到有现成的开源解决方案，先采用ReversedLinesFileReader类来进行操作，再apache的commons-io依赖中。

emm,考虑到编码问题，推荐采用2.7版本来操作，因为看2.4的源码是不支持gbk编码的。

依赖：

	<dependency>
			<groupId>commons-io</groupId>
			<artifactId>commons-io</artifactId>
			<version>2.7</version>
	</dependency>

2.读取的代码

 public  void reverseReadFileContent(String filePath){
        File file=new File(filePath);
        int blockSize=(int) file.length()/1024<2?2:(int) file.length()/1024;
        try (ReversedLinesFileReader reader = new ReversedLinesFileReader(file,blockSize,
                Charset.forName("UTF-8"))) {//钻石语法哈，自动关闭流
            String line = "";
            while (line!=null){
                line= reader.readLine();
               System.out.println("本行的内容是"+line);
            }

        }catch (Exception e) {
            e.printStackTrace();
        }
    }

3.识别文件的编码格式

看上述代码其实仅支持utf8编码文件读取，在集成现有项目的其他模块时是不适用的。因此在读之前需要先进行编码解析的操作，一番百度又一次找到了根据文件前几个字节来做的。但仍然是有一个成熟的模块可支持我们的操作。照例采用成熟的东西。

依赖：

<dependency>
			<groupId>net.sourceforge.cpdetector</groupId>
			<artifactId>cpdetector</artifactId>
			<version>1.0.4</version>
</dependency>

4.读取文件的编码格式

/**
     * 获取文件的编码格式 加锁,不加锁会引起线程并发问题
     * @param filePath
     * @return 编码格式
     */
    public static synchronized String getFileEncode(String filePath) {
        String charsetName = "";
        try {
            File file = new File(filePath);
            CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
            detector.add(new ParsingDetector(false));
            //添加三种编码格式
            detector.add(JChardetFacade.getInstance());
            detector.add(ASCIIDetector.getInstance());
//            detector.add(UnicodeDetector.getInstance());
            java.nio.charset.Charset charset = null;
            charset = detector.detectCodepage(file.toURI().toURL());
            if (charset != null) {
                charsetName = charset.name();
            } else {
                charsetName = "UTF-8";
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        return charsetName;
    }

由于我的读操作会在多线程情况下读取多个file来找出其内容因此会有并发问题，加锁即可解决该问题。

5.ReversedLinesFileReader的GBK编码读取支持

观察此构造器可以发现该版本的reader是支持gbk编码文件内容的读取的，但是我们的文件中文编码有可能是gb2312、big5的，可以知道如果直接传进此编码会抛出异常，但是gbk会兼容此两种编码，因此我们将gb2312、big5放入集合中判断是否contains这两种编码，是的话采用gbk来读文件

/**
     * Creates a ReversedLinesFileReader with the given block size and encoding.
     *
     * @param file
     *            the file to be read
     * @param blockSize
     *            size of the internal buffer (for ideal performance this should
     *            match with the block size of the underlying file system).
     * @param encoding
     *            the encoding of the file
     * @throws IOException  if an I/O error occurs
     * @since 2.7
     */
    public ReversedLinesFileReader(final Path file, final int blockSize, final Charset encoding) throws IOException {
        this.blockSize = blockSize;
        this.encoding = encoding;

        // --- check & prepare encoding ---
        final Charset charset = Charsets.toCharset(encoding);
        final CharsetEncoder charsetEncoder = charset.newEncoder();
        final float maxBytesPerChar = charsetEncoder.maxBytesPerChar();
        if (maxBytesPerChar == 1f) {
            // all one byte encodings are no problem
            byteDecrement = 1;
        } else if (charset == StandardCharsets.UTF_8) {
            // UTF-8 works fine out of the box, for multibyte sequences a second UTF-8 byte can never be a newline byte
            // http://en.wikipedia.org/wiki/UTF-8
            byteDecrement = 1;
        } else if(charset == Charset.forName("Shift_JIS") || // Same as for UTF-8
                // http://www.herongyang.com/Unicode/JIS-Shift-JIS-Encoding.html
                charset == Charset.forName("windows-31j") || // Windows code page 932 (Japanese)
                charset == Charset.forName("x-windows-949") || // Windows code page 949 (Korean)
                charset == Charset.forName("gbk") || // Windows code page 936 (Simplified Chinese)
                charset == Charset.forName("x-windows-950")) { // Windows code page 950 (Traditional Chinese)
            byteDecrement = 1;
        } else if (charset == StandardCharsets.UTF_16BE || charset == StandardCharsets.UTF_16LE) {
            // UTF-16 new line sequences are not allowed as second tuple of four byte sequences,
            // however byte order has to be specified
            byteDecrement = 2;
        } else if (charset == StandardCharsets.UTF_16) {
            throw new UnsupportedEncodingException("For UTF-16, you need to specify the byte order (use UTF-16BE or " +
                    "UTF-16LE)");
        } else {
            throw new UnsupportedEncodingException("Encoding " + encoding + " is not supported yet (feel free to " +
                    "submit a patch)");
        }

        // NOTE: The new line sequences are matched in the order given, so it is important that \r\n is BEFORE \n
        newLineSequences = new byte[][] { "\r\n".getBytes(encoding), "\n".getBytes(encoding), "\r".getBytes(encoding) };

        avoidNewlineSplitBufferSize = newLineSequences[0].length;

        // Open file
        channel = Files.newByteChannel(file, StandardOpenOption.READ);
        totalByteLength = channel.size();
        int lastBlockLength = (int) (totalByteLength % blockSize);
        if (lastBlockLength > 0) {
            totalBlockCount = totalByteLength / blockSize + 1;
        } else {
            totalBlockCount = totalByteLength / blockSize;
            if (totalByteLength > 0) {
                lastBlockLength = blockSize;
            }
        }
        currentFilePart = new FilePart(totalBlockCount, lastBlockLength, null);

    }

  private static List<String> gbkCharsets;

    static {
        gbkCharsets = new ArrayList<> ();
        gbkCharsets.add("GB2312");
        gbkCharsets.add("BIG5");
//        gbkCharsets.add("GBK");
    }
    
     /**
     * 输入获取的字符集获取到该字符集的大集合，例如GB2312(中文简体)，BIG5(中文繁体)均在GBK编码中
     * @param fileCharset
     * @return
     */
    private String getFileCharset(String fileCharset){
        return gbkCharsets.contains(fileCharset)?"gbk":fileCharset;
    }

上述的读文件代码即可改为如下

 public  void reverseReadFileContent(String filePath){
        File file=new File(filePath);
        int blockSize=(int) file.length()/1024<2?2:(int) file.length()/1024;
		String charset=getFileEncode(filePath);
        charset=getFileCharset(charset.toUpperCase());
        try (ReversedLinesFileReader reader = new ReversedLinesFileReader(file,blockSize,
                Charset.forName(charset))) {//钻石语法哈，自动关闭流
            String line = "";
            while (line!=null){
                line= reader.readLine();
               System.out.println("本行的内容是"+line);
            }

        }catch (Exception e) {
            e.printStackTrace();
        }
    }

6.最后加上我的多线程读取代码吧，其实我的是需要线程读取的返回值的，因此采用线程池和Callable来做。

public Object readMultiFileContent(){
Integer maxPoolSize ="配置文件可以提前设置好一个默认线程池的大小";
        ExecutorService executor = Executors.newFixedThreadPool(paths.size()>maxPoolSize?maxPoolSize:paths.size());
        ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executor);
        try {
            for (int i = 0; i < paths.size(); i++) {
                String path = String.valueOf(paths.get(i));
                completionService.submit(()->{
                        reverseReadFileContent(path);//返回结果可以在这里拿到返回
                        return "方法的返回结果";
                });
            }
            for (int i = 0; i < paths.size(); i++) {
                Future<String> future = completionService.take();
                if (future != null) {
                     future.get();
                }
            }
        }catch (Exception e) {
            log.error("线程异常:"+e);
        } finally {
            executor.shutdown();
        }
        return result;
    }

以上便是需求的所有完成步骤了，再试试其他方案看是否可提高读取效率，我的是需要倒叙读取最新n(n为常数10，30，50)行包含输入字符串的数据，因此在极端情况下是会遍历所有行的，commons-io扩展出的倒叙读是会有问题的，因此，看是否有其他高效的读取方案。欢迎大家指出不足与提供新的思路、解决方案。

米修斯的橘子

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
文件倒叙读取行内容识别文件编码多线程

文件倒叙读取行内容识别文件编码多线程1.文件倒叙读取的方式文件倒叙读取行内容参考了许多博客，大体是一种采用RandomAccessFile进行读取，考虑到有线程的开源解决方案，先采用ReversedLinesFileReader类来进行操作，再apache的commons-io依赖中。emm,考虑到编码问题，推荐采用2.7版本来操作，因为看2.4的源码是不支持gbk编码的。依赖： <dependency> <groupId>commons-io</groupId
复制链接

扫一扫