解决中文乱码问题

Rover.x

已于 2023-07-09 18:04:26 修改

阅读量435

点赞数 2

分类专栏： Java 文章标签： java

于 2023-06-25 22:57:35 首次发布

本文链接：https://blog.csdn.net/Yingtaozi_0/article/details/131387282

版权

Java 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

最近项目中遇到一个中文乱码的问题，接收的文件中包含有特殊字符跟生僻字，第一次乱码将gbk编码格式改为了gb18030，第二次修改将字符流改为了字节流。先来概述下业务背景及上下文：大概流程就是我们接收xxx系统的文件，做一些转换处理后入库(数据库编码格式为utf-8)，入库后根据不同的情况做不同的处理。在接收文件时候(从oracle导出)，导出时指定编码格式为gbk，我们需要先拆分文件，然后对子文件再做数据处理。第一版是这么做的：文件拆分时，指定编码格式为gbk，写出时候为utf-8(默认)，数据处理时，读取写出编码均为默认的utf-8，运行结果是大部分文件正常通过，但带有生僻字的，如，在数据处理时乱码报错，究其原因，发现这几个字在gbk跟gb18030下所占字节数不一致，如，gbk下占双字节，gb18030下占4个字节，因此，修改后做了第二版：将拆分时文件读入编码格式由gbk改为gb18030；跑了一段时间数据都没问题，正松懈之时，数据处理又报错了，带有"凃?"的字段在按指定长度截取时候报错，好吧..继续追溯，尝试将字符流直接改用字节流，这样应该没问题了吧，说改就改，果不其然，字节流下按指定长度挨个去读写就通过了。下面附上第一三版修改代码：

...
BufferedReader in= new BufferedReader(new InputStreamReader(new FileInputStream(file,"GBK")));
BufferedWriter[] out = new BufferedWriter[splitNum];
out[i] =  new BufferedWriter(new FileWriter(name,false));
...

public static String subStrByBytes(String str,int startIndex,ine endIndex){
    String subStr = "";
    try{
        byte[] bytes = str.getBytes("GBK");
        int subLen = endIndex - startIndex;
        byte[] subBytes = new byte[subLen];
        int i = 0;
        while(startIndex < endIndex){
            subBytes[i++] = bytes[startIndex++]; 
        }
        subStr = new String(subBytes,"GBK");
    }catch(Exception e){
        logger.error(e.getMessage());
    }
    return subStr;
}

...
FileInputStream in = new FileInputStream(file);
FileOutputStream[] out = new FileOutputStream[splitNum];
out[i] = new FileOutputStream(name);
...
byte[] bytes = new bytes[xxx];
String field1 = new String(bytes,0,10,"GB18030");
String field2 = new String(bytes,10,10,"GB18030");
...