判断csv文件字符编码类型的方法

牛十二

于 2024-08-14 16:44:46 发布

阅读量292

点赞数 2

文章标签： java 开发语言

本文链接：https://blog.csdn.net/s_ongfei/article/details/141194679

版权

查了很久，百度上出现的答案都实现不了我的需求，最后不得不使用AI神器通义千问，在回答的建议提问里找到了icu4j,这jar包名字对软件工程师来说好可怕，记录一下以免后来者重蹈覆辙。

使用apche的io包判断pom不靠谱，建议使用了IBM的包icu4j

引入icu4j依赖

        <dependency>
            <groupId>com.ibm.icu</groupId>
            <artifactId>icu4j</artifactId>
            <version>70.1</version>
        </dependency>

编写工具类

public class CharsetEncodingUtils {
    public static String getCharset(InputStream in) throws IOException {
        String charset = null;
        BufferedInputStream bis = null;
        try {
            bis = new BufferedInputStream(in);
            CharsetDetector cd = new CharsetDetector();
            cd.setText(bis);
            CharsetMatch cm = cd.detect();
            if (cm != null) {
                charset = cm.getName();
            } else {
                throw new UnsupportedCharsetException("获取文件编码失败");
            }
        } catch (IOException e) {
            e.printStackTrace();
            throw new IOException(e);
        }finally {
            if (null != bis) {
                try {
                    bis.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (null != in) {
                in.close();
            }
        }

        return charset;
    }

    public static void main(String[] args) {
        File file = new File("/Users/xxxxx/Documents/批量添加验证样本模板111/上传文件样例-表格 1.txt");
        InputStream inputStream = null;
        try {
            inputStream = new FileInputStream(file);
            String charset = getCharset(inputStream);
            System.out.println("charset:" + charset);
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (null != inputStream) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

测试case
以下是我做的16个case，使用16种字符编码的文件，通过icu4j来获取文件的字符编码，目前测试来看只要源文件的中文能正常显示，使用icu4j读取到的编码就能正常解析，同时将解析到的内容以UTF-8的编码输出到新的文件，也没问题。

输入文件字符编码	ICU4J读取文件字符编码	输出UTF-8	备注
ANSI	GB18030	中文转码通过
BOM UTF-8	UTF-8	中文转码通过
UTF-16BE	UTF-16BE	中文转码通过
UTF-16BE with BOM	UTF-16BE	中文转码通过
UTF-16LE	UTF-16LE	中文转码通过
UTF-16LE with BOM	UTF-16LE	中文转码通过
UTF-8	UTF-8	中文转码通过
UTF-7	ISO-8859-1	中文转码未通过	源文件中文已经乱码
UTF-32	UTF-32LE	中文转码通过
UTF-32BE	UTF-32BE	中文转码通过
UTF-32LE	UTF-32LE	中文转码通过
GB 18030	GB18030	中文转码通过
GBK	GB18030	中文转码通过
ISO 2022-CN	ISO-2022-CN	中文转码通过	源文件中文已经乱码
DOS Latin 2	ISO-8859-1	中文转码未通过	源文件中文已经乱码
ASCLL	ISO-8859-1	中文转码未通过	源文件中文无法显示，被ASCLL编码