【原】UTF-8编码不得不说的事情

最新推荐文章于 2024-07-14 14:59:27 发布

renminzdb2

最新推荐文章于 2024-07-14 14:59:27 发布

阅读量166

点赞数

分类专栏： JAVA BASE 文章标签： java 操作系统

本文链接：https://blog.csdn.net/renminzdb2/article/details/84421486

版权

JAVA BASE 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一贯都喜欢用UTF-8作为系统的编码方式。但是项目中做了一个上传的操作，直接将xml字符串存库。
流读取的时候用的是utf-8编码，上传的文件也是utf-8编码，怎么上传后就乱码了？乱的也不是很离谱，就
在文件的头部，多了一个？字符。可是上传之前的日志输出：没有任何问题。“？”这个字符是从哪里来的。
百度一番，原来utf-8还有带不带BOM 之分。
BOM: Byte Order Mark
UTF-8 BOM又叫UTF-8 签名,其实UTF-8 的BOM对UFT-8没有作用,是为了支援UTF-16,UTF-32才加上的BOM,BOM签名的意思就是告诉编辑器当前文件采用何种编码,
方便编辑器识别,但是BOM虽然在编辑器中不显示,但是会产生输出。
如果通过java写的UTF-8文件，使用Java可以正确的读，但是如果用记事本将相同的内容使用UTF-8格式保存，则在使用程序读取是会从文件中多读出一个不可见字符。

 
  public static void main(String[] args) throws IOException {
  File f = new File("C:/utf.txt");
  FileInputStream in = new FileInputStream(f);
        // 指定读取文件时以UTF-8的格式读取
  BufferedReader br = new BufferedReader(new InputStreamReader(in,
    "UTF-8"));
  String line = br.readLine();
  while (line != null) {
   byte[] allbytes = line.getBytes("UTF-8");     
            for (int i=0; i < allbytes.length; i++)    
            {    
                int tmp = allbytes[i];    
                String hexString = Integer.toHexString(tmp);    
                // 1个byte变成16进制的，只需要2位就可以表示了，取后面两位，去掉前面的符号填充    
                hexString = hexString.substring(hexString.length() -2);    
                System.out.print(hexString.toUpperCase());    
                System.out.print(" ");    
            }   

   System.out.println(line);
   line = br.readLine();
  }

 }

输出结果如下：

[quote]
EF BB BF 54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 6C 69 6E 65 2E
?This is the first line.
54 68 69 73 20 69 73 20 73 65 63 6F 6E 64 20 6C 69 6E 65 2E
This is second line. [/quote]

红色部分的"EF BB BF"刚好是UTF-8文件的BOM编码，可以看出Java在读文件时没能正确处理UTF-8文件的BOM编码，将前3个字节当作文本内容来处理了。

解决办法：
[quote] JDK Bug 4508058

Java InpuStreamReader will support BOM mark for UTF-16 files. But for some reason it does not recognize UTF-8 BOM marks. This is very unfortunate all Windows (>win2k) users if textfiles are saved with Notepad using UTF-8 format. Notepad will add BOM bytes at the start of file, but Java's InputStreamReader does not skip it.

UnicodeInputStream.java class will help you to autorecognize and skip BOMs. This will support UTF-8 as well.

UnicodeReader.java class will do everything ever more transparently. Just instantiate it and read text. [/quote]
1.通过上面的描述，我们可以发现 inputStream 没有对其处理，但是 UnicodeInputStream 和 UnicodeReader 就可以解决这个问题。

  BufferedReader br = new BufferedReader(new UnicodeReader(in, Charset.defaultCharset().name()));

2.我们自己可以去写程序跳过这个BOM标志。

 /**
  * 读取流中前面的字符，看是否有bom，如果有bom，将bom头先读掉丢弃
  * 
  * @param in
  * @return
  * @throws IOException
  */
 public static InputStream trimBOM(InputStream in) throws IOException {

  PushbackInputStream testin = new PushbackInputStream(in);
  int ch = testin.read();
  if (ch != 0xEF) {
   testin.unread(ch);
  } else if ((ch = testin.read()) != 0xBB) {
   testin.unread(ch);
   testin.unread(0xef);
  } else if ((ch = testin.read()) != 0xBF) {
   throw new IOException("错误的UTF-8格式文件");
  } else {
   // 不需要做，这里是bom头被读完了
   // // System.out.println("still exist bom");

  }
  return testin;

 }

编辑器的问题：

win 记事本保存的utf-8格式文件是带有BOM。
notepad++ 保存的utf-8 也是带有BOM的，但是他提供了编码方式 : UTF-8 无 BOM 编码方式
editplus 保存的utf-8 是不带BOM的其提供了编码方式： UTF-8 + bom

renminzdb2

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【原】UTF-8编码不得不说的事情

一贯都喜欢用UTF-8作为系统的编码方式。但是项目中做了一个上传的操作，直接将xml字符串存库。流读取的时候用的是utf-8编码，上传的文件也是utf-8编码，怎么上传后就乱码了？乱的也不是很离谱，就在文件的头部，多了一个？字符。可是上传之前的日志输出：没有任何问题。“？”这个字符是从哪里来的。百度一番，原来utf-8还有带不带...
复制链接

扫一扫