How to use UTF-8_with_BOM, XML and Java together

UTF_BOM FAQ
www Escapes
Wikipedia UTF-8
kuinka ääkköset toimimaan servletissä (in finnish)

Use UTF8 for your html files
You should use utf8 for all your html files, it just make life easier. There are two things to keep in mind, see example html below. If you follow these simple rules your site readers should not have problems displaying text.

  • Save your .html as UTF-8 encoded text files
  • Add "meta http-equiv" metatag to head part of html files
 

 

XML
You should put BOM marker at the start of text files if possible. Then to make all even more safe add xml header row and specify encoding you use within a document.
    <?xml version="1.0" encoding="UTF-8"?>

Windows Notepad (Win2k, XP) can save files with BOM marker. Change your favourite text editor if it cannot cope with standard bom markers.

  • UTF-8
  • UTF-16BE (big endian)
  • UTF-16LE (little endian)

Windows WordPad (Win2k, XP) can't save files using UTF-8 charset.

Here is a small example xml document.

 

You should see this after unescaping document.


   Jättiläinen meni keittiöön 
  ja kaatoi kaikki kattilat.    Hiiri
            meni puutarhaan
   ja söi kaikki puut.

   char entities: < > & " '
   safe xml chars: /O/
   Decimal Numeric Character Reference: Ä €
   Hex Numeric Character Reference: Ä €

 

 

 

Java BOM recognition
UnicodeReader class
JDK bug 4508058

Java default io reader does not recognize all BOM markers. It it known to be fixed in JDK6, but I havent tested it yet. You can use UnicodeReader class to overcome problems and auto-recognize bom markers. It will give a transparent behaviour to underlying inputstreams.

Example code using UnicodeReader class
Here is an example method to read text file. It will recognize bom marker and skip it while reading.

 

Example code to write UTF-8 with bom marker
Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.


 

 

XML Test Application, Config Test Application
Example application using UnicodeReader class with full sources. It reads various unicode xml text files and output values to UTF-8_with_BOM text file. Application uses UnicodeReader class to autorecognize unicode bom markers.
TestXML = read and write xml file
TestConfig = read and write properties file

javaXMLTest.zip
Reference image of xml file output
Html test page


Run test application, open data.txt.rtf file to WordPad or any text editor able to use unicode truetype/opentype fonts. I have found Arial Unicode MS font to be a very good. File is just a text file even so it has .rtf suffix. You may open it to Notepad but it might not show all characters properly as default. You can still use Notepad but to save file just do not edit unknown blackbox character letters.

 

 

备注:

    1、本文转载自:http://koti.mbnet.fi/akini/java/java_utf8_xml/

    2、UnicodeReader and UnicodeInputStream 下载地址:http://koti.mbnet.fi/akini/java/unicodereader/

 

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值