UTF_BOM FAQ
www Escapes
Wikipedia UTF-8
kuinka ääkköset toimimaan servletissä (in finnish)
Use UTF8 for your html files
You should use utf8 for all your html files, it just make life easier. There are two things to keep in mind, see example html below. If you follow these simple rules your site readers should not have problems displaying text.
- Save your .html as UTF-8 encoded text files
- Add "meta http-equiv" metatag to head part of html files
XML
You should put BOM marker at the start of text files if possible. Then to make all even more safe add xml header row and specify encoding you use within a document.
<?xml version="1.0" encoding="UTF-8"?>
Windows Notepad (Win2k, XP) can save files with BOM marker. Change your favourite text editor if it cannot cope with standard bom markers.
- UTF-8
- UTF-16BE (big endian)
- UTF-16LE (little endian)
Windows WordPad (Win2k, XP) can't save files using UTF-8 charset.
Here is a small example xml document.
You should see this after unescaping document.
Jättiläinen meni keittiöön
ja kaatoi kaikki kattilat. Hiiri
meni puutarhaan
ja söi kaikki puut.
char entities: < > & " '
safe xml chars: /O/
Decimal Numeric Character Reference: Ä €
Hex Numeric Character Reference: Ä €
Java BOM recognition
UnicodeReader class
JDK bug 4508058
Java default io reader does not recognize all BOM markers. It it known to be fixed in JDK6, but I havent tested it yet. You can use UnicodeReader class to overcome problems and auto-recognize bom markers. It will give a transparent behaviour to underlying inputstreams.
Example code using UnicodeReader class
Here is an example method to read text file. It will recognize bom marker and skip it while reading.
Example code to write UTF-8 with bom marker
Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.
XML Test Application, Config Test Application
Example application using UnicodeReader class with full sources. It reads various unicode xml text files and output values to UTF-8_with_BOM text file. Application uses UnicodeReader class to autorecognize unicode bom markers.
TestXML = read and write xml file
TestConfig = read and write properties file
javaXMLTest.zip
Reference image of xml file output
Html test page
Run test application, open data.txt.rtf file to WordPad or any text editor able to use unicode truetype/opentype fonts. I have found Arial Unicode MS font to be a very good. File is just a text file even so it has .rtf suffix. You may open it to Notepad but it might not show all characters properly as default. You can still use Notepad but to save file just do not edit unknown blackbox character letters.
备注:
1、本文转载自:http://koti.mbnet.fi/akini/java/java_utf8_xml/
2、UnicodeReader and UnicodeInputStream 下载地址:http://koti.mbnet.fi/akini/java/unicodereader/