CP1252 and ISO8859-1

 http://www.herongyang.com/Unicode/Java-charset-EncodingSampler-Test-encode-Method.html

http://www.herongyang.com/Unicode/Java-charset-Example-of-CP1252-ISO-8859-1-Encoding.html

JDK offers 4 methods to encode characters:

  • CharsetEncoder.encode()
  • Charset.encode()
  • String.getBytes()
  • OutputStreamWriter.write()

Here is a program that demonstrate how to encode characters with each of above 4 methods:

/**
 * EncodingSampler.java
 * Copyright (c) 2002 by Dr. Herong Yang
 */
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
class EncodingSampler {
   static String dfltCharset = null;
   static char[] chars={0x0000, 0x003F, 0x0040, 0x007F, 0x0080, 0x00BF,
                        0x00C0, 0x00FF, 0x0100, 0x3FFF, 0x4000, 0x7FFF,
                        0x8000, 0xBFFF, 0xC000, 0xEFFF, 0xF000, 0xFFFF};
   static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                             '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
   public static void main(String[] arg) {
      String charset = null;
      if (arg.length>0) charset = arg[0];
      OutputStreamWriter o = new OutputStreamWriter(
         new ByteArrayOutputStream());
      dfltCharset = o.getEncoding();
      if (charset==null) System.out.println("Default ("+dfltCharset
         +") encoding:");
      else System.out.println(charset+" encoding:");
      System.out.println("Char, String, Writer, Charset, Encoder");
      for (int i=0; i<chars.length; i++) {
         char c = chars[i];
         byte[] b1 = encodeByString(c,charset);
         byte[] b2 = encodeByWriter(c,charset);
         byte[] b3 = encodeByCharset(c,charset);
         byte[] b4 = encodeByEncoder(c,charset);
         System.out.print(charToHex(c)+",");
         printBytes(b1);
         System.out.print(",");
         printBytes(b2);
         System.out.print(",");
         printBytes(b3);
         System.out.print(",");
         printBytes(b4);
         System.out.println("");
      }
   }
   public static byte[] encodeByCharset(char c, String cs) {
      Charset cso = null;
      byte[] b = null;
      try {   	
         if (cs==null) cso = Charset.forName(dfltCharset);
         else cso = Charset.forName(cs);
         ByteBuffer bb = cso.encode(String.valueOf(c));
         b = copyBytes(bb.array(),bb.limit());
      } catch (IllegalCharsetNameException e) {
         System.out.println(e.toString());
      }      	
      return b;
   }
   public static byte[] encodeByEncoder(char c, String cs) {
      Charset cso = null;
      byte[] b = null;
      try {   	
         if (cs==null) cso = Charset.forName(dfltCharset);
         else cso = Charset.forName(cs);
         CharsetEncoder e =  cso.newEncoder();
         e.reset();
         ByteBuffer bb = e.encode(CharBuffer.wrap(new char[] {c}));
         b = copyBytes(bb.array(),bb.limit());
      } catch (IllegalCharsetNameException e) {
         System.out.println(e.toString());
      } catch (CharacterCodingException e) {
         //System.out.println(e.toString());
         b = new byte[] {(byte)0x00};
      }      	
      return b;
   }
   public static byte[] encodeByString(char c, String cs) {
      String s = String.valueOf(c);
      byte[] b = null;
      if (cs==null) {
         b = s.getBytes();
      } else {
         try {
            b = s.getBytes(cs);
         } catch (UnsupportedEncodingException e) {
            System.out.println(e.toString());
         }
      }
      return b;
   }
   public static byte[] encodeByWriter(char c, String cs) {
      byte[] b = null;
      ByteArrayOutputStream bs = new ByteArrayOutputStream();
      OutputStreamWriter o = null;
      if (cs==null) {
         o = new OutputStreamWriter(bs);
      } else {
         try {
            o = new OutputStreamWriter(bs, cs);
         } catch (UnsupportedEncodingException e) {
            System.out.println(e.toString());
         }
      }
      String s = String.valueOf(c);
      try {
         o.write(s);
         o.flush();
         b = bs.toByteArray();
         o.close();
      } catch (IOException e) {
         System.out.println(e.toString());
      }
      return b;
   }
   public static byte[] copyBytes(byte[] a, int l) {
      byte[] b = new byte[l];
      for (int i=0; i<Math.min(l,a.length); i++) b[i] = a[i];
      return b;
   }
   public static void printBytes(byte[] b) {
      for (int j=0; j<b.length; j++)
         System.out.print(" "+byteToHex(b[j]));
   }
   public static String byteToHex(byte b) {
      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(a);
   }
   public static String charToHex(char c) {
      byte hi = (byte) (c >>> 8);
      byte lo = (byte) (c & 0xff);
      return byteToHex(hi) + byteToHex(lo);
   }
}

Note that:

  • If the same encoding is used, each of the encode method in the program should return the exactly the same byte sequence.
  • getEncoding() is used on OuputStreamWriter class to get the name of the default encoding.
  • There is no way to know the name of the default encoding on String class.
  • There is no default instance of Charset and Encoder.
  • In encodeByEncoder(), 0x00 is used as the output when the given character can not be encoded by the encoder.

Running the testing program, EncodingSampler.java, provided in the previous section without any argument will use the JVM's default encoding:

Default (Cp1252) encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

The results shows that:

  • The default encoding of the String class seems to be the same as OutputStreamWriter: Cp1252.
  • There are a number of characters that can not be encoded by Cp1252. The String, OutputStreamWriter, and Charset classes are returning 0x3F for those non-encodable characters.
  • It's obvious that Cp1252 works on a character set in the 0x0000 - 0x00FF range.

Running the program again with 'CP1252' as argument should give us the same output as the previous run:

CP1252 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

Let's try another encoding, ISO-8859-1:

ISO-8859-1 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 80, 80, 80, 80
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

It appears to be the same as CP1252.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值