汉字对应的unicode编码

     在Java中采用unicode字符集,每个字符占据2个字节,unicode字符集最多可包含65 535个字符。
65 535是一个很大的数字,英文字母、俄文字母、希腊字母、日文字母、阿拉伯数字、标点符号以及汉字等,都是unicode字符集中的字符。
具体而言,汉字对应的unicode范围为\u4E00~\u9FA5,9FA5-4E00=30101,即有30101个汉字,我们常用的汉字有7000个。
下面以一个程序来测定汉字的unicode编码:
package cn.ch.da;

public class ChaToZhongwen {
	public static void main(String[] args) {
		int s='中',t='国';
		System.out.println("汉字'中'对应的unicode编码为:"+s);
		System.out.println("汉字'国'对应的unicode编码为:"+t);
		char ch1=20013,ch2=22269;
		System.out.println("\\u20013对应的汉字为: "+ch1);
		System.out.println("\\u22269对应的汉字为: "+ch2);
	}

}


效果如下:

备注:(unicode的分别情况表)

  0000..007F;   Basic   Latin  
  0080..00FF;   Latin-1   Supplement  
  0100..017F;   Latin   Extended-A  
  0180..024F;   Latin   Extended-B  
  0250..02AF;   IPA   Extensions  
  02B0..02FF;   Spacing   Modifier   Letters  
  0300..036F;   Combining   Diacritical   Marks  
  0370..03FF;   Greek  
  0400..04FF;   Cyrillic  
  0530..058F;   Armenian  
  0590..05FF;   Hebrew  
  0600..06FF;   Arabic  
  0700..074F;   Syriac      
  0780..07BF;   Thaana  
  0900..097F;   Devanagari  
  0980..09FF;   Bengali  
  0A00..0A7F;   Gurmukhi  
  0A80..0AFF;   Gujarati  
  0B00..0B7F;   Oriya  
  0B80..0BFF;   Tamil  
  0C00..0C7F;   Telugu  
  0C80..0CFF;   Kannada  
  0D00..0D7F;   Malayalam  
  0D80..0DFF;   Sinhala  
  0E00..0E7F;   Thai  
  0E80..0EFF;   Lao  
  0F00..0FFF;   Tibetan  
  1000..109F;   Myanmar    
  10A0..10FF;   Georgian  
  1100..11FF;   Hangul   Jamo  
  1200..137F;   Ethiopic  
  13A0..13FF;   Cherokee  
  1400..167F;   Unified   Canadian   Aboriginal   Syllabics  
  1680..169F;   Ogham  
  16A0..16FF;   Runic  
  1780..17FF;   Khmer  
  1800..18AF;   Mongolian  
  1E00..1EFF;   Latin   Extended   Additional  
  1F00..1FFF;   Greek   Extended  
  2000..206F;   General   Punctuation  
  2070..209F;   Superscripts   and   Subscripts  
  20A0..20CF;   Currency   Symbols  
  20D0..20FF;   Combining   Marks   for   Symbols  
  2100..214F;   Letterlike   Symbols  
  2150..218F;   Number   Forms  
  2190..21FF;   Arrows  
  2200..22FF;   Mathematical   Operators  
  2300..23FF;   Miscellaneous   Technical  
  2400..243F;   Control   Pictures  
  2440..245F;   Optical   Character   Recognition  
  2460..24FF;   Enclosed   Alphanumerics  
  2500..257F;   Box   Drawing  
  2580..259F;   Block   Elements  
  25A0..25FF;   Geometric   Shapes  
  2600..26FF;   Miscellaneous   Symbols  
  2700..27BF;   Dingbats  
  2800..28FF;   Braille   Patterns  
  2E80..2EFF;   CJK   Radicals   Supplement  
  2F00..2FDF;   Kangxi   Radicals  
  2FF0..2FFF;   Ideographic   Description   Characters  
  3000..303F;   CJK   Symbols   and   Punctuation  
  3040..309F;   Hiragana  
  30A0..30FF;   Katakana  
  3100..312F;   Bopomofo  
  3130..318F;   Hangul   Compatibility   Jamo  
  3190..319F;   Kanbun  
  31A0..31BF;   Bopomofo   Extended  
  3200..32FF;   Enclosed   CJK   Letters   and   Months  
  3300..33FF;   CJK   Compatibility  
  3400..4DB5;   CJK   Unified   Ideographs   Extension   A  
  4E00..9FFF;   CJK   Unified   Ideographs  
  A000..A48F;   Yi   Syllables  
  A490..A4CF;   Yi   Radicals  
  AC00..D7A3;   Hangul   Syllables  
  D800..DB7F;   High   Surrogates  
  DB80..DBFF;   High   Private   Use   Surrogates  
  DC00..DFFF;   Low   Surrogates  
  E000..F8FF;   Private   Use  
  F900..FAFF;   CJK   Compatibility   Ideographs  
  FB00..FB4F;   Alphabetic   Presentation   Forms  
  FB50..FDFF;   Arabic   Presentation   Forms-A  
  FE20..FE2F;   Combining   Half   Marks  
  FE30..FE4F;   CJK   Compatibility   Forms  
  FE50..FE6F;   Small   Form   Variants  
  FE70..FEFE;   Arabic   Presentation   Forms-B  
  FEFF..FEFF;   Specials  
  FF00..FFEF;   Halfwidth   and   Fullwidth   Forms  
  FFF0..FFFD;   Specials  
  10300..1032F;   Old   Italic  
  10330..1034F;   Gothic  
  10400..1044F;   Deseret  
  1D000..1D0FF;   Byzantine   Musical   Symbols  
  1D100..1D1FF;   Musical   Symbols  
  1D400..1D7FF;   Mathematical   Alphanumeric   Symbols  
  20000..2A6D6;   CJK   Unified   Ideographs   Extension   B  
  2F800..2FA1F;   CJK   Compatibility   Ideographs   Supplement  
  E0000..E007F;   Tags  
  F0000..FFFFD;   Private   Use  
  100000..10FFFD;   Private   Use  

说明:Unicode CJK 的范围分布在多个区段中,上面贴出来的是整个 Unicode 中区块
表,带有 CJK 的区块名中都拥有汉字。但最常用的范围是 U+4E00~U+9FA5,即名
为:CJK Unified Ideographs 的区块,但 U+9FA6~U+9FFF 之间的字符还属于空码,
暂时还未定义,但不能保证以后不会被定义。

另:在正则表达式中使用 [\u4e00-\u9fa5] 这种方式属于写死的代码,并不能根据
平台所提供的字符集范围不同而改变,不过对于要求不是很高的话的是可以了。如果
对字符集的要求很高,可以采用下面的这种 Unicode 块的方式:

String regex = "[\\p{InCJK Unified Ideographs}&&\\P{Cn}]]";




 

  • 7
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值