unicode编码的分布

unicode编码的分布

2012-03-19 整理自:http://topic.csdn.net/u/20080629/00/2f669f44-6e30-4e2e-9cce-08889dba2ae2.html

------------------------------------------------------------------------------------------------------------

  0000..007F; Basic Latin  
  0080..00FF; Latin-1 Supplement  
  0100..017F; Latin Extended-A  
  0180..024F; Latin Extended-B  
  0250..02AF; IPA Extensions  
  02B0..02FF; Spacing Modifier Letters  
  0300..036F; Combining Diacritical Marks  
  0370..03FF; Greek  
  0400..04FF; Cyrillic  
  0530..058F; Armenian  
  0590..05FF; Hebrew  
  0600..06FF; Arabic  
  0700..074F; Syriac  
  0780..07BF; Thaana  
  0900..097F; Devanagari  
  0980..09FF; Bengali  
  0A00..0A7F; Gurmukhi  
  0A80..0AFF; Gujarati  
  0B00..0B7F; Oriya  
  0B80..0BFF; Tamil  
  0C00..0C7F; Telugu  
  0C80..0CFF; Kannada  
  0D00..0D7F; Malayalam  
  0D80..0DFF; Sinhala  
  0E00..0E7F; Thai  
  0E80..0EFF; Lao  
  0F00..0FFF; Tibetan  
  1000..109F; Myanmar  
  10A0..10FF; Georgian  
  1100..11FF; Hangul Jamo  
  1200..137F; Ethiopic  
  13A0..13FF; Cherokee  
  1400..167F; Unified Canadian Aboriginal Syllabics  
  1680..169F; Ogham  
  16A0..16FF; Runic  
  1780..17FF; Khmer  
  1800..18AF; Mongolian  
  1E00..1EFF; Latin Extended Additional  
  1F00..1FFF; Greek Extended  
  2000..206F; General Punctuation  
  2070..209F; Superscripts and Subscripts  
  20A0..20CF; Currency Symbols  
  20D0..20FF; Combining Marks for Symbols  
  2100..214F; Letterlike Symbols  
  2150..218F; Number Forms  
  2190..21FF; Arrows  
  2200..22FF; Mathematical Operators  
  2300..23FF; Miscellaneous Technical  
  2400..243F; Control Pictures  
  2440..245F; Optical Character Recognition  
  2460..24FF; Enclosed Alphanumerics  
  2500..257F; Box Drawing  
  2580..259F; Block Elements  
  25A0..25FF; Geometric Shapes  
  2600..26FF; Miscellaneous Symbols  
  2700..27BF; Dingbats  
  2800..28FF; Braille Patterns  
  2E80..2EFF; CJK Radicals Supplement  
  2F00..2FDF; Kangxi Radicals  
  2FF0..2FFF; Ideographic Description Characters  
  3000..303F; CJK Symbols and Punctuation  
  3040..309F; Hiragana  
  30A0..30FF; Katakana  
  3100..312F; Bopomofo  
  3130..318F; Hangul Compatibility Jamo  
  3190..319F; Kanbun  
  31A0..31BF; Bopomofo Extended  
  3200..32FF; Enclosed CJK Letters and Months  
  3300..33FF; CJK Compatibility                                     //中文字符开始
  3400..4DB5; CJK Unified Ideographs Extension A  
  4E00..9FFF; CJK Unified Ideographs                           //中文字符结束
  A000..A48F; Yi Syllables  
  A490..A4CF; Yi Radicals  
  AC00..D7A3; Hangul Syllables  
  D800..DB7F; High Surrogates  
  DB80..DBFF; High Private Use Surrogates  
  DC00..DFFF; Low Surrogates  
  E000..F8FF; Private Use  
  F900..FAFF; CJK Compatibility Ideographs  
  FB00..FB4F; Alphabetic Presentation Forms  
  FB50..FDFF; Arabic Presentation Forms-A  
  FE20..FE2F; Combining Half Marks  
  FE30..FE4F; CJK Compatibility Forms  
  FE50..FE6F; Small Form Variants  
  FE70..FEFE; Arabic Presentation Forms-B  
  FEFF..FEFF; Specials  
  FF00..FFEF; Halfwidth and Fullwidth Forms  
  FFF0..FFFD; Specials  
  10300..1032F; Old Italic  
  10330..1034F; Gothic  
  10400..1044F; Deseret  
  1D000..1D0FF; Byzantine Musical Symbols  
  1D100..1D1FF; Musical Symbols  
  1D400..1D7FF; Mathematical Alphanumeric Symbols  
  20000..2A6D6; CJK Unified Ideographs Extension B  
  2F800..2FA1F; CJK Compatibility Ideographs Supplement  
  E0000..E007F; Tags  
  F0000..FFFFD; Private Use  
  100000..10FFFD; Private Use  

---------------------------------------------------------------------------------------------------------------

Unicode CJK(中文字符) 的范围分布在多个区段中,带有 CJK 的区块名中都拥有汉字。最常用的范围是 U+4E00~U+9FA5,即名为:CJK Unified Ideographs 的区块,但 U+9FA6~U+9FFF 之间的字符还属于空码,
暂时还未定义,但不能保证以后不会被定义。

PS:Unicode 中 U+4E00~U+9FFF 的码表:
http://www.unicode.org/charts/PDF/U4E00.pdf

在这里可以根据 Unicode 码查到所有的字符:
http://www.unicode.org/cgi-bin/GetUnihanData.pl


另:在正则表达式中使用 [\u4e00-\u9fa5] 这种方式属于写死的代码,并不能根据
平台所提供的字符集范围不同而改变,不过对于要求不是很高的话的是可以了。如果
对字符集的要求很高,可以采用下面的这种 Unicode 块的方式:

Java code:String regex = " [\\p{InCJK Unified Ideographs}&&\\P{Cn}]] " ;

在当前的 JDK 版中与 [\u4e00-\u9fa5] 的意义一致。但这样可以匹配 Java 平台所支持
Unicode 块名为 CJK Unified Ideogrpahs 中已定义的字符,这种方式就属于“活”代码
今后的 JDK 版本升级了,定义到了 \u9fa6 的字符,同样能够满足匹配。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值