1. 中文汉字Unicode 编码表
序号 | 字符集 | 字数 | Unicode 编码范围 |
---|---|---|---|
1 | 基本汉字 | 20902 | 4E00-9FA5 |
2 | 基本汉字补充 | 74 | 9FA6-9FEF |
3 | 扩展A | 6582 | 3400-4DB5 |
4 | 扩展B | 42711 | 20000-2A6D6 |
5 | 扩展C | 4149 | 2A700-2B734 |
6 | 扩展D | 222 | 2B740-2B81D |
7 | 扩展E | 5762 | 2B820-2CEA1 |
8 | 扩展F | 7473 | 2CEB0-2EBE0 |
9 | 康熙部首 | 214 | 2F00-2FD5 |
10 | 部首扩展 | 115 | 2E80-2EF3 |
11 | 兼容汉字 | 477 | F900-FAD9 |
12 | 兼容扩展 | 542 | 2F800-2FA1D |
13 | PUA(GBK)部件 | 81 | E815-E86F |
14 | 部件扩展 | 452 | E400-E5E8 |
15 | PUA增补 | 207 | E600-E6CF |
16 | 汉字笔画 | 36 | 31C0-31E3 |
17 | 汉字结构 | 12 | 2FF0-2FFB |
18 | 汉语注音 | 43 | 3105-312F |
19 | 注音扩展 | 22 | 31A0-31BA |
20 | 〇 | 1 | 3007 |
2. Python代码实现
#只要是检测到一个非汉字字符就返回
#if条件一大堆,肯定有更简单的写法,再学吧!
def is_ch(word):
for ch in word:
if not('\u4e00' <= ch <= '\u9fef') and not ('\u3400' <= ch <= '\u4db5') \
and not ('\u20000' <= ch <= '\u2a6d6') and not ('\u2a700' <= ch <= '\u2b734')\
and not ('\u2b740' <= ch <= '\u2b81d') and not ('\u2b820' <= ch <= '\u2cea1')\
and not ('\u2ceb0' <= ch <= '\u2ebe0') and not ('\u2f00' <= ch <= '\u2fd5')\
and not ('\u2e80' <= ch <= '\u2ef3') and not ('\uf900' <= ch <= '\ufad9')\
and not ('\u2f800' <= ch <= '\u2fa1d') and not ('\ue815' <= ch <= '\ue86f')\
and not ('\ue400' <= ch <= '\ue5e8') and not ('\ue600' <= ch <= '\ue6cf')\
and not ('\u31c0' <= ch <= '\u31e3') and not ('\u2ff0' <= ch <= '\u2ffb')\
and not ('\u3105' <= ch <= '\u312f') and not ('\u31a0' <= ch <= '\u31ba'):
return False
break
return True
3. 有时间时可以扩展
(1)比如:全部为汉字时返回True和原字符串,有非汉字时返回False和非汉字字符串。
(2)if中判断条件一大堆,肯定有简单的写法,找到一个简单的写法或是优雅点的写法。
(3)更简单的实现方法?这些Unicode 编码连续吗?找时间研究一下!