pythonutf8汉字占几个字节_python – UTF-8中的中文字符的上限和下限是多少?

从Unicode标准(v6.0,第12.1节),

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block | Range | Comment

----------------------------------------+-------------+-----------------------------------------------------

CJK Unified Ideographs | 4E00–9FFF | Common

CJK Unified Ideographs Extension A | 3400–4DBF | Rare

CJK Unified Ideographs Extension B | 20000–2A6DF | Rare, historic

CJK Unified Ideographs Extension C | 2A700–2B73F | Rare, historic

CJK Unified Ideographs Extension D | 2B740–2B81F | Uncommon, some in current use

CJK Compatibility Ideographs | F900–FAFF | Duplicates, unifiable variants, corporate characters

CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

除了这些块之外还有一些额外的东西:

Table 12-3. Small Extensions to the URO

Range | Version | Comment

----------+---------+-------------------------------------------------

9FA6–9FB3 | 4.1 | Interoperability with HKSCS standard

9FB4–9FBB | 4.1 | Interoperability with GB 18030 standard

9FBC–9FC2 | 5.1 | Interoperability with commercial implementations

9FC3 | 5.1 | Correction of mistaken unification

9FC4–9FC6 | 5.2 | Interoperability with ARIB standard

9FC7–9FCB | 5.2 | Interoperability with HKSCS standard

要使用set操作构造一组这些的序数值,您可以这样做:

chinese = set(range(0x4E00, 0xA000) +

range(0x3400, 0x4DC0) +

range(0x20000, 0x2A6E0) +

range(0x2A700, 0x2B740) +

range(0x2B740, 0x2B820) +

range(0xF900, 0xFB00) +

range(0x2F800, 0x2FA20) +

range(0x9FA6, 0x9FCC))

但请注意,此集包含超过75000个字符,因此它可能不是最紧凑或最有效的数据结构.

此外,如果您坚持在文字字符上使用ord(),则需要使用32位unicode文字形式:

>>> ord(u'\U00002F800')

194560

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值