Unicode character classes General Category

https://www.unicode.org/versions/Unicode10.0.0/ch04.pdf


The Unicode Character Database defines a General_Category property for all Unicodecode points. The General_Category value for a character serves as a basic classification ofthat character, based on its primary usage. The property extends the widely used subdivisionof ASCII characters into letters, digits, punctuation, and symbols—a useful classificationthat needs to be elaborated and further subdivided to remain appropriate for the largerand more comprehensive scope of the Unicode Standard.

Each Unicode code point is assigned a normative General_Category value. Each value ofthe General_Category is given a two-letter property value alias, where the first letter givesinformation about a major class and the second letter designates a subclass of that majorclass. In each class, the subclass “other” merely collects the remaining characters of themajor class. For example, the subclass “No” (Number, other) includes all characters of theNumber class that are not a decimal digit or letter. These characters may have little in commonbesides their membership in the same major class.

Table 4-4 enumerates the General_Category values, giving a short description of eachvalue. See Table 2-3 for the relationship between General_Category values and basic typesof code points.

There are several other conventions for how General_Category values are assigned to Unicodecharacters. Many characters have multiple uses, and not all such uses can be capturedby a single, simple partition property such as General_Category. Thus, many letters oftenserve dual functions as numerals in traditional numeral systems. Examples can be found inthe Roman numeral system, in Greek usage of letters as numbers, in Hebrew, and similarlyfor many scripts. In such cases the General_Category is assigned based on the primary letterusage of the character, even though it may also have numeric values, occur in numericexpressions, or be used symbolically in mathematical expressions, and so on.

The General_Category gc=Nl is reserved primarily for letterlike number forms which arenot technically digits. For example, the compatibility Roman numeral characters,U+2160..U+217F, all have gc=Nl. Because of the compatibility status of these characters,the recommended way to represent Roman numerals is with regular Latin letters (gc=Ll orgc=Lu). These letters derive their numeric status from conventional usage to expressRoman numerals, rather than from their General_Category value.

Currency symbols (gc=Sc), by contrast, are given their General_Category value basedentirely on their function as symbols for currency, even though they are often derived fromletters and may appear similar to other diacritic-marked letters that get assigned one of theletter-related General_Category values.

Pairs of opening and closing punctuation are given their General_Category values (gc=Psfor opening and gc=Pe for closing) based on the most typical usage and orientation of suchpairs. Occasional usage of such punctuation marks unpaired or in opposite orientation certainlyoccurs, however, and is in no way prevented by their General_Category values.


Lu = Letter, uppercase 

Ll = Letter, lowercase

 Lt = Letter, titlecase 

Lm = Letter, modifier 

Lo = Letter, other 

Mn = Mark, nonspacing 

Mc = Mark, spacing combining 

Me = Mark, enclosing 

Nd = Number, decimal digit 

Nl = Number, letter 

No = Number, other 

Pc = Punctuation, connector 

Pd = Punctuation, dash 

Ps = Punctuation, open 

Pe = Punctuation, close 

Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) 

Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) 

Po = Punctuation, other 

Sm = Symbol, math 

Sc = Symbol, currency 

Sk = Symbol, modifier 

So = Symbol, other 

Zs = Separator, space 

Zl = Separator, line 

Zp = Separator, paragraph 

Cc = Other, control 

Cf = Other, format 

Cs = Other, surrogate 

Co = Other, private use 

Cn = Other, not assigned (including noncharacters)

Similarly, characters whose General_Category identifies them primarily as a symbol or as amathematical symbol may function in other contexts as punctuation or even paired punctuation.The most obvious such case is for U+003C “<” less-than sign and U+003E “>”greater-than sign. These are given the General_Category gc=Sm because their primaryidentity is as mathematical relational signs. However, as is obvious from HTML and XML,they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

A common use of the General_Category of a Unicode character is in the derivation ofproperties for the determination of text boundaries, as in Unicode Standard Annex #29,“Unicode Text Segmentation.” Other common uses include determining language identifiersfor programming, scripting, and markup, as in Unicode Standard Annex #31, “UnicodeIdentifier and Pattern Syntax,” and in regular expression languages such as Perl. For moreinformation, see Unicode Technical Standard #18, “Unicode Regular Expressions.”

This property is also used to support common APIs such as isDigit(). Common functionssuch as isLetter()and isUppercase()do not extend well to the larger and morecomplex repertoire of Unicode. While it is possible to naively extend these functions toUnicode using the General_Category and other properties, they will not work for the entirerange of Unicode characters and the kinds of tasks for which people intend them. For moreappropriate approaches, see Unicode Standard Annex #31, “Unicode Identifier and PatternSyntax”; Unicode Standard Annex #29, “Unicode Text Segmentation”; Section 5.18, CaseMappings; and Section 4.10, Letters, Alphabetic, and Ideographic.

Although the General_Category property is normative, and its values are used in the derivationof many other properties referred to by Unicode algorithms, it does not follow thatthe General_Category always provides the most appropriate classification of a character forany given purpose. Implementations are not required to treat characters solely according totheir General_Category values when classifying them in various contexts. The followingexamples illustrate some typical cases in which an implementation might reasonablydiverge from General_Category values for a character when grouping characters as “punctuation,”“symbols,” and so forth.

• A character picker application might classify U+0023 # number sign amongsymbols, or perhaps under both symbols and punctuation.

• An “Ignore Punctuation” option for a search might choose not to ignoreU+0040 @ commercial at.

• A layout engine might treat U+0021 ! exclamation mark as a mathematicaloperator in the context of a mathematical equation, and lay it out differentlythan if the same character were used as terminal punctuation in text.

• A regular expression syntax could provide an operator to match all punctuation,but include characters other than those limited to gc=P (for example,U+00A7 § section sign ).

The general rule is that if an implementation purports to be using the Unicode General_-Category property, then it must use the exact values specified in the Unicode CharacterDatabase for that claim to be conformant. Thus, if a regular expression syntax explicitlysupports the Unicode General_Category property and matches gc=P, then that match mustbe based on the precise UCD values.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值