Unicode character classes General Category

最新推荐文章于 2022-10-26 22:37:53 发布

taogez

最新推荐文章于 2022-10-26 22:37:53 发布

阅读量365

点赞数

分类专栏： Unicode Class 文章标签： Unicode Class Unicode Unicode Category

Unicode Class 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

https://www.unicode.org/versions/Unicode10.0.0/ch04.pdf

The Unicode Character Database defines a General_Category property for all Unicodecode points. The General_Category value for a character serves as a basic classification ofthat character, based on its primary usage. The property extends the widely used subdivisionof ASCII characters into letters, digits, punctuation, and symbols—a useful classificationthat needs to be elaborated and further subdivided to remain appropriate for the largerand more comprehensive scope of the Unicode Standard.

Each Unicode code point is assigned a normative General_Category value. Each value ofthe General_Category is given a two-letter property value alias, where the first letter givesinformation about a major class and the second letter designates a subclass of that majorclass. In each class, the subclass “other” merely collects the remaining characters of themajor class. For example, the subclass “No” (Number, other) includes all characters of theNumber class that are not a decimal digit or letter. These characters may have little in commonbesides their membership in the same major class.

Table 4-4 enumerates the General_Category values, giving a short description of eachvalue. See Table 2-3 for the relationship between General_Category values and basic typesof code points.

There are several other conventions for how General_Category values are assigned to Unicodecharacters. Many characters have multiple uses, and not all such uses can be capturedby a single, simple partition property such as General_Category. Thus, many letters oftenserve dual functions as numerals in traditional numeral systems. Examples can be found inthe Roman numeral system, in Greek usage of letters as numbers, in Hebrew, and similarlyfor many scripts. In such cases the General_Category is assigned based on the primary letterusage of the character, even though it may also have numeric values, occur in numericexpressions, or be used symbolically in mathematical expressions, and so on.

The General_Category gc=Nl is reserved primarily for letterlike number forms which arenot technically digits. For example, the compatibility Roman numeral characters,U+2160..U+217F, all have gc=Nl. Because of the compatibility status of these characters,the recommended way to represent Roman numerals is with regular Latin letters (gc=Ll orgc=Lu). These letters derive their numeric status from conventional usage to expressRoman numerals, rather than from their General_Category value.

Currency symbols (gc=Sc), by contrast, are given their General_Category value basedentirely on their function as symbols for currency, even though they are often derived fromletters and may appear similar to other diacritic-marked letters that get assigned one of theletter-related General_Category values.

Pairs of opening and closing punctuation are given their General_Category values (gc=Psfor opening and gc=Pe for closing) based on the most typical usage and orientation of suchpairs. Occasional usage of such punctuation marks unpaired or in opposite orientation certainlyoccurs, however, and is in no way prevented by their General_Category values.

Lu = Letter, uppercase

Ll = Letter, lowercase

Lt = Letter, titlecase

Lm = Letter, modifier

Lo = Letter, other

Mn = Mark, nonspacing

Mc = Mark, spacing combining

Me = Mark, enclosing

Nd = Number, decimal digit

Nl = Number, letter

No = Number, other

Pc = Punctuation, connector

Pd = Punctuation, dash

Ps = Punctuation, open

Pe = Punctuation, close

Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage)

Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage)

Po = Punctuation, other

Sm = Symbol, math

Sc = Symbol, currency

Sk = Symbol, modifier

So = Symbol, other

Zs = Separator, space

Zl = Separator, line

Zp = Separator, paragraph

Cc = Other, control

Cf = Other, format

Cs = Other, surrogate

Co = Other, private use

Cn = Other, not assigned (including noncharacters)

Similarly, characters whose General_Category identifies them primarily as a symbol or as amathematical symbol may function in other contexts as punctuation or even paired punctuation.The most obvious such case is for U+003C “<” less-than sign and U+003E “>”greater-than sign. These are given the General_Category gc=Sm because their primaryidentity is as mathematical relational signs. However, as is obvious from HTML and XML,they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

A common use of the General_Category of a Unicode character is in the derivation ofproperties for the determination of text boundaries, as in Unicode Standard Annex #29,“Unicode Text Segmentation.” Other common uses include determining language identifiersfor programming, scripting, and markup, as in Unicode Standard Annex #31, “UnicodeIdentifier and Pattern Syntax,” and in regular expression languages such as Perl. For moreinformation, see Unicode Technical Standard #18, “Unicode Regular Expressions.”

This property is also used to support common APIs such as isDigit(). Common functionssuch as isLetter()and isUppercase()do not extend well to the larger and morecomplex repertoire of Unicode. While it is possible to naively extend these functions toUnicode using the General_Category and other properties, they will not work for the entirerange of Unicode characters and the kinds of tasks for which people intend them. For moreappropriate approaches, see Unicode Standard Annex #31, “Unicode Identifier and PatternSyntax”; Unicode Standard Annex #29, “Unicode Text Segmentation”; Section 5.18, CaseMappings; and Section 4.10, Letters, Alphabetic, and Ideographic.

Although the General_Category property is normative, and its values are used in the derivationof many other properties referred to by Unicode algorithms, it does not follow thatthe General_Category always provides the most appropriate classification of a character forany given purpose. Implementations are not required to treat characters solely according totheir General_Category values when classifying them in various contexts. The followingexamples illustrate some typical cases in which an implementation might reasonablydiverge from General_Category values for a character when grouping characters as “punctuation,”“symbols,” and so forth.

• A character picker application might classify U+0023 # number sign amongsymbols, or perhaps under both symbols and punctuation.

• An “Ignore Punctuation” option for a search might choose not to ignoreU+0040 @ commercial at.

• A layout engine might treat U+0021 ! exclamation mark as a mathematicaloperator in the context of a mathematical equation, and lay it out differentlythan if the same character were used as terminal punctuation in text.

• A regular expression syntax could provide an operator to match all punctuation,but include characters other than those limited to gc=P (for example,U+00A7 § section sign ).

The general rule is that if an implementation purports to be using the Unicode General_-Category property, then it must use the exact values specified in the Unicode CharacterDatabase for that claim to be conformant. Thus, if a regular expression syntax explicitlysupports the Unicode General_Category property and matches gc=P, then that match mustbe based on the precise UCD values.

taogez

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Unicode character classes General Category

https://www.unicode.org/versions/Unicode10.0.0/ch04.pdfThe Unicode Character Database defines a General_Category property for all Unicodecode points. The General_Category value for a character serves ...
复制链接

扫一扫