正则表达式学习指南(二十二)----XML Character Classes

最新推荐文章于 2021-10-10 22:47:31 发布

wushuai1346

最新推荐文章于 2021-10-10 22:47:31 发布

阅读量685

点赞数

分类专栏：正则表达式文章标签： character 正则表达式 regex class xml schema

正则表达式专栏收录该内容

29 篇文章 0 订阅

订阅专栏

XML Schema Character Classes

XML Schema Regular Expressions support the usual six shorthand character classes, plus four more. These four aren't supported by any other regular expression flavor.\i matches any character that may be the first character of an XML name, i.e.[_:A-Za-z]. \c matches any character that may occur after the first character in an XML name, i.e.[-._:A-Za-z0-9].\I and \C are the respective negated shorthands. Note that the\c shorthand syntax conflicts with thecontrol character syntax used in many other regex flavors.

You can use these four shorthands both inside and outside character classes using the bracket notation. They're very useful for validating XML references and values in your XML schemas. The regular expression\i\c* matches an XML name like xml:schema. In other regular expression flavors, you'd have to spell this out as[_:A-Za-z][-._:A-Za-z0-9]*. The latter regex also works with XML's regular expression flavor. It just takes more time to type in.

The regex <\i\c*\s*> matches an opening XML tag without any attributes.</\i\c*\s*> matches any closing tag.<\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*\s*> matches an opening tag with any number of attributes. Putting it all together,<(\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*|/\i\c*)\s*> matches either an opening tag with attributes or a closing tag.

Character Class Subtraction

While the regex flavor it defines is quite limited, the XML Schema adds a new regular expression feature not previously seen in any (popular) regular expression flavor: character class subtraction. Currently, this feature is only supported by theJGsoft and .NET regex engines (in addition to those implementing the XML Schema standard).

Character class subtraction makes it easy to match any single character present in one list (the character class), but not present in another list (the subtracted class). The syntax for this is[class-[subtract]]. If the character after a hyphen is an opening bracket, XML regular expressions interpret the hyphen as the subtraction operator rather than the range operator. E.g.[a-z-[aeiuo]] matches a single letter that is not a vowel (i.e. a single consonant). Without the character class subtraction feature, the only way to do this would be to list all consonants:[b-df-hj-np-tv-z].

This feature is more than just a notational convenience, though. You can use the full character class syntax within the subtracted character class. E.g. to match all Unicode letters except ASCII letters (i.e. all non-English letters), you could easily use[\p{L}-[\p{IsBasicLatin}]].

Nested Character Class Subtraction

Since you can use the full character class syntax within the subtracted character class, you can subtract a class from the class being subtracted. E.g.[0-9-[0-6-[0-3]]] first subtracts0-3 from 0-6, yielding[0-9-[4-6]], or [0-37-9], which matches any character in the string0123789.

The class subtraction must always be the last element in the character class. [0-9-[4-6]a-f] is not a valid regular expression. It should be rewritten as [0-9a-f-[4-6]]. The subtraction works on the whole class. E.g. [\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] matches all uppercase and lowercase Unicode letters, except any ASCII letters. The\p{IsBasicLatin} is subtracted from the combination of\p{Ll}\p{Lu} rather than from\p{Lu} alone. This regex will not matchabc.

While you can use nested character class subtraction, you cannot subtract two classes sequentially. To subtract ASCII letters and Greek letters from a class with all Unicode letters, combine the ASCII and Greek letters into one class, and subtract that, as in [\p{L}-[\p{IsBasicLatin}\p{IsGreek}]].

Notational Compatibility with Other Regex Flavors

Note that a regex like [a-z-[aeiuo]] will not cause any errors in regex flavors that do not support character class subtraction. But it won't match what you intended either. E.g. in Perl, this regex consists of a character class followed by a literal ]. The character class matches a character that is either in the range a-z, or a hyphen, or an opening bracket, or a vowel. Since the a-z range and the vowels are redundant, you could write this character class as[a-z-[] or [-[a-z]. A hyphen after a range is treated as a literal character, just like a hyphen immediately after the opening bracket. This is true in all regex flavors, including XML. E.g.[a-z-_] matches a lowercase letter, a hyphen or an underscore in both Perl and XML Schema.

While the last paragraph strictly speaking means that the XML Schema character class syntax is incompatible with Perl and the majority of other regex flavors, in practice there's no difference. Using non-alphanumeric characters in character class ranges is very bad practice, as it relies on the order of characters in the ASCII character table, which makes the regular expression hard to understand for the programmer who inherits your work. E.g. while[A-[] would match any upper case letter or an opening square bracket in Perl, this regex is much clearer when written as[A-Z[]. The former regex would cause an error in XML Schema, because it interprets-[] as an empty subtracted class, leaving an unbalanced[.

wushuai1346

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
正则表达式学习指南(二十二)----XML Character Classes

XML Schema Character ClassesXML Schema Regular Expressions support the usual six shorthand character classes, plus four more. These four aren't supported by any other regular expression flavor.\i ma
复制链接

扫一扫