小记：正则表达式

最新推荐文章于 2024-05-13 14:16:26 发布

weixin_34218890

最新推荐文章于 2024-05-13 14:16:26 发布

阅读量111

点赞数

文章标签： awk 操作系统 shell

原文链接：https://my.oschina.net/nirno/blog/76312

版权

为什么80%的码农都做不了架构师？>>>

重新了解了下正则表达式，小记如下，
参考：《Classic Shell Scripting》 p33 ~ p47

POSIX BRE and ERE metacharacters

Character    BRE / ERE    Meaning in a pattern
\                   Both
.                   Both
*                  Both
^                  Both
$                   Both
[...]               Both
\{n,m\}         BRE
              BRE
\n                 BRE
+                  ERE
?                   ERE
|                   ERE
( )                 ERE
{n,m}          ERE

POSIX bracket expressions

Character classes

    Class            Matching characters
    [:alnum:]      Alphanumeric characters
    [:alpha:]       Alphabetic characters
    [:blank:]       Space and tab characters
    [:cntrl:]         Control characters
    [:digit:]         Numeric characters
    [:graph:]   Nonspace characters
    [:lower:]   Lowercase characters
    [:print:]        Printable characters
    [:punct:]   Punctuation characters
    [:space:]      Whitespace characters
    [:upper:]      Uppercase characters
    [:xdigit:]   Hexadecimal digits

Collating symbols

A collating symbol is a multicharacter sequence that should be treated as a unit.
It consists of the characters bracketed by [. and .]. Collating symbols are specific to
the locale in which they are used.

Equivalence classes

An equivalence class lists a set of characters that should be considered equivalent,
such as e and è. It consists of a named element from the locale, bracketed by [= and =].

All three of these constructs must appear inside the square brackets of a bracket
expression. For example, [[:alpha:]!] matches any single alphabetic character or the
exclamation mark,and [[.ch.]] matches the collating element ch, but does not match just
the letter c or the letter h. In a French locale, [[=e=]] might match any of e, è, ë, ê, or é.

Basic Regular Expressions

Matching single characters

    • Ordinary characters
    • Metacharacters: escaping it
    • The . (dot) character
    • Bracket expression: [](such as [012345], [0-5], [^0-5]) or Character classes(such as
[:digit:]) or Equivalence classes(such as [=e=]) or Collating symbols(such as [.ch.]).

        Within bracket expressions, all other metacharacters lose their special meanings. Thus,
[*\.] matches a literal asterisk, a literal backslash, or a literal period. To get a ] into the set,
place it first in the list: [ ]*\.] adds the ] to the list. To get a minus character into the set,
place it first in the list: [-*\.]. If you need both a right bracket and a minus, make the right
bracket the first character, and make the minus the last one in the list: [ ]*\.-].

Backreferences

Pattern                                                    Matches
$ab$$cd$[def]*\2\1
$why$.*\1
$[[:alpha:]_][[:alnum:]_]*$ = \1;
$["']$.*\1

Matching multiple characters with one expression

*
\{N\}
\{N,\}
\{N,M\}
\{,M\}

Anchoring text matches

^
$

BRE operator precedence

Operator                Meaning
[..] [==] [::]
\metacharacter
[]
 \digit
* \{\}
no symbol                Concatenation
^ $

Extended Regular Expressions

Matching single characters

same as BREs. But one notable exception is that in awk, \ is special inside bracket
expressions. Thus, to match a left bracket, dash, right bracket, or backslash, you could
use [\[\-\]\\].

Backreferences don’t exist

Matching multiple regular expressions with one expression

*
+
?
{N}
{N,}
{N,M}
{,M}

Alternation

Grouping

()

Anchoring text matches

same as BRE. But there is one significant difference: in EREs, ^ and $ are always
metacharacters. Thus, regular expressions such as ab^cd and ef$gh are valid, but cannot
match anything,

ERE operator precedence

Operator                Meaning
[..] [==] [::]
\metacharacter
[]
()
* + ? {}
no symbol              Concatenation
^ $
|                           Alternation

Additional GNU regular expression operators

Operator               Meaning
\w                        Matches any word-constituent character. Equivalent to [[:alnum:]_].
\W                        Matches any nonword-constituent character. Equivalent to [^[:alnum:]_].
\< \>                    Matches the beginning and end of a word, as described previously.
\b                         Matches the null string found at either the beginning or the end of a word.
                            This is a generalization of the \< and \> operators. Note: Because awk uses
                             \b to represent the backspace character, GNU awk (gawk) uses \y.
\B                         Matches the null string between two word-constituent characters.
\' \`                    Matches the beginning and end of an emacs buffer, respectively. GNU
                            programs (besides emacs) generally treat these as being equivalent to ^ and $.

        Finally, although POSIX explicitly states that the NUL character need not be matchable, GNU
programs have no such restriction. If a NUL character occurs in input data, it can be matched by
the . metacharacter or a bracket expression.

Unix programs and their regular expression type

Type    grep    sed    ed    ex/vi    more    egrep    awk    lex
BRE        •        •       •       •     •
ERE                                                        •          •       •
\< \>      •        •       •       •       •

转载于:https://my.oschina.net/nirno/blog/76312