关于PCRE的初级入门

最新推荐文章于 2022-04-24 11:36:14 发布

qdujunjie

最新推荐文章于 2022-04-24 11:36:14 发布

阅读量1.3k

点赞数

分类专栏： PHP 文章标签： PHP PCRE

本文链接：https://blog.csdn.net/qdujunjie/article/details/8559097

版权

PHP 专栏收录该内容

82 篇文章 1 订阅

订阅专栏

事情起源于一个字符串切分成数组：

$str = "123我是谁ab1我和3356a&%%%??!!。。。~~~b……我是！，，，，";

function mb_str_split( $string ) {
# Split at all position not after the start: ^
# and not before the end: $
return preg_split('/(?<!^)(?!$)/u', $string );
}

var_dump($mb_str_split($str));

输出：

array
  0 => string '1' (length=1)
  1 => string '2' (length=1)
  2 => string '3' (length=1)
  3 => string '我' (length=3)
  4 => string '是' (length=3)
  5 => string '谁' (length=3)
  6 => string 'a' (length=1)
  7 => string 'b' (length=1)
  8 => string '1' (length=1)
  9 => string '我' (length=3)
  10 => string '和' (length=3)
  11 => string '3' (length=1)
  12 => string '3' (length=1)
  13 => string '5' (length=1)
  14 => string '6' (length=1)
  15 => string 'a' (length=1)
  16 => string '&' (length=1)
  17 => string '%' (length=1)
  18 => string '%' (length=1)
  19 => string '%' (length=1)
  20 => string '?' (length=1)
  21 => string '?' (length=1)
  22 => string '!' (length=1)
  23 => string '!' (length=1)
  24 => string '。' (length=3)
  25 => string '。' (length=3)
  26 => string '。' (length=3)
  27 => string '~' (length=1)
  28 => string '~' (length=1)
  29 => string '~' (length=1)
  30 => string 'b' (length=1)
  31 => string '…' (length=3)
  32 => string '…' (length=3)
  33 => string '我' (length=3)
  34 => string '是' (length=3)
  35 => string '！' (length=3)
  36 => string '，' (length=3)
  37 => string '，' (length=3)
  38 => string '，' (length=3)
  39 => string '，' (length=3)

发现其中的正则： /(?<!^)(?!$)/u

是PCRE模式：

什么是PCRE？？

PCRE = Perl-Compatible Regular Expressions

PCRE正则中的断言：

<http://www.php.net/manual/zh/regexp.reference.assertions.php>

断言

一个断言就是一个对当前匹配位置之前或之后的字符的测试，它不会实际消耗任何字符。简单的断言代码有\b、\B、 \A、 \Z、\z、 ^、$ 等等。更加复杂的断言以子组的方式编码。它有两种类型：前瞻断言(从当前位置向前测试)和后瞻断言(从当前位置向后测试)。

一个断言子组的匹配还是通过普通方式进行的，不同在于它不会导致当前的匹配点发生改变。前瞻断言中的正面断言(断言此匹配为真)以 ”(?=” 开始，消极断言以 ”(?!” 开头。比如， \w+(?=;) 匹配一个单词紧跟着一个分号但是匹配结果不会包含分号， foo(?!bar) 匹配所有后面没有紧跟 ”bar” 的 ”foo” 字符串。注意一个类似的模式 (?!foo)bar，它不能用于查找之前出现所有不是 ”foo” 的 ”bar” 匹配，它会查找到任意的 ”bar” 出现的情况，因为 (?!foo) 这个断言在接下来三个字符时 ”bar” 的时候是永远都 TRUE 的。前瞻断言需要达到的就是这样的效果。

后瞻断言中的正面断言以”(?<=”开始, 消极断言以”(?<!”开始。比如， (?<!foo)bar 用于查找任何前面不是 ”foo” 的 ”bar”。后瞻断言的内容被严格限制为只能用于匹配定长字符串。但是，如果有多个可选分支，它们不需要拥有相同的长度。比如 (?<=bullock|donkey) 是允许的，但是 (?<!dogs?|cats?) 将会引发一个编译期的错误。在最上级分支可以匹配不同长度的字符串是允许的。相比较于 perl 5.005 而言，它会要求多个分支使用相同长度的字符串匹配。 (?<=ab(c|de)) 这样的断言是不允许的，因为它单个的顶级分支可以匹配两个不同的长度，但是它可以接受使用两个顶级分支的写法 (?<=abc|abde) 这样的断言实现，对于每个可选分支，暂时将当前位置移动到尝试匹配的当前位置之前的固定宽度处。如果在当前没有足够的字符就视为匹配失败。后瞻断言与一次性子组结合使用可以用来匹配字符串结尾；一个例子就是在一次性子组上给出字符串结尾。

多个断言(任意顺序)可以同时出现。比如 (?<=\d{3})(?<!999)foo 匹配前面有三个数字但不是 ”999” 的字符串 ”foo”。注意，每个断言独立应用到对目标字符串该点的匹配。首先它会检查前面的三位都是数字，然后检查这三位不是 ”999”。这个模式不能匹配 ”foo” 前面有三位数字然后紧跟 3 位非 999 共 6 个字符的字符串，比如，它不匹配 ”123abcfoo”。匹配 ”123abcfoo” 这个字符串的模式可以是(?<=\d{3}…)(?<!999)foo。

这种情况下，第一个断言查看(当前匹配点)前面的 6 个字符，检查前三个是数字，然后第二个断言检查(当前匹配点)前三个字符不是 ”999”。

断言可以以任意复杂度嵌套。比如 (?<=(?<!foo)bar)baz 匹配前面有 ”bar” 但是 ”bar” 前面没有 ”foo” 的 ”baz”。另外一个模式 (?<=\d{3}…(?<!999))foo 则匹配前面有三个数字字符紧跟 3 个不是 999 的任意字符的 ”foo”。

断言子组时非捕获子组，并且不能用量词修饰，因为对同一件事做多次断言是没有意义的.如果所有的断言都包含一个捕获子组，那么为了在整个模式中捕获子组计数的目的，它们都会被计算在内。然而，子字符串的捕获仅可以用于正面断言，因为对于消极的断言是没有意义的。

将断言计算在内，可以拥有的最大子组数量是 200 个。

看完上面这段可以知道，/(?<!^)(?!$)/u 中的(?!$)是前瞻断言中的消极断言，也即是消极断言$,也即是不以任何为结束；而(?<!^)是后瞻断言中的消极断言，也即是消极断言^,也即不以任何为开始。

而其中的/u属于PCRE的模式修饰符，解释为：

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"

2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )

4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

The following script should give you an idea of what works and what doesn't;

<?php
$examples = array(
    'Valid ASCII' => "a",
    'Valid 2 Octet Sequence' => "\xc3\xb1",
    'Invalid 2 Octet Sequence' => "\xc3\x28",
    'Invalid Sequence Identifier' => "\xa0\xa1",
    'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
    'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
    'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",

    'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
    'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
    'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
    'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
    'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
    'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);

echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
    echo "$name\n";
    preg_match("/".$str."/u",'Testing');
}

echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {

    preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
    echo "$name: ";

    if ( count($ar) == 0 ) {
      echo "Matched nothing!\n";
    } else {
      echo "Matched {$ar[0]}\n";
    }

}

echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
    preg_match_all('/./u', $str, $ar);
    echo "$name: ";

    $num_utf8_chars = count($ar[0]);
    if ( $num_utf8_chars == 0 ) {
      echo "Matched nothing!\n";
    } else {
      echo "Matched $num_utf8_chars character\n";
    }

}
?>