Lua:The Frontier pattern: %f

最新推荐文章于 2024-03-18 02:10:55 发布

hiheasy

最新推荐文章于 2024-03-18 02:10:55 发布

阅读量1.5k

点赞数

文章标签： lua 正则表达式 string regex null

本文链接：https://blog.csdn.net/hiheasy/article/details/7937831

版权

项目中要使用到正则表达式，但是一般的开源库的正则实现都很庞大，一个boost的regex的debug版大到了32.9 MB（1.51），感觉有点不爽。而lua中的正则实现仅仅只有500行，也基本满足了大部分需求，所以就花了点时间看了下lua的实现。至于lua的正则是和string库一起的，可以看这里。

在了解了lua的正则所能支持的功能后，看起源码了就水到渠成了（lstrlib.c）。但还是有两个pattern在网上找不到好的资料。一个是%b，一个是%f 。这里就只介绍下这两个pattern.

%b, %f 两个pattern的关键代码如下：

 
  case L_ESC: {
 

 
      switch (*(p+1)) {
 

 
        case 'b': {  /** balanced string? */
 

 
          s = matchbalance(ms, s, p+2);
 

 
          if (s == NULL) return NULL;
 

 
          p+=4; goto init;  /** else return match(ms, s, p+4); */
 

}

 
        case 'f': {  /** frontier? */
 

 
          const char *ep; char previous;
 

 
          p += 2;
 

 
          if (*p != '[')
 

 
            luaL_error(ms->L, "missing " LUA_QL("[") " after "
 

 
                               LUA_QL("%%f") " in pattern");
 

 
          ep = classend(ms, p);  /** points to what is next */
 

 
  391           previous = (s == ms->src_init) ? '\0' : *(s-1);
 
 // s的前一个字符不符合， s当前字符必须符合

 
          if (matchbracketclass(uchar(previous), p, ep-1) ||
 

 
  393              !matchbracketclass(uchar(*s), p, ep-1)) return NULL;
   
  // p的位置增加了， 但s的位置没有增加

 
          p=ep; goto init;  /** else return match(ms, s, ep); */
 

}

 
        default: {
 

 
          if (isdigit(uchar(*(p+1)))) {  /** capture results (%0-%9)? */
 

 
            s = match_capture(ms, s, uchar(*(p+1)));
 

 
            if (s == NULL) return NULL;
 

 
            p+=2; goto init;  /** else return match(ms, s, p+2) */
 

}

 
          goto dflt;  /** case default */
 

}

}

}

（1）'%b' 用来匹配对称的字符。常写为 '%bxy' ，x和y是任意两个不同的字符；x作为匹配的开始，y作为匹配的结束。比如，'%b()' 匹配以 '(' 开始，以 ')' 结束的字符串：

print(string.gsub("123abcdefg456", "%bag", ""))
--> 123456

 
  static const char *matchbalance (MatchState *ms, const char *s,
 

 
                                   const char *p) {
 

 
  if (*p == 0 || *(p+1) == 0)
 

 
    luaL_error(ms->L, "unbalanced pattern");
 

 
  if (*s != *p) return NULL;
 

 
  else {
 

 
    int b = *p; // 开始字符 即%bxy中的x
 

 
    int e = *(p+1); // 开始字符 即%bxy中的y
 

 
    int cont = 1;
 

 
    while (++s < ms->src_end) {
 

 
      if (*s == e) {
 

 
        if (--cont == 0) return s+1; //发现一个y
 

}

 
      else if (*s == b) cont++; // 发现一个x
 

}

}

 
  return NULL;  /** string ends out of balance */
 

}

最主要的还是这个 matchbalance ，就是用来找到%bxy中的x和y框起来的部分

（2） %f是一个未被列如lua手册的pattern，但是作用确实不容小觑。 %f 被叫做frontier pattern 详细见这里

它只有在当前字符要符合要求，而上个字符不符合要求的情形下才能通过探测， %f只是一个探测功能，并不影响原字符串的位置

如：打印出一行中所有全大写的单词

string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", print)

THE
QUICK
JUMPS

%f[%a]表示探测当前位置是“字符”， 上个位置是“非字符”。 后面的%f[%A]表探测当前位置是“非字符”， 上个位置是“字符”

一开始探测位置为0即字符T， 上个字符不存在就为'\0'（非字符），%f[%a]通过探测， 因为%f不影响原字符串的位置所以%u+还是从0开始检查。 %u+即以贪婪的方式匹配大写字母， 所以会匹配到索引为3的位置，此时匹配%f[%A], 他要求当前位置为“非字符”， 由于当前位置是空格， 符合； 上个位置为“字符”， 而上个位置为E，也是字符，所以%f[%A]的探测也是成功的， THE就被匹配出来了