【C】【RCRE】正则表达式从入门到实战

最新推荐文章于 2022-04-24 11:36:14 发布

时暑

最新推荐文章于 2022-04-24 11:36:14 发布

阅读量652

点赞数

分类专栏： C++/C

本文链接：https://blog.csdn.net/wjb123sw99/article/details/103653247

版权

C++/C 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

前言：

正则表达式（Regular Expression）用于检索符合自定义规则的文本。例如在检索用户输入的手机号、身份证号有效性，用户设置的密码是否安全等场景时，正则表达式所体现的功能非常强大。当然，它的缺点也很明显，其不易于阅读。

本文示例使用C语言正则表达式引擎RCRE，通过实例让你更快上手。

入门：

正则表达式是由普通字符（大小写字母、数字、标点等）和元字符组成。常用的元字符如下：

常用的元字符
元字符	含义
.	匹配除换行符意外的任意字符
\b	匹配单词的开始或结束
\d	匹配数组
\w	匹配字母、数字、下划线或汉字
\s	匹配任意空白符，包括空格、制表符、换行符等
^	匹配字符串开始
$	匹配字符串结束
x\|y	匹配 x 或 y。
+	匹配前面的子表达式一次或多次。例如，'zo+' 能匹配 "zo" 以及 "zoo"，但不能匹配 "z"。+ 等价于 {1,}。
(pattern)	匹配 pattern 并获取这一匹配。所获取的匹配可以从产生的 Matches 集合得到，在VBScript 中使用 SubMatches 集合，在JScript 中则使用 $0…$9 属性。要匹配圆括号字符，请使用 '$' 或 '$'。
?	当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如，对于字符串 "oooo"，'o+?' 将匹配单个 "o"，而 'o+' 将匹配所有 'o'。
{n}	n 是一个非负整数。匹配确定的 n 次。例如，'o{2}' 不能匹配 "Bob" 中的 'o'，但是能匹配 "food" 中的两个 o。

正则表达式与算术表达式一样，遵循相同优先级的从左到右计算，不同优先级的运算先高后低。下表是正则表达式运算符优先级顺序。

运算符优先级
运算符	描述	优先级
\	转义符	一级，优先级最高
(), (?:), (?=), []	圆括号和方括号	二级
*, +, ?, {n}, {n,}, {n,m}	限定符	三级
^, $, \任何元字符、任何字符	定位点和序列（即：位置和顺序）	四级
\|	替换，"或"操作	五级，优先级最低

实战：

让我们开始认知第一个简单的正则表达式：He|he

解析：该正则表达式运用了元字符x|y,可用于检索字符串中为He或者he的内容。

示例：

#include <stdio.h>
#include <string.h>
#include "pcre.h"
#define OVECCOUNT 30 /* should be a multiple of 3 */
 
int main()
{
    pcre *re = 0;
    const char *error = 0;
    int erroffset = 0;
    int rc = 0;
    int ovector[30];
    char buf[] = "He‘s a friend of mine.\
He‘s a friend of mine.\
I think of him.\
In the pain of the night\
My pain is fading away\
Because he is a friend of mine.";
    char *p  = 0;

    /*
    pcre_compile API 介绍
    pcre *pcre_compile(const char *pattern, int options, const char **errptr,
     int *erroffset, const unsigned char *tableptr);
    
    pattern[in] : 正则表达式
    options[in]：一般为0
    errptr[out]：正则表达式错误提示
    erroffset[out]：正则表达式错误位置
    tableptr[out]：一般为NULL

    return: 失败返回NULL, 成功返回正则表达式规则指针， 使用pcre_free释放该指针
    */
    re = pcre_compile("He|he", 0, &error, &erroffset, NULL);
    if (re == NULL) 
    {
        printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
        return 1;
    }

    p = buf;
    /*
    pcre_exec API 介绍
    int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length,
     int startoffset, int options, int *ovector, int ovecsize);

    code[in]：正则表达式规则指针
    extra[in]:一般为NULL
    subject[in]:待处理字符串
    length[in]:待处理字符串长度
    startoffset[in]:起始位置，一般为0
    options[in],一般为0
    ovector[out], 指向一个结果的整型数组
    ovecsize[in]   数组大小

    成功返回 匹配数量， 失败返回PCRE_ERROR_NOMATCH
    */
    rc = pcre_exec(re, NULL, buf, strlen(buf), 0, 0, ovector, sizeof(ovector));
    while ( ( rc = pcre_exec(re, NULL, p, strlen(p), 0, 0, ovector, OVECCOUNT)) != PCRE_ERROR_NOMATCH )
    {
        int i = 0;
        for (i = 0; i < rc; i++)
        {
            char *substring_start = p + ovector[2*i];
            int substring_length = ovector[2*i+1] - ovector[2*i];
            char matched[1024];
            memset( matched, 0, 1024 );
            strncpy( matched, substring_start, substring_length );
            printf( "match[%d][%d][%d]:%s\n", i, ovector[2*i], ovector[2*i+1] - ovector[2*i], matched );
        }
        p += ovector[1];
        if ( !p )
        {
            break;
        }
    }
    pcre_free(re);
    return 0;
}

运行结果：

match[0][0][2]:He
match[0][22][2]:He

下面我们开始挑战更复杂的正则表达式：(\d+)-(\d+)-(\d+)

解析：该表达式首先匹配了符合 \d+ 模式的字符串，\d为匹配一个数字字符，+为该数字字符后面连续的数字字符都匹配。

该正则表达式用于检索是符合三个\d+ 模式且中间由字符-分割的字符串。

示例：

#include <stdio.h>
#include <string.h>
#include "pcre.h"
#define OVECCOUNT 30 /* should be a multiple of 3 */
 
int main()
{
    pcre *re = 0;
    const char *error = 0;
    int erroffset = 0;
    int rc = 0;
    int ovector[30];
    char buf[] = "2019-11-w\
    2019-18-8\
    -22-22-\
    hwlw-ww-ww";
    char *p  = 0;

    re = pcre_compile("(\\d+)-(\\d+)-(\\d+)", 0, &error, &erroffset, NULL);
    if (re == NULL) 
    {
        printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
        return 1;
    }

    p = buf;
    rc = pcre_exec(re, NULL, buf, strlen(buf), 0, 0, ovector, sizeof(ovector));
    while ( ( rc = pcre_exec(re, NULL, p, strlen(p), 0, 0, ovector, OVECCOUNT)) != PCRE_ERROR_NOMATCH )
    {
        int i = 0;
        char *substring_start = p + ovector[2*i];
        int substring_length = ovector[2*i+1] - ovector[2*i];
        char matched[1024];
        memset( matched, 0, 1024 );
        strncpy( matched, substring_start, substring_length );
        printf( "match[%d][%d][%d]:%s\n", i, ovector[2*i], ovector[2*i+1] - ovector[2*i], matched );
        p += ovector[1];
        if ( !p )
        {
            break;
        }
    }
    pcre_free(re);
    return 0;
}

运行结果：

match[0][13][9]:2019-18-8
最后我们来挑战一个有意义的正则表达式:((25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)\.){3}(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)

解析：这个表达式体现了正则表达式的缺点，阅读性很差。它其实是用来匹配字符串中的IP地址。

我们知道IP地址匹配范围是[255，0].[255，0].[255，0].255，0]

匹配主要难点在于如何去创建[255,0]这个范围的规则，因此我们主要来解析一下(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)

按照优先级，首先去匹配[250-255]之间的数字25[0-5]，

当目标不符合时，则匹配[249-200]之间的数字2[0-4]\d

当目标不符合时，则匹配[199-100]之间的数字1\d{2}

当目标不符合时，则匹配[99-0]之间的数字[1-9]?\d

示例：

#include <stdio.h>
#include <string.h>
#include "pcre.h"
#define OVECCOUNT 30 /* should be a multiple of 3 */
 
int main()
{
    pcre *re = 0;
    const char *error = 0;
    int erroffset = 0;
    int rc = 0;
    int ovector[30];
    char buf[] = "127.0.0.1-\
    127.257.257.257-\
    255.255.255.255";
    char *p  = 0;

    re = pcre_compile("((25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)\\.){3}(25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)", 0, &error, &erroffset, NULL);
    if (re == NULL) 
    {
        printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
        return 1;
    }

    p = buf;
    rc = pcre_exec(re, NULL, buf, strlen(buf), 0, 0, ovector, sizeof(ovector));
    while ( ( rc = pcre_exec(re, NULL, p, strlen(p), 0, 0, ovector, OVECCOUNT)) != PCRE_ERROR_NOMATCH )
    {
        int i = 0;
        char *substring_start = p + ovector[2*i];
        int substring_length = ovector[2*i+1] - ovector[2*i];
        char matched[1024];
        memset( matched, 0, 1024 );
        strncpy( matched, substring_start, substring_length );
        printf( "match[%d][%d][%d]:%s\n", i, ovector[2*i], ovector[2*i+1] - ovector[2*i], matched );
        p += ovector[1];
        if ( !p )
        {
            break;
        }
    }
    pcre_free(re);
    return 0;
}

运行结果：

match[0][0][9]:127.0.0.1
match[0][25][15]:255.255.255.255

时暑

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【C】【RCRE】正则表达式从入门到实战

前言：正则表达式（Regular Expression）用于检索符合自定义规则的文本。例如在检索用户输入的手机号、身份证号有效性，用户设置的密码是否安全等场景时，正则表达式所体现的功能非常强大。当然，它的缺点也很明显，其不易于阅读。本文示例使用C语言正则表达式引擎RCRE，通过实例让你更快上手。入门：正则表达式是由普通字符（大小写字母、数字、标点等）和元字符组成。常用的元字符如下：...
复制链接

扫一扫

专栏目录