#墙裂推荐Boost regex# C，C++11，Boost三种regex库性能比较

最新推荐文章于 2024-08-02 13:15:05 发布

卡卡罗特Z

最新推荐文章于 2024-08-02 13:15:05 发布

阅读量9.8k

点赞数 4

文章标签：正则表达式 c c++11 boost 性能

本文链接：https://blog.csdn.net/we_izheng/article/details/40859615

版权

在最近的一个项目中，发现之前的正则匹配模块对于长字符串匹配性能损失比较厉害，因此对长字符串下的各种正则匹配进行了略微研究并附有实例。本文参考了博客http://www.cnblogs.com/pmars/archive/2012/10/24/2736831.html（下文称文1），这篇文章也是对三种regex库进行了比较，但有些地方我还有一些自己的见解，特此罗列如下，感谢这篇文章的作者。

1.C regex库

由于项目中一直用的都是C regex库，所以首先对C regex进行了研究。对于C regex的各种接口及参数可以参照文1，如果要看起来更专业的话，可以参考这一篇，http://pubs.opengroup.org/onlinepubs/000095399/functions/regcomp.html（下文称文2）。

对于C regex的性能测试是这样的，首先按照惯例我们将C regex提供的各个调用封装成一个调用，其传入参数为模式字符串和待匹配字符串，如果二者匹配，则返回1，否则返回0.

这个调用中的流程像文1中所述，首先调用regcomp对模式字符串进行编译，编译时可以指定各种参数，如扩展/基础正则语法，是否忽略大小写，是否存储结果以及对换行符的设定（此参数请参照文2）。之后，调用regexec将待匹配的目标字符串与编译好的正则表达式进行匹配，后边的参数可以设置匹配到的字符串保存位置，以及将模式字符串中的行首符（^）和行尾符（$）是否当作一般字符处理，详情可以参考文1，文2。最后，调用regfree释放内存。

int match(const char* pattern, const char* target)
{
	regex_t oRegex;
	int nErrCode = 0;
	char szErrMsg[1024] = {0};
	size_t unErrMsgLen = 0;

	if ((nErrCode = regcomp(&oRegex, pattern, 0)) == 0)
	{
		if ((nErrCode = regexec(&oRegex, target, 0, NULL, 0)) == 0)
		{
			regfree(&oRegex);
			return 1;
		}
	}

	unErrMsgLen = regerror(nErrCode, &oRegex, szErrMsg, sizeof(szErrMsg));
	unErrMsgLen = unErrMsgLen < sizeof(szErrMsg) ? unErrMsgLen : sizeof(szErrMsg) - 1;
	szErrMsg[unErrMsgLen] = '\0';

	regfree(&oRegex);
	return 0;
}

测试程序设定好模式字符串以及待匹配的目标字符串后，对此函数进行10000次调用，取开始和结束的时间之差作为性能评测依据，下文的C++11和Boost也是使用这种方法，简单有效嘛。

考虑到是重复调用10000次，我们自然会想，那我岂不是在这个match里面把同一个模式字符串不停的进行编译，匹配，释放内存，我们要测试的只是匹配性能啊，编译一次不就好了吗，好，我们可以改进一下，封装成下面这个调用，看到了吧，我们传进来编译好的正则表达式regex_t指针不就好了吗，然后循环10000次之后，再将其内存释放掉不是很完美？

int match_pre_comp(regex_t * pattern, const char* target)
{
	int nErrCode = 0;
	char szErrMsg[1024] = {0};
	size_t unErrMsgLen = 0;

	if ((nErrCode = regexec(pattern, target, 0, NULL, 0)) == 0)
	{
		return 1;
	}

	unErrMsgLen = regerror(nErrCode, pattern, szErrMsg, sizeof(szErrMsg));
	unErrMsgLen = unErrMsgLen < sizeof(szErrMsg) ? unErrMsgLen : sizeof(szErrMsg) - 1;
	szErrMsg[unErrMsgLen] = '\0';

	return 0;
}

为了同时测试上述两种方法的性能，我们使用宏定义来编译出不同版本的可执行程序，设计的主体如下，在其中，我们故意使用长目标字符串（1000）来进行测试，以达到目的，当指定 #define NOT_PRE_COMP时，使用第一种方法，当指定#define PRE_COMP时使用第二种方法，二者不能同时存在，也不能同时不存在。使用#define LOOP_COUNT (XXX)可以指定循环次数。当然这些变了之后，得重新编译。

/*
 * Program: 
 *   This program for test c regex performance
 * Platform
 *   Ubuntu14.04     gcc-4.8.2
 * History:
 *   weizheng     2014.11.05    1.0
 */

#include <sys/time.h>
#include <stdio.h>
#include "regex.h"
#define LOOP_COUNT ( 10000 )
#define PRE_COMP

/*
 * you must choice PRE_COMP or NOT_PRE_COMP to decide pre_complie the regex expression or not
 */
#if defined(PRE_COMP) && defined(NOT_PRE_COMP)
	#error can not define PRE_COMP and NOT_PRE_COMP at the same time
#elif !defined(NOT_PRE_COMP) && !defined(PRE_COMP)
	#error please define PRE_COMP or NOT_PRE_COMP
#endif

/************************************ main ************************************/

int main(void)
{
	char pattern[] = "ywfuncFlag.*rzgw/rzdk_rzzl.jsp";
	char target[] = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

#ifdef PRE_COMP
	regex_t oRegex;
	if (regcomp(&oRegex, pattern, 0))
		printf("regex complie error\n");
#endif

	/*
	 * record the start time
	 */
	struct timeval tv_start, tv_end;
	gettimeofday(&tv_start, NULL);

	/*
	 * matching
	 */
	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
#ifdef PRE_COMP
		if(match_pre_comp(&oRegex, target))
#endif

#ifdef NOT_PRE_COMP
		if(match(pattern, target))
#endif
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	gettimeofday(&tv_end, NULL);
	unsigned long time_used = (tv_end.tv_sec * 1000000 + tv_end.tv_usec - (tv_start.tv_sec * 1000000 + tv_start.tv_usec))/1000;

#ifdef PRE_COMP
	regfree(&oRegex);
#endif
	printf("used:   %lu ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}

编译运行两种版本，结果如下：

不对模式字符串进行预编译，即就是每次都在循环内进行重新编译：

weizheng@weizheng-MS-7798:~/test$ ./c_regex_main 
used:   418 ms
matched 10000 times

在循环之前对模式字符串进行预先编译，之后传入编译好的正则表达式用来循环，不进行重新编译：

weizheng@weizheng-MS-7798:~/test$ ./c_regex_main 
used:   404 ms
matched 10000 times

从结果中看出C regex的匹配耗时，这个时候也不好说到底是快还是慢，看跟谁比了，如果你想知道，请继续往下看。不过就C regex来说，不预编译和预编译两种方法差异并不大，应该对于这种长字符串，编译所占时间远远小于匹配时间，所以编译一次和多次影响不会很大。

2. C++11 regex

我想说，我写这篇文章一个主要目的就是来吐槽C++ 11 regex，准确的说，应该是吐槽g++ 4.8的！C++对于正则匹配提供了两个调用：regex_match，regex_search，如果整个输入序列与表达式匹配，则regex_match函数返回true；如果输入序列中有一个子串与表达式匹配，则regex_search函数返回true（参考C++ primer（第五版，中文版，P646））。对于regex_match还好说，对于regex_search，说起来，我真是一万只草泥马呼啸了......

这个regex_search么，你给模式字符串设定且仅设定好一个单词，比如上文待匹配字符串中的任意一个子串，"xml"吧，这很简单了吧，一看都应该匹配上啊，但这尼玛就是返回false，你有啥话说，你就是把待匹配目标字符串设定成“xml”，它也是返回false。卧槽，我是不是打开的方式不对，到底问题在哪里！

于是，我把cplusplus上的官方示例拉下来运行，我想这应该没问题吧，但我去....regex error（模式字符串编译错误），我去....啥情况么，官方都不对？好吧，我只好把圣经C++ primer拿出来，我相信圣经不会错的，卧槽，结果真的不出所料，仍旧的regex error。我去...我三观都毁了，草泥马到底长的是个什么样子......

不过，没事么，咱还可以google么，虽然被封，但咱还是能翻墙的么，呵呵，搜索regex_search always return false，说到底还是google管用啊，我搜到了stackoverflow上类似的问题：

http://stackoverflow.com/questions/20027305/strange-results-when-using-c11-regexp-with-gcc-4-8-2-but-works-with-boost-reg
http://stackoverflow.com/questions/12279869/using-regex-search-from-the-c-regex-library
http://stackoverflow.com/questions/11628047/difference-between-regex-match-and-regex-search?lq=1

终于找到了答案，对，你猜对了，C++11 too new，g++ 4.8还没有完全支持。有人说VS2010，VS2012已经支持了，好吧，换上我强大的双系统，咱windows下解决问题么，呵呵。

来，把上面的C regex代码修改修改换成使用C++11 regex的，如下，注意一点，在windows下获取时间就不能用gettimeofday了，得换成相应windows调用。同样，我们通过常量LOOP_COUNT来设置循环次数，C++使用模式字符串构造regex_t对象就相当于C regex中的编译正则表达式字符串，所以这个就不用担心多次重复编译了。

/*
 * Program: 
 *   This program for test c++11 regex performance
 * Platform
 *   windows8.1     VS2012
 * History:
 *   weizheng     2014.11.06    1.0
 */

#include <regex>
#include <windows.h>
#include <stdio.h>

const int LOOP_COUNT = 10000;

/************************************ main ************************************/

int main()
{
	std::regex pattern("ywfuncFlag.*rzgw/rzdk_rzzl.jsp");
	std::string target = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

	/*
	 * record the start time
	 */
	LARGE_INTEGER lFreq,lSatrt,lEnd;
	QueryPerformanceFrequency(&lFreq);
	QueryPerformanceCounter(&lSatrt);

	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
		if(std::regex_search(target, pattern))
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	QueryPerformanceCounter(&lEnd);
	float time_used = (float)(lEnd.QuadPart - lSatrt.QuadPart)*1000/lFreq.QuadPart;

	printf("used:   %.2f ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}

结果如下，这尼玛简直找不到更慢的了，不知道乌龟看到会不会后悔没跟它比赛。你可能觉得C真特么快，如果是这样，你还需要继续往下看：

used:   13754.17 ms
matched 10000 times
请按任意键继续. . .

3. Boost regex

好的，下面就是我墙裂推荐的Boost regex了。Boost是C++标准库的后备库，有”准标准库“之称，颇负盛名。我当年大二写了个自认为还不错的系统，后来一个学长告诉我他用boost 10行代码就能完成...我当时就震精了。参考了Boost regex的基本介绍和一些样例之后，发现这个设计完全跟C++11一样啊，只要把regex调用的命名空间从std换成boost似乎就好了，当然前提是你已经安装好了boost库，编译时加上链接选项-lboost_regex。

ok，当然，测试代码也是简单有效，不过看到结果时，我又一次被震精了。

/*
 * Program: 
 *   This program for test boost regex performance
 * Platform
 *   Ubuntu14.04     g++-4.8.2
 * History:
 *   weizheng     2014.11.06    1.0
 */

#include <boost/regex.hpp>
#include <sys/time.h>
#include <cstdio>

const int LOOP_COUNT = 10000;

/************************************ main ************************************/

int main()
{
	boost::regex pattern("ywfuncFlag.*rzgw/rzdk_rzzl.jsp");
	std::string target = "<xml><param0>{\"bod\":{\"autoLoad\":false,\"keys\":[\"YWFUNC\"],\"state\":4,\"supportDynamic\":true},\"compId\":\"\",\"flag\":0,\"realUrl\":\"\",\"remoteIp\":\"\",\"ywType\":\"2\",\"ywfunc\":\"39\",\"ywfuncFlag\":\"/FMISWeb/faces/financing/dhkmanage/rzgw/rzdk_rzzl.jsp?^IRZDKServiceBC/RZFS%3d00000007%26amp;^IRZDKServiceBC/BEGINDATE%3d2014-10-01%26amp;^IRZDKServiceBC/ENDDATE%3d2014-10-31%26amp;^IRZDKServiceBC/ISTJTH%3d8\",\"ywfuncName\":\"\",\"ywfuncPageId\":\"DYHKRZZL\",\"RZWAIT_ZLLIST\":{\"bod\":{\"autoLoad\":false,\"keys\":[\"GID\"],\"state\":4,\"supportDynamic\":true},\"listItemClass\":\"com.soft.grm.investfinancing.service.masterdata.model.bo.RzptZltsOldModel\",\"listType\":0,\"masterProperty\":[],\"notUpdateField\":[],\"tableName\":\"\",\"useColumnNameToXml\":true,\"metaList\":{\"GID\":{\"columnCaption\":\"GID\",\"columnIndex\":0,\"columnName\":\"GID\",\"length\":0,\"mapName\":\"GID\",\"nullAble\":true,\"scale\":-127,\"sqlType\":2},\"ZLTS\":{\"columnCaption\":\"ZLTS\",\"columnIndex\":2,\"columnName\":\"ZLTS\",\"length\":4,\"mapName\":\"ZLTS\",\"nullAble\":true,\"scale\":0,\"sqlType\":";

	/*
	 * record the start time
	 */
	struct timeval tv_start, tv_end;
	gettimeofday(&tv_start, NULL);

	int count = 0;
	for(int i = 0; i < LOOP_COUNT; i++)
	{
		if(boost::regex_search(target, pattern))
		{
			count++;
		}
	}

	/*
	 * record the end time
	 */
	gettimeofday(&tv_end, NULL);
	unsigned long time_used = (tv_end.tv_sec * 1000000 + tv_end.tv_usec - (tv_start.tv_sec * 1000000 + tv_start.tv_usec))/1000;

	printf("used:   %lu ms\n", time_used);
	printf("matched %d times\n", count);

	return 0;
}

结果如下：

weizheng@weizheng-MS-7798:~/test$ ./boost_regex_main 
used:   11 ms
matched 10000 times

这结果，竟然比C快了40倍，而且这都是在模式字符串可以匹配目标字符串的前提下的，如果测试不匹配的情况，Boost竟然比C快了4000倍，我不敢相信这个结果，在此贴上测试用的模式字符和目标字符，看有没有人跟我一样，或者有别的原因，例如跟正则表达式或者目标字符串本身结构有关，测试字符串如下：

	boost::regex pattern(".*CSZ.*XTCS.*CSMC.*NEWBB");
	std::string target ="<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?><SOAP-ENV:Envelope xmlns:SOAPSDK1=\"http://www.w3.org/2001/XMLSchema\" xmlns:SOAPSDK2=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:SOAPSDK3=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:SOAP-ENV=\"http://schemas.xmlsoap.org/soap/envelope/\"><SOAP-ENV:Body SOAP-ENV:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\"><SOAPSDK4:GetSystemDate xmlns:SOAPSDK4=\"http://svr.blf.common.fmis.ygsoft.com\"/></SOAP-ENV:Body></SOAP-ENV:Envelope>";

所以，从以上对比结果中看出boost regex的表现似乎极为突出，但文1中最后的比较结果是C regex更快，我与其测试结果不同，希望看到更多的测试结果，搞清楚问题。不过，目前最期待的还是如果将项目中原来的C regex替换成Boost regex，不知道结果会怎么样，拭目以待吧。

卡卡罗特Z

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
6
评论
#墙裂推荐Boost regex# C，C++11，Boost三种regex库性能比较

在最近的一个项目中，发现之前的正则匹配模块对于长字符串匹配性能损失比较厉害，因此对长字符串下的各种正则匹配进行了略微研究并附有实例。本文参考了博客http://www.cnblogs.com/pmars/archive/2012/10/24/2736831.html（下文称文1），这篇文章也是对三种regex库进行了比较，但有些地方我还有一些自己的见解，特此罗列如下，感谢这篇文章的作者。
复制链接

扫一扫