解决Boost.Regex对中文支持不好的问题

最新推荐文章于 2021-06-21 12:46:57 发布

Jinhill

最新推荐文章于 2021-06-21 12:46:57 发布

阅读量4.9k

点赞数

分类专栏： C/C++ 文章标签： regex null string 正则表达式 delete iostream

C/C++ 专栏收录该内容

106 篇文章 1 订阅

订阅专栏

问题的提出：

Boost.Regex作为Boost对正则表达式的实践，是C++开发中常用模式匹配工具。但在这次使用过程中发现，它他对中文的支持并不好。当我们指定/w匹配时，包含“数”或“节”等字的字符串就会出现匹配失败的问题。

解决方案：

思路：把字符都转换成宽字符，然后再匹配。
需要用到以下和宽字符有关的类：
1、wstring：
作为STL中和string相对应的类，专门用于处理宽字符串。方法和string都一样，区别是value_type是wchar_t。wstring类的对象要赋值或连接的常量字符串必须以L开头标示为宽字符。
2、wregex：
和regex相对应，专门处理宽字符的正则表达式类。同样可以使用regex_match()和regex_replace()等函数。regex_match()的结果需要放在wsmatch类的对象中。
字符和宽字符的相互转换：
1、RTL的方法
//把字符串转换成宽字符串
    setlocale( LC_CTYPE, "" ); // 很重要，没有这一句，转换会失败。
    int iWLen= mbstowcs( NULL, sToMatch.c_str(), sToMatch.length() ); // 计算转换后宽字符串的长度。（不包含字符串结束符）
    wchar_t *lpwsz= new wchar_t[iWLen+1];
    int i= mbstowcs( lpwsz, sToMatch.c_str(), sToMatch.length() ); // 转换。（转换后的字符串有结束符）
    wstring wsToMatch(lpwsz);
    delete []lpwsz;
//把宽字符串转换成字符串，输出使用
    int iLen= wcstombs( NULL, wsm[1].str().c_str(), 0 ); // 计算转换后字符串的长度。（不包含字符串结束符）
    char *lpsz= new char[iLen+1];
    int i= wcstombs( lpsz, wsm[1].str().c_str(), iLen ); // 转换。（没有结束符）
    lpsz[iLen] = '/0';
    string sToMatch(lpsz);
    delete []lpsz;
2、Win32 SDK的方法
//把字符串转换成宽字符串
    int iWLen= MultiByteToWideChar( CP_ACP, 0, sToMatch.c_str(), sToMatch.size(), 0, 0 ); // 计算转换后宽字符串的长度。（不包含字符串结束符）
    wchar_t *lpwsz= new wchar_t [iWLen+1];
    MultiByteToWideChar( CP_ACP, 0, sToMatch.c_str(), sToMatch.size(), lpwsz, iWLen ); // 正式转换。
    wsz[iWLen] = L'/0';
//把宽字符串转换成字符串，输出使用
    int iLen= WideCharToMultiByte( CP_ACP, NULL, wsResult.c_str(), -1, NULL, 0, NULL, FALSE ); // 计算转换后字符串的长度。（包含字符串结束符）
    char *lpsz= new char[iLen];
    WideCharToMultiByte( CP_OEMCP, NULL, wsResult.c_str(), -1, lpsz, iLen, NULL, FALSE); // 正式转换。
    sResult.assign( lpsz, iLen-1 ); // 对string对象进行赋值。

示例：

通过以下程序我们可以看到，对字符串做/w匹配时，某些字会引起匹配失败。通过把字符串转换成宽字符串尝试解决这个问题。

#include <iostream>
using std::cout;
using std::endl;
#include <string>
using std::string;
using std::wstring;
#include <locale>

#include "boost/tr1/regex.hpp"
using namespace boost;

void MatchWords(string sToMatch)
{
 regex rg("(//w*)");
 smatch sm;
 regex_match( sToMatch, sm, rg );
 cout << "匹配结果：" << sm[1].str() << endl;
}

void MatchWords(wstring wsToMatch)
{
    wregex wrg(L"(//w*)");
    wsmatch wsm;
    regex_match( wsToMatch, wsm, wrg );

    int iLen= wcstombs( NULL, wsm[1].str().c_str(), 0 );
    char *lpsz= new char[iLen+1];
    int i= wcstombs( lpsz, wsm[1].str().c_str(), iLen );
    lpsz[iLen] = '/0';

string sToMatch(lpsz);
 delete []lpsz;
 cout << "匹配结果：" << sToMatch << endl;
}

void main()
{
    string sToMatch("数超限");
    MatchWords( sToMatch );
    sToMatch = "节点数目超限";
    MatchWords( sToMatch );

    setlocale( LC_CTYPE, "" );
    int iWLen= mbstowcs( NULL, sToMatch.c_str(), sToMatch.length() );
    wchar_t *lpwsz= new wchar_t[iWLen+1];
    int i= mbstowcs( lpwsz, sToMatch.c_str(), sToMatch.length() );

    wstring wsToMatch(lpwsz);
    delete []lpwsz;
    MatchWords( wsToMatch );
}

编译执行程序后输出：
   匹配结果：数超限
    匹配结果：
    匹配结果：节点数目超限
第一行显示“数超限”匹配成功。但第二行“节点数超限”没有匹配到任何字符。只有转换成宽字符串之后才能够对“节点数超限”成功进行/w匹配

------------------------------------------------------------

其他参考

C/C++ code

#include "stdafx.h"

#include <cstdlib>

#include <stdlib.h>

#include <boost/regex.hpp>

#include <string>

#include <iostream>

using namespace std;

//using namespace boost;

boost::wregex expression(L"^//s*我+//s*[想|爱|恨|扁]+//s*你");

int main(int argc, char* argv[]) {

locale loc("Chinese-simplified");

wcout.imbue(loc);

std::wstring in = L"我我我我爱爱爱爱爱你";

static boost::wsmatch what;

cout << "enter test string" << endl;

//getline(cin,in);

if (boost::regex_match(in.c_str(), what, expression))

{

for (int i = 0;i < what.size();i++)

wcout << L"str :" << what[i].str() << endl;

}

else {

wcout << L"Error Input" << endl;

}

return 0;

}

============ 为了程序能够在VC6.0中运行改为如下所示 ==================

boost::wregex expression(L"^//s*我+//s*[想|爱|恨|扁]+//s*你");

  locale loc("Chinese-simplified");

  wcout.imbue(loc);

  std::wstring in = L"我我我我爱爱爱爱爱你";

  static boost::wsmatch what;

if (boost::regex_match(in.c_str(), what, expression))
 {

 for (int i = 0;i < what.size();i++)
 {
 // string test = (LPCTSTR)what[i].str().c_str();
 int iLen= WideCharToMultiByte( CP_ACP, NULL, what[i].str().c_str(), -1, NULL, 0, NULL, FALSE ); // 计算转换后字符串的长度。（包含字符串结束符）
 char *lpsz= new char[iLen];
 WideCharToMultiByte( CP_OEMCP, NULL, what[i].str().c_str(), -1, lpsz, iLen, NULL, FALSE);
 }
 }

Jinhill

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
解决Boost.Regex对中文支持不好的问题

 问题的提出： Boost.Regex作为Boost对正则表达式的实践，是C++开发中常用模式匹配工具。但在这次使用过程中发现，它他对中文的支持并不好。当我们指定/w匹配时，包含“数”或“节”等字的字符串就会出现匹配失败的问题。 解决方案： 思路：把字符都转换成宽字符，然后再匹配。 需要用到以下和宽字符有关的类： 1、wstring： 作为STL中和string相对应的类，专门用于处理宽字符串。方法和string都一样，区别是value_
复制链接

扫一扫