Regular Expressions in C++ with Boost.Regex(4)

Searching

Matching and parsing a single string in its entirety does not address the important and ubiquitous use case of searching a string that contains a substring you want, but possibly a lot of other characters you don't.

Like matching, Boost.Regex lets you search a string for a regular expression in two ways. In the simplest case, you may just want to know if a given string contains a match for your regular expression. Example 3 is a trivial implementation of the grep program that reads in each line from a file and prints it out if it contains a string that satisfies the regular expression pattern.

#include  < iostream >
#include 
< string >
#include 
< boost / regex.hpp >
#include 
< fstream >

using   namespace  std;
const   int  BUFSIZE  =   10000 ;

int  main( int  argc,  char **  argv)  {

   
// Safety checks omitted...
   boost::regex re(argv[1]);
   
string file(argv[2]);
   
char buf[BUFSIZE];

   ifstream 
in(file.c_str());
   
while (!in.eof())
   
{
      
in.getline(buf, BUFSIZE-1);
      
if (boost::regex_search(buf, re))
      
{
         cout 
<< buf << endl;
      }

   }

}

Example 3. Trivial grep

You can see that you use regex_search in the same way as regex_match.

This comes in handy sometimes, but has limited appeal. More often, you will enumerate over all substrings that match a given pattern. For example, maybe you are writing a web crawler and want to iterate over all anchor tags in a page. Craft a regular expression to grab anchor tags:

<a/s+href="([/-:/w/d/.//]+)">

You don't want the whole line returned, though, as in the grep example above; you want the target URL. To do this, use the second subexpression in match_results. Example 4, a slightly modified version of Example 3, will do just that.

#include  < iostream >
#include 
< string >
#include 
< boost / regex.hpp >
#include 
< fstream >

using   namespace  std;
const   int  BUFSIZE  =   10000 ;

int  main( int  argc,  char **  argv)  {

   
// Safety checks omitted...
   boost::regex re("<a/s+href="([/-:/w/d/.//]+)">");
   
string file(argv[1]);
   
char buf[BUFSIZE];
   boost::cmatch matches;
   
string sbuf;
   
string::const_iterator begin;
   ifstream 
in(file.c_str());

   
while (!in.eof())
   
{
      
in.getline(buf, BUFSIZE-1);
      sbuf 
= buf;
      begin 
= sbuf.begin();

      
while (boost::regex_search(begin, sbuf.end(), matches, re))
      
{
         
string url(matches[1].first, matches[1].second);
         cout 
<< "URL: " << url << endl;
         
// Update the beginning of the range to the character
         
// following the match
         begin = matches[1].second;
      }

   }

}

Example 4. Enumerating anchor tags

The hard-coded regular expression in Example 4 contains lots of backslashes. This is necessary because I am escaping certain characters twice: once for the compiler, and once for the regular expression engine.

Example 4 uses a different overload of regex_search than Example 3; this version takes two bidirectional iterator arguments that refer to the beginning and end of a range of characters to be searched. To access every matching substring, all I have to do is update begin to point to the character following the last match, which is in matches[1].second.

This is not the only way to iterate over all occurrences of a pattern. If you prefer (or require) iterator semantics, use a regex_token_iterator, which is an iterator interface to the results from a regular expression search. In Example 4, you could just as easily have iterated over the results of the URL search:


//  Read the HTML file into the string s...
   boost::sregex_token_iterator p(s.begin(), s.end(), re,  0 );
   boost::sregex_token_iterator end;

   
for  (;p  !=  end;count ++ ++ p)
   
{
      
string m(p->first, p->second);
      cout 
<< m << endl;
   }

That's not all, though. The first token iterator here passes a zero as the last argument to its constructor. This tells it to iterate over the strings that satisfy the regular expression. Change it to -1 and you get the opposite: iteration over substrings that do not satisfy the expression. In other words, it tokenizes the string, where each token is something that satisfies the regular expression. This is a cool feature, because it lets you tokenize a string of characters based on complex delimiters. To use the example of parsing a web page, you could, for example, break the document into sections by its headers, using header tags such as <h1>...</h1>, <h3>...</h3>, etc.

Stuff to Check Out

There is, of course, more to Boost.Regex than I've presented here, but this should give you a good idea of what you can do with regular expressions in C++. The documentation on the Boost.Regex page is comprehensive, and there are plenty of examples you can copy and experiment with. In addition to searching strings as I did above, you can:

  • Search and replace using different Perl and Sed-style formatting conventions.
  • Use POSIX basic and extended regular expression format.
  • Use Unicode strings and other non-standard string formats.

Above all, you should experiment with regular expression syntax. There are different ways to do the same thing, and it's fun to see how concise you can make an expression that does what you want. Once you're a pro at regular expressions, you will be surprised at how often you can use them to validate, search, or parse a string.

Conclusion

Boost.Regex is the library in the Boost project that implements a regular expression engine in C++. You can use it to match, search, or search and replace with regular expressions against a target string, instead of writing ugly and cumbersome string-parsing code. Boost.Regex has been accepted as part of the next C++ standard library, and you will see it appearing in implementations of TR1 (in the tr1 namespace) from standard library vendors very soon. Check out Boost.Regex to get a feel for how useful it is, and while you're at it, take a look at many of the other libraries in Boost--there's a lot of good stuff there.

Ryan Stephens is a software engineer, writer, and student living in Tempe, Arizona. He enjoys programming in virtually any language, especially C++.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值