boost tokenizer

最新推荐文章于 2024-06-23 12:18:02 发布

繁华都市的夜晚

最新推荐文章于 2024-06-23 12:18:02 发布

阅读量4.9k

点赞数

本文链接：https://blog.csdn.net/mmzsyx/article/details/8211480

版权

tokenizer:
tokenizer库是一个专门用于分词(token)的字符串处理库,可以使用简单易用的方法把一个字符串分解成若干个单词；
#include<boost/tokenizer.hpp>
uning namespace boost;
tolenizer类是tokenizer库的核心，它以容器的外观提供分词的序列。
类摘要：
template<typename TokenizerFunc = char_delimiters_separator<char>, typename Iterator = std::string::const_iterator, typename Type = std::string>
class tokenizer
{
tokenizer(Iterator first, Iterator last, const TokenizerFunc& f);
tokenizer(const Container& c, const TokenizerFunc& f);
void assign(Iterator first, Iterator last);
void assign(const Container& c);
void assign(const Container& c, const TokenizerFunc& f);
iterator begin() const;
iterator end() const;
};

tokenizer接受三个模板类型参数：
【1】TokenizerFunc:tokenizer库专门的分词函数对象，默认是使用空格和标点分词；
【2】Iterator:字符序列的迭代器类型；
【3】Type:保存分词结果的类型；
这三个模板类型都提供了默认值，但通常只有前两个模板参数可以变化，第三个类型一般只能选择std::string或者std::wstring,这也是它位于模板参数列表最后的原因.
tokeninzer的构造函数接受要进行分词的字符串，可以以迭代器的区间形式给出，也可以是一个有begin()和end()成员函数的容器.
assign()函数可以重新指定要分词的字符串，用于再利用tokenizer
tokenizer具有类似标准容器的接口，begin()函数使用tokenizer开始执行分词功能，返回第一个分词的迭代器，end()函数表明迭代器已经到达分词序列的末尾，分词结束。
用法：
tokenizer的用法很像string_algo的分割迭代器，可以像使用一个容器那样使用它，向tokenizer传入一个欲分词的字符串构造，然后用begin()获得迭代器反复迭代，就可以完成分词功能。
示例：
#include <assert.h>
#include <iostream>
#include <vector>
#include <boost/assign.hpp>
#include <boost/tokenizer.hpp>
#include <boost/typeof/typeof.hpp>
using namespace boost;
using namespace std;
int main()
{
string str("Link raise the master-sword.");
tokenizer<> tok(str); //使用缺省模板参数创建分词对象
//可以像遍历一个容器一样使用for循环获得分词结果
for (BOOST_AUTO(pos, tok.begin()); pos != tok.end(); ++pos)
cout<< "[" << *pos<<"]";
cout<<endl;
system("pause");
return 0;
}
注意：tokenizer默认把所有的空格和标点作为分隔符，因此分割出的只是单词，这与string_algo::split的算法含义(分割字符串)有所差别.

分词函数对象：
tokenizer作为分词的容器本身的用法很简单，它的真正威力在于第一个模板类型参数TokenizerFunc， TokenizerFunc是一个函数对象，它决定如何进行分词处理，TokenizerFunc同时也是一个规范，只要具有合适的operator()和reset()语义的函数对象，都可以用于tokenizer分词.

tokenizer库提供预定义好的四个分词对象，它们是：
【1】char_delimiters_separator:使用标点符号分词，是tokenizer默认使用的分词函数对象，但它已经被声明废弃，应当尽量不使用它；
【2】char_separator:它支持一个字符集合作为分隔符，默认的行为与char_delimiters_separator类似；
【3】escaped_list_separator:用csv格式（逗号分隔）的分词；
【4】offset_separator:使用偏移量来分词，在分解平文件格式的字符串时很有用。

char_separator:
char_separator使用了一个字符集合作为分词依据，行为很类似split算法，它的构造函数声明是：
char_separator(const char* dropped_delims, const char * kept_delims = 0, empty_token_policy empty_tokens = drop_empty_tokens);
构造函数中的参数含义如下：
【1】第一个参数dropped_delims是分隔符集合，这个集合中的字符不会作为分词的结果出现；
【2】第二个参数kept_delims也是分隔符集合，但其中的字符会保留在分词结果中；
【3】第三个参数empty_tokens类似split算法eCompress参数，处理连续出现的分隔符，如为keep_empty_tokens则表示连续出现的分隔符标识了一个空字符串，相当与split算法的token_compress_off值，如为drop_empty_tokens,则空白单词不会作为分词的结果。
如果使用默认的构造函数，不传入任何参数，则其行为等同于char_separator(" ", 标点符号, drop_empty_tokens),以空格和标点符号分词，保留标点符号，不输出空白单词;
示范：
#include <assert.h>
#include <iostream>
#include <vector>
#include <boost/assign.hpp>
#include <boost/tokenizer.hpp>
#include <boost/typeof/typeof.hpp>
using namespace boost;
using namespace std;
template<typename T>
void print(T &tok)
{
//可以像遍历一个容器一样使用for循环获得分词结果
for (BOOST_AUTO(pos, tok.begin()); pos != tok.end(); ++pos)
  cout<< "[" << *pos<<"]";
cout<<endl;
}
int main()
{
char* str("Link ;; the <master-sword>.zelda");
char_separator<char> sep;  //一个char_separator对象
//传入char_separator构造分词对象
tokenizer<char_separator<char>, char*> tok(str, str + strlen(str), sep);
print(tok);
tok.assign(str, str + strlen(str));  //重新分词
char_separator<char>(" :-", "<>");
print(tok);
tok.assign(str, str + strlen(str));  //重新分词
char_separator<char>(" :-", "", keep_empty_tokens);
print(tok);
system("pause");
return 0;
}
这段代码对tokenizer的模板参数稍微做了一下改变，第二个参数改为char*，这使得tokenizer可以分析c风格的字符串数组，同时构造函数也必须变为传入字符串的首末位置，不能仅传递一个字符串首地址，因为字符串数组不符合容器的概念。
第一次分词使用char_separator的缺省构造，以空格和标点分词，保留标点作为单词的一部分，并抛弃空白单词，第二次分词使用";-"和"<>"共5个字符分词，保留<>作为单词的一部分，同样抛弃空白单词；最后一次分词同样使用" ;-<>"分词，但都不作为单词的一部分，并且保留空白单词.

escaped_list_separator:
escaped_list_separator是专门处理CSV格式(Comma Split Value,逗号分隔值)的分词对象，它的构造函数声明是：
escaped_list_separator(char e = '\\', char c = ',', char q = '\"')
escaped_list_separator的构造函数参数一般都是取默认值，含义如下：
【1】第一个参数e指定了字符串中转义字符，默认是斜杠\;
【2】第二个参数是分隔符，默认是逗号；
【3】第三个参数是引号字符，默认是";
示范:
#include <assert.h>
#include <iostream>
#include <vector>
#include <boost/assign.hpp>
#include <boost/tokenizer.hpp>
#include <boost/typeof/typeof.hpp>
using namespace boost;
using namespace std;
template<typename T>
void print(T &tok)
{
//可以像遍历一个容器一样使用for循环获得分词结果
for (BOOST_AUTO(pos, tok.begin()); pos != tok.end(); ++pos)
cout<< "[" << *pos<<"]";
cout<<endl;
}
int main()
{
string str = "id,100,name.Name;sex,\"mario\"";
escaped_list_separator<char> sep;
tokenizer<escaped_list_separator<char>> tok(str, sep);
print(tok);
system("pause");
return 0;
}

offset_separator:
offset_separator与前两种分词函数对象不用，它分词的功能不基于查找分隔符，而是使用偏移量的概念，在处理某些不使用分隔符而使用固定字段宽度的文本时很有用，构造函数声明：
template<typename Iter>
offset_separator(Iter begin, Iter end, bool wrap_offsets = true, bool return_partial_last = true);
offset_separator的构造函数接收两个迭代器参数(也可以是数组指针)begin和end，指定分词用的整数偏移量序列，整数序列的每个元素是分词字段的宽度。
bool参数bwrapoffsets，决定是否在偏移量用完后继续分词，bool参数return_partial_last决定在偏移量序列最后是否返回分词不足的部分，这两个附加参数的默认值都是true；
示范：
#include <assert.h>
#include <iostream>
#include <vector>
#include <boost/assign.hpp>
#include <boost/tokenizer.hpp>
#include <boost/typeof/typeof.hpp>
using namespace boost;
using namespace std;
template<typename T>
void print(T &tok)
{
//可以像遍历一个容器一样使用for循环获得分词结果
for (BOOST_AUTO(pos, tok.begin()); pos != tok.end(); ++pos)
cout<< "[" << *pos<<"]";
cout<<endl;
}
int main()
{
string str = "2233344445";
int offsets[] = {2, 3, 4};
offset_separator sep(offsets, offsets + 3, true, false);
tokenizer<offset_separator> tok(str, sep);
print(tok);
tok.assign(str, offset_separator(offsets, offsets + 3, false));
print(tok);
str += "56667";
tok.assign(str, offset_separator(offsets, offsets + 3, true, false));
print(tok);
system("pause");
return 0;
}

tokenizer库的缺陷：
【1】它只支持使用单个字符进行分词，如果要分解如"||"等多个字符组成的分隔符则不能为力，不能自己定义分词函数对象，或者使用string_algo,正则表达式等其他字符串功能库。
【2】对wstring(unicode)缺乏完善的考虑，也没有像string_algo那样使用std::locale(),不方便使用。

例如：如果使用wstring，string_algo库很简单的分词：
wstring str(L"Link mario samus");
typedef split_iterator<wstring::iterator> string_split_iterator;
string_split_iterator p, endp;
for (p = make_split_iterator(str, first_finder(L" ", is_iequal())); p!= endp; ++p)
{wcout<<L"["<<*p<<L"]";}
为了使用wstring，需要把字符串类型改为wstring，字符串常量用L标记是宽字符，在使用wcout输出，处理字符串用的string_algo算法只需要变动split_iterator的一处模板参数就可以了。

而使用tokenizer，除了以上操作，还要完整无误地写出它的全部模板参数：
char_separator<wchar_t> sep(L" ");
tokenizer<char_separator<wchar_t>, wstring::const_iterator, wstring> tok(str, sep);
for(BOOST_AUTO(pos, tok.begin)); pos != tok.end(); ++pos)
{wcout<< L"["<< *pos << L"]";}
使用escaped_list_separator等其他分词函数对象也是如此：
escaped_list_separator<wchar_t> sep;
tokenizer<escaped_list_separator<wchar_t>, wstring::const_iterator, wstring> tok(str, sep);

针对这个问题，提供了一个包装类，它能够部分解决这个问题:
template<typename Func, typename string = std::string>
struct tokenizer_wrapper
{
//typedef typename Func::string_type string;
typedef tokenizer<Func, typename string::const_iterator, string> type;
};
tokenizer_wrapper有两个模板参数，第一个Func是分词函数对象，第二个是分词所使用的字符串类型，内部用这两个模板类型typedef简化了tokenizer的模板声明，使用这个包装类，上面声明可以简化为；
tokenizer_wrapper<char_separator<wchar_t>, wstring>::type tok(str, sep);
tokenizer_wrapper<escaped_list_separator<wchar_t>, wstring>::type tok(str, sep);
请注意tokenizer_wrapper类内部的那行注释，很遗憾，tokenizer库的分词函数对象均把字符串类型作为它的内部的typedef，不能被外界使用，并且offset_separator不提供这个typedef，否则包装类可以节省一个模板参数，像这样；
template<typename Func>
struct tokenizer_wrapper
{
//不能通过编译，string_type是私有typedef
typedef typename Func::string_type string;
typedef tokenizer<Func, typename string::const_iterator, string> type;
};
tokenizer库设计之初就没有对这些问题做很好的考虑，除非改动源代码，但最后不要修改，因为正则表达式和string_algo通常都比tokenizer工作得更好。

繁华都市的夜晚

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
boost tokenizer

tokenizer: tokenizer库是一个专门用于分词(token)的字符串处理库,可以使用简单易用的方法把一个字符串分解成若干个单词；#includeuning namespace boost; tolenizer类是tokenizer库的核心，它以容器的外观提供分词的序列。类摘要：template, typename Iterator = std::string:
复制链接

扫一扫