使用strtk来切割文本

Lets assume we have been given an English text file to process, with the intention of extracting a lexicon from the file.
从这个文件中提取出一个词库。
One solution would be to break the problem down to a line by line tokenization problem. In this case we would define a functional object such as the following which will take the container in which we plan on storing our tokens (words) and a predicate and insert the tokens as strings into our container.
这个问题可以简化为对文本逐行逐行进行处理。为此,我们需要定义一个如下的函数对象,该类有一个container成员变量,这个变量用于存储切割得到的token,除此之外,该类还有一个成员变量predicate。

template <typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   void operator() (const std::string& str)
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:

   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template <typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                         "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                         "0123456789";

   strtk::multiple_char_delimiter_predicate predicate(delimiters);

   strtk::for_each_line(file_name,
                        parse_line(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";

   std::deque<std::string> word_list;

   parse_text(text_file_name,word_list);

   std::cout << "Token Count: " << word_list.size() << std::endl;

   return 0;
}

for_each_line函数的定义如下:

    //参数function一般是一个函数对象
   template <typename Function>
   inline std::size_t for_each_line(const std::string& file_name,
                                    Function function,
                                    const std::size_t& buffer_size = one_kilobyte)
   {
      std::ifstream stream(file_name.c_str());
      if (stream)
         return for_each_line(stream,function,buffer_size);
      else
         return 0;
   }

   template <typename Function>
   inline std::size_t for_each_line(std::istream& stream,
                                    Function function,
                                    const std::size_t& buffer_size = one_kilobyte)
   {
      std::string buffer;
      buffer.reserve(buffer_size);
      std::size_t line_count = 0;

      while (std::getline(stream,buffer))
      {
         function(buffer);
         ++line_count;
      }

      return line_count;
   }

使用C++11 lambdas表达式来实现上述需求:

int main()
{
   std::string text_file_name = "text.txt";

   std::deque<std::string> word_list;

   strtk::for_each_line(text_file_name,
                        [&word_list](const std::string& line)
                        {
                           static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                                                 "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                                                 "0123456789";

                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << "Token Count: " << word_list.size() << std::endl;

   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of “words” should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:
现在回到原来的问题,即构建一个词典。 在这种情况下,一组“单词”应该只包含感兴趣的单词。 为了简单起见,我们将感兴趣的词定义为除了以下介词之外的任何内容:
as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via
这种类型的列表通常被称为Stop Word List停止词表。 在本例中,Stop Word List停止词列表定义如下:

const std::string stop_word_list [] =
                  {
                     "as", "at", "but", "by", "for",
                     "in", "like", "next", "of", "on",
                     "opposite", "out", "past", "to",
                     "up", "via", ""
                  };

const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

现在我们对parse_line文本处理函数进行一些更新:使用filter_on_match谓词来确定当前处理的token是否是介词,并且还可以调用range_to_type back_inserter,以将token从其迭代器类型表示转换为与用户定义的容器兼容的类型表示。 除此外,我们还可以对word_list容器进行替换,我们将替换deque为某种1-1关联容器(例如集合)。 以下是parse_line处理器的改进版本:

template <typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(" "),
     tokenizer_(tmp_,predicate_,true),
     filter_(stop_word_list,stop_word_list + stop_word_list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;

      strtk::for_each_token(s,tokenizer_,filter);
   }

private:

   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值