使用strtk来切割文本

最新推荐文章于 2024-03-08 18:22:23 发布

dengkaikaikai

最新推荐文章于 2024-03-08 18:22:23 发布

阅读量736

点赞数 1

分类专栏： strtk 文章标签：库 strtk

strtk 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Lets assume we have been given an English text file to process, with the intention of extracting a lexicon from the file.
从这个文件中提取出一个词库。
One solution would be to break the problem down to a line by line tokenization problem. In this case we would define a functional object such as the following which will take the container in which we plan on storing our tokens (words) and a predicate and insert the tokens as strings into our container.
这个问题可以简化为对文本逐行逐行进行处理。为此，我们需要定义一个如下的函数对象，该类有一个container成员变量，这个变量用于存储切割得到的token，除此之外，该类还有一个成员变量predicate。

template <typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   void operator() (const std::string& str)
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:

   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template <typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                         "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                         "0123456789";

   strtk::multiple_char_delimiter_predicate predicate(delimiters);

   strtk::for_each_line(file_name,
                        parse_line(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";

   std::deque<std::string> word_list;

   parse_text(text_file_name,word_list);

   std::cout << "Token Count: " << word_list.size() << std::endl;

   return 0;
}

for_each_line函数的定义如下：

    //参数function一般是一个函数对象
   template <typename Function>
   inline std::size_t for_each_line(const std::string& file_name,
                                    Function function,
                                    const std::size_t& buffer_size = one_kilobyte)
   {
      std::ifstream stream(file_name.c_str());
      if (stream)
         return for_each_line(stream,function,buffer_size);
      else
         return 0;
   }

   template <typename Function>
   inline std::size_t for_each_line(std::istream& stream,
                                    Function function,
                                    const std::size_t& buffer_size = one_kilobyte)
   {
      std::string buffer;
      buffer.reserve(buffer_size);
      std::size_t line_count = 0;

      while (std::getline(stream,buffer))
      {
         function(buffer);
         ++line_count;
      }

      return line_count;
   }

使用C++11 lambdas表达式来实现上述需求:

int main()
{
   std::string text_file_name = "text.txt";

   std::deque<std::string> word_list;

   strtk::for_each_line(text_file_name,
                        [&word_list](const std::string& line)
                        {
                           static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                                                 "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                                                 "0123456789";

                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << "Token Count: " << word_list.size() << std::endl;

   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of “words” should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:
现在回到原来的问题，即构建一个词典。在这种情况下，一组“单词”应该只包含感兴趣的单词。为了简单起见，我们将感兴趣的词定义为除了以下介词之外的任何内容：
as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via
这种类型的列表通常被称为Stop Word List停止词表。在本例中，Stop Word List停止词列表定义如下：

const std::string stop_word_list [] =
                  {
                     "as", "at", "but", "by", "for",
                     "in", "like", "next", "of", "on",
                     "opposite", "out", "past", "to",
                     "up", "via", ""
                  };

const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

现在我们对parse_line文本处理函数进行一些更新：使用filter_on_match谓词来确定当前处理的token是否是介词，并且还可以调用range_to_type back_inserter，以将token从其迭代器类型表示转换为与用户定义的容器兼容的类型表示。除此外，我们还可以对word_list容器进行替换，我们将替换deque为某种1-1关联容器（例如集合）。以下是parse_line处理器的改进版本：

template <typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(" "),
     tokenizer_(tmp_,predicate_,true),
     filter_(stop_word_list,stop_word_list + stop_word_list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;

      strtk::for_each_token(s,tokenizer_,filter);
   }

private:

   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};