Accelerated C++ 库算法实现字符串提取

最新推荐文章于 2022-04-13 11:16:23 发布

Kang_TJU

最新推荐文章于 2022-04-13 11:16:23 发布

阅读量370

点赞数

分类专栏： C++学习

本文链接：https://blog.csdn.net/Kang_TJU/article/details/53241749

版权

C++学习专栏收录该内容

36 篇文章 0 订阅

订阅专栏

本目主要讲一下字符串提取的问题。分别给出不使用库算法以及库算法，和一个增强版库算法的实现。

问题

考虑如下文本，提取其中的美国城市名称。每一行不同的城市由一个逗号和一个空格分隔开。

//input.txt
Houston, San Antonio, Los Angeles
North Carolina, Oklahoma, Puerto Rico
Texas, Utah, Virgin Islands

思路比较直观，枚举所有行。对于每一行进行提取，具体提取的时候。先寻找字符串的开始，即字母字符。然后寻找字符串的结尾，即‘,’字符。下面仅给出每一行的解析代码，总的代码在最后给出。

代码

int process_line( const std::string& line, std::vector<std::string>& ret )
{
    if( line == "" )
        return -1;

    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        while( b != e && !std::isalpha(*b) ) ++b;
        if( b != e )
        {
            // find the after
            const_iter after = b;
            while( after != e && *after != ',' ) ++after;

            // push the pattern to ret
            ret.push_back( std::string( b, after ) );

            b = after;
        }
    }

    return 0;
}

下面给出借助库算法实现的版本。

代码1

bool is_alpha( char c )
{
    return std::isalpha(c);
}
bool is_comma( char c )
{
    return c == ',';
}
int process_line1( const std::string& line, std::vector<std::string>& ret )
{
    if( line == "" )
        return -1;

    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        b = std::find_if( b, e, is_alpha );
        if( b != e )
        {
            // find the after
            const_iter after = std::find_if( b, e, is_comma );

            // push the pattern to ret
            ret.push_back( std::string(b, after) );

            b = after;
        }
    }
    return 0;
}

很明显，使用了库算法之后。代码的实现看起来简单多了。屏蔽了很多具体的细节，只需考虑实现的逻辑即可。非常方便。

下面进一步，考虑如下的字符串提取。考虑如下的文本：

//input1.txt
,.;:?!' Houston Rockets, .;?Los Angles Lakers,
....San Antonio Spurs, Miami Heat,....;
Cleveland Cavaliers,.....

上面这段文本，要求将NBA球队的名字在这些非法字符当中提取出来。还是可以沿用上面的思路，只不过，在具体写的时候。发现上面问题的非法字符只有逗号一个，但是这个问题的非法字符有很多个。总不能写很多个非法字符的判定函数啊。
一个可行的办法是：把这些非法字符放在一个字符串当中。每次判断一个字符是否是非法字符就去这个字符串里面找。当然，放到集合里面效果会更好，因为后者查询效率高一点。下面给出代码实现。

代码2

bool is_legal( char c )
{
    return !not_illegal(c);
}
bool not_illegal( char c )
{
    static const std::string illegal_str = ",.;:?!' ";
    return (std::find( illegal_str.begin(), illegal_str.end(), c ) != illegal_str.end() );
}
int process_line2( const std::string& line, std::vector< std::string >& ret )
{
    if( line == "" )
        return -1;
    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        b = std::find_if( b, e, is_legal );
        if( b != e )
        {
            // find the end
            const_iter after = std::find_if( b, e, not_illegal );

            // push the pattern to the ret
            ret.push_back( std::string(b, after) );

            after = b;
        }
    }

    return 0;
}

注意上面代码这一句：之前一直没想到静态局部变量有什么用，这是个很好的例子。我自己在写的时候，最初写的是全局变量。只是为了不用反复生成。但是从逻辑上来说，既然没有共享的必要，是可以不用写成全局变量的。但是写成局部变量之后，又会存在反复生成的问题。那么写成静态局部变量就可以完美的解决这个问题了。

static const std::string illegal_str = ",.;:?!' ";

下马给出完整代码

代码3

/*将下列城市名提取出来*/
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <cctype>
#include <algorithm>

int process_line( const std::string& line, std::vector<std::string>& ret );

bool is_alpha( char c );
bool is_comma( char c );
int process_line1( const std::string& line, std::vector<std::string>& ret );

bool is_legal( char c );
bool not_illegal( char c );
int process_line2( const std::string& line, std::vector< std::string >& ret );

int main( void )
{
    std::ifstream fin;
    fin.open( "input1.txt" );
    if( !fin.is_open() )
    {
        std::cerr << "Can not open the file!" << std::endl;
        return -1;
    }

    std::string line;
    std::vector<std::string> ret;
    while( std::getline( fin, line) )
    {
        process_line1( line, ret );
    }
    fin.close();

    typedef std::vector<std::string>::const_iterator const_iter_vec;
    const_iter_vec b = ret.begin();
    const_iter_vec e = ret.end();

    while( b != e )
    {
        std::cout << *b << std::endl;
        ++b;
    }


    return 0;
}

int process_line( const std::string& line, std::vector<std::string>& ret )
{
    if( line == "" )
        return -1;

    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        while( b != e && !std::isalpha(*b) ) ++b;
        if( b != e )
        {
            // find the after
            const_iter after = b;
            while( after != e && *after != ',' ) ++after;

            // push the pattern to ret
            ret.push_back( std::string( b, after ) );

            b = after;
        }
    }

    return 0;
}

bool is_alpha( char c )
{
    return std::isalpha(c);
}
bool is_comma( char c )
{
    return c == ',';
}
int process_line1( const std::string& line, std::vector<std::string>& ret )
{
    if( line == "" )
        return -1;

    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        b = std::find_if( b, e, is_alpha );
        if( b != e )
        {
            // find the after
            const_iter after = std::find_if( b, e, is_comma );

            // push the pattern to ret
            ret.push_back( std::string(b, after) );

            b = after;
        }
    }
    return 0;
}

bool is_legal( char c )
{
    return !not_illegal(c);
}
bool not_illegal( char c )
{
    static const std::string illegal_str = ",.;:?!' ";
    return (std::find( illegal_str.begin(), illegal_str.end(), c ) != illegal_str.end() );
}
int process_line2( const std::string& line, std::vector< std::string >& ret )
{
    if( line == "" )
        return -1;
    typedef std::string::const_iterator const_iter;
    const_iter b = line.begin();
    const_iter e = line.end();

    while( b != e )
    {
        // find the begin
        b = std::find_if( b, e, is_legal );
        if( b != e )
        {
            // find the end
            const_iter after = std::find_if( b, e, not_illegal );

            // push the pattern to the ret
            ret.push_back( std::string(b, after) );

            after = b;
        }
    }

    return 0;
}

做一点补充，其实上面的方法已经非常强大了。我在看Introduction to Programming with c++的时候，看到了用getline提取的方法，也是可以的。但是这种办法只能是适用于如下的文本：
字符串之间只能用非空格的一个字符隔开，比如下面的文本是用逗号。

//input2.txt
Houston,San Antonio,Los Angeles
North Carolina,Oklahoma,Puerto Rico
Texas,Utah,Virgin Islands

代码4

int process_line3( const std::string& line, std::vector< std::string >& ret )
{
    if( line == "" )
        return -1;
    std::stringstream ss(line);
    std::string token;
    while( std::getline( ss, token, ',' ) )
    {
        ret.push_back( token );
    }

    return 0;
}

Kang_TJU

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Accelerated C++ 库算法实现字符串提取

本目主要讲一下字符串提取的问题。分别给出不使用库算法以及库算法，和一个增强版库算法的实现。问题考虑如下文本，提取其中的美国城市名称。每一行不同的城市由一个逗号和一个空格分隔开。//input.txtHouston, San Antonio, Los AngelesNorth Carolina, Oklahoma, Puerto RicoTexas, Utah, Virgin Islands思路
复制链接

扫一扫

专栏目录