Split a string

最新推荐文章于 2024-10-19 12:39:03 发布

dyufei

最新推荐文章于 2024-10-19 12:39:03 发布

阅读量456

点赞数

分类专栏： C&C++ 文章标签： string tokenize vector character token microsoft

C&C++ 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

Split a string

Thevast majority of questions about splitting strings is about tokenization: splitting a string into substrings containing only relatedcharacters, called, depending on context, tokens or fields.

But first, somebasic concepts:

What is the difference between a token and a field?

What is the essential algorithm?

The reality is thatthere are so many ways to split strings it is overwhelming. Here are some, inno particular order, with examples:

Method	Iterated or all-at-once	Delimiter						Empty Fields
		char	string	function	quotable	offset	no case	elided	trailing
Boost String Algorithms: Split	all-at-once	Y	Y	Y				4	Y
Boost String Algorithms: Split Regex	all-at-once	1	Y	regex	2		2	2	?
Boost Tokenizer	iterated3	Y	Y		Y	Y		opt	Y
Trolltech Qt’s QString::split()	all-at-once	Y	Y	regex	2		Y	opt	Y
GNU String Utility Functions: Split	all-at-once	1	Y						Y
iostreams and getline()	iterated3	Y						Y	Y
string::find_first_of()	all-at-once3		Y					Y	Y
strtok()	iterated3	1	Y					always	NO
Roll your own C tokenizer	iterated3	1	Y					opt	Y

1	Searching for a single character is, of course, possible when searching for
	more than one is, but the function does not provide a direct way to do it.
2	Via regex capabilities.
3	While the primary method may be one or the other, it is easy enough to
	write a wrapper that provides the other behavior.
4	Empty fields are not elided (as in ‘omitted’), but adjacent delimiters are (as in ‘combined’).

Asimple Google search reveals many more forperusal.

What’s the difference between a token and a field?

Simply put, a tokenand a field are two different things, though you will often see them confused.Most of what is on this page concerns fields.

tokens

Tokensrelate to lexing and parsing, where the text being decoded is an ordered list of tokens, notnecessarily all of the same type.

A token is typically a string of characters that possess somecommon characteristic, such as being all alphabetic, or being a specific listof characters, like "->" and "<<". Tokens also have semantic value(or meaning) attached to them. (Fields do not.)

Forexample, in C++, the text string s = "Helloworld!"; containsfive tokens:

string	a type
s	an identifier
=	an operator
"Hello world!"	a constant string literal
;	a statement terminator

It is worth notinghere that for the C++ example each token has different characteristics andwhitespace is treated specially. This kind of tokenizing requires carefullexing; typically token-specific functions are used to determine whether or nota character belongs to the current token and to classify the token’s type.

fields

A field is typically a string of characters that do not includea set of special characters called delimiters. Anordered collection of fields, separated by delimiters, is called a record. A collection of records is called, variously: databases,tables, spreadsheets, etc.

Dependingon your requirements specifications, it may be possible for some fieldsto include characters that would normally be considered delimiters; thecharacters are in some way encoded into the field to prevent them from beingunderstood as delimiters.

For example, here isa record of six fields, delimited by commas and quoted by double-quotecharacters, where (unquoted) leading and trailing whitespace is ignored:

NalleliAndrade, 12 Jun 1989,,"piña colada, long walks in the rain" ,ID 1589-73AYN,

The six fields are,in order:

1	Nalleli Vaca Andrade
2	12 Jun 1989
3
4	piña colada, long walks in the rain
5	ID 1589-73AYN
6

One ofthe most common questions about this kind of data structure is for CSV files;specifically in relation toMicrosoft Excel. The C++ I/O library actuallymakes handling these kind of things relatively easy for simple data, but formore advanced handling, see the topic Parse CSV data?

What is the essential algorithm?

The split/tokenizealgorithm involves very little:

You are given:

a string to split

a way of determining whether a character “is a token” or “is a delimiter”

You also need:

a “start index” into the string to remember where to start the next token

To get a token:

    index =start index
    while index < length( s)
    {
      if s[index] is a delimiter (or isnot a token)
      {
        break out of the loop
      }
      index = index + 1
    }
    next start index = index + something (see comments below)
    result = substring( s, from:start,to:index - 1 )

something

Youmust be careful when handling characters that are not part of the currenttoken. For tokens, the last character we examinedin the loop might actually be part of the next token. (Your something might be zero – but you would have to guarantee to never returnan empty token.) For fields, the last character is not partof any token. (Yoursomething is at least one.) When you adjust your start index, make sure to account forthese kinds of things.

Other things toaccount for are multi-character delimiters and elided delimiters.

Thebasic tokenize/split algorithm is illustrated with the strtok() FAQ. The only issue is that strtok() modifies the original stringby sticking '\0's in it (which you should not do).

Otherexamples are found with the string::find_first_of() and roll your own Ctokenizer examplesbelow. (Be warned, though, that you should have a pretty solid understanding ofthe basic algorithm before you look at these examples, as they also provide theoption to eliminate empty fields.)

Boost String Algorithms: Split

The Boost String Algorithms Library is a comprehensive libraryto do useful things to strings. Included is a nifty function called split(). Here are three simple examplesof how to use it:

#include <boost/algorithm/string.hpp>
#include <iostream>
#include <string>
#include <vector>

using namespace std;
using namespace boost;

void print( vector <string> & v )
{
for (size_t n = 0; n < v.size(); n++)
cout << "\"" << v[ n ] << "\"\n";
cout << endl;
}

int main()
{
string s = "a,b, c ,,e,f,";
vector <string> fields;

cout << "Original = \"" << s << "\"\n\n";

cout << "Split on \',\' only\n";
split( fields, s, is_any_of( "," ) );
print( fields );

cout << "Split on \" ,\"\n";
split( fields, s, is_any_of( " ," ) );
print( fields );

cout << "Split on \" ,\" and compress adjacent delimiters\n";
split( fields, s, is_any_of( " ," ), token_compress_on );
print( fields );

return 0;
}

Original = "a,b, c ,,e,f,"

Split on ',' only
"a"
"b"
" c "
""
"e"
"f"
""

Split on " ,"
"a"
"b"
""
"c"
""
""
"e"
"f"
""

Split on " ," and compress adjacent delimiters
"a"
"b"
"c"
"e"
"f"
""

Noticein that last example that that empty field there at the end is still among theresults? Multiple delimiters may be treated as if theywere one delimiter (or ‘compressed’), which is different thanremoving empty fields.

Boost String Algorithms: Split Regex

Again,the Boost StringAlgorithms Library – Regex Variants comes to the rescue. This version of split() is much more powerful thanthe one above... but it requires you to have compiled the BoostRegex library and to link it to your executable.

Here is an exampleof using it to split on a multi-character match:

#include <boost/regex.hpp>
#include <boost/algorithm/string/regex.hpp>
#include <iostream>
#include <string>
#include <vector>

using namespace std;
using namespace boost;

void print( vector <string> & v )
{
for (size_t n = 0; n < v.size(); n++)
cout << "\"" << v[ n ] << "\"\n";
cout << endl;
}

int main()
{
string s = "one->two->thirty-four";
vector <string> fields;

split_regex( fields, s, regex( "->" ) );
print( fields );

return 0;
}

"one"
"two"
"thirty-four"

Remember,we are finding delimiters by matching against a regular expression. And again, make sure to linkwithlibboost_regex. (Need help Compiling and Linking?)

Boost Tokenizer

The Boost.Tokenizer library is a small librarydesigned to handle common tokenizing tasks. Unlike the split()functions, it is used withan iterator to work your way through the tokens in astring.

There are threegeneral tokenizing methods that come with it:

Break on delimiter characters using char_separator.
It allows you to specify:
- delimiter characters
- delimiter characters to keep in the extracted field
- whether adjacent delimiter characters indicate an empty field or are treated as a single delimiter

Break on delimiter characters but allowing quoted fields using escaped_list_separator.
This method is nice if you are following the C/C++ CSV convention, but it unfortunately cannot handle the industry ‘standard’ Microsoft Excel CSV quoting conventions. (Nor can it be made to fully... a failure of the Boost Tokenizer library’s design, alas.)

Break based on position using offset_separator.
This method is unique: you can split a string based solely upon offsets (or, more accurately, field widthcounts) into the string. You get everything, though, so you must decide what to keep. Unfortunately, using iterators makes it a little more clumsy than just using std::string::substr() a few times...

Irecommend you to the Boost.Tokenizer site for full explanationand examples.

Forproperly dealing with Microsoft Excel CSV data, see the next FAQ.

Trolltech Qt’s QString

Rememberthat Trolltech’s Qt product is only free to use if youplan to never sell the software you developwith it. Otherwise they expect handsome payment for their (very nicelydesigned) libraries, right at the beginning.

Qt’s QString handles Unicode and RegularExpression parsing easily.

See QString::split() for full explanations andexamples.

Asnice as it is, however, the Qt Framework is not designed to handle the fullMicrosoft Excel CSV file format. Again, to properly handle Microsoft Excel CSVdata, see the next FAQ.

GNU String Utility Functions: Split

The GNU C Library’sString Utility Functions also include useful routines:

g_strsplit()
Split a string on a single delimiter character.

g_strsplit_set()
Split a string on any of the argument delimiters.

Here is a simpleexample using the first:

/* example.c */
#include <stdio.h>
#include <glib.h>

int main()
{
const char* s = ",,three,,five,,";
char** fields = g_strsplit( s, ',', 0 );

gint n = 0;
for (char** field = fields; field; ++field, ++n)
{
printf( "\"%s\"\n", *field );
}
printf( "%d tokens\n", n );

g_strfreev( fields );
fields = NULL;

return 0;
}

""
""
"three"
""
"five"
""
""

Makesure to link with GLib when you compile! (If youare a beginner, this might be beyond your capacity on non-Linux systems. Theredo exist binary development packages for Windows, but GTK+ is not trivial toinstall... Good luck!)

iostreams and getline()

Thesimplest method of tokenizing strings in C++ is to use the standard iostreamcapabilities. The std::getline()function has a very rudimentary capacity to breakstrings up using a single delimiter character eachtime you call the function.

Because getline() is designed to ignore emptyfields at the end of input, we must do something very brash –something that is often the Wrong Thing to do – and loop on EOF. In this case, though, it is actually the RightThing to do, as we want the odd behavior and we arebeing very careful to get it.

Thebasic algorithm to print every field to the standard outputis this:

string s = "string, to, split";
istringstream ss( s );
while (!ss.eof())
{
string x; // here's a nice, empty string
getline( ss, x, ',' ); // try to read the next field into it
cout << x << endl; // print it out, even if we already hit EOF
}

Let’sput that into a convenient function that splits a string into a container ofyour choice (such as a std::vector). Let’salso add the ability to elide (or omit) empty fields. Here is the completecode.

#include <sstream>
#include <string>

struct split
{
enum empties_t { empties_ok, no_empties };
};

template <typename Container>
Container& split(
Container&                                 result,
const typename Container::value_type&      s,
typename Container::value_type::value_type delimiter,
split::empties_t                           empties = split::empties_ok )
{
result.clear();
std::istringstream ss( s );
while (!ss.eof())
{
    typename Container::value_type field;
    getline( ss, field, delimiter );
    if ((empties == split::no_empties) && field.empty()) continue;
    result.push_back( field );
}
return result;
}

(Wecould also add the ability to trim() leading and trailingwhitespace fromthe field between lines 21 and 22, but we’ll leave that to you.) Here is anexample demonstrating how to use it.

#include <iostream>
#include <vector>
using namespace std;

void print( vector <string> & v )
{
for (size_t n = 0; n < v.size(); n++)
cout << "\"" << v[ n ] << "\"\n";
cout << endl;
}

int main()
{
string s = "One, two,, four , five,";

vector <string> fields;
split( fields, s, ',' );

cout << "\"" << s << "\"\n\n";
print( fields );
cout << fields.size() << " fields.\n";

return 0;
}

"One, two,, four , five,"

"One"
" two"
""
" four "
" five"
""

6 fields.

By theway, how would you like to be able deduce thereturn type andjust say fields = split( s, ',' );? Read about it here.

If you are unsure how to actually use anyof these functions in your own programs, make sure to read all about it at this other spot.

string::find_first_of()

Theversion of split() built using std::getline() was pretty slick, but we canactually do better – much better. We would also liketo be able to match on any of a set of delimiters. Thatis much easier using the STL string find functions.

The basic algorithmis this:

string s = "string, to, split";
string delimiters = " ,";
size_t current;
size_t next = -1;
do
{
current = next + 1;
next = s.find_first_of( delimiters, current );
cout << s.substr( current, next - current ) << endl;
}
while (next != string::npos);

Let’sdo like we did above, and put that into a functionthat fills a container of your choice, and add the ability to elide emptyfields.

#include <cstddef>

struct split
{
enum empties_t { empties_ok, no_empties };
};

template <typename Container>
Container& split(
Container&                            result,
const typename Container::value_type& s,
const typename Container::value_type& delimiters,
split::empties_t                      empties = split::empties_ok )
{
result.clear();
size_t current;
size_t next = -1;
do
{
    if (empties == split::no_empties)
    {
      next = s.find_first_not_of( delimiters, next + 1 );
      if (next == Container::value_type::npos) break;
      next -= 1;
    }
    current = next + 1;
    next = s.find_first_of( delimiters, current );
    result.push_back( s.substr( current, next - current ) );
}
while (next != Container::value_type::npos);
return result;
}

#include <iostream>
#include <string>
#include <vector>
using namespace std;

void print( vector <string> & v )
{
for (size_t n = 0; n < v.size(); n++)
cout << "\"" << v[ n ] << "\"\n";
cout << endl;
}

int main()
{
string s = "One, two,, four , five,";

vector <string> fields;

cout << "\"" << s << "\"\n\n";

split( fields, s, "," );
print( fields );
cout << fields.size() << " fields.\n\n";

split( fields, s, ",", split::no_empties );
print( fields );
cout << fields.size() << " fields.\n";

return 0;
}

"One, two,, four , five,"

"One"
" two"
""
" four "
" five"
""

6 fields.

"One"
" two"
" four "
" five"

4 fields.

Rememberto read up and learn how you can deduce the returntype andcall the function like this:

fields = split( s, ",",split::no_empties );.

strtok()

Thisis the old C library function. There is an extensive overview in a later FAQ. Here is an example of using it.

/* This is C code */
#include <stdio.h>
#include <string.h>

int main()
{
char s[] = "one, two,, four , five,"; /* mutable! */
const char* p;

for (p = strtok( s, "," ); p; p = strtok( NULL, "," ))
{
printf( "\"%s\"\n", p );
}

return 0;
}

"one"
" two"
" four "
" five"

Noticehow strtok() is too stupid to treat adjacent delimiters as anempty field? And it misses that empty field at the end!

Theseproblems are beyond fixing if you use strtok().

Roll your own C tokenizer

Youcan always get better results by rolling your own tokenizer from the otherfunctions available in <string.h>. Here’s a simple one you are free to use:

/* c_tokenizer.h */

#pragma once
#ifndef C_TOKENIZER_H
#define C_TOKENIZER_H

typedef struct
{
char*       s;
const char* delimiters;
char*       current;
char*       next;
int         is_ignore_empties;
}
tokenizer_t;

enum { TOKENIZER_EMPTIES_OK, TOKENIZER_NO_EMPTIES };

tokenizer_t tokenizer( const char* s, const char* delimiters, int empties );
const char* free_tokenizer( tokenizer_t* tokenizer );
const char* tokenize( tokenizer_t* tokenizer );

#endif

/* c_tokenizer.c */

#include <stdlib.h>
#include <string.h>

#include "c_tokenizer.h"

#ifndef strdup
#define strdup sdup
static char* sdup( const char* s )
{
size_t n = strlen( s ) + 1;
char* p = malloc( n );
return p ? memcpy( p, s, n ) : NULL;
}
#endif

tokenizer_t tokenizer( const char* s, const char* delimiters, int empties )
{
char* strdup( const char* );

tokenizer_t result;

result.s                 = (s && delimiters) ? strdup( s ) : NULL;
result.delimiters        = delimiters;
result.current           = NULL;
result.next              = result.s;
result.is_ignore_empties = (empties != TOKENIZER_EMPTIES_OK);

return result;
}

const char* free_tokenizer( tokenizer_t* tokenizer )
{
free( tokenizer->s );
return tokenizer->s = NULL;
}

const char* tokenize( tokenizer_t* tokenizer )
{
if (!tokenizer->s) return NULL;

if (!tokenizer->next)
return free_tokenizer( tokenizer );

tokenizer->current = tokenizer->next;
tokenizer->next = strpbrk( tokenizer->current, tokenizer->delimiters );

if (tokenizer->next)
{
*tokenizer->next = '\0';
tokenizer->next += 1;

if (tokenizer->is_ignore_empties)
    {
      tokenizer->next += strspn( tokenizer->next, tokenizer->delimiters );
      if (!(*tokenizer->current))
        return tokenize( tokenizer );
    }
}
else if (tokenizer->is_ignore_empties && !(*tokenizer->current))
    return free_tokenizer( tokenizer );

return tokenizer->current;
}

And here is a simpleexample of how to use it.

#include <stdio.h>
#include "c_tokenizer.h"

int main()
{
const char* s = ",,a,,b,,"; /* see notes with accompanying text below */

tokenizer_t tok = tokenizer( s, ",", TOKENIZER_EMPTIES_OK );
const char* token;
unsigned n;

n = 0;
for (token = tokenize( &tok ); token; token = tokenize( &tok ))
{
printf( "\"%s\"\n", token );
n += 1;
}
free_tokenizer( &tok );
printf( "%u tokens\n", n );

return 0;
}

""
""
"a"
""
"b"
""
""
7 tokens

Remember, if you need help using thisstuff in your own programs, make sure to read all about it here.

The tokenizer here is smart enough totokenize on more than one delimiter; it does notsuffer state problems (including threading issues); it allows you to collapseadjacent delimiters (what strtok() does automatically) or totreat them as delimiting empty fields; and it works properly on all input. Tryreplacing the string to test with things like "" and NULL. Try tokenizing the examplesusing TOKENIZER_NO_EMPTIES too.

源文档 <http://cplusplus.com/faq/sequences/strings/split/>