深入理解字符串学习笔记
本章研究标准C++中的string类,先简要介绍C++字符串的构成要素,然后阐述C++版本的字符串与传统C语言字符型数组有哪些不同。读者将会了解使用string对TOC象时的各种操作方法,还会看到C++ string类在处理不同字符集和字符串数据转换时的神来之笔。
1.字符串的内部是什么
在C语言中字符串基本就是字符型数组,并且总是以二进制零(通常被称为空结束符(null terminator))作为其最末元素。
C++string与它们在C语言中的前身截然不同。首先,也是最重要的不同点,C++string隐藏了它所包含的字符序列的物理表示。程序设计人员不必关心数组的维数或空结束符方面的问题。C++string也包含关于其数据容量及存储地址的“内务处理”信息。具体地说,C++string对象知道自己在内存中的开始位置,包含的内容,包含的字符长度(length in characters)以及在必须重新调整内部数据缓冲区的大小之前自己可以增长到的最大字符长度。
#ifndef STRINGSTORAGE_H
#define STRINGSTORAGE_H
#include <iostream>
#include <string>
#include "../TestSuite/Test.h"
using std::cout;
using std::endl;
using std::string;
class StringStorageTest : public TestSuite::Test {
public:
void run() {
string s1("12345");
// This may copy the first to the second or
// use reference counting to simulate a copy:
string s2 = s1;
test_(s1 == s2);
// Either way, this statement must ONLY modify s1:
s1[0] = '6';
cout << "s1 = " << s1 << endl; // 62345
cout << "s2 = " << s2 << endl; // 12345
test_(s1 != s2);
}
};
#endif // STRINGSTORAGE_H ///:~
只有当字符串被修改的时候才创建各自的拷贝,这种实现方式称为写时复制(copy-on-write)策略。当字符串这是作为值参数(value parameter)或在其他只读情形下使用,这种方法能够节省时间和空间。
2.创建并初始化C++字符串
#include <string>
#include <iostream>
using namespace std;
int main() {
string s1("What is the sound of one clam napping?");
string s2("Anything worth doing is worth overdoing.");
string s3("I saw Elvis in a UFO");
// Copy the first 8 chars:
string s4(s1, 0, 8);
cout << s4 << endl;
// Copy 6 chars from the middle of the source:
string s5(s2, 15, 6);
cout << s5 << endl;
// Copy from middle to end:
string s6(s3, 6, 15);
cout << s6 << endl;
// Copy many different things:
string quoteMe = s4 + "that" +
// substr() copies 10 chars at element 20
s1.substr(20, 10) + s5 +
// substr() copies up to either 100 char
// or eos starting at element 5
"with" + s3.substr(5, 100) +
// OK to copy a single char this way
s1.substr(37, 1);
cout << quoteMe << endl;
} ///:~
string类对象的成员函数substr()将开始位置作为其第1个参数,而将待选字符的个数作为其第2个参数。
不可以使用单个的字符,ASCII码或其他整数值来初始化C++字符串。但是,可用单个字符的多个拷贝来初始化字符串:
#include <string>
#include <cassert>
using namespace std;
int main() {
// Error: no single char inits
//! string nothingDoing1('a');
// Error: no integer inits
//! string nothingDoing2(0x37);
// The following is legal:
string okay(5, 'a');
assert(okay == string("aaaaa"));
} ///:~
第1个参数表示放入字符串中的第2个参数的拷贝的个数。第2个参数只能是单个字符的char型数据,而不能是char型数组。
3.对字符串进行操作
标准C语言的char型数组工具中存在着其固有的第2个误区,那就是他们都显示地依赖一个假设:字符数组包括一个空结束符。若由于疏忽或其他差错,这个空结束符被忽略或重写,这个小小的差错就会使C语言的char型数组处理函数几乎不可避免地操作其已分配空间之外的内存,有时会带来灾难性的后果。
3.1 追加,插入和连接字符串
#include <string>
#include <iostream>
using namespace std;
int main() {
string bigNews("I saw Elvis in a UFO. ");
cout << bigNews << endl;
// How much data have we actually got?
cout << "Size = " << bigNews.size() << endl;
// How much can we store without reallocating?
cout << "Capacity = " << bigNews.capacity() << endl;
// Insert this string in bigNews immediately
// before bigNews[1]:
bigNews.insert(1, " thought I");
cout << bigNews << endl;
cout << "Size = " << bigNews.size() << endl;
cout << "Capacity = " << bigNews.capacity() << endl;
// Make sure that there will be this much space
bigNews.reserve(500);
// Add this to the end of the string:
bigNews.append("I've been working too hard.");
cout << bigNews << endl;
cout << "Size = " << bigNews.size() << endl;
cout << "Capacity = " << bigNews.capacity() << endl;
} ///:~
size()函数返回当前在字符串存储的字符数,它跟length()成员函数的作用是一样的。如果要生成的新字符串的规模比当前的字符串大或者是需要截短原字符串,resize()函数就会在字符串的末尾追加空格。(resize()的一个重载可以指定一个不同的填充字符)
3.2替换字符串中的字符
replace()有很多的重载版本,最简单的版本使用了3个参数:一个参数用于指示从字符串的什么位置开始改写;第二个参数用于指示从字符串中剔除多少个字符;另外一个是替换字符串(它所包含的字符数可以与被剔除的字符数组不同)。举例如下:
#include <cassert>
#include <string>
using namespace std;
int main() {
string s("A piece of text");
string tag("$tag$");
s.insert(8, tag + ' ');
assert(s == "A piece $tag$ of text");
int start = s.find(tag);
assert(start == 8);
assert(tag.size() == 5);
s.replace(start, tag.size(), "hello there");
assert(s == "A piece hello there of text");
} ///:~
string对象看上去就像是字符的容器:可用string::begin()得到容器范围的前端,用string::end()得到其末尾。下面的例子显示了如何使用replace()算法将所有单个的字符‘X’替换为‘Y’:
#include <algorithm>
#include <cassert>
#include <string>
using namespace std;
int main() {
string s("aaaXaaaXXaaXXXaXXXXaaa");
replace(s.begin(), s.end(), 'X', 'Y');
assert(s == "aaaYaaaYYaaYYYaYYYYaaa");
} ///:~
3.3使用非成员重载运算符连接
对于一个学习C++string处理的C程序员来说,等待她的最令人欣喜的发现之一就是,借助operator+和operator+=可以如此轻而易举地实现string地合并与追加。这些运算符使合并串的操作在语法上类似与数值型数据的加法运算:
#include <string>
#include <cassert>
using namespace std;
int main() {
string s1("This ");
string s2("That ");
string s3("The other ");
// operator+ concatenates strings
s1 = s1 + s2;
assert(s1 == "This That ");
// Another way to concatenates strings
s1 += s3;
assert(s1 == "This That The other ");
// You can index the string on the right
s1 += s3 + s3[4] + "ooh lala";
assert(s1 == "This That The other The other oooh lala");
} ///:~
4.字符串的查找
find():在一个字符串中查找一个指定的单个字符或字符组。如果找到,就返回首次匹配的开始位置;如果没有找到匹配的内容,则返回npos。
4.1 反向查找
如果需要在一个string对象中从后往前进行查找(用“后进/先出"的顺序查找数据),可以使用字符串成员函数rfind():
#ifndef RPARSE_H
#define RPARSE_H
#include <cstddef>
#include <string>
#include <vector>
#include "../TestSuite/Test.h"
using std::size_t;
using std::string;
using std::vector;
class RparseTest : public TestSuite::Test {
// To store the words:
vector<string> strings;
public:
void parseForData() {
// The ';' characters will be delimiters
string s("now.;sense;make;to;going;is;This");
// The last element of the string:
int last = s.size();
// The beginning of the current word:
size_t current = s.rfind(';');
// Walk backward through the string:
while(current != string::npos) {
// Push each word into the vector.
// Current is incremented before copying
// to avoid copying the delimiter:
++current;
strings.push_back(s.substr(current, last - current));
// Back over the delimiter we just found,
// and set last to the end of the next word:
current -= 2;
last = current + 1;
// Find the next delimiter:
current = s.rfind(';', current);
}
// Pick up the first word -- it's not
// preceded by a delimiter:
strings.push_back(s.substr(0, last));
}
void testData() {
// Test them in the new order:
test_(strings[0] == "This");
test_(strings[1] == "is");
test_(strings[2] == "going");
test_(strings[3] == "to");
test_(strings[4] == "make");
test_(strings[5] == "sense");
test_(strings[6] == "now.");
string sentence;
for(size_t i = 0; i < strings.size() - 1; i++)
sentence += strings[i] += " ";
// Manually put last word in to avoid an extra space:
sentence += strings[strings.size() - 1];
test_(sentence == "This is going to make sense now.");
}
void run() {
parseForData();
testData();
}
};
#endif // RPARSE_H ///:~
4.2 查找一组字符第一次或最后一次出现的位置
使用find_first_of()和find_last_of()成员函数可以很方便地实现一些小的功能,比如从字符串地头尾两端删除空白字符。注意,它并不触动原字符串,而是返回一个新字符串:
#ifndef TRIM_H
#define TRIM_H
#include <string>
#include <cstddef>
inline std::string trim(const std::string& s) {
if(s.length() == 0)
return s;
std::size_t beg = s.find_first_not_of(" \a\b\f\n\r\t\v");
std::size_t end = s.find_last_not_of(" \a\b\f\n\r\t\v");
if(beg == std::string::npos) // No non-spaces
return "";
return std::string(s, beg, end - beg + 1);
}
#endif // TRIM_H ///:~
4.3 从字符串中删除字符
使用erase()成员函数删除字符串中地字符是简单而有效的。这个函数有两个参数:一个参数表示开始删除字符的位置(默认是0);另一个参数表示要删除多少个字符(默认值是string::npos)。如果指定删除的字符个数比字符串中剩余的字符还多,那么剩余的字符将全部被删除(所以调用不含参数的erase()函数将删除字符串中的所有字符)。
有时,删除一个HTML文件中的标记(tag)与特殊字符很有用的,这样就可以得到类似于浏览器中所显示的文本文件,仅仅作为纯文本文件。下面这个例子用erase()来完成这个工作:
#include <cassert>
#include <cmath>
#include <cstddef>
#include <fstream>
#include <iostream>
#include <string>
#include "ReplaceAll.h"
#include "../require.h"
using namespace std;
string& stripHTMLTags(string& s) {
static bool inTag = false;
bool done = false;
while(!done) {
if(inTag) {
// The previous line started an HTML tag
// but didn't finish. Must search for '>'.
size_t rightPos = s.find('>');
if(rightPos != string::npos) {
inTag = false;
s.erase(0, rightPos + 1);
}
else {
done = true;
s.erase();
}
}
else {
// Look for start of tag:
size_t leftPos = s.find('<');
if(leftPos != string::npos) {
// See if tag close is in this line:
size_t rightPos = s.find('>');
if(rightPos == string::npos) {
inTag = done = true;
s.erase(leftPos);
}
else
s.erase(leftPos, rightPos - leftPos + 1);
}
else
done = true;
}
}
// Remove all special HTML characters
replaceAll(s, "<", "<");
replaceAll(s, ">", ">");
replaceAll(s, "&", "&");
replaceAll(s, " ", " ");
// Etc...
return s;
}
int main(int argc, char* argv[]) {
requireArgs(argc, 1,
"usage: HTMLStripper InputFile");
ifstream in(argv[1]);
assure(in, argv[1]);
string s;
while(getline(in, s))
if(!stripHTMLTags(s).empty())
cout << s << endl;
} ///:~
4.4 字符串的比较
字符串的比较与数字的比较有其固有的不同。数字有恒定的永远有意义的值。为了评定两个字符串的大小关系,必须进行字典比较。通常,这种校对序列是ASCII校对序列,它给英语的可打印字符分配的数值为从32到127范围内的连续十进制数字。
#ifndef COMPSTR_H
#define COMPSTR_H
#include <string>
#include "../TestSuite/Test.h"
using std::string;
class CompStrTest : public TestSuite::Test {
public:
void run() {
// Strings to compare
string s1("This");
string s2("That");
test_(s1 == s1);
test_(s1 != s2);
test_(s1 > s2);
test_(s1 >= s2);
test_(s1 >= s1);
test_(s2 < s1);
test_(s2 <= s1);
test_(s1 <= s1);
}
};
#endif // COMPSTR_H ///:~
compare()成员函数能够提供远比非成员运算符集更复杂精密的比较手段。它提供的那些重载版本,可以比较:
- 两个完整的字符串
- 一个字符串的某一部分与另一字符串的全部
- 两个字符串的子集
#include <cassert>
#include <string>
using namespace std;
int main() {
string first("This");
string second("That");
assert(first.compare(first) == 0);
assert(second.compare(second) == 0);
// Which is lexically greater?
assert(first.compare(second) > 0);
assert(second.compare(first) < 0);
first.swap(second);
assert(first.compare(second) < 0);
assert(second.compare(first) > 0);
} ///:~
#include <cassert>
#include <string>
using namespace std;
int main() {
string first("This is a day that will live in infamy");
string second("I don't believe that this is what "
"I signed up for");
// Compare "his is" in both strings:
assert(first.compare(1, 7, second, 22, 7) == 0);
// Compare "his is a" to "his is w":
assert(first.compare(1, 9, second, 22, 9) < 0);
} ///:~
5.字符串的应用
本部分将介绍一个程序,该程序仅用来提取所有的代码,以便程序员进行手工编译和检查。程序员可以用这个程序来提取本教材中的所有代码,并将文档保存为文本文件。
#include <cassert>
#include <cstddef>
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
using namespace std;
// Legacy non-standard C header for mkdir()
#if defined(__GNUC__) || defined(__MWERKS__)
#include <sys/stat.h>
#elif defined(__BORLANDC__) || defined(_MSC_VER) \
|| defined(__DMC__)
#include <direct.h>
#else
#error Compiler not supported
#endif
// Check to see if directory exists
// by attempting to open a new file
// for output within it.
bool exists(string fname) {
size_t len = fname.length();
if(fname[len-1] != '/' && fname[len-1] != '\\')
fname.append("/");
fname.append("000.tmp");
ofstream outf(fname.c_str());
bool existFlag = outf;
if(outf) {
outf.close();
remove(fname.c_str());
}
return existFlag;
}
int main(int argc, char* argv[]) {
// See if input file name provided
if(argc == 1) {
cerr << "usage: extractCode file [dir]" << endl;
exit(EXIT_FAILURE);
}
// See if input file exists
ifstream inf(argv[1]);
if(!inf) {
cerr << "error opening file: " << argv[1] << endl;
exit(EXIT_FAILURE);
}
// Check for optional output directory
string root("./"); // current is default
if(argc == 3) {
// See if output directory exists
root = argv[2];
if(!exists(root)) {
cerr << "no such directory: " << root << endl;
exit(EXIT_FAILURE);
}
size_t rootLen = root.length();
if(root[rootLen-1] != '/' && root[rootLen-1] != '\\')
root.append("/");
}
// Read input file line by line
// checking for code delimiters
string line;
bool inCode = false;
bool printDelims = true;
ofstream outf;
while(getline(inf, line)) {
size_t findDelim = line.find("//" "/:~");
if(findDelim != string::npos) {
// Output last line and close file
if(!inCode) {
cerr << "Lines out of order" << endl;
exit(EXIT_FAILURE);
}
assert(outf);
if(printDelims)
outf << line << endl;
outf.close();
inCode = false;
printDelims = true;
} else {
findDelim = line.find("//" ":");
if(findDelim == 0) {
// Check for '!' directive
if(line[3] == '!') {
printDelims = false;
++findDelim; // To skip '!' for next search
}
// Extract subdirectory name, if any
size_t startOfSubdir =
line.find_first_not_of(" \t", findDelim+3);
findDelim = line.find(':', startOfSubdir);
if(findDelim == string::npos) {
cerr << "missing filename information\n" << endl;
exit(EXIT_FAILURE);
}
string subdir;
if(findDelim > startOfSubdir)
subdir = line.substr(startOfSubdir,
findDelim - startOfSubdir);
// Extract file name (better be one!)
size_t startOfFile = findDelim + 1;
size_t endOfFile =
line.find_first_of(" \t", startOfFile);
if(endOfFile == startOfFile) {
cerr << "missing filename" << endl;
exit(EXIT_FAILURE);
}
// We have all the pieces; build fullPath name
string fullPath(root);
if(subdir.length() > 0)
fullPath.append(subdir).append("/");
assert(fullPath[fullPath.length()-1] == '/');
if(!exists(fullPath))
#if defined(__GNUC__) || defined(__MWERKS__)
mkdir(fullPath.c_str(), 0); // Create subdir
#else
mkdir(fullPath.c_str()); // Create subdir
#endif
fullPath.append(line.substr(startOfFile,
endOfFile - startOfFile));
outf.open(fullPath.c_str());
if(!outf) {
cerr << "error opening " << fullPath
<< " for output" << endl;
exit(EXIT_FAILURE);
}
inCode = true;
cout << "Processing " << fullPath << endl;
if(printDelims)
outf << line << endl;
}
else if(inCode) {
assert(outf);
outf << line << endl; // Output middle code line
}
}
}
exit(EXIT_SUCCESS);
} ///:~
6.小结
C++string对象的优越性是C语言中相关功能难以望其项背的,这给程序研发者带来了极大的遍历。在很大程度上,string类使得通过字符型指针来引用字符串已经不在必要了。这就从根本上消除了由于使用未经初始化的指针或具有不正确值的指针造成的一系列软件缺陷。