C++下快速读取文件

最新推荐文章于 2023-08-02 21:00:08 发布

naruto2011sasuke

最新推荐文章于 2023-08-02 21:00:08 发布

阅读量1.4k

点赞数 1

分类专栏： C++学习

本文链接：https://blog.csdn.net/naruto2011sasuke/article/details/24289061

版权

C++学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

问题描述：

最近在写分类算法，需要和 SVM效果做对比。发现SVM读文件的速度慢的不能忍，所以想探讨一下windows下的最快的文件读取速度。

输入说明：

1.输入数据可能有几千个1M左右的小文件

2.也可能为100M左右的单个文件

3.最终数据有数w行，每一行格式如下

<label> <index>:<value> <index>:<value> <index>:<value> ...

+1 201:1 3148:1 3983:1 4882:1
-1 874:1 3652:1 3963:1 6179:1

每一行大约有几千个特征。详见 SVM数据集。

分析：

如果文件是空格分隔的数字，就算用ifstream 无脑读也不会慢到哪里去，windows下能轻松达到1s 10M左右的速度。但这里需要一边读一边解析，若用C++直接读，由于ifstream的缓冲太低效，导致速度急剧下降（测试1）。若用fread或者异步读取，速度必然会更快。

测试1：用C++的方式同步读取

读取大小13M的1800行，每行约4千个特征的数据：

1.直接C++getline，然后用istringstream来解析：

int parserLine(const std::string &line, Instance &in) {
	istringstream is(line);
	int label = 0;
	if (!(is >> label) && (label != 1 && label != -1)) {
		arowlog::write("Parser Error");
		return -1;
	}
	in.label = label;
	int id = 0;
	char sep = 0;
	double val = 0;
	while (is >> id >> sep >> val) {
		Feathure fe;
		fe.index = id;
		fe.weight = val;
		in.fs.push_back(fe);
		//arowlog::testPrint("%d:%lf\n", id, val);
	}
}

总耗时：
[2014-04-22-10-19-20]Num:1184 |Total Read Time: 53.917999

2.直接C++getline，然后自己解析：

int praserData(const char *str, int n, std::vector<Instance> &instances) {
	char tmp[15];
	int t = 0;
	int pos = 0;
	bool is_v = false;
	Instance in;
	Feathure fe;
	while (pos < n) {
		
		if (str[pos] == '+') {
			in.label = 1;
			++pos;
		} else if (str[pos] == '-') {
			in.label = -1;
			++pos;
		} else if (str[pos] == ':') {
			tmp[t] = 0;
			fe.index = atoi(tmp);
			//arowlog::testPrint("%d\n", fe.index);
			t = 0;
			is_v = true;
		} else if (str[pos] >= '0' && str[pos] <= '9' || str[pos] == '.') {
			tmp[t++] = str[pos];
		} else if (str[pos] == ' ' && is_v) {
			tmp[t] = 0;
			fe.weight = atof(tmp);
			//arowlog::testPrint("%lf\n", fe.weight);
			in.fs.push_back(fe);
			t = 0;
			is_v = false;
		}
		++pos;
	}
	if (pos == n) {//end
		if (is_v) {
			tmp[t] = 0;
			fe.weight = atof(tmp);
			arowlog::testPrint("%lf\n", fe.weight);
			in.fs.push_back(fe);
		}
		instances.push_back(in);
		return pos;
	}
}

总耗时：

[2014-04-22-10-18-14]Num:1184 |Total Read Time: 24.702000