前两天写了一个对40M的有格式文本文件的信息抽取,开始使用 MFC的CFile来读取,存储到char*中,然后对每个字符循环读取,判断抽取有用信息.思路很简单,也很笨拙,所以也就没有奢望它的效率好到那里.但实际运行的时候还是非常令人吃惊,两个多小时竟然没有跑完; 又细看了一下代码,只做简单的优化是没有用处了,就改用了C++来重新写,fstream的标准库,string 做为buffer,又调用了string的一些方法,结果非常的好,用了30多秒,就跑完了. 现把代码贴出来 ,达人多指教.
文本文件是按行组织的,每一行的数据以空格分割,共44维数据,每一维是 Key:value的格式.现在要取出其中的 qid,1,13...等几维数据,每行的前后的信息也要保留.
0 qid:1 1:11.512826 2:288.000000 3:195.000000 4:6.000000 5:2.000000 6:0.223063 7:0.000858 8:0.000677 9:0.000018 10:0.000826 11:0.001129 12:0.003413 13:10.546871 14:0.000772 15:-14.725600 16:0.017391 17:-12.865100 18:-9.148330 19:-8.108850 20:0.569967 21:-8.119490 22:-7.974970 23:-7.379150 24:1.563410 25:-7.253540 26:-7.252590 27:10.546905 28:2.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000035 33:0.000000 34:0.000000 35:0.000000 36:0.000249 37:0.143097 38:0.000607 39:0.004101 40:0.001772 41:0.001124 42:19.575701 43:12.622301 44:11.723003 #docid = 1032
第一种写法:
文本文件是按行组织的,每一行的数据以空格分割,共44维数据,每一维是 Key:value的格式.现在要取出其中的 qid,1,13...等几维数据,每行的前后的信息也要保留.
0 qid:1 1:11.512826 2:288.000000 3:195.000000 4:6.000000 5:2.000000 6:0.223063 7:0.000858 8:0.000677 9:0.000018 10:0.000826 11:0.001129 12:0.003413 13:10.546871 14:0.000772 15:-14.725600 16:0.017391 17:-12.865100 18:-9.148330 19:-8.108850 20:0.569967 21:-8.119490 22:-7.974970 23:-7.379150 24:1.563410 25:-7.253540 26:-7.252590 27:10.546905 28:2.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000035 33:0.000000 34:0.000000 35:0.000000 36:0.000249 37:0.143097 38:0.000607 39:0.004101 40:0.001772 41:0.001124 42:19.575701 43:12.622301 44:11.723003 #docid = 1032
第一种写法:
CExtractKeysDlg::CExtractKeysDlg(CWnd
*
pParent
/*=NULL*/
)
: CDialog(CExtractKeysDlg::IDD, pParent)
, openfileedit(_T( "" ))
, savefileedit(_T( "" ))
, saveDim(_T( "" ))
{
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
openfileedit="d:/d.txt";
savefileedit="d:/e.txt";
saveDim="qid,1,13,27,28,39,40,41,42,43,44";
//UpdateData(false);
}
void CExtractKeysDlg::OnBnClickedExtract()
{
// TODO: 在此添加控件通知处理程序代码
const int readlen=1024*64; //读取64k大的文件
char buf[readlen],wbuf[readlen];
int truelen(0),wlen(0);
int wflag(1);//记录需要写入几次的标志
UpdateData(true);
CFile cfopen(this->openfileedit,CFile::modeRead);
CFile cfsave(this->savefileedit,CFile::modeCreate|CFile::modeWrite);
do{
truelen=cfopen.Read(buf,readlen);
int i(0),wbegin(0
: CDialog(CExtractKeysDlg::IDD, pParent)
, openfileedit(_T( "" ))
, savefileedit(_T( "" ))
, saveDim(_T( "" ))
{
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
openfileedit="d:/d.txt";
savefileedit="d:/e.txt";
saveDim="qid,1,13,27,28,39,40,41,42,43,44";
//UpdateData(false);
}
void CExtractKeysDlg::OnBnClickedExtract()
{
// TODO: 在此添加控件通知处理程序代码
const int readlen=1024*64; //读取64k大的文件
char buf[readlen],wbuf[readlen];
int truelen(0),wlen(0);
int wflag(1);//记录需要写入几次的标志
UpdateData(true);
CFile cfopen(this->openfileedit,CFile::modeRead);
CFile cfsave(this->savefileedit,CFile::modeCreate|CFile::modeWrite);
do{
truelen=cfopen.Read(buf,readlen);
int i(0),wbegin(0