搜索引擎之存储设计(google方式)

最新推荐文章于 2022-06-19 15:52:01 发布

yoki2009

最新推荐文章于 2022-06-19 15:52:01 发布

阅读量1.2k

点赞数

分类专栏：搜索引擎文章标签： google 存储搜索引擎 compression 文档 iterator

本文链接：https://blog.csdn.net/yoki2009/article/details/4297114

版权

搜索引擎专栏收录该内容

9 篇文章 0 订阅

订阅专栏

搜索引擎之存储设计(google方式)

早期google计算下载24000000个网页共需要147GB,现在每天都有成千上万个网页被更新,因此google在使用蜘蛛程序下载到本地服务器时必然要采用压缩的方式存储,google使用zlib压缩方式存储下载的网页,zlib的压缩比率是3:1,可以使用Level 6去平衡压缩比率和速度,文档在连续空间被存储是按如下规定docID,长度,URL-内容存贮到知识库中,这里docID是被压缩的网页的ID,我这里为了方便使用的是文本,我们在扫描得到所有文件的同时在数据库建立索引记录,对每个文件进行zlib压缩,文件名为docID,这个工作可以参考我在文件目录扫描的代码,当我们做文件分析时打开每一个压缩的文件对其进行处理.

for example:
一个网页中包含两个连接, <a href= "a.html"> 和 <a href="b.html">我们去解析时会得到两个文档一个是a.html,另一个是b.html,我们为这两个文档分配两个docID,如果假设源文档的docID是1的话,那么a.html的docID为2,b.html的docID为3,依此类推,建立一个索引器来索引docID

我在这里使用INI文件来作DocID与真实文件名的映射,文件名交docid.pair,另一个文件存储当前的DocID分配号,DocID也是一直在增长,每次加一.

DOCID与真实文件映射结构如下:
[maptable]
96=c:/vs/docid.pair
97=c:/vs/output2.txt
98=c:/vs/sybdir/wm.log
99=c:/vs/sybdir/_error_.txt
100=c:/vs/wm.log
101=c:/vs/_error_.txt

// two.cpp : Defines the entry point for the console application.
//
//All right revsered by yoki2009
//mailto:imj040144@tom.com
//Welcome to my blog: http://blog.csdn.net/yoki2009

#include "stdafx.h"
#include "two.h"
#include <time.h>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include "DirSearcher.h"

using namespace std;

#ifdef _DEBUG
#define new DEBUG_NEW
#endif

int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
{
time_t stop,start;
start = time(NULL);

int docID;
//get the current docid
ifstream indocID("docID.dat");
indocID>>docID;
indocID.close();
docID++;
//save current directory before search special dir.
char * m_currdir = new char[256];
GetCurrentDirectory(256,m_currdir);

DirSearcher * pDirSearch = DirSearcher::getInstance();
pDirSearch->setDirPath("c://vs");
pDirSearch->DoDirSearch();

//restore old dir
SetCurrentDirectory(m_currdir);

vector<CString>::iterator pos;

for (pos = pDirSearch->_filepath.begin();pos != pDirSearch->_filepath.end();++pos)
{
  //read orignal file
  FILE * oriFile = fopen((*pos),"r");
  CString tmp;
  tmp.Format("%d",docID);
  //create defalte file
  FILE * destFile = fopen(tmp,"w");
  //deflate file
  int ret = def(oriFile,destFile,Z_DEFAULT_COMPRESSION);

  if (ret != Z_OK)
   zerr(ret);
  fclose(oriFile);
  fclose(destFile);
  //Create docid and file path mapping.
  //notice:you should set lpFileName parameter, or WritePrivateProfileString
  //will searches the windows directory for the file.
  WritePrivateProfileString("maptable",tmp,(*pos),"./docid.pair");
  docID++;
}
//save current docID
ofstream outdocID("docID.dat",ios_base::out);
outdocID<<docID;
outdocID.close();
stop = time(NULL);
cout<<"/nElapsed run time:"<<showpoint<<difftime(stop,start)<<endl;
system("PAUSE");
return 0;
}

其中目录遍历的部分可以参见我写的文件目录遍历的那篇文章,压缩和解压缩的部分可以参见我写的zlib实现压缩与解压缩的那篇.

yoki2009

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
搜索引擎之存储设计(google方式)

搜索引擎之存储设计(google方式) 早期google计算下载24000000个网页共需要147GB,现在每天都有成千上万个网页被更新,因此google在使用蜘蛛程序下载到本地服务器时必然要采用压缩的方式存储,google使用zlib压缩方式存储下载的网页,zlib的压缩比率是3:1,可以使用Level 6去平衡压缩比率和速度,文档在连续空间被存储是按如下规定docID,长度,URL-内容
复制链接

扫一扫

专栏目录