自顶向下学搜索引擎——北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍

author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

不好意思让大家久等了,前一阵一直在忙考试,终于结束了。呵呵!废话不多说了下面我们开始吧!

TSE用的是将抓取回来的网页文档全部装入一个大文档,让后对这一个大文档内的数据整体统一的建索引,其中包含了几个步骤。

1.  The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.



//Doc.idx  文档编号	文档长度	checksum hash码

0	0	bc9ce846d7987c4534f53d423380ba70

1	76760	4f47a3cad91f7d35f4bb6b2a638420e5

2	141624	d019433008538f65329ae8e39b86026c

3	142350	5705b8f58110f9ad61b1321c52605795

//Doc.idx	end



  The url index (url.idx) is used to convert URLs into docIDs.



//url.idx

5c36868a9c5117eadbda747cbdb0725f	0

3272e136dd90263ee306a835c6c70d77	1

6b8601bb3bb9ab80f868d549b5c5a5f3	2

3f9eba99fa788954b5ff7f35a5db6e1f	3

//url.idx	end



It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.



	./DocIndex

		got Doc.idx, Url.idx, DocId2Url.idx	//Data文件夹中的Doc.idx DocId2Url.idx和Doc.idx中



//DocId2Url.idx

0	http://*.*.edu.cn/index.aspx

1	http://*.*.edu.cn/showcontent1.jsp?NewsID=118

2	http://*.*.edu.cn/0102.html

3	http://*.*.edu.cn/0103.html

//DocId2Url.idx	end



2.  sort Url.idx|uniq > Url.idx.sort_uniq	//Data文件夹中的Url.idx.sort_uniq



//Url.idx.sort_uniq

//对hash值进行排序

000bfdfd8b2dedd926b58ba00d40986b	1111

000c7e34b653b5135a2361c6818e48dc	1831

0019d12f438eec910a06a606f570fde8	366

0033f7c005ec776f67f496cd8bc4ae0d	2103



3. Segment document to terms, (with finding document according to the url)

	./DocSegment Tianwang.raw.2559638448		//Tianwang.raw.2559638448为爬回来的文件 ,每个页面包含http头

		got Tianwang.raw.2559638448.seg		



//Tianwang.raw.2559638448	爬取的原始网页文件在文档内部每一个文档之间应该是通过version,</html>和回车做标志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413



HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private







<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<title>Apabi数字资源平台</title>

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME="DESCRIPTION" CONTENT="数字图书馆 方正数字图书馆 电子图书 电子书 ebook e书 Apabi 数字资源平台">

<link rel="stylesheet" type="text/css" href="css/common.css">



<style type="text/css">

<!--

.style4 {color: #666666}

-->

</style>



<script LANGUAGE="vbscript">

...

</script>



<Script Language="javascript">

...

</Script>

</head>

<body leftmargin="0" topmargin="0">

</body>

</html>

//Tianwang.raw.2559638448	end



//Tianwang.raw.2559638448.seg	将每个页面分成一行如下(注意中间没有回车作为分隔)

1

...

...

...

2

...

...

...

//Tianwang.raw.2559638448.seg	end



//下是 Tiny search 非必须因素

4. Create forward index (docic-->termid)		//建立正向索引

	./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx



//Tianwang.raw.2559638448.seg 将每个页面分成一行如下
//分词   DocID
1
三星/  s/  手机/  论坛/  ,/  手机/  铃声/  下载/  ,/  手机/  图片/  下载/  ,/  手机/
2
...
...
...
//Tianwang.raw.2559638448.seg end


//moon.fidx

//每篇文档号对应文档内分出来的	分词	DocID

都会	2391

使	2391

那些	2391

拥有	2391

它	2391

的	2391

人	2391

的	2391

视野	2391

变	2391

窄	2391

在	2180

研究生部	2180

主页	2180

培养	2180

管理	2180

栏目	2180

下载	2180

)	2180

、	2180

关于	2180

做好	2180

年	2180

国家	2180

公派	2180

研究生	2180

项目	2180

//moon.fidx	end



5.# set | grep "LANG"

LANG=en; export LANG;

sort moon.fidx > moon.fidx.sort



6. Create inverted index (termid-->docid)	//建立倒排索引

	./CrtInvertedIdx moon.fidx.sort > sun.iidx



//sun.iidx	//文件规模大概减少1/2

花工	 236

花海	 2103

花卉	 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949

花蕾	 447 447

花木	 1061

花呢	 1430

花期	 447 447 447 447 447 525

花钱	 174 236

花色	 1730 1730

花色品种	 1660

花生	 450 526

花式	 1428 1430 1430 1430

花纹	 1430 1430

花序	 447 447 447 447 447 450

花絮	 136 137

花芽	 450 450

//sun.iidx	end



TSESearch	CGI program for query

Snapshot	CGI program for page snapshot


author:http://hi.baidu.com/jrckkyy author:http://blog.csdn.net/jrckkyy

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
TSE(Tiny Search Engine) ======================= (Temporary) Web home: http://162.105.80.44/~yhf/Realcourse/ TSE is free utility for non-interactive download of files from the Web. It supports HTTP. According to query word or url, it retrieve results from crawled pages. It can follow links in HTML pages and create output files in Tianwang (http://e.pku.edu.cn/) format or ISAM format files. Additionally, it provies link structures which can be used to rebuild the web frame. --------------------------- Main functions in the TSE: 1) normal crawling, named SE, e.g: crawling all pages in PKU scope. and retrieve results from crawled pages according to query word or url, 2) crawling images and corresponding pages, named ImgSE. --------------------------- INSTALL: 1) execute "tar xvfz tse.XXX.gz" --------------------------- Before running the program, note Note: The program is default for normal crawling (SE). For ImgSE, you should: 1. change codes with the following requirements, 1) In "Page.cpp" file, find two same functions "CPage::IsFilterLink(string plink)" One is for ImgSE whose urls must include "tupian", "photo", "ttjstk", etc. the other is for normal crawling. For ImgSE, remember to comment the paragraph and choose right "CPage::IsFilterLink(string plink)". For SE, remember to open the paragraph and choose righ "CPage::IsFilterLink(string plink)". 2) In Http.cpp file i. find "if( iPage.m_sContentType.find("image") != string::npos )" Comment the right paragraph. 3) In Crawl.cpp file, i. "if( iPage.m_sContentType != "text/html" Comment the right paragraph. ii. find "if(file_length < 40)" Choose right one line. iii. find "iMD5.GenerateMD5( (unsigned char*)iPage.m_sContent.c_str(), iPage.m_sContent.length() )" Comment the right paragraph. iv. find "if (iUrl.IsImageUrl(strUrl))" Comment the right paragraph. 2.sh Clean; (Note not remove link4History.url, you should commnet "rm -f link4History.url" line first) secondly use "link4History.url" as a seed file. "link4History" is produced while normal crawling (SE). --------------------------- EXECUTION: execute "make clean; sh Clean;make". 1) for normal crawling and retrieving ./Tse -c tse_seed.img According to query word or url, retrieve results from crawled pages ./Tse -s 2) for ImgSE ./Tse -c tse_seed.img After moving Tianwang.raw.* data to secure place, execute ./Tse -c link4History.url --------------------------- Detail functions: 1) suporting multithreads crawling pages 2) persistent HTTP connection 3) DNS cache 4) IP block 5) filter unreachable hosts 6) parsing hyperlinks from crawled pages 7) recursively crawling pages h) Outputing Tianwang format or ISAM format files --------------------------- Files in the package Tse --- Tse execute file tse_unreachHost.list --- unreachable hosts according to PKU IP block tse_seed.pku --- PKU seeds tse_ipblock --- PKU IP block ... Directories in the package hlink,include,lib,stack,uri directories --- Parse links from a page --------------------------- Please report bugs in TSE to MAINTAINERS: YAN Hongfei * Created: YAN Hongfei, Network lab of Peking University. * Created: July 15 2003. version 0.1.1 * # Can crawl web pages with a process * Updated: Aug 20 2003. version 1.0.0 !!!! * # Can crawl web pages with multithreads * Updated: Nov 08 2003. version 1.0.1 * # more classes in the codes * Updated: Nov 16 2003. version 1.1.0 * # integrate a new version linkparser provided by XIE Han * # according to all MD5 values of pages content, * for all the pages not seen before, store a new page * Updated: Nov 21 2003. version 1.1.1 * # record all duplicate urls in terms of content MD5

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Zda天天爱打卡

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值