TSE中提取HTML中链接的方法

TSE中提取html中链接 uri 采用的是Lex分析

TSE中和lex相关的是hlink.l和uri.l

 

其中 uri.l是用来处理一个提取出的uri ,hlink.l是用来提取html中链接的。

代码流程:

 

在Crawl类的DownloadFile()方法中,当取得一个网页(HTML)的内容,并存到一个CPage类里。

开始对CPage类进行分析处理。先是处理这个网页的uri uri_parse_string(iPage.m_sUrl.c_str(), &page_uri);      结果存到page_uri里去。

之后调用hlink_detect_string(iPage.m_sContent.c_str(), &page_uri, onfind, &p);

这个hlink_detect_string函数是关键函数,这个函数的意思就是,把网页内容作为参数(iPage.m_sContent.c_str())传进去,当找到有链接信息后就调用onfind函数,onfind函数的参数是&p.

hlink_detect_string函数是在hlink.l中定义的。

 

int hlink_detect_string(const char *string, const struct uri *pg_uri,onfind_t onfind, void *arg)

参数string就是传入的网页内容,pg_uri是这篇网页的地址,onfind是找到链接后要执行的函数,arg是参数,会一直带着走的。

 

if (buf = yy_scan_string(string))
{
   yy_switch_to_buffer(buf);          
   __base_uri = (struct uri *)pg_uri;
   __is_our_base = 0;
   __onfind = onfind;
   __arg = arg;

   BEGIN INITIAL;
   n = yylex();

}

 

buf = yy_scan_string(string)和

yy_switch_to_buffer告诉程序,需要分析的字符串是参数string 并把传入的参数存为全局变量。

之后进入INITIAL状态,yylex()函数启动分析流程。

 

 

在INITIAL状态下 遇到以下表达式的话 "<"{cdata}/{blank}|">"   就是说,遇到类似 "<A>"或者" <A "的话,就把"A" 那一行对应的数组 赋给__cur_elem全局结构。比如,搜索到"<A "就把{"A", __elem_a_attr}赋给__cur_elem.相关代码为:

/* Element names are case-insensitive. */
for (yyleng = 0; __elems[yyleng].name; yyleng++)
{
   if (strcasecmp(yytext + 1, __elems[yyleng].name) == 0)
   {
    __cur_elem = __elems + yyleng;
    break;
   }
}

其中数组的定义为

static struct __elem __elems[] = {
{"A", __elem_a_attr},
{"AREA", __elem_area_attr},
{"BASE", __elem_base_attr},
{"FRAME", __elem_frame_attr},
{"IFRAME", __elem_iframe_attr},
{"IMG", __elem_img_attr},
{"LINK", __elem_link_attr},
{"META", __elem_meta_attr},
{NULL, }
};

static const struct   __elem    *__cur_elem;

 

之后进入ATTRIBUTE状态

yy_push_state(ATTRIBUTE);

 

在ATTRIBUTE状态下   想要找出ATTRIBUTE的值 比如 A href=   首先取得href (没有"=")

此时匹配的字符串yytext为"href" 就是把空格和等号去掉后的字符串 相关代码为

 

匹配的正则表达式:{cdata}{blank}{0,512}"="{blank}{0,512}

去掉空格和等号:

yyleng = 0;
while (!HLINK_ISBLANK(yytext[yyleng]) && yytext[yyleng] != '=')
   yyleng++;
yytext[yyleng] = '/0';

 

然后把属性值 就是"href"串 存到 char *__cur_attr中,并用__curpos来标记存放uri的数组_buffer的位置

for (yyleng = 0; __cur_elem->attrs[yyleng]; yyleng++)
{
   if (strcasecmp(yytext, __cur_elem->attrs[yyleng]) == 0)
   {
    __curpos = __buffer;
    __cur_attr = __cur_elem->attrs[yyleng];
    break;
   }
}

进入URI状态,准备提取URI

BEGIN URI;

 

在URI状态下,如果遇到双引号,就进入双引号状态,反之单引号也一样

<URI>/"{blank}{0,512} BEGIN DOUBLE_QUOTED;

<URI>"'"{blank}{0,512} BEGIN SINGLE_QUOTED;

就是说 现在已经读了<A href=" 准备读取实际的uri了

也就是一般会进入这个动作:

<UNQUOTED,DOUBLE_QUOTED,SINGLE_QUOTED,ENTITY>.|/n

.|/n 就是除了读取到 引号之前的字符 (因为引号代表href="xxx" xxx读取完毕) 当然也不包括"&lt;"这些很特殊的HTML代码。

 

读到字符时,最关键的语句是 *__curpos++ = *yytext; 就是把字符读到_curpos指向的内存数组中去,并把_curpos指针加1,就这样一直读,直到把uri读取完毕,也就是读取到引号之类的

 

<DOUBLE_QUOTED>{blank}{0,512}/"   |
<SINGLE_QUOTED>{blank}{0,512}"'" |
<UNQUOTED>{blank}|">"     {

 

 

读取完毕后的操作是,首先在读取到的字符串后2位,全部加上结束字符"/0",

*(__curpos + 1) = *__curpos = '/0';

 

用一个指针指向uri数组的起始位置

ptr = __buffer;

之后把string类型的字符数组,存到uri结构体去

yyleng = uri_parse_buffer(ptr, __curpos - ptr + 2, &uri);

然后把找到的uri和该uri所在的uri     merge成一个最终uri result

uri_merge(&uri, __base_uri, result);

 

最终调用onfind函数

__onfind(__cur_elem->name, __cur_attr,result, __arg)

 

 

关于_onfind函数的调用 是一旦找到一个合法的uri 就会调用 所以在扫描一篇HTML时会多次调用onfind函数

 

onfind函数的功能 在获得的uri 如果是img类型 就存到m_ofsLink4HistoryFile文件里去

如果是其他类型 就交给AddUrl函数处理

 

 

这个是TSE中Lex 分析HTML过程的大致流程 以后再写TSE的主代码如何爬取网页的流程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TSE(Tiny Search Engine) ======================= (Temporary) Web home: http://162.105.80.44/~yhf/Realcourse/ TSE is free utility for non-interactive download of files from the Web. It supports HTTP. According to query word or url, it retrieve results from crawled pages. It can follow links in HTML pages and create output files in Tianwang (http://e.pku.edu.cn/) format or ISAM format files. Additionally, it provies link structures which can be used to rebuild the web frame. --------------------------- Main functions in the TSE: 1) normal crawling, named SE, e.g: crawling all pages in PKU scope. and retrieve results from crawled pages according to query word or url, 2) crawling images and corresponding pages, named ImgSE. --------------------------- INSTALL: 1) execute "tar xvfz tse.XXX.gz" --------------------------- Before running the program, note Note: The program is default for normal crawling (SE). For ImgSE, you should: 1. change codes with the following requirements, 1) In "Page.cpp" file, find two same functions "CPage::IsFilterLink(string plink)" One is for ImgSE whose urls must include "tupian", "photo", "ttjstk", etc. the other is for normal crawling. For ImgSE, remember to comment the paragraph and choose right "CPage::IsFilterLink(string plink)". For SE, remember to open the paragraph and choose righ "CPage::IsFilterLink(string plink)". 2) In Http.cpp file i. find "if( iPage.m_sContentType.find("image") != string::npos )" Comment the right paragraph. 3) In Crawl.cpp file, i. "if( iPage.m_sContentType != "text/html" Comment the right paragraph. ii. find "if(file_length < 40)" Choose right one line. iii. find "iMD5.GenerateMD5( (unsigned char*)iPage.m_sContent.c_str(), iPage.m_sContent.length() )" Comment the right paragraph. iv. find "if (iUrl.IsImageUrl(strUrl))" Comment the right paragraph. 2.sh Clean; (Note not remove link4History.url, you should commnet "rm -f link4History.url" line first) secondly use "link4History.url" as a seed file. "link4History" is produced while normal crawling (SE). --------------------------- EXECUTION: execute "make clean; sh Clean;make". 1) for normal crawling and retrieving ./Tse -c tse_seed.img According to query word or url, retrieve results from crawled pages ./Tse -s 2) for ImgSE ./Tse -c tse_seed.img After moving Tianwang.raw.* data to secure place, execute ./Tse -c link4History.url --------------------------- Detail functions: 1) suporting multithreads crawling pages 2) persistent HTTP connection 3) DNS cache 4) IP block 5) filter unreachable hosts 6) parsing hyperlinks from crawled pages 7) recursively crawling pages h) Outputing Tianwang format or ISAM format files --------------------------- Files in the package Tse --- Tse execute file tse_unreachHost.list --- unreachable hosts according to PKU IP block tse_seed.pku --- PKU seeds tse_ipblock --- PKU IP block ... Directories in the package hlink,include,lib,stack,uri directories --- Parse links from a page --------------------------- Please report bugs in TSE to MAINTAINERS: YAN Hongfei * Created: YAN Hongfei, Network lab of Peking University. * Created: July 15 2003. version 0.1.1 * # Can crawl web pages with a process * Updated: Aug 20 2003. version 1.0.0 !!!! * # Can crawl web pages with multithreads * Updated: Nov 08 2003. version 1.0.1 * # more classes in the codes * Updated: Nov 16 2003. version 1.1.0 * # integrate a new version linkparser provided by XIE Han * # according to all MD5 values of pages content, * for all the pages not seen before, store a new page * Updated: Nov 21 2003. version 1.1.1 * # record all duplicate urls in terms of content MD5

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值