perl 爬虫（一）

最新推荐文章于 2024-07-11 16:18:48 发布

高锦-生信

最新推荐文章于 2024-07-11 16:18:48 发布

阅读量1.9k

点赞数

分类专栏： perl 文章标签： perl html 爬虫

perl 专栏收录该内容

23 篇文章 7 订阅

订阅专栏

Example 9-1. Simple HTML

<ul>    <li>Ice cream.</li>    <li>Whipped cream.    <li>Hot apple pie <br>(mmm pie)</li>  </ul>

　　我的理解是在html中每一个标签（如html，head，li，table等）都是html树上的一个结点。这些结点间的关系就像一棵树的根、主干、分枝和树叶之间的关系一样。在此例中，html就是树的根，head和body就是树的两个主干。ul是长在body上的分枝，li是长在ul上的分枝，文本“　Ice cream　”就是树的树叶。

9.2. HTML::TreeBuilder使用
创建一个HTML::TreeBuilder 程序只需要五个步骤:

１、创建一个HTML::TreeBuilder 实例.
２、选择要分析的html文件和html字符串.
３、分析这个 HTML.
４、根据你自己的需要处理html.
５、删除已经创建的HTML::TreeBuilder 实例因为我们已经不再需要它了.

Example 9-2. Simple HTML::TreeBuilder program

#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder 3; # 确信该版本不是旧的
my $root = HTML::TreeBuilder->new;
$root->parse( # 分析html...
q{
 <ul>
 <li>Ice cream.</li>
 <li>Whipped cream.
 <li>Hot apple pie (mmm pie)</li>
 </ul>
});
$root->eof( ); # 当分析完html树时所必须做的，我的理解是只有在用parse(）函数时才用得到
　　　　　　　　#　parse_file()函数则不用　
$root->dump; # 打印整棵html树
$root->delete; # 删除html树因为我们已经不需要它了

9.2.1. 构造与初始化

用new()函数创建一个新的空树
　　$root = HTML::TreeBuilder->new( );

在一个步骤中创建一个新树并且解释HTML，用new_from_content( )函数解释一个或多个html字符串
　　$root = HTML::TreeBuilder->new_from_content([string, ...]);

解释一个html文件可以用new_from_file()方法，它的参数可以是一个文件名或者一个文件句柄。
例如：
　　　$root = HTML::TreeBuilder->new_from_file(filename);
　　　$root = HTML::TreeBuilder->new_from_file(filehandle);

如果你使用new_from_file( ) 或 new_from_content( )函数，解释器使用的是默认的解释选项。如果要用非默认选项

，你必须使用new()构造器并且调用parse_file()或parse()函数。

9.2.2. 解释器选项
通过调用HTML::TreeBuilder实例的方法，可以设置解释器的选项。这些方法返回先前的调置并且设置新的解释参数。
例如：
　　　$comments = $root->strict_comment( );
　　　print "Strict comment processing is ";
　　　print $comments ? "on\n" : "off\n";
　　　$root->strict_comments(0); # 使strict_comments()无效
　　　一些选项影响，HTML标准被忽略或者继续，其它影响这解释器的内部行为。下面是解释器选项的列表。

$root->strict_comments([boolean]);
The HTML standard says that a comment is terminated by an even number of -- s between the opening < and

the closing >, and there must be nothing but whitespace between even and odd -- s. That part of the HTML

standard is little known, little understood, and little obeyed. So most browsers simply accept any -->

as the end of a comment. If enabled via a true value, this option makes the HTML::TreeBuilder recognize

only those comments that obey the HTML standard. By default, this option is off, so that

HTML::TreeBuilder will parse comments as normal browsers do.

$root->strict_comments([boolean]);（这个实在是看不明白了，请高手译一下吧）

$root->strict_names([boolean]);
Some HTML has unquoted attribute values that include spaces, e.g., <img alt=big dog! src="dog.jpg">. If

this option is enabled, that tag would be reported as text, because it doesn't obey the standard (dog!

is not a valid attribute name). If the option is disabled, as it is by default, source such as this is

parsed as a tag, with a Boolean attribute called dog! set. （这个实在是看不明白了，请高手译一下吧）

$root->implicit_tags([boolean]);
Enabled by default, this option makes the parser create nodes for missing start- or end-tags. If

disabled, the parse tree simply reflects the input text, which is rarely useful. （这个实在是看不明白了，请高手译一下吧）

$root->implicit_body_p_tag([boolean]);
This option controls what happens to text or phrasal tags (such as ...) that are directly in a

<body>, without a containing . By default, the text or phrasal tag nodes are children of the <body>.

If enabled, an implicit is created to contain the text or phrasal tags. （这个实在是看不明白了，请高手译一下吧）

$root->ignore_unknown([boolean]);
默认情况下未知标签被忽略，例如：<footer>被忽略。可以在解析的树中为这些未知的标签创建节点。

$root->ignore_text([boolean]);
默认情况下“文本内容”出现在解释树中，可以通过这个选项创建一个不含“文本内容”的树。意思就是忽略html中的文本信息。

$root->ignore_ignorable_whitespace([boolean]);
大多数标签之前的空白被忽略，并且多个空格会被合并成一个。如果你想保持在原来的HTML的空白,可以使用本选项。

9.2.3. 分析HTML
可以从文件或者字符串两种方法来分析HTML

通过parse_file()方法可以通过文件名或文件句柄来分析HTML

$success = $root->parse_file(filename);
$success = $root->parse_file(filehandle);

例如：从标准输入分析HTML

$root->parse_file(*STDIN) or die "Can't parse STDIN";
The parse_file( ) 方法返回HTML::TreeBuilder实例如果错误的话则返回undef

The parse( )方法解释HTML流，每一次调用parse（）方法必须调用eof()方法结束。如
$success = $root->parse(chunk);
$success = $root->eof( );

你每一次获得你的HTML流时这个方法就被使用，当你分析一个非常大的HTML文件时这个方法是同样有用的（如果用

parse_file()一次性在内存中装入一个非常大的文件，当内存不够大的时候可能会出现问题）。在许多情况下，你可以

调用new_from_content()函数，但是回调new_from_content()函数不能设置解释器选项。

9.2.4. 清除垃圾

The delete( ) 方法释放先前创建的HTML树和元素，释放内存。
$root->delete( );

9.3. 处理HTML

　　一旦你分析一些HTML，你就需要对它进行处理。你所做的将依靠你要处理问题的类型。两种通常的做法是提取信息和改变原有HTML(例如：删除广告标记)
　　你可能会发现我们所感兴趣的只是HTML文本中的一小部分。它们有可能是所有的标题、所有的粗斜体或者所有用

class="blinking"标记的段落。HTML::Element提供几个函数可能搜索这个HTML树。

9.3.1. 搜索HTML树的方法

　　在标量上下文中，这些方法返回满足符合要求的第一个节点。在列表上下文中，所有符合要求的节点被返回。这些

方法能被在根上或者任何其它的节点上调用。

返回要搜索标签名的第一个节点

$node->find_by_tag_name(tag [, ...])

返回要搜索标签名的节点列表。例如：搜索所有h1和h2标签的节点列表。
@headings = $root->find_by_tag_name('h1', 'h2');

返回要搜索属性的第一个节点

$node->find_by_attribute(attribute, value)

$node->look_down(...)
$node->look_up(...)
look_down()　函数搜寻$node及它的子节点（包括子节点的子节点，等等）
look_down()　函数搜寻$node及它的父节点（包括父节点的父节点，等等）

返回要搜索属性的列表，例如找到所有属性中class="blinking"的节点
@blinkers = $root->find_by_attribute("class","blinking");

搜寻任何匹配指定的规则的节点，参数是一对（属性=>值）或子程序通过当前节点并且返回真到标示这个被感兴趣的节

点。
例如搜寻所有标签是h2并且class => 'blinking'的节点
@blinkers = $root->look_down(_tag => 'h2', class => 'blinking');

高锦-生信

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录