HTML::Parser 简单解释

最新推荐文章于 2024-09-10 16:18:30 发布

cnki_ok

最新推荐文章于 2024-09-10 16:18:30 发布

阅读量651

点赞数

分类专栏： perl 文章标签： html api 文档

perl 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

HTML::Parser 是一个非常强大的用于处理 html 解析的模块。

HTML::Parser 的文档没有一个完整的例子。所以我就把下面的我写在 ShellWeb 里的例子摘取出来，并简单的解释一下。

use HTML::Parser;my $parser = HTML::Parser->new( api_version => 3,
    start_h => [/&start, "self, tagname, attr, text"],
    text_h  => [/&text,  "self, dtext"],
    end_h   => [/&end,   "self, tagname"],
    ignore_elements => [qw(script style)],
);
$parser->parse($content);

new 用于建立实例，api_version => 3 用于指定版本，因为 HTML::Parser 还有老的版本 2，可能以后还会有新的版本。而不同的版本有不同的 API, 所以指定这个是必须的。
start_h, text_h, end_h 是三个事件处理过程。分别对应标签的开始，文字（就是没有标签的地方但不包括注释等），标签的关闭。

其实完整的事件处理包括 text, start, end, declaration, comment, process, start_document, end_document 和 default.
一般用到的只有 text start end.

指定事件所对应的处理方法和参数有如下几种方法：
第一种就是如上面所看到的，在事件后面加上 _h, 如 text_h, start_h 这样。

第二种也是在 new 里做为参数。

$p = HTML::Parser->new(api_version => 3,
                        handlers => { text => [/@array, "event,text"],
                                      comment => [/@array, "event,text"],
                                    });

用 handlers 来定义。

第三种是在 new 里不定义，在后面用 handler 来定义。

$p = HTML::Parser->new(...);
$p->handler(start =>  /&start, 'attr, attrseq, text' );

这三者接受参数的方法是一样的。第一个为处理方法，第二个为参数。
完整的参数参考 perldoc HTML::Parser
最最常用的为 self, tagname, attr, text
不过注意的是不同的事件有不同的参数限制。如果 start 事件可以有 attr 但是 end 事件就没有 attr 了。（因为结束标签里是不会有属性的。）

如最开始的例子中我们使用了 start_h => [/&start, "self, tagname, attr, text"],
那么接下来我们就定义我们自己的 start 子程序：

sub start {
    my ($self, $tag, $attr, $text) = @_;
    if ($tag eq 'img') {
        $buffer->insert($iter, "[IMG - $attr->{src}]");
    } elsif ($tag eq 'pre') {
        $self->{'_pre'} = 1;
    } elsif ($tag eq 'title') {
        $self->{'_tilte'} = 1;
    } else {
        $buffer->insert($iter, $text);
    }
}

start 子程序的第一行就是 start_h => [/&start, "self, tagname, attr, text"], 后面所对应的参数。分别对照为 self, 标签名（如 a/img 等），属性（这是一个散列，比如 img 必定有 $attr->{src}），text 指的是整个标签的完整原始内容。
看起来就是这么简单。接下来定义 text end 子程序：

sub text {
    my ($self, $text) = @_;
    $text =~ s//r/n//sg unless ( defined($self->{'_pre'}) );
    if ( defined($self->{'_tilte'}) ) {
        $window->set_title("$text - ShellWeb");
    } else {
        $buffer->insert($iter, $text);
    }
}sub end {
    my ($self, $tag) = @_;
    if ($tag eq 'li') {
        $buffer->insert($iter, "/n");
    } elsif ($tag eq 'p') {
        $buffer->insert($iter, "/n");
    } 
    $self->{'_tilte'} = undef if ($tag eq 'title');
    $self->{'_pre'} = undef   if ($tag eq 'pre');
}

这个完整的程序中间我们加了一点点小小的技巧。因为 text 事件是不知道这个文本外面的标签名的（也就是 text_h 没有 tagname 参数）。
这样我们必须在 title/pre 标签的开始时用 $self->{'_title'} 保存它开始了，在 end 时将它释放。
这样我们就在 text 事件中知道刚文本是不是包围在 title 或 pre 里面。

上面就是大致的 HTML::Parser 分析过程。最重要的是实践，自己设定个结果，然后用 HTML::Parser 实现它。
实现不了出了问题看 perldoc HTML::Parser