html 筛选标签,Perl 实现简单的html 标签筛选

此程序提供简单的获取html 页面代码并筛选出以下标签和一些基本属性:

: 属性 href

: 属性 src

后续会添加一些更有用的功能,并逐步完善命令行接口。

使用方法:

perl filter_html.pl 

#!/usr/bin/perl

# --------------------------

# author : ez

# date : 2015/8/23

# describe : this script send http request for a http url and filter some special

# tag you input

# --------------------------

use strict;

use warnings;

use LWP::UserAgent;

use Data::Dumper;

use HTML::TreeBuilder;

# use HTML::Parser;

our $VERSION = 1.0;

my %disp_func = (

a => sub {

my $em = shift;

return if ! defined ($em) and $em -> tag () ne 'a';

my $href = $em -> attr ('href');

print "a url = "

. ($href ? $href : 'none') . "\n";

},

script => sub {

my $em = shift;

return if ! defined ($em) and $em -> tag () ne 'script';

my $type = $em -> attr ('type');

my $src = $em -> attr ('src');

print "script type = "

. ($type ? $type : 'none') . ", src = "

. ($src ? $src : 'none') . "\n";

},

img => sub {

my $em = shift;

return if ! defined ($em) and $em -> tag () ne 'img';

my $src = $em -> attr ('src');

print "img src = "

. ($src ? $src : 'none') . "\n";

}

);

&_usage () if @ARGV < 1;

my $url = shift @ARGV;

my @tags = qw(a script form img);

@tags = @ARGV if @ARGV >= 1;

my $useragent = LWP::UserAgent -> new;

my $request = HTTP::Request -> new ('GET' => $url);

$request -> content_type ('application/x-www-form-urlencoded');

$request -> header ('Accept-Language' => 'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3');

print "[-] sending request to $url ...\n";

my $html = $useragent -> request ($request);

print "[-] get response !\n";

my $tree = new HTML::TreeBuilder;

$tree -> parse ($html -> content ());

$tree -> eof ();

my $html_tag = $tree -> elementify ();

# my @decendants = $html_tag -> descendants ();

# maybe the parameter could be more exciting :-)

my @find_tags = $html_tag -> find_by_tag_name ('a', 'script', 'img');

foreach (@find_tags) {

next if !defined ($_) and $_ -> tag () eq '';

&{$disp_func {$_ -> tag ()}} ($_);

}

$tree -> delete ();

sub get_tags {

my ($tag, $node) = @_;

return if ! $tag;

}

sub _usage {

print "usage: filter_html.pl ";

exit;

}

# debug

# my $tag = $val -> tag (); # get 'html'

# TODO : parse start

# my $items = $tree -> findnodes ('/html/body//a');

# for my $item ($items -> get_nodelist ()) {

# my $str = $item -> content -> [0];

# print "$str\n";

# }

# print $html -> as_string ();

# print $html -> content ();

# my @line = $html -> content ();

# /<(\S*?)[^>]>.*?<\/\1>|<.*?\/>/

# foreach (@line) {

# print "$_\n";

# }

__END__

注:perl中可能没有需要的HTML::TreeBuilder模块和Data::Dumper模块,可以CPAN自己下载安装。

运行环境: Linux 3.18.0-kail3-amd64 #1 SMP Debian x86_64 GNU / Linux

Perl: v5.14.2 built for x86_64-linux-gnu-thread-multi

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值