A DEMO script on how to use CPAN module HTML::LinkExtor

43 篇文章 0 订阅
20 篇文章 0 订阅

Last week I got a large book "Perl Cookbook".  It mentions an useful module HTML::LinkExtor in the book, seems handy to use. Right now, I just wanted to crawl some docs from MVS-OE archive webpage, so I wrote a small script that can demo how to use the module.


 
 
use LWP:: Simple ;
use HTML:: LinkExtor ;
use URI:: URL ;

binmode STDOUT , ':utf8' ;
my $url = "http://www2.marist.edu/htbin/wlvindex?mvs-oe" ;
my $base = "http://www2.marist.edu/htbin" ;
my $ref_links = extract_link ( $url , "" , "a" , "href" );
foreach ( @$ref_links )
{
     my $sub_url = $_ ;
     print "Parsing sub url: " . $sub_url . "\n" ;
     my $thread_links = extract_link ( $sub_url , $base , "a" , "href" );
     foreach ( @$thread_links )
     {
         print "GET\n" ;
         print $_ . "\n" ;
         get ( $_ );
     }
}


sub extract_link ()
{
   my $url = shift ;
   my $base = shift ; # base URL
   my $mytag = shift ; # specified html tag name, such as a, form ...
   my $attr_name = shift ; # link pattern


   $base =~ s/\/$//g ;

   $ua = LWP:: UserAgent -> new or dir $! ;

   # Set up a callback that collect image links
   my @links = ();

   sub callback {
      my ( $tag , %attr ) = @_ ;
      return if $tag ne $mytag ; # we only look closer at <img ...>
      push ( @links , $attr { $attr_name });
   }

   # Make the parser. Unfortunately, we don't know the base yet
   # (it might be different from $url)
   my $p = HTML:: LinkExtor -> new ( \& callback );

   # Request document and parse it as it arrives
   my $res = $ua -> request ( HTTP:: Request -> new ( GET => $url ),
                          sub { $p -> parse ( $_ [ 0 ])}) or die $! ;


   # Expand all image URLs to absolute ones
   $base or $base = $res -> base ;
   @links = map { $_ = url ( $_ , $base ) -> abs ; } @links ;

   return \ @links ;
}

Comments are parsed


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值