Heritrix 抓取高级篇

最新推荐文章于 2020-07-15 19:02:17 发布

Kuiiiiiiie

最新推荐文章于 2020-07-15 19:02:17 发布

阅读量790

点赞数

分类专栏：待学习文章标签：网络爬虫 heritrix

待学习专栏收录该内容

14 篇文章 0 订阅

订阅专栏

使用Heritrix进行抓取网页，有半天阅读我之前博客的话，很容易就能够顺利的进行抓取任务，但在抓取过程中可能会遇到：
1 想抓取特定格式/特定要求的网页
这个要根据具体的网站，才能采取具体的措施。这主要是根据网站编写的时候，它的出度的具体格式。如果是类似<a href="http://www.xxx.xxx.xx...." ..>这样的可以直接指向某个具体的URL，那么添加到URI中的应该是这个完整的URL,如果是去掉了http://www等的前面的内容，而只是简单指向本网站下的某个网页，那么在加入到URI中的时候，要记得加上头使得它是一个完整的网页的URL。根据CCER网站下的网页内容，自己写了个CCERExtractor.java来进行过滤，只抓取符合条件的URL。

    Java代码   
    
  
 package org.archive.crawler.extractor;  
   
 import java.io.IOException;  
 import java.util.regex.Matcher;  
 import java.util.regex.Pattern;  
   
 import org.apache.commons.httpclient.URIException;  
 import org.archive.crawler.datamodel.CrawlURI;  
 import org.archive.io.ReplayCharSequence;  
 import org.archive.util.HttpRecorder;  
   
 /** 
  * I ignore the log information. 
  * @author Administrator 
  * 
  */  
 public class CCERExtractor extends Extractor{  
       
     /** 
      * if the url starts with http          ----        if it is under http://www.pku.edu.cn, not then "not" 
      * else if matches mailto or javascript ----        "not" 
      * else                                 ----        "yes" 
      */  
     public static final String pattern_ahref = "<[aA] href=\"([^\"]+)\"";// group(1)  
       
     public CCERExtractor(String name){  
         super(name,"CCER Extractor");  
     }  
       
     public CCERExtractor(String name, String description) {  
         super(name, description);  
     }  
   
     @Override  
     protected void extract(CrawlURI curi) {  
         HttpRecorder hr = curi.getHttpRecorder();  
         ReplayCharSequence cs = null;  
         try {  
             cs = hr.getReplayCharSequence();  
         } catch (IOException e) {  
             e.printStackTrace();  
         }  
         if(cs == null){  
             return;  
         }  
           
         String content = cs.toString();  
         Matcher matcher = Pattern.compile(CCERExtractor.pattern_ahref).matcher(content);  
         while(matcher.find()){  
             String newUrl = matcher.group(1);  
             if(newUrl.startsWith("http")){//find the ccer website  
                 if(newUrl.startsWith("http://www.pku.edu.cn")){// case 1 that matches  
                     createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);  
                 }  
             }else if(!newUrl.toLowerCase().startsWith("mailto") && !newUrl.toLowerCase().startsWith("javascript")){//case 2 that matches. Ignore the mailto and javascript href.   
                 if(newUrl.trim().startsWith("/")){  
                     newUrl = newUrl.trim().substring(1).trim();  
                 }  
                 newUrl = "http://www.ccer.pku.edu.cn/cn/" + newUrl;//" http://www.ccer.pku.edu.cn/cn/ " should be added to the first  
                 createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);// make sure that the newUrl is available.  
             }  
         }  
     }  
   
     private void createAndAddLinkRelativeToBase(CrawlURI curi, String newUrl, char hopType){  
         try {  
             curi.createAndAddLinkRelativeToBase(newUrl, "", hopType);  
         } catch (URIException e) {  
             e.printStackTrace();  
         }  
     }  
 }  

在modules下的Processor.options下将这个新的解析器加入进去，那么在配置的时候便可以出现这个选项。但是必须注意： Crawler是严格按照配置的信息来进行抓取的，所以CCERExtractor必须在ExtractorHttp后面。在options里面的位置无所谓，只要放到extractor中即可，没有先后顺序。

2 单线程的困扰 (我之前的博客 http://hanyuanbo.iteye.com/blog/788177 )

3 机器内存的限制 (在run configuration中，设置VM-arguments为-Xmx512m。见附件)

4 robots.txt的限制
这个是“君子协定”的内容，如果不考虑这个限制的话，可以提高抓取的效率（虽然某些网站可能不希望你这样肆无忌惮的抓取它的网站）。这个很简单，只需要把

    Java代码   
    
 org.archive.crawler.prefetch.PreconditionEnforcer

中将

    Java代码   
    
  
 private boolean considerRobotsPreconditions(CrawlURI curi) {  
          ...  
 }  

改成

    Java代码   
    
  
 private boolean considerRobotsPreconditions(CrawlURI curi) {  
         return false;//无论如何都不考虑robots.txt的限制  
 }  

5 有时候配置的时候，那个可选项的下拉菜单没有了！！ (在Heritrix的run configuration中，classpath中的user entries中选择右边的advanced，然后选择external folder，选择conf目录即可。见附件)

至此，便开始抓取啦，还是按照之前博客里所说的配置，记得在选择extractor的时候，选择CCERExtractor，并且这个要放在ExtractorHtml后面。抓取过程如下：

Kuiiiiiiie

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Heritrix 抓取高级篇

使用Heritrix进行抓取网页，有半天阅读我之前博客的话，很容易就能够顺利的进行抓取任务，但在抓取过程中可能会遇到： 1 想抓取特定格式/特定要求的网页这个要根据具体的网站，才能采取具体的措施。这主要是根据网站编写的时候，它的出度的具体格式。如果是类似这样的可以直接指向某个具体的URL，那么添加到URI中的应该是这个完整的URL,如果是去掉了http://www等的前面的内容，而只是
复制链接

扫一扫