Heritrix定时抓取任务(Job)设计

Heritrix抓取灵活,配置性强 ,而有些需求不要求用户修改抓取规则,这些需要web隐藏一些页面。有些需要定时抓取,并且可以对该定时任务进行控制(停止,开始),这个需求的思路其实比较简单就是 在一个页面中新建一个job,把相关信息在另一个页面request.getParameter()得到,根据这些信息和handler 就可启动一个抓取任务,这个抓取任务(CustomThread)实现Runnable,run方法中使用while(f){}, 如果要停止这个任务,只需在web中发送f=false的请求,runnable线程发现f为false自然就停止了抓取任务,如果要启动一个抓取任务 就再new一个customThread即可。需要做的有如下:

1、修改webapps.admin.jobs.jsp  就命名为(webapps.admin.)planjobs.jsp

<%@include file="/include/handler.jsp"%>
<%@ page import="org.archive.crawler.datamodel.CrawlOrder,org.archive.crawler.admin.CrawlJob,java.util.List" %>
<%@ page pageEncoding="utf-8"%>
<%
  //  String sAction = request.getParameter("action");
 //   if(sAction != null){
        // Need to handle an action    
  //      if(sAction.equalsIgnoreCase("delete")){
  //          handler.deleteJob(request.getParameter("job"));
    //    }
   // }    

    String title = "Crawl jobs";
    int tab = 3;
%>

<%@include file="/include/head.jsp"%>

<% 
    if(request.getParameter("message") != null &&
        request.getParameter("message").length() > 0) {
%>
    <p>
        <span class="flashMessage" style="display:none"><b><%=request.getParameter("message")%></b></span>
<% } %>

<% if(handler.isCrawling()){ %>
    <h2>活动任务 - <i><%=handler.getCurrentJob().getJobName()%></i></h2>
 
<% } %>

<h2>创建新的计划任务</h2>
    <ul>
	
	<li><a href="<%=request.getContextPath()%>/jobs/plannew.jsp">
	新建计划任务</a></li>
    </ul>

效果为:

http://localhost/planjobs.jsp

说明:

      控制台和任务是原来的功能

      计划任务控制台 计划任务是 本篇要讲的内容

 2、修改webapps.admin.jobs.new.jsp 命名为plannew.jsp

点击“新建计划任务”跳转到http://localhost/jobs/plannew.jsp

相应代码为:

<%@include file="/include/handler.jsp"%>

<%@ page import="org.archive.crawler.datamodel.CrawlOrder" %>
<%@ page import="org.archive.crawler.admin.ui.JobConfigureUtils" %>
<%@ page import="org.archive.crawler.settings.ComplexType" %>
<%@ page import="org.archive.crawler.settings.CrawlerSettings" %>
<%@ page import="org.archive.crawler.settings.XMLSettingsHandler" %>

<%@ page import="java.io.BufferedReader" %>
<%@ page import="java.io.FileReader" %>
<%@ page import="java.util.regex.Pattern" %>
<%@ page import="java.util.Iterator" %>
<%@ page import="java.io.File" %>
<%@ page pageEncoding="utf-8"%>

<%@include file="/include/head.jsp"%>

        <form name="frmNew" method="post" action="../../plancontrol.jsp">      
            <b>

            </b>
            <p>            
            <table>
              <tr>
                    <td>
                       时间间隔:
                    </td>
                    <td>
                        <input maxlength="38" name="invervaltime"  style="width: 50px">小时
                    </td>
                </tr>
                <tr>
                    <td>
                       计划任务名称:
                    </td>
                    <td>
                        <input maxlength="38" name="meta/name"  style="width: 440px">
                    </td>
                </tr>

                <tr>
                    <td>
                        计划任务描述:
                    </td>
                    <td>
                        <input name="meta/description"  style="width: 440px">
                    </td>
                </tr>
                <tr>
                    <td valign="top">
                        网站:
                    </td>
                    <td><font size="-1">在下面输入网址、IP或者IP段,每行一个</font></br>
                        <textarea name="seeds" style="width: 440px" rows="8"></textarea>
                    </td>
                </tr>
                <tr>
                <td colspan="2" align="center">

    <input type="submit" value="submit">

                </td>
                </tr>
            </table>
        </form>


3、修改webapps.admin.index.jsp 为webapps.admin.index.plancontrol.jsp

http://localhost/plancontrol.jsp 可以看到类似index.jsp的效果,由于没环境了,代码残缺,就不截图了

 

plancontrol.jsp相关代码为:

<%@include file="/include/handler.jsp"%>
<%@ page import="org.archive.crawler.admin.CrawlJob" %>
<%@ page import="org.archive.crawler.Heritrix" %>

<%@ page import="org.archive.crawler.admin.CustomThread" %>
<%@ page import="org.archive.crawler.admin.StatisticsTracker" %>
<%@ page import="org.archive.util.ArchiveUtils" %>
<%@ page import="org.archive.util.TextUtils" %>
<%@ page import="javax.servlet.jsp.JspWriter" %>
<%@ page import="java.util.Iterator" %>
<%@ page import="java.util.regex.Pattern" %>
<%@ page import="org.archive.crawler.datamodel.CrawlOrder" %>
<%@ page import="org.archive.crawler.admin.ui.JobConfigureUtils" %>
<%@ page import="org.archive.crawler.settings.ComplexType" %>
<%@ page import="org.archive.crawler.settings.CrawlerSettings" %>
<%@ page import="org.archive.crawler.settings.XMLSettingsHandler" %>

<%@ page import="org.archive.crawler.admin.PlanJob" %>
<%@ page pageEncoding="utf-8"%> 
<%
/* if (session.getAttribute("login")==null)
{
response.sendRedirect(request.getContextPath() + "/login.jsp"); 
} */

%>
<%!
	private void printTime(final JspWriter out,long time)
    throws java.io.IOException {
	    out.println(ArchiveUtils.formatMillisecondsToConventional(time,false));
	}
%>

<%
    CrawlJob theJob = handler.getJob(request.getParameter("job"));
    boolean isProfile = "true".equals(request.getParameter("profile"));
     String recovery = request.getParameter("recover");
    if (theJob == null) {
        // Ok, use default profile then.
        theJob = handler.getDefaultProfile();
        if(theJob == null){
            // ERROR - This should never happen. There must always be at least
            // one (default) profile.
            out.println("ERROR: NO PROFILE FOUND");
            return;
        }
    } 
    
    XMLSettingsHandler settingsHandler = theJob.getSettingsHandler();
    CrawlOrder crawlOrder = settingsHandler.getOrder();
    CrawlerSettings orderfile = settingsHandler.getSettingsObject(null);
    //System.out.println("fpath"+settingsHandler.getOrderFile());
    String error = null;
    String metaName = request.getParameter("meta/name");
    String jobDescription = request.getParameter("meta/description");
    String intervaltime = request.getParameter("invervaltime");
   // String operatorname = session.getAttribute("login").toString();
    CrawlJob newJob = null;

 // if(request.getParameter("action") != null) {
  if(true) {
   // Make new job.
        

      
   if(error == null) {
   
   if(isProfile) {
    CrawlJob test = handler.getJob(metaName);
               
                if(test == null) {    
                    // unique name
                    newJob = handler.newProfile(theJob, metaName,
                        jobDescription,
                        request.getParameter("seeds"),operatorname);
                       
                } else {      //test not null
                    // Need a unique name!
                    error = "Profile name must be unique!";
                }
    }
   
   }  
  }

%>

<%
    String sAction = request.getParameter("action");
    if(sAction != null) {
        if(sAction.equalsIgnoreCase("logout")) {
            // Logging out.
            session = request.getSession();
            if (session != null) {
                session.invalidate();
                // Redirect back to here and we'll get thrown to the login
                // page.
                response.sendRedirect(request.getContextPath() + "/login.jsp"); 
            }
        }
    }

    String title = "Admin Console";
    int tab = 2;
    
  
 //   PlanJob p = (PlanJob)application.getAttribute("planjob");
     CustomThread c = (CustomThread)application.getAttribute("planjob");
  
   if (c == null && request.getParameter("seeds")!=null)
   {
   
   c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname);
   application.setAttribute("planjob", c);
   }
   
   if (c != null && request.getParameter("seeds")!=null && c.getState().toString().equals("TERMINATED"))
   {
   
   c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname);
   application.setAttribute("planjob", c);
   }

%>

<%
String planstate = request.getParameter("saction");

if(planstate != null){
if (planstate.equals("stopplan"))
{
System.out.println("runstop");
 c.setf(false);

  response.sendRedirect("./plancontrol.jsp");
 
}

if (planstate.equals("startplan"))
{
System.out.println("run");

if (c.getState().toString().equals("RUNNABLE")||c.getState().toString().equals("TIMED_WAITING"))
{
    out.print("<script>alert('计划任务已经开启');</script>"); 
    return;
}
if (c.getSeeds()== null)
{
    out.print("<script>alert('请输入完整计划任务信息!');</script>"); 
    return;
   
}
c.start();
 //response.sendRedirect(request.getContextPath() +
    //               "/jobs.jsp?message=Job created");
     response.sendRedirect("./plancontrol.jsp");
            
           
//

}
}

%>


<%@include file="/include/head.jsp"%>
<html>
<head>
 
    <script type="text/javascript">
        function doTerminateCurrentJob(){
            if(confirm("Are you sure you wish to terminate the job currently being crawled?")){
                document.location = '<%out.print(request.getContextPath());%>/console/action.jsp?action=terminate';
            }
        }    
    </script>
   </head>

    <body>
    <table border="0" cellspacing="0" cellpadding="0">
	<tr height="10"><td></td></tr>
	<tr><td>
    <fieldset style="width: 750px">
        <legend> 
        <p>
        <b><span class="legendTitle" >计划任务状态:</span> </p>
        <%
        if(c==null||c.getState().toString().equals("TERMINATED"))
        {
        out.print("无计划任务");
        
        }     
        
        else
        {
        out.print(        (c.getState().toString().equals("RUNNABLE")||c.getState().toString().equals("TIMED_WAITING"))
            ? "<span class='status crawling' style='font-size: 11pt'>计划任务进行中 |</span></b>"
              +"<a href='"+request.getContextPath()+"plancontrol.jsp?saction=stopplan' style='font-size: 11pt'>终结</a>"
            : "<span class='status holding' style='font-size: 11pt'>计划任务终结 |</span></b>"
              +"<a href='"+request.getContextPath()+"plancontrol.jsp?saction=startplan ' style='font-size: 11pt'>开启</a>");
        
        }
         %>
        

       

        <p><b><span class="legendTitle" >爬取状态: 正在爬取任务</span> 
  <!--     <p> <%= handler.isRunning() 
            ? "<span class='status crawling' style='font-size: 11pt'>正在爬取任务 |</span></b>"
              +"<a href='"+request.getContextPath()+"/console/action.jsp?action=stop' style='font-size: 11pt'>挂起</a>"
            : "<span class='status holding' style='font-size: 11pt'>任务挂起 |</span></b>"
              +"<a href='"+request.getContextPath()+"/console/action.jsp?action=start' style='font-size: 11pt'>开始</a>"
        %> </b>  -->
        </legend>
 <!--       <div style="float:right;padding-right:50px; display:none">
	        <b>Memory</b><br>
	        <div style="padding-left:20px">
		        <%=(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/1024%> KB 
		        used<br>
		        <%=(Runtime.getRuntime().totalMemory())/1024%> KB
		        current heap<br>
		        <%=(Runtime.getRuntime().maxMemory())/1024%> KB
		        max heap
	        </div>
	        -->
	    </div>
		
        <b></b>
        <div style="padding-left:20px; display:none">
			<%= handler.getCurrentJob()!=null
			    ? shortJobStatus+": <i>"
			      +handler.getCurrentJob().getJobName()+"</i>"
			    : ((handler.isRunning()) ? "None available" : "None running")
			 %><br>
	        <%= handler.getPendingJobs().size() %> pending,
	        <%= handler.getCompletedJobs().size() %> completed
        </div>

	        
         </fieldset>
            <%
            	long begin, end;
	            if(stats != null) {
	                begin = stats.successfullyFetchedCount();
	                end = stats.totalCount();
	                if(end < 1) {
	                    end = 1;
	                }
	            } else {
                    begin = 0;
                    end = 1;
	            }
                
                if(handler.getCurrentJob() != null)
                {
                    final long timeElapsed, timeRemain;
                    if(stats == null) {
                        timeElapsed= 0;
                        timeRemain = -1;
                    } else {
	                    timeElapsed = (stats.getCrawlerTotalElapsedTime());
	                    if(begin == 0) {
	                        timeRemain = -1;
	                    } else {
	                        timeRemain = ((long)(timeElapsed*end/(double)begin))-timeElapsed;
	                    }
                    }
            %>
            <fieldset style="width: 750px">
               <legend>
               <b><span class="legendTitle">任务状态:</span>
			 
               <%= 
               "<span class='status "
              +shortJobStatus+"' style='display:none'>"
               +shortJobStatus+"</span>"
               %>
               </b> 
<%      
    if(handler.isCrawling()) {
	    if ((handler.getCurrentJob().getStatus().
                equals(CrawlJob.STATUS_PAUSED) ||
            handler.getCurrentJob().getStatus().
			    equals(CrawlJob.STATUS_WAITING_FOR_PAUSE))) {
		//	out.println("暂停");
  //          out.println("| <a href='/console/action.jsp?action=resume'>" +
       //         "继续</a>");
            /*out.println(" | ");
            out.println("<a href=\"");
            out.println(request.getContextPath());
            out.println("/console/action.jsp?action=checkpoint\">" +
                "检查站</a>");**/
        } else if (!handler.getCurrentJob().isCheckpointing()) {
		//    out.println("正在运行");
    //        out.println("| <a href=\"");
    //        out.println(request.getContextPath());
    //        out.println("/console/action.jsp?action=pause\">暂停</a> ");
           /* if (!handler.getCurrentJob().getStatus().
                   equals(CrawlJob.STATUS_PENDING)) {
                out.println(" | ");
                out.println("<a href=\"");
                out.println(request.getContextPath());
                out.println("/console/action.jsp?action=checkpoint\">" +
                    "检查站</a>");
            }**/
        }
  //      out.println(" | <a href='javascript:doTerminateCurrentJob()'>" +
     //       "终止</a>");
    }
%>
               </legend>

                <%
                  if(handler.isCrawling() && stats != null)
                  {
                %>
                	<div style="float:right; padding-right:50px; display:none">
                	    <b>Load</b>
            			<div style="padding-left:20px">
			            	<%=stats.activeThreadCount()%> active of <%=stats.threadCount()%> threads
			            	<br>
			            	<%=ArchiveUtils.doubleToString((double)stats.congestionRatio(),2)%>
			            	congestion ratio
			            	<br>
			            	<%=stats.deepestUri()%> deepest queue
			            	<br>
			            	<%=stats.averageDepth()%> average depth
						</div>
					</div>
	     <!--           <b>速率</b>        
	    <!--            <div style="padding-left:20px">
		                <%=ArchiveUtils.doubleToString(stats.currentProcessedDocsPerSec(),2)%> 		                
		                URIs/秒
		                (平均速率<%=ArchiveUtils.doubleToString(stats.processedDocsPerSec(),2)%> )
		                <br>
		                <%=stats.currentProcessedKBPerSec()%>
						KB/秒
						(平均速率<%=stats.processedKBPerSec()%> )
					</div>

                    <b>时间</b>
                    <div class='indent'>已用时间
	                    <%= ArchiveUtils.formatMillisecondsToConventional(timeElapsed,false) %>
						
						<br>    -->
	                    <%
	                       if(timeRemain != -1) {
	                    %> 估计剩余时间
		                    <%= ArchiveUtils.formatMillisecondsToConventional(timeRemain,false) %>
		                   
		               	<%
	                       }
                   		%>
					</div>
                    <b>总计</b>
                	<%
                          }
                }
                if(stats != null)
                {
	                int ratio = (int) (100 * begin / end);
            %>
                            <center>
                            <table border="0" cellpadding="0" cellspacing= "0" width="600"> 
                                <tr>
                                    <td align='right' width="25%">已下载 <%= begin %> </td>
                                    <td class='completedBar' width="<%= (int)ratio/2 %>%" align="right">
                                    <%= ratio > 50 ? "<b>"+ratio+"</b>% " : "" %>
                                    </td>
                                    <td class='queuedBar' align="left" width="<%= (int) ((100-ratio)/2) %>%">
                                    <%= ratio <= 50 ? " <b>"+ratio+"</b>%" : "" %>
                                    </td>
                                    <td width="25%" nowrap> <%= stats.queuedUriCount() %> 排队</td>
                                </tr>
                            </table>
                            文件总数 <%= end %><br>      
                    		<!--<%=stats.crawledBytesSummary()%>-->
                            </center>
            <%
                }
                if (handler.getCurrentJob() != null &&
                	handler.getCurrentJob().getStatus().equals(CrawlJob.STATUS_PAUSED)) {
            %>
            		<!--<b>Paused Operations</b>
            		<div class='indent'>
	                	<a href='<%= request.getContextPath() %>/console/frontier.jsp'>View or Edit Frontier URIs</a>
	                </div>-->
	        <%
            	}
            %>
    </fieldset>
    </td></tr>
    <tr><td>
    
	<a href="plancontrol.jsp">刷新</a>
    </td></tr>
    <tr><td>
        <p>
            
        <p>
            
    </td></tr>
    <tr><td>
        <% if (heritrix.isCommandLine()) {  
            // Print the shutdown only if we were started from command line.
            // It makes no sense when in webcontainer mode.
         %>
        <a href="<%=request.getContextPath()%>/console/shutdown.jsp">关闭软件</a>
		 
        <% } %><p>
        <a href="<%=request.getContextPath()%>/index.jsp?action=logout">退出登录</a>
    </td></tr></table>
<p></p>
		 
</body>
</html>

代码说明:

可以看到有个CustomThread线程类,这个就是作业定时类,关于怎么实现见4

c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname); application.setAttribute("planjob", c);

4、CustomThread.java

private CrawlJobHandler handler;
	private CrawlJob newJob;
	private String  intervaltime;  //hours
	private String metaName;
	private String jobDescription;
	private CrawlJob theJob;
	private String recovery;
	private String seeds;
	private int PRIORITY_AVERAGE;
	private String operator;
	private boolean f = true;

	public CustomThread(CrawlJobHandler handler, CrawlJob newJob,
			String metaName, String jobDescription, CrawlJob theJob,
			String recovery, String seeds, int PRIORITY_AVERAGE,String intervaltime,String operator) {
		this.handler = handler;
		this.newJob = newJob;
		this.metaName = metaName;
		System.out.println(this.metaName);
		this.jobDescription = jobDescription;
		this.theJob = theJob;
		this.recovery = recovery;
		this.seeds = seeds;
		this.PRIORITY_AVERAGE = PRIORITY_AVERAGE;
		this.intervaltime = intervaltime;
		this.operator = operator;
		
		
	}


根据上面的属性信息,就可以   

this.newJob = handler.newJob(theJob, recovery, metaName,
     jobDescription, seeds, CrawlJob.PRIORITY_AVERAGE,operator);
handler.ensureNewJobWritten(newJob, metaName, jobDescription);
handler.addJob(newJob);
handler.startCrawler();

这这些放入

run{

           while(f){

         }

}方法中,里面再加入Thread.sleep的intervaltime就可以实现定时了。

5、任务的控制

只需模仿原先的cosole.jsp,修改f即可实现控制

总结:上面是读研期间网站网页内容健康监测的项目,实现了网页抓取,网页分词,不健康网页(父域名,子域名 网页快照)和入库,每次抓取不健康网页数量统计,jpcap截取ip和对应的域名。

本文知识提供了一种思路,定时的方法可能不是比较好的,欢迎指正。

感谢陈忱师兄的带领及杨剑师兄的web页面修改,以及静雯、文恒的一起工作。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值