Heritrix抓取灵活,配置性强 ,而有些需求不要求用户修改抓取规则,这些需要web隐藏一些页面。有些需要定时抓取,并且可以对该定时任务进行控制(停止,开始),这个需求的思路其实比较简单就是 在一个页面中新建一个job,把相关信息在另一个页面request.getParameter()得到,根据这些信息和handler 就可启动一个抓取任务,这个抓取任务(CustomThread)实现Runnable,run方法中使用while(f){}, 如果要停止这个任务,只需在web中发送f=false的请求,runnable线程发现f为false自然就停止了抓取任务,如果要启动一个抓取任务 就再new一个customThread即可。需要做的有如下:
1、修改webapps.admin.jobs.jsp 就命名为(webapps.admin.)planjobs.jsp
<%@include file="/include/handler.jsp"%>
<%@ page import="org.archive.crawler.datamodel.CrawlOrder,org.archive.crawler.admin.CrawlJob,java.util.List" %>
<%@ page pageEncoding="utf-8"%>
<%
// String sAction = request.getParameter("action");
// if(sAction != null){
// Need to handle an action
// if(sAction.equalsIgnoreCase("delete")){
// handler.deleteJob(request.getParameter("job"));
// }
// }
String title = "Crawl jobs";
int tab = 3;
%>
<%@include file="/include/head.jsp"%>
<%
if(request.getParameter("message") != null &&
request.getParameter("message").length() > 0) {
%>
<p>
<span class="flashMessage" style="display:none"><b><%=request.getParameter("message")%></b></span>
<% } %>
<% if(handler.isCrawling()){ %>
<h2>活动任务 - <i><%=handler.getCurrentJob().getJobName()%></i></h2>
<% } %>
<h2>创建新的计划任务</h2>
<ul>
<li><a href="<%=request.getContextPath()%>/jobs/plannew.jsp">
新建计划任务</a></li>
</ul>
效果为:
说明:
控制台和任务是原来的功能
计划任务控制台 计划任务是 本篇要讲的内容
2、修改webapps.admin.jobs.new.jsp 命名为plannew.jsp
点击“新建计划任务”跳转到http://localhost/jobs/plannew.jsp
相应代码为:
<%@include file="/include/handler.jsp"%>
<%@ page import="org.archive.crawler.datamodel.CrawlOrder" %>
<%@ page import="org.archive.crawler.admin.ui.JobConfigureUtils" %>
<%@ page import="org.archive.crawler.settings.ComplexType" %>
<%@ page import="org.archive.crawler.settings.CrawlerSettings" %>
<%@ page import="org.archive.crawler.settings.XMLSettingsHandler" %>
<%@ page import="java.io.BufferedReader" %>
<%@ page import="java.io.FileReader" %>
<%@ page import="java.util.regex.Pattern" %>
<%@ page import="java.util.Iterator" %>
<%@ page import="java.io.File" %>
<%@ page pageEncoding="utf-8"%>
<%@include file="/include/head.jsp"%>
<form name="frmNew" method="post" action="../../plancontrol.jsp">
<b>
</b>
<p>
<table>
<tr>
<td>
时间间隔:
</td>
<td>
<input maxlength="38" name="invervaltime" style="width: 50px">小时
</td>
</tr>
<tr>
<td>
计划任务名称:
</td>
<td>
<input maxlength="38" name="meta/name" style="width: 440px">
</td>
</tr>
<tr>
<td>
计划任务描述:
</td>
<td>
<input name="meta/description" style="width: 440px">
</td>
</tr>
<tr>
<td valign="top">
网站:
</td>
<td><font size="-1">在下面输入网址、IP或者IP段,每行一个</font></br>
<textarea name="seeds" style="width: 440px" rows="8"></textarea>
</td>
</tr>
<tr>
<td colspan="2" align="center">
<input type="submit" value="submit">
</td>
</tr>
</table>
</form>
3、修改webapps.admin.index.jsp 为webapps.admin.index.plancontrol.jsp
http://localhost/plancontrol.jsp 可以看到类似index.jsp的效果,由于没环境了,代码残缺,就不截图了
plancontrol.jsp相关代码为:
<%@include file="/include/handler.jsp"%>
<%@ page import="org.archive.crawler.admin.CrawlJob" %>
<%@ page import="org.archive.crawler.Heritrix" %>
<%@ page import="org.archive.crawler.admin.CustomThread" %>
<%@ page import="org.archive.crawler.admin.StatisticsTracker" %>
<%@ page import="org.archive.util.ArchiveUtils" %>
<%@ page import="org.archive.util.TextUtils" %>
<%@ page import="javax.servlet.jsp.JspWriter" %>
<%@ page import="java.util.Iterator" %>
<%@ page import="java.util.regex.Pattern" %>
<%@ page import="org.archive.crawler.datamodel.CrawlOrder" %>
<%@ page import="org.archive.crawler.admin.ui.JobConfigureUtils" %>
<%@ page import="org.archive.crawler.settings.ComplexType" %>
<%@ page import="org.archive.crawler.settings.CrawlerSettings" %>
<%@ page import="org.archive.crawler.settings.XMLSettingsHandler" %>
<%@ page import="org.archive.crawler.admin.PlanJob" %>
<%@ page pageEncoding="utf-8"%>
<%
/* if (session.getAttribute("login")==null)
{
response.sendRedirect(request.getContextPath() + "/login.jsp");
} */
%>
<%!
private void printTime(final JspWriter out,long time)
throws java.io.IOException {
out.println(ArchiveUtils.formatMillisecondsToConventional(time,false));
}
%>
<%
CrawlJob theJob = handler.getJob(request.getParameter("job"));
boolean isProfile = "true".equals(request.getParameter("profile"));
String recovery = request.getParameter("recover");
if (theJob == null) {
// Ok, use default profile then.
theJob = handler.getDefaultProfile();
if(theJob == null){
// ERROR - This should never happen. There must always be at least
// one (default) profile.
out.println("ERROR: NO PROFILE FOUND");
return;
}
}
XMLSettingsHandler settingsHandler = theJob.getSettingsHandler();
CrawlOrder crawlOrder = settingsHandler.getOrder();
CrawlerSettings orderfile = settingsHandler.getSettingsObject(null);
//System.out.println("fpath"+settingsHandler.getOrderFile());
String error = null;
String metaName = request.getParameter("meta/name");
String jobDescription = request.getParameter("meta/description");
String intervaltime = request.getParameter("invervaltime");
// String operatorname = session.getAttribute("login").toString();
CrawlJob newJob = null;
// if(request.getParameter("action") != null) {
if(true) {
// Make new job.
if(error == null) {
if(isProfile) {
CrawlJob test = handler.getJob(metaName);
if(test == null) {
// unique name
newJob = handler.newProfile(theJob, metaName,
jobDescription,
request.getParameter("seeds"),operatorname);
} else { //test not null
// Need a unique name!
error = "Profile name must be unique!";
}
}
}
}
%>
<%
String sAction = request.getParameter("action");
if(sAction != null) {
if(sAction.equalsIgnoreCase("logout")) {
// Logging out.
session = request.getSession();
if (session != null) {
session.invalidate();
// Redirect back to here and we'll get thrown to the login
// page.
response.sendRedirect(request.getContextPath() + "/login.jsp");
}
}
}
String title = "Admin Console";
int tab = 2;
// PlanJob p = (PlanJob)application.getAttribute("planjob");
CustomThread c = (CustomThread)application.getAttribute("planjob");
if (c == null && request.getParameter("seeds")!=null)
{
c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname);
application.setAttribute("planjob", c);
}
if (c != null && request.getParameter("seeds")!=null && c.getState().toString().equals("TERMINATED"))
{
c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname);
application.setAttribute("planjob", c);
}
%>
<%
String planstate = request.getParameter("saction");
if(planstate != null){
if (planstate.equals("stopplan"))
{
System.out.println("runstop");
c.setf(false);
response.sendRedirect("./plancontrol.jsp");
}
if (planstate.equals("startplan"))
{
System.out.println("run");
if (c.getState().toString().equals("RUNNABLE")||c.getState().toString().equals("TIMED_WAITING"))
{
out.print("<script>alert('计划任务已经开启');</script>");
return;
}
if (c.getSeeds()== null)
{
out.print("<script>alert('请输入完整计划任务信息!');</script>");
return;
}
c.start();
//response.sendRedirect(request.getContextPath() +
// "/jobs.jsp?message=Job created");
response.sendRedirect("./plancontrol.jsp");
//
}
}
%>
<%@include file="/include/head.jsp"%>
<html>
<head>
<script type="text/javascript">
function doTerminateCurrentJob(){
if(confirm("Are you sure you wish to terminate the job currently being crawled?")){
document.location = '<%out.print(request.getContextPath());%>/console/action.jsp?action=terminate';
}
}
</script>
</head>
<body>
<table border="0" cellspacing="0" cellpadding="0">
<tr height="10"><td></td></tr>
<tr><td>
<fieldset style="width: 750px">
<legend>
<p>
<b><span class="legendTitle" >计划任务状态:</span> </p>
<%
if(c==null||c.getState().toString().equals("TERMINATED"))
{
out.print("无计划任务");
}
else
{
out.print( (c.getState().toString().equals("RUNNABLE")||c.getState().toString().equals("TIMED_WAITING"))
? "<span class='status crawling' style='font-size: 11pt'>计划任务进行中 |</span></b>"
+"<a href='"+request.getContextPath()+"plancontrol.jsp?saction=stopplan' style='font-size: 11pt'>终结</a>"
: "<span class='status holding' style='font-size: 11pt'>计划任务终结 |</span></b>"
+"<a href='"+request.getContextPath()+"plancontrol.jsp?saction=startplan ' style='font-size: 11pt'>开启</a>");
}
%>
<p><b><span class="legendTitle" >爬取状态: 正在爬取任务</span>
<!-- <p> <%= handler.isRunning()
? "<span class='status crawling' style='font-size: 11pt'>正在爬取任务 |</span></b>"
+"<a href='"+request.getContextPath()+"/console/action.jsp?action=stop' style='font-size: 11pt'>挂起</a>"
: "<span class='status holding' style='font-size: 11pt'>任务挂起 |</span></b>"
+"<a href='"+request.getContextPath()+"/console/action.jsp?action=start' style='font-size: 11pt'>开始</a>"
%> </b> -->
</legend>
<!-- <div style="float:right;padding-right:50px; display:none">
<b>Memory</b><br>
<div style="padding-left:20px">
<%=(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/1024%> KB
used<br>
<%=(Runtime.getRuntime().totalMemory())/1024%> KB
current heap<br>
<%=(Runtime.getRuntime().maxMemory())/1024%> KB
max heap
</div>
-->
</div>
<b></b>
<div style="padding-left:20px; display:none">
<%= handler.getCurrentJob()!=null
? shortJobStatus+": <i>"
+handler.getCurrentJob().getJobName()+"</i>"
: ((handler.isRunning()) ? "None available" : "None running")
%><br>
<%= handler.getPendingJobs().size() %> pending,
<%= handler.getCompletedJobs().size() %> completed
</div>
</fieldset>
<%
long begin, end;
if(stats != null) {
begin = stats.successfullyFetchedCount();
end = stats.totalCount();
if(end < 1) {
end = 1;
}
} else {
begin = 0;
end = 1;
}
if(handler.getCurrentJob() != null)
{
final long timeElapsed, timeRemain;
if(stats == null) {
timeElapsed= 0;
timeRemain = -1;
} else {
timeElapsed = (stats.getCrawlerTotalElapsedTime());
if(begin == 0) {
timeRemain = -1;
} else {
timeRemain = ((long)(timeElapsed*end/(double)begin))-timeElapsed;
}
}
%>
<fieldset style="width: 750px">
<legend>
<b><span class="legendTitle">任务状态:</span>
<%=
"<span class='status "
+shortJobStatus+"' style='display:none'>"
+shortJobStatus+"</span>"
%>
</b>
<%
if(handler.isCrawling()) {
if ((handler.getCurrentJob().getStatus().
equals(CrawlJob.STATUS_PAUSED) ||
handler.getCurrentJob().getStatus().
equals(CrawlJob.STATUS_WAITING_FOR_PAUSE))) {
// out.println("暂停");
// out.println("| <a href='/console/action.jsp?action=resume'>" +
// "继续</a>");
/*out.println(" | ");
out.println("<a href=\"");
out.println(request.getContextPath());
out.println("/console/action.jsp?action=checkpoint\">" +
"检查站</a>");**/
} else if (!handler.getCurrentJob().isCheckpointing()) {
// out.println("正在运行");
// out.println("| <a href=\"");
// out.println(request.getContextPath());
// out.println("/console/action.jsp?action=pause\">暂停</a> ");
/* if (!handler.getCurrentJob().getStatus().
equals(CrawlJob.STATUS_PENDING)) {
out.println(" | ");
out.println("<a href=\"");
out.println(request.getContextPath());
out.println("/console/action.jsp?action=checkpoint\">" +
"检查站</a>");
}**/
}
// out.println(" | <a href='javascript:doTerminateCurrentJob()'>" +
// "终止</a>");
}
%>
</legend>
<%
if(handler.isCrawling() && stats != null)
{
%>
<div style="float:right; padding-right:50px; display:none">
<b>Load</b>
<div style="padding-left:20px">
<%=stats.activeThreadCount()%> active of <%=stats.threadCount()%> threads
<br>
<%=ArchiveUtils.doubleToString((double)stats.congestionRatio(),2)%>
congestion ratio
<br>
<%=stats.deepestUri()%> deepest queue
<br>
<%=stats.averageDepth()%> average depth
</div>
</div>
<!-- <b>速率</b>
<!-- <div style="padding-left:20px">
<%=ArchiveUtils.doubleToString(stats.currentProcessedDocsPerSec(),2)%>
URIs/秒
(平均速率<%=ArchiveUtils.doubleToString(stats.processedDocsPerSec(),2)%> )
<br>
<%=stats.currentProcessedKBPerSec()%>
KB/秒
(平均速率<%=stats.processedKBPerSec()%> )
</div>
<b>时间</b>
<div class='indent'>已用时间
<%= ArchiveUtils.formatMillisecondsToConventional(timeElapsed,false) %>
<br> -->
<%
if(timeRemain != -1) {
%> 估计剩余时间
<%= ArchiveUtils.formatMillisecondsToConventional(timeRemain,false) %>
<%
}
%>
</div>
<b>总计</b>
<%
}
}
if(stats != null)
{
int ratio = (int) (100 * begin / end);
%>
<center>
<table border="0" cellpadding="0" cellspacing= "0" width="600">
<tr>
<td align='right' width="25%">已下载 <%= begin %> </td>
<td class='completedBar' width="<%= (int)ratio/2 %>%" align="right">
<%= ratio > 50 ? "<b>"+ratio+"</b>% " : "" %>
</td>
<td class='queuedBar' align="left" width="<%= (int) ((100-ratio)/2) %>%">
<%= ratio <= 50 ? " <b>"+ratio+"</b>%" : "" %>
</td>
<td width="25%" nowrap> <%= stats.queuedUriCount() %> 排队</td>
</tr>
</table>
文件总数 <%= end %><br>
<!--<%=stats.crawledBytesSummary()%>-->
</center>
<%
}
if (handler.getCurrentJob() != null &&
handler.getCurrentJob().getStatus().equals(CrawlJob.STATUS_PAUSED)) {
%>
<!--<b>Paused Operations</b>
<div class='indent'>
<a href='<%= request.getContextPath() %>/console/frontier.jsp'>View or Edit Frontier URIs</a>
</div>-->
<%
}
%>
</fieldset>
</td></tr>
<tr><td>
<a href="plancontrol.jsp">刷新</a>
</td></tr>
<tr><td>
<p>
<p>
</td></tr>
<tr><td>
<% if (heritrix.isCommandLine()) {
// Print the shutdown only if we were started from command line.
// It makes no sense when in webcontainer mode.
%>
<a href="<%=request.getContextPath()%>/console/shutdown.jsp">关闭软件</a>
<% } %><p>
<a href="<%=request.getContextPath()%>/index.jsp?action=logout">退出登录</a>
</td></tr></table>
<p></p>
</body>
</html>
代码说明:
可以看到有个CustomThread线程类,这个就是作业定时类,关于怎么实现见4
c = new CustomThread(handler,newJob,metaName,jobDescription,theJob,recovery,request.getParameter("seeds"),CrawlJob.PRIORITY_AVERAGE,intervaltime,operatorname); application.setAttribute("planjob", c);
4、CustomThread.java
private CrawlJobHandler handler;
private CrawlJob newJob;
private String intervaltime; //hours
private String metaName;
private String jobDescription;
private CrawlJob theJob;
private String recovery;
private String seeds;
private int PRIORITY_AVERAGE;
private String operator;
private boolean f = true;
public CustomThread(CrawlJobHandler handler, CrawlJob newJob,
String metaName, String jobDescription, CrawlJob theJob,
String recovery, String seeds, int PRIORITY_AVERAGE,String intervaltime,String operator) {
this.handler = handler;
this.newJob = newJob;
this.metaName = metaName;
System.out.println(this.metaName);
this.jobDescription = jobDescription;
this.theJob = theJob;
this.recovery = recovery;
this.seeds = seeds;
this.PRIORITY_AVERAGE = PRIORITY_AVERAGE;
this.intervaltime = intervaltime;
this.operator = operator;
}
根据上面的属性信息,就可以
this.newJob = handler.newJob(theJob, recovery, metaName,
jobDescription, seeds, CrawlJob.PRIORITY_AVERAGE,operator);
handler.ensureNewJobWritten(newJob, metaName, jobDescription);
handler.addJob(newJob);
handler.startCrawler();
这这些放入
run{
while(f){
}
}方法中,里面再加入Thread.sleep的intervaltime就可以实现定时了。
5、任务的控制
只需模仿原先的cosole.jsp,修改f即可实现控制
总结:上面是读研期间网站网页内容健康监测的项目,实现了网页抓取,网页分词,不健康网页(父域名,子域名 网页快照)和入库,每次抓取不健康网页数量统计,jpcap截取ip和对应的域名。
本文知识提供了一种思路,定时的方法可能不是比较好的,欢迎指正。
感谢陈忱师兄的带领及杨剑师兄的web页面修改,以及静雯、文恒的一起工作。