Understanding the columns/fields in Nutch 2.0 Webpage

13 篇文章 0 订阅

Understanding the columns/fields in Nutch 2.0 Webpage


One of the great things about Nutch 2.0 and the move to GORA is that the datastore is much more easily accessible than it was under Nutch 1.x. In particular, if you are are using Nutch 2.0 with MySQL you can leverage your existing SQL knowledge and SQL tools to see what is going on in the background. This makes both learning and debugging Nutch 2.0 significantly easier than previous versions. In any case, as the goal of this article is to give a brief explanation of each column in the webpage table so that you can better understand what is happening when running Nutch, it is applicable to all Nutch 2.0 back end stores, not just SQL stores.

Before jumping in, it is helpful to quickly review how Nutch crawls the web and stores those results at a very high level as the steps of the crawl are linked to the columns in the webpage table. First, an initial set of seed urls are injected then there are repeated web crawling cycles. These crawl cycles consist of generatefetch,parse and database update job steps. These steps use the various columns in the webpage table in the Nutch database.

The following list contains all of the columns in the webpage in the order they are in the table. When looking at the webpage table in the Nutch database remember every row in the webpage table represents an individual url.  Where applicable it is noted which step of the crawl cycle the column is primarily used by.

id – Generator Field. This is used as the index of the table and consists of the url in a slightly different order (reversed domain name:protocol:port and path) from the order normally seen in your web browser so that it can be searched more quickly. Nutch contains convenience utility methods such as for unreversing urls atTableUtil. Note that using a url as the primary key means the default Nutch 2.0 design is to keep track of the current state of the crawl universe. Nutch 2.0 is not designed for keeping an archive of pages over time as they change (at least without a little modification).
example: org.creativecommons:http/press-releases/entry/5064

headers – standard http headers including various non printing characters.
example:

Set-CookieXPHPSESSID=hmsgtrfqsnfql14aaj8lv0n2l6; path=/Age0X-Powered-By*PHP/5.3.3-7+squeeze13Content-Length7184Linkf<http://creativecommons.org/?p=5064>; rel=shortlinkAccept-Ranges
bytes Content-EncodinggzipX-PingbackJhttp://creativecommons.org/xmlrpc.phpConnection
closeVia1.1 varnishContent-Type0text/html; charset=UTF-signature8X-Varnish2185015582Date:Thu, 09 Aug 2012 01:50:38 GMTServer,Apache/2.2.16 (Debian)VaryAccept-Encoding


text – Parse field that is a conglameration of various text fields for general search purposes. Given advances in Solr I suspect this is no longer really needed except possibly for performance reasons.
example:

Creative Commons Unique Search Tool Now Integrated into Firefox 1.0 - Creative Commons Skip Navigation Home Creative Commons Menu About Licenses Public Domain Support CC Projects News About About CC History Who Uses CC? Case Studies Videos about CC The Team Board of Directors ....


status – fetch field used to store whether the link was actually fetched

1 unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)
2 fetched (page was successfully fetched)
3 gone (that page no longer exists)
4 redir_temp (temporary redirection — see reprUrl below for more details)
5 redir_perm (permanent redirection — see reprUrl below for more details)
34 retry
38 not modified

example: 2

markers – contains the inject, generate, fetch and parse marks with the batchId used as value in the marker*. See Nutch2Crawling
example: _ftcmrk_(1344476998-339713229_gnmrk_(1344476998-339713229__prsmrk__(1344476998-339713229

parseStatus – Parse field normally null until parsing attempted. For list of codes see. ParseStatusCodes.html
example (3 bytes): 02 00 00

modifiedTime – Fetch field – FetchSchedule sets this to the last time the content was modified according to the content source. This information comes from the protocol implementations. It is not the last time the database field was modified.
example: 1344597206693

score – DbUpdate field ranking a given url/page’s importance. Higher is better. See NewScoring
example: 0.0183057

typ – Fetch field containing the mime type Internet_media_type for the document such as text/html or application/pdf. Note that some Mime types are excluded by default and this can be modified in conf/regex-urlfilter.txt.
example: text/html

baseUrl – Fetch field. The base url for relative links contained in the content. Maybe be different from url if the request redirected.
example: http://gora.apache.org/

content – Fetch Field – content of the URL.
example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
<head> 
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
<meta content="Apache Forrest" name="Generator"> 
<meta name="Forrest-version" content="0.9"> 
<meta name="Forrest-skin-name" content="nutch"> 
<title>Welcome to Apache Nutch&#153;</title> 
<link type="text/css" href="skin/basic.css" rel="stylesheet"> 
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> 
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
...


title – Parse field – The text in the title tags of the HTML head.
example: Welcome to Apache Nutch™

reprUrl – Fetch field for representative urls used for redirects. The default behaviour is that the fetcher won’t immediately follow redirected URLs, instead it will record them for fetching during the next round. The documentation indicates that this can be changed to immediately follow redirected urls by copying the http.redirect.max property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to a value greater than 0. However, this is not yet implemented for Nutch 2.0 at this time and every redirect is handled during the next fetch regardless of the property of http.redirect.max.*
example: http://www.apachecon.eu/c/aceu2009/sessions/136

fetchInterval – Fetch field containing default interval until next fetch in seconds (defaults to 30 days). See fetchTime field default explanation. Can be set at the url level when injecting so the field is necessary (seenutch_inject).
example: 2592000

prevFetchTime – Fetch field – previous value of fetch time, or null if not available. This is the previous Nutch fetch time, not to be confused with modifiedTime which is the time the content was actually modified. See fetchTime field default explanation.
example: 1347093015591

inlinks – DbUpdate field with inbound links useful for Linkrank. See Webgraph at NewScoring
example: xhttp://blog.foofactory.fi/2007/03/twice-speed-half-size.htmlWebsite up

prevSignature – Parse field — previous signature. For more details see signature further down.
example (16 bytes): 25 59 5c 73 03 09 bb ed a0 98 5e b6 5e 0c 89 63

outlinks – DbUpdate field – outbound links
example: http://www.adobe.com/jp/products/acrobat/readstep2.html

fetchTime – Fetch field used by Mapper to decide if it is time to fetch this url. See this link how-to-re-crawl-with-nutch for a well written overview. Also see the Nutch API documentation AbstractFetchSchedule. The default re-fetch schedule is somewhat simplistic. No matter if the page was changed or not, the fetchInterval remains unchanged, and the updated page fetchTime will always be set to fetchTime + fetchInterval * 1000. See DefaultFetchSchedule. A better implementation for most cases is the AdaptiveFetchScheduleAdaptiveFetchSchedule. The FetchSchedule implementation can be changed by copying the db.fetch.schedule.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value.
example: 1347093160403

retriesSinceFetch – Fetch field counter for number of retries to fetch due to (hopefully transient) errors since the last success. See AbstractFetchSchedule
example: 2

protocolStatus – Fetch field – see ProtocolStatusCodes
example (3 bytes): 02 00 00

ACCESS_DENIED 17
BLOCKED 23
EXCEPTION 16
FAILED 2
GONE 11
MOVED 12
NOTFETCHING 20
NOTFOUND 14
NOTMODIFIED 21
PROTO_NOT_FOUND 10
REDIR_EXCEEDED 19
RETRY 15
ROBOTS_DENIED 18
SUCCESS 1
TEMP_MOVED 13
WOULDBLOCK 22

signature – This parse field contains a signature calculated every time a page is fetched so that Nutch knows whether a page has changed or not the next time it does a fetch. The default signature calculation implementation uses both content and header as information for calculating the signature. For various reasons (etags, etc.) the header can change without the actual content changing making the default implementation less than optimal for most requirements. For those looking to save some bandwidth on current status crawl or those implementing archival crawling (requires more changes than just this) the TextProfileSignature implementation is more appropriate. The signature calculation implementation can be changed by copying the db.signature.class property from conf/nutch-default.xml to conf/nutch-site.xml and changing the value to org.apache.nutch.crawl.TextProfileSignature.
example (16 bytes): e1 f7 cc cc 49 7a 45 6b e7 fc 05 68 9a e8 ea 93

metadata – This is a mixed catch all field for metadata (see metadata-package-summary.html). TheIndexMetatags plugin does not currently work in Nutch 2.0 or 2.1. metadata-package-summary.html has more information but it is unclear how much works with 2.x
example: _csh_:テつケ)elanguageen

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值