nutch源码分析—parse
“bin/nutch parse crawl/segments/*”这条命令最终会调用org.apache.nutch.parse.ParseSegment的main函数。
ParseSegment::main
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new ParseSegment(),
args);
System.exit(res);
}
ToolRunner的run函数最终调用ParseSegment的run函数。
ParseSegment::run
public int run(String[] args) throws Exception {
Path segment = new Path(args[0]);
parse(segment);
return 0;
}
public void parse(Path segment) throws IOException {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();
JobConf job = new NutchJob(getConf());
job.setJobName("parse " + segment);
FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(ParseSegment.class);
job.setReducerClass(ParseSegment.class);
FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(ParseOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ParseImpl.class);
JobClient.runJob(job);
}
ParseSegment的parse函数创建一个Job,设置输入为crawl/segments/2*下的content目录,处理函数为ParseSegment的map和reduce函数,最后的输出使用ParseOutputFormat处理。其中,reduce函数没有任何功能,因此下面主要看map和ParseOutputFormat的处理函数。
ParseSegment::map
public void map(WritableComparable<?> key, Content content,
OutputCollector<Text, ParseImpl> output, Reporter reporter)
throws IOException {
...
parseUtil = new ParseUtil(getConf());
ParseResult parseResult = parseUtil.parse(content);
for (Entry<Text, Parse> entry : parseResult) {
Text url = entry.getKey();
Parse parse = entry.getValue();
ParseStatus parseStatus = parse.getData().getStatus();
...
parse.getData().getContentMeta()
.set(Nutch.SEGMENT_NAME_KEY, getConf().get(Nutch.SEGMENT_NAME_KEY));
byte[] signature = SignatureFactory.getSignature(getConf()).calculate(
content, parse);
parse.getData().getContentMeta()
.set(Nutch.SIGNATURE_KEY, StringUtil.toHexString(signature));
...
output.collect(
url,
new ParseImpl(new ParseText(parse.getText()), parse.getData(), parse
.isCanonical()));
}
}
传入的参数Content为网页内容的封装,map函数创建ParseUtil并调用其parse函数对所有网页进行解析,解析的结果封装在ParseResult结构中。然后遍历ParseResult结构,将结果封装为ParseImpl并写入文件中。
ParseSegment::map->ParseUtil::parse
public ParseResult parse(Content content) throws ParseException {
Parser[] parsers = null;
parsers = this.parserFactory.getParsers(content.getContentType(),
content.getUrl() != null ? content.getUrl() : "");
for (int i = 0; i < parsers.length; i++) {
ParseResult parseResult = runParser(parsers[i], content);
if (parseResult != null && !parseResult.isEmpty())
return parseResult;
}
}
getParsers最后获得org.apache.nutch.parse.html.HtmlParser(根据抓取的文件不同调用不同的Parser),然后调用runParser解析并返回ParseResult。
ParseSegment::map->ParseUtil::parse->runParser
private ParseResult runParser(Parser p, Content content) {
ParseCallable pc = new ParseCallable(p, content);
Future<ParseResult> task = executorService.submit(pc);
ParseResult res = null;
try {
res = task.get(maxParseTime, TimeUnit.SECONDS);
} catch (Exception e) {
task.cancel(true);
} finally {
pc = null;
}
return res;
}
public ParseResult call() throws Exception {
return p.getParse(content);
}
runParser针对每个Parser创建线程进行解析,ParseCallable的call函数如下,
public ParseResult call() throws Exception {
return p.getParse(content);
}
因此每个线程最终调用HtmlParser的getParse函数进行解析并返回ParseResult。
HtmlParser::getParse
public ParseResult getParse(Content content) {
HTMLMetaTags metaTags = new HTMLMetaTags();
URL base = new URL(content.getBaseUrl());
String text = "";
String title = "";
Outlink[] outlinks = new Outlink[0];
Metadata metadata = new Metadata();
DocumentFragment root;
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(
contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
input.setEncoding(encoding);
root = parse(input);
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
if (!metaTags.getNoIndex()) {
StringBuffer sb = new StringBuffer();
utils.getText(sb, root);
text = sb.toString();
sb.setLength(0);
utils.getTitle(sb, root);
title = sb.toString().trim();
}
if (!metaTags.getNoFollow()) {
ArrayList<Outlink> l = new ArrayList<Outlink>();
URL baseTag = utils.getBase(root);
utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
outlinks = l.toArray(new Outlink[l.size()]);
}
ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
ParseData parseData = new ParseData(status, title, outlinks,
content.getMetadata(), metadata);
ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text, parseData));
ParseResult filteredParse = this.htmlParseFilters.filter(content,
parseResult, metaTags, root);
return filteredParse;
}
sniffCharacterEncoding取出内容的前面一部分byte,和guessEncoding一起获取html文件的编码,例如UTF-8。接下来的parse函数最终调用parseNeko函数通过NekoHtml解析html文件,返回根节点。
getMetaTags获取文档的meta信息,DOMContentUtils的getText函数获取html文件中的文本内容,getTitle函数获取标题,getOutlinks获取网页内的连接,最后创建ParseResult封装这些信息并返回,下面一一看这些函数。
HtmlParser::getParse->parse->parseNeko
private DocumentFragment parseNeko(InputSource input) throws Exception {
DOMFragmentParser parser = new DOMFragmentParser();
try {
parser
.setFeature(
"http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe",
true);
parser.setFeature("http://cyberneko.org/html/features/augmentations",
true);
parser.setProperty(
"http://cyberneko.org/html/properties/default-encoding",
defaultCharEncoding);
parser
.setFeature(
"http://cyberneko.org/html/features/scanner/ignore-specified-charset",
true);
parser
.setFeature(
"http://cyberneko.org/html/features/balance-tags/ignore-outside-content",
false);
parser.setFeature(
"http://cyberneko.org/html/features/balance-tags/document-fragment",
true);
parser.setFeature("http://cyberneko.org/html/features/report-errors",
LOG.isTraceEnabled());
} catch (SAXException e) {
}
HTMLDocumentImpl doc = new HTMLDocumentImpl();
doc.setErrorChecking(false);
DocumentFragment res = doc.createDocumentFragment();
DocumentFragment frag = doc.createDocumentFragment();
parser.parse(input, frag);
res.appendChild(frag);
while (true) {
frag = doc.createDocumentFragment();
parser.parse(input, frag);
if (!frag.hasChildNodes())
break;
res.appendChild(frag);
}
return res;
}
NekoHTML会扫描解析html文件并作适当的修正,例如当缺少META标签时会添加等等。具体如何处理该html文件由DOMFragmentParser的setFeature和setFeature界定,最后调用DOMFragmentParser的parse函数解析文档(input)生成的DocumentFragment代表文档的某一部分,然后将其添加到根DocumentFragment中并返回。
HtmlParser::getParse->getMetaTags
public static final void getMetaTags(HTMLMetaTags metaTags, Node node,
URL currURL) {
metaTags.reset();
getMetaTagsHelper(metaTags, node, currURL);
}
private static final void getMetaTagsHelper(HTMLMetaTags metaTags, Node node,
URL currURL) {
if (node.getNodeType() == Node.ELEMENT_NODE) {
if ("body".equalsIgnoreCase(node.getNodeName())) {
return;
}
if ("meta".equalsIgnoreCase(node.getNodeName())) {
NamedNodeMap attrs = node.getAttributes();
Node nameNode = null;
Node equivNode = null;
Node contentNode = null;
for (int i = 0; i < attrs.getLength(); i++) {
Node attr = attrs.item(i);
String attrName = attr.getNodeName().toLowerCase();
if (attrName.equals("name")) {
nameNode = attr;
} else if (attrName.equals("http-equiv")) {
equivNode = attr;
} else if (attrName.equals("content")) {
contentNode = attr;
}
}
...
} else if ("base".equalsIgnoreCase(node.getNodeName())) {
...
}
}
NodeList children = node.getChildNodes();
if (children != null) {
int len = children.getLength();
for (int i = 0; i < len; i++) {
getMetaTagsHelper(metaTags, children.item(i), currURL);
}
}
}
getMetaTagsHelper函数的整体思路就是嵌套遍历由NekoHTML解析出的各个节点,根据每个节点的信息对HTMLMetaTags进行相应的设置。
例如,如果是body节点就直接返回,如果是meta节点,则继续找到meta节点下的name节点、http-equiv节点、content节点,然后根据每个节点的信息设置HTMLMetaTags,省略掉的代码就是根据每个节点的值设置HTMLMetaTags,因为和主线代码没有大关系,所以不仔细看了。
getMetaTagsHelper函数的最后,获得某个节点下的所有子节点,并对其调用getMetaTagsHelper嵌套执行。
HtmlParser::getParse->getText
public void getText(StringBuffer sb, Node node) {
getText(sb, node, false);
}
public boolean getText(StringBuffer sb, Node node,
boolean abortOnNestedAnchors) {
if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
return true;
}
return false;
}
private boolean getTextHelper(StringBuffer sb, Node node,
boolean abortOnNestedAnchors, int anchorDepth) {
boolean abort = false;
NodeWalker walker = new NodeWalker(node);
while (walker.hasNext()) {
Node currentNode = walker.nextNode();
String nodeName = currentNode.getNodeName();
short nodeType = currentNode.getNodeType();
if ("script".equalsIgnoreCase(nodeName)) {
walker.skipChildren();
}
if ("style".equalsIgnoreCase(nodeName)) {
walker.skipChildren();
}
if (nodeType == Node.COMMENT_NODE) {
walker.skipChildren();
}
if (nodeType == Node.TEXT_NODE) {
String text = currentNode.getNodeValue();
text = text.replaceAll("\\s+", " ");
text = text.trim();
if (text.length() > 0) {
if (sb.length() > 0)
sb.append(' ');
sb.append(text);
}
}
}
return abort;
}
getText函数用于获取html文件内的文本内容,其内部调用了getTextHelper函数。NodeWalker用于封装node,用来深度优先遍历node树。接下来遍历所有节点,忽略标签名为script、style以及注释的节点以及其所有子节点,针对节点类型为TEXT_NODE,获取节点内的文本内容至传入的参数sb中。
HtmlParser::getParse->getTitle
public boolean getTitle(StringBuffer sb, Node node) {
NodeWalker walker = new NodeWalker(node);
while (walker.hasNext()) {
Node currentNode = walker.nextNode();
String nodeName = currentNode.getNodeName();
short nodeType = currentNode.getNodeType();
if ("body".equalsIgnoreCase(nodeName)) {
return false;
}
if (nodeType == Node.ELEMENT_NODE) {
if ("title".equalsIgnoreCase(nodeName)) {
getText(sb, currentNode);
return true;
}
}
}
return false;
}
参考getText函数的分析,首先通过NodeWalker遍历所有节点,如果到达标签为body对应的节点还没找到title就直接返回了,如果找到了title节点,就调用getText函数获取该节点下的所有文本内容至sb中并返回。
HtmlParser::getParse->getOutlinks
public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node) {
NodeWalker walker = new NodeWalker(node);
while (walker.hasNext()) {
Node currentNode = walker.nextNode();
String nodeName = currentNode.getNodeName();
short nodeType = currentNode.getNodeType();
NodeList children = currentNode.getChildNodes();
int childLen = (children != null) ? children.getLength() : 0;
if (nodeType == Node.ELEMENT_NODE) {
nodeName = nodeName.toLowerCase();
LinkParams params = (LinkParams) linkParams.get(nodeName);
if (params != null) {
if (!shouldThrowAwayLink(currentNode, children, childLen, params)) {
StringBuffer linkText = new StringBuffer();
getText(linkText, currentNode, true);
if (linkText.toString().trim().length() == 0) {
NodeWalker subWalker = new NodeWalker(currentNode);
while (subWalker.hasNext()) {
Node subNode = subWalker.nextNode();
if (subNode.getNodeType() == Node.ELEMENT_NODE) {
if (subNode.getNodeName().toLowerCase().equals("img")) {
NamedNodeMap subAttrs = subNode.getAttributes();
Node alt = subAttrs.getNamedItem("alt");
if (alt != null) {
String altTxt = alt.getTextContent();
if (altTxt != null && altTxt.trim().length() > 0) {
if (linkText.length() > 0)
linkText.append(' ');
linkText.append(altTxt);
}
}
} else {
}
} else if (subNode.getNodeType() == Node.TEXT_NODE) {
String txt = subNode.getTextContent();
if (txt != null && txt.length() > 0) {
if (linkText.length() > 0)
linkText.append(' ');
linkText.append(txt);
}
}
}
}
NamedNodeMap attrs = currentNode.getAttributes();
String target = null;
boolean noFollow = false;
boolean post = false;
for (int i = 0; i < attrs.getLength(); i++) {
Node attr = attrs.item(i);
String attrName = attr.getNodeName();
if (params.attrName.equalsIgnoreCase(attrName)) {
target = attr.getNodeValue();
} else if ("rel".equalsIgnoreCase(attrName)
&& "nofollow".equalsIgnoreCase(attr.getNodeValue())) {
noFollow = true;
} else if ("method".equalsIgnoreCase(attrName)
&& "post".equalsIgnoreCase(attr.getNodeValue())) {
post = true;
}
}
if (target != null && !noFollow && !post)
try {
URL url = URLUtil.resolveURL(base, target);
outlinks.add(new Outlink(url.toString(), linkText.toString()
.trim()));
} catch (MalformedURLException e) {
}
}
if (params.childLen == 0)
continue;
}
}
}
}
getOutlinks的整体思路是遍历所有节点,找到特定节点的(例如a标签、img标签等等)特定属性(例如a标签的href属性、img标签的src属性)里的地址,并查看有无辅助信息,例如img标签的alt、标签内部的一些文字等等,最后将url地址和辅助信息封装成Outlink添加到outlinks列表中。
linkParams成员变量中保存了DOMContentUtils要处理哪些类型的标签,以及需要过滤的每个标签的属性,其在DOMContentUtils的setConf函数中被赋值,后面来看。
getText函数从某个节点获取对应的文本内容作为url的辅助信息linkText,如果为空,则遍历子节点,从子节点下img标签(如果存在)的alt中获取,或者从类型为TEXT_NODE的子节点下获取。
再往下就从对应节点下的属性里获取url地址并添加到outlinks列表中,但是,如果某个标签下存在属性rel其值为nofollow,或者属性method其值为post(例如表单form),则不添加该url地址。
DOMContentUtils::setConf
public void setConf(Configuration conf) {
Collection<String> forceTags = new ArrayList<String>(1);
this.conf = conf;
linkParams.clear();
linkParams.put("a", new LinkParams("a", "href", 1));
linkParams.put("area", new LinkParams("area", "href", 0));
if (conf.getBoolean("parser.html.form.use_action", true)) {
linkParams.put("form", new LinkParams("form", "action", 1));
if (conf.get("parser.html.form.use_action") != null)
forceTags.add("form");
}
linkParams.put("frame", new LinkParams("frame", "src", 0));
linkParams.put("iframe", new LinkParams("iframe", "src", 0));
linkParams.put("script", new LinkParams("script", "src", 0));
linkParams.put("link", new LinkParams("link", "href", 0));
linkParams.put("img", new LinkParams("img", "src", 0));
String[] ignoreTags = conf.getStrings("parser.html.outlinks.ignore_tags");
for (int i = 0; ignoreTags != null && i < ignoreTags.length; i++) {
if (!forceTags.contains(ignoreTags[i]))
linkParams.remove(ignoreTags[i]);
}
}
从该函数可以看出,DOMContentUtils默认获取a标签的href、area标签的href、form标签的action、frame标签的src、iframe标签的src、script标签的src、link标签的href、img标签的src这几个属性下的url地址,如果要取消对某个标签的url地址,可以设置parser.html.outlinks.ignore_tags属性。
分析完map函数后,下面开始分析ParseOutputFormat中的getRecordWriter函数,该函数创建一个RecordWriter用来处理输出。
RecordWriter::RecordWriter
public RecordWriter<Text, Parse> getRecordWriter(FileSystem fs, JobConf job,
String name, Progressable progress) throws IOException {
...
Path out = FileOutputFormat.getOutputPath(job);
Path text = new Path(new Path(out, ParseText.DIR_NAME), name);
Path data = new Path(new Path(out, ParseData.DIR_NAME), name);
Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name);
...
final MapFile.Writer textOut = new MapFile.Writer(job, text,
tKeyClassOpt, tValClassOpt, tCompOpt, tProgressOpt);
...
final MapFile.Writer dataOut = new MapFile.Writer(job, data,
dKeyClassOpt, dValClassOpt, dCompOpt, dProgressOpt);
final SequenceFile.Writer crawlOut = SequenceFile.createWriter(job, SequenceFile.Writer.file(crawl),
SequenceFile.Writer.keyClass(Text.class),
SequenceFile.Writer.valueClass(CrawlDatum.class),
SequenceFile.Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
SequenceFile.Writer.replication(fs.getDefaultReplication(crawl)),
SequenceFile.Writer.blockSize(1073741824),
SequenceFile.Writer.compression(compType, new DefaultCodec()),
SequenceFile.Writer.progressable(progress),
SequenceFile.Writer.metadata(new Metadata()));
return new RecordWriter<Text, Parse>();
}
RecordWriter的构造函数在crawl/segments/2*目录下创建parse_text、parse_data和crawl_parse文件夹,然后创建对应的输入流textOut、dataOut和crawlOut。当有数据到达时,会调用RecordWriter的write函数写。
RecordWriter::write
public void write(Text key, Parse parse) throws IOException {
String fromUrl = key.toString();
String origin = null;
textOut.append(key, new ParseText(parse.getText()));
ParseData parseData = parse.getData();
String sig = parseData.getContentMeta().get(Nutch.SIGNATURE_KEY);
if (sig != null) {
byte[] signature = StringUtil.fromHexString(sig);
if (signature != null) {
CrawlDatum d = new CrawlDatum(CrawlDatum.STATUS_SIGNATURE, 0);
d.setSignature(signature);
crawlOut.append(key, d);
}
}
...
Outlink[] links = parseData.getOutlinks();
int outlinksToStore = Math.min(maxOutlinks, links.length);
int validCount = 0;
CrawlDatum adjust = null;
List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(
outlinksToStore);
List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
String toUrl = links[i].getToUrl();
if (!isParsing) {
toUrl = ParseOutputFormat.filterNormalize(fromUrl, toUrl, origin,
ignoreInternalLinks, ignoreExternalLinks, ignoreExternalLinksMode, filters, exemptionFilters, normalizers);
if (toUrl == null) {
continue;
}
}
CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);
Text targetUrl = new Text(toUrl);
MapWritable outlinkMD = links[i].getMetadata();
if (outlinkMD != null) {
target.getMetaData().putAll(outlinkMD);
}
scfilters.initialScore(targetUrl, target);
targets.add(new SimpleEntry(targetUrl, target));
links[i].setUrl(toUrl);
outlinkList.add(links[i]);
validCount++;
}
adjust = scfilters.distributeScoreToOutlinks(key, parseData, targets,
null, links.length);
for (Entry<Text, CrawlDatum> target : targets) {
crawlOut.append(target.getKey(), target.getValue());
}
if (adjust != null)
crawlOut.append(key, adjust);
Outlink[] filteredLinks = outlinkList.toArray(new Outlink[outlinkList
.size()]);
parseData = new ParseData(parseData.getStatus(), parseData.getTitle(),
filteredLinks, parseData.getContentMeta(), parseData.getParseMeta());
dataOut.append(key, parseData);
...
}
}
write函数依次向parse_text中写入html的内容,向crawl_parse中写入签名信息,向crawl_parse写入crawlDB的信息。
再往下,遍历获取到的url地址列表,调用ParseOutputFormat.filterNormalize过滤掉没用的url,例如.css、.img等url。然后创建SimpleEntry封装targetUrl和target,其中targetUrl是url地址,target的类型为CrawlDatum,封装了对应url地址的信息,例如何时被抓取,抓取状态,初始分数等等。
接下来计算需要调整的分数adjust,然后将adjust和url对应的信息target写入crawl_parse中,然后重写ParseData,替换原来的ParseData中的连接信息,因为前面对该连接进行了过滤,最后将ParseData写入parse_data中。