搜索引擎Nutch源代码研究之一网页抓取（3）

最新推荐文章于 2024-07-25 11:43:17 发布

iteye_14216

最新推荐文章于 2024-07-25 11:43:17 发布

阅读量78

点赞数

分类专栏： Search Engine 文章标签：搜索引擎数据结构

Search Engine 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

今天我们看看Nutch网页抓取，所用的几种数据结构：
主要涉及到了这几个类：FetchListEntry，Page，
首先我们看看FetchListEntry类：
public final class FetchListEntry implements Writable, Cloneable
实现了Writable, Cloneable接口，Nutch许多类实现了Writable, Cloneable。
自己负责自己的读写操作其实是个很合理的设计方法，分离出来反倒有很琐碎
的感觉。
看看里面的成员变量：
[code]
public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录
private final static byte CUR_VERSION = 2;//当前的版本号
private boolean fetch;//是否抓取以便以后更新
private Page page;//当前抓取的页面
private String[] anchors;//抓取到的该页面包含的链接
[/code]
我们看看如何读取各个字段的，也就是函数
public final void readFields(DataInput in) throws IOException
读取version 字段，并判断如果版本号是否大约当前的版本号，则抛出版本不匹配的异常，
然后读取fetch 和page 字段。
判断如果版本号大于1，说明anchors已经保存过了，读取anchors，否则直接赋值一个空的字符串
代码如下：
[code]
byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);

fetch = in.readByte() != 0; // read fetch flag

page = Page.read(in); // read page

if (version > 1) { // anchors added in version 2
anchors = new String[in.readInt()]; // read anchors
for (int i = 0; i < anchors.length; i++) {
anchors[i] = UTF8.readString(in);
}
} else {
anchors = new String[0];
}
[/code]
同时还提供了一个静态的读取各个字段的函数，并构建出FetchListEntry对象返回：
[code]
public static FetchListEntry read(DataInput in) throws IOException {
FetchListEntry result = new FetchListEntry();
result.readFields(in);
return result;
}
[/code]
写得代码则比较易看,分别写每个字段：
[code]
public final void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
out.writeByte((byte)(fetch ? 1 : 0)); // write fetch flag
page.write(out); // write page
out.writeInt(anchors.length); // write anchors
for (int i = 0; i < anchors.length; i++) {
UTF8.writeString(out, anchors[i]);
}
}
[/code]
其他的clone和equals函数实现的也非常易懂。
下面我们看看Page类的代码：
public class Page implements WritableComparable, Cloneable
和FetchListEntry一样同样实现了Writable, Cloneable接口，我们看看Nutch的注释，我们就非常容易知道各个字段的意义了：
[code]
/*********************************************
* A row in the Page Database.
* <pre>
* type name description
* ---------------------------------------------------------------
* byte VERSION - A byte indicating the version of this entry.
* String URL - The url of a page. This is the primary key.
* 128bit ID - The MD5 hash of the contents of the page.
* 64bit DATE - The date this page should be refetched.
* byte RETRIES - The number of times we've failed to fetch this page.
* byte INTERVAL - Frequency, in days, this page should be refreshed.
* float SCORE - Multiplied into the score for hits on this page.
* float NEXTSCORE - Multiplied into the score for hits on this page.
* </pre>
*
* @author Mike Cafarella
* @author Doug Cutting
*********************************************/
[/code]
各个字段：
[code]
private final static byte CUR_VERSION = 4;
private static final byte DEFAULT_INTERVAL =
(byte)NutchConf.get().getInt("db.default.fetch.interval", 30);

private UTF8 url;
private MD5Hash md5;
private long nextFetch = System.currentTimeMillis();
private byte retries;
private byte fetchInterval = DEFAULT_INTERVAL;
private int numOutlinks;
private float score = 1.0f;
private float nextScore = 1.0f;
[/code]
同样看看他是如何读取自己的各个字段的，其实代码加上本来提供的注释，使很容易看懂的，不再详述：
[code]
ublic void readFields(DataInput in) throws IOException {
byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);

url.readFields(in);
md5.readFields(in);
nextFetch = in.readLong();
retries = in.readByte();
fetchInterval = in.readByte();
numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
score = (version>1) ? in.readFloat() : 1.0f; // score added in version 2
nextScore = (version>3) ? in.readFloat() : 1.0f; // 2nd score added in V4
}
[/code]
写各个字段也很直接：
[code]
public void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
url.write(out);
md5.write(out);
out.writeLong(nextFetch);
out.write(retries);
out.write(fetchInterval);
out.writeInt(numOutlinks);
out.writeFloat(score);
out.writeFloat(nextScore);
}
[/code]
我们顺便看看提供方便读写Fetch到的内容的类FetcherOutput：这个类通过委托前面介绍的两个类的读写，提供了Fetche到的
各种粒度结构的读写功能，代码都比较直接，不再详述。
下次我们看看parse-html插件，看看Nutch是如何提取html页面的。

iteye_14216

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
搜索引擎Nutch源代码研究之一网页抓取（3）

今天我们看看Nutch网页抓取，所用的几种数据结构：主要涉及到了这几个类：FetchListEntry，Page，首先我们看看FetchListEntry类：public final class FetchListEntry implements Writable, Cloneable 实现了Writable, Cloneable接口，Nutch许多类实现了Writable, Cl...
复制链接

扫一扫