今天我们看看Nutch网页抓取,所用的几种数据结构:
主要涉及到了这几个类:FetchListEntry,Page,
首先我们看看FetchListEntry类:
public final class FetchListEntry implements Writable, Cloneable
实现了Writable, Cloneable接口,Nutch许多类实现了Writable, Cloneable。
自己负责自己的读写操作其实是个很合理的设计方法,分离出来反倒有很琐碎
的感觉。
看看里面的成员变量:
我们看看如何读取各个字段的,也就是函数
public final void readFields(DataInput in) throws IOException
读取version 字段,并判断如果版本号是否大约当前的版本号,则抛出版本不匹配的异常,
然后读取fetch 和page 字段。
判断如果版本号大于1,说明anchors已经保存过了,读取anchors,否则直接赋值一个空的字符串
代码如下:
同时还提供了一个静态的读取各个字段的函数,并构建出FetchListEntry对象返回:
写得代码则比较易看,分别写每个字段:
其他的clone和equals函数实现的也非常易懂。
下面我们看看Page类的代码:
public class Page implements WritableComparable, Cloneable
和FetchListEntry一样同样实现了Writable, Cloneable接口,我们看看Nutch的注释,我们就非常容易知道各个字段的意义了:
各个字段:
同样看看他是如何读取自己的各个字段的,其实代码加上本来提供的注释,使很容易看懂的,不再详述:
写各个字段也很直接:
我们顺便看看提供方便读写Fetch到的内容的类FetcherOutput:这个类通过委托前面介绍的两个类的读写,提供了Fetche到的
各种粒度结构的读写功能,代码都比较直接,不再详述。
下次我们看看parse-html插件,看看Nutch是如何提取html页面的。
主要涉及到了这几个类:FetchListEntry,Page,
首先我们看看FetchListEntry类:
public final class FetchListEntry implements Writable, Cloneable
实现了Writable, Cloneable接口,Nutch许多类实现了Writable, Cloneable。
自己负责自己的读写操作其实是个很合理的设计方法,分离出来反倒有很琐碎
的感觉。
看看里面的成员变量:
- public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录
- private final static byte CUR_VERSION = 2;//当前的版本号
- private boolean fetch;//是否抓取以便以后更新
- private Page page;//当前抓取的页面
- private String[] anchors;//抓取到的该页面包含的链接
public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录
private final static byte CUR_VERSION = 2;//当前的版本号
private boolean fetch;//是否抓取以便以后更新
private Page page;//当前抓取的页面
private String[] anchors;//抓取到的该页面包含的链接
我们看看如何读取各个字段的,也就是函数
public final void readFields(DataInput in) throws IOException
读取version 字段,并判断如果版本号是否大约当前的版本号,则抛出版本不匹配的异常,
然后读取fetch 和page 字段。
判断如果版本号大于1,说明anchors已经保存过了,读取anchors,否则直接赋值一个空的字符串
代码如下:
- byte version = in.readByte(); // read version
- if (version > CUR_VERSION) // check version
- throw new VersionMismatchException(CUR_VERSION, version);
- fetch = in.readByte() != 0; // read fetch flag
- page = Page.read(in); // read page
- if (version > 1) { // anchors added in version 2
- anchors = new String[in.readInt()]; // read anchors
- for (int i = 0; i < anchors.length; i++) {
- anchors[i] = UTF8.readString(in);
- }
- } else {
- anchors = new String[0];
- }
byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);
fetch = in.readByte() != 0; // read fetch flag
page = Page.read(in); // read page
if (version > 1) { // anchors added in version 2
anchors = new String[in.readInt()]; // read anchors
for (int i = 0; i < anchors.length; i++) {
anchors[i] = UTF8.readString(in);
}
} else {
anchors = new String[0];
}
同时还提供了一个静态的读取各个字段的函数,并构建出FetchListEntry对象返回:
- public static FetchListEntry read(DataInput in) throws IOException {
- FetchListEntry result = new FetchListEntry();
- result.readFields(in);
- return result;
- }
public static FetchListEntry read(DataInput in) throws IOException {
FetchListEntry result = new FetchListEntry();
result.readFields(in);
return result;
}
写得代码则比较易看,分别写每个字段:
- public final void write(DataOutput out) throws IOException {
- out.writeByte(CUR_VERSION); // store current version
- out.writeByte((byte)(fetch ? 1 : 0)); // write fetch flag
- page.write(out); // write page
- out.writeInt(anchors.length); // write anchors
- for (int i = 0; i < anchors.length; i++) {
- UTF8.writeString(out, anchors[i]);
- }
- }
public final void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
out.writeByte((byte)(fetch ? 1 : 0)); // write fetch flag
page.write(out); // write page
out.writeInt(anchors.length); // write anchors
for (int i = 0; i < anchors.length; i++) {
UTF8.writeString(out, anchors[i]);
}
}
其他的clone和equals函数实现的也非常易懂。
下面我们看看Page类的代码:
public class Page implements WritableComparable, Cloneable
和FetchListEntry一样同样实现了Writable, Cloneable接口,我们看看Nutch的注释,我们就非常容易知道各个字段的意义了:
- /*********************************************
- * A row in the Page Database.
- * <pre>
- * type name description
- * ---------------------------------------------------------------
- * byte VERSION - A byte indicating the version of this entry.
- * String URL - The url of a page. This is the primary key.
- * 128bit ID - The MD5 hash of the contents of the page.
- * 64bit DATE - The date this page should be refetched.
- * byte RETRIES - The number of times we've failed to fetch this page.
- * byte INTERVAL - Frequency, in days, this page should be refreshed.
- * float SCORE - Multiplied into the score for hits on this page.
- * float NEXTSCORE - Multiplied into the score for hits on this page.
- * </pre>
- *
- * @author Mike Cafarella
- * @author Doug Cutting
- *********************************************/
/*********************************************
* A row in the Page Database.
* <pre>
* type name description
* ---------------------------------------------------------------
* byte VERSION - A byte indicating the version of this entry.
* String URL - The url of a page. This is the primary key.
* 128bit ID - The MD5 hash of the contents of the page.
* 64bit DATE - The date this page should be refetched.
* byte RETRIES - The number of times we've failed to fetch this page.
* byte INTERVAL - Frequency, in days, this page should be refreshed.
* float SCORE - Multiplied into the score for hits on this page.
* float NEXTSCORE - Multiplied into the score for hits on this page.
* </pre>
*
* @author Mike Cafarella
* @author Doug Cutting
*********************************************/
各个字段:
- private final static byte CUR_VERSION = 4;
- private static final byte DEFAULT_INTERVAL =
- (byte)NutchConf.get().getInt("db.default.fetch.interval", 30);
- private UTF8 url;
- private MD5Hash md5;
- private long nextFetch = System.currentTimeMillis();
- private byte retries;
- private byte fetchInterval = DEFAULT_INTERVAL;
- private int numOutlinks;
- private float score = 1.0f;
- private float nextScore = 1.0f;
private final static byte CUR_VERSION = 4;
private static final byte DEFAULT_INTERVAL =
(byte)NutchConf.get().getInt("db.default.fetch.interval", 30);
private UTF8 url;
private MD5Hash md5;
private long nextFetch = System.currentTimeMillis();
private byte retries;
private byte fetchInterval = DEFAULT_INTERVAL;
private int numOutlinks;
private float score = 1.0f;
private float nextScore = 1.0f;
同样看看他是如何读取自己的各个字段的,其实代码加上本来提供的注释,使很容易看懂的,不再详述:
- ublic void readFields(DataInput in) throws IOException {
- byte version = in.readByte(); // read version
- if (version > CUR_VERSION) // check version
- throw new VersionMismatchException(CUR_VERSION, version);
- url.readFields(in);
- md5.readFields(in);
- nextFetch = in.readLong();
- retries = in.readByte();
- fetchInterval = in.readByte();
- numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
- score = (version>1) ? in.readFloat() : 1.0f; // score added in version 2
- nextScore = (version>3) ? in.readFloat() : 1.0f; // 2nd score added in V4
- }
ublic void readFields(DataInput in) throws IOException {
byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);
url.readFields(in);
md5.readFields(in);
nextFetch = in.readLong();
retries = in.readByte();
fetchInterval = in.readByte();
numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
score = (version>1) ? in.readFloat() : 1.0f; // score added in version 2
nextScore = (version>3) ? in.readFloat() : 1.0f; // 2nd score added in V4
}
写各个字段也很直接:
- public void write(DataOutput out) throws IOException {
- out.writeByte(CUR_VERSION); // store current version
- url.write(out);
- md5.write(out);
- out.writeLong(nextFetch);
- out.write(retries);
- out.write(fetchInterval);
- out.writeInt(numOutlinks);
- out.writeFloat(score);
- out.writeFloat(nextScore);
- }
public void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
url.write(out);
md5.write(out);
out.writeLong(nextFetch);
out.write(retries);
out.write(fetchInterval);
out.writeInt(numOutlinks);
out.writeFloat(score);
out.writeFloat(nextScore);
}
我们顺便看看提供方便读写Fetch到的内容的类FetcherOutput:这个类通过委托前面介绍的两个类的读写,提供了Fetche到的
各种粒度结构的读写功能,代码都比较直接,不再详述。
下次我们看看parse-html插件,看看Nutch是如何提取html页面的。
- 16:39
- 浏览 (1892)
- 评论 (3)
- 分类: Search Engine
- 收藏
- 相关推荐
评论
public final class Content extends VersionedWritable
我们看到继承了VersionedWritable类。VersionedWritable类实现了版本字段的读写功能。
我们先看看成员变量:
DIR_NAME 为Content保存的目录,
VERSION 为版本常量
url为该Content所属页面的url
base为该Content所属页面的base url
contentType为该Content所属页面的contentType
metadata为该Content所属页面的meta信息
下面我们看看Content是如何读写自身的字段的:
public final void readFields(DataInput in) throws IOException
这个方法功能为读取自身的各个字段
代码加注释之后基本上比较清晰了.
super.readFields(in);
这句调用父类VersionedWritable读取并验证版本号
写的代码也比较简单:
其实这些类主要是它的字段.以及怎样划分各个域模型的