nucth抓取存储数据库

最新推荐文章于 2024-09-10 22:35:25 发布

vanadiumlin007

最新推荐文章于 2024-09-10 22:35:25 发布

阅读量157

点赞数

文章标签：数据结构 lucene 网络应用搜索引擎 Hadoop

本文链接：https://blog.csdn.net/vanadiumlin007/article/details/83731590

版权

本文由FaceYe的haipeng根据实际开发工作整理而来,转载请注明出处.

就像我们知道的一样,nutch是一个架构在lucene之上的网络爬虫+搜索引擎.

是由lucene的作者在lucene基础之上开发,并整合了hadoop,实现在分布式云计算,使用google标准的HFDS文件系统作为存储结构,是一款高伸缩性能与高效高并发的网络爬虫+搜索引擎.

FaceYe在后台已经整合了nutch,在适当的时候,就可以开始为用户提供高质量的知识索引服务.顺便说一下,nutch在生产环境中,并不能在windows下运行,需要在liux下运行,这其中主要是hadoop采用了一些shello脚本,当然,开发平台还是可以搭建在window下,但需要安装cygwin,来模拟shell环境.废话少说,入nutch正题

正像上面说到的,nutch使用HFDS来存储索引文件,并没有将爬取来的数据存储入数据库,这是因为HFDS是一种比数据库更高效,更容易实现负载均衡的结构,对于像搜索引擎这样的应用,使用数据库将对严重制约性能,所以,使用HFDS再加上倒派索引,会取理满意的性能,HFDS也是目前搜索巨头google,以及yahoo所正在使用的文件格式.

虽然有了HFDS,但在进行网络爬取的时候,我们还是希望,可以将爬取的一些个数据,比如网页url,比如网页标题等关键信息存储到数据库中,但nutch并没有提供这样的功能,怎么办?动手发明轮子~

nutch支持强大的plugin 机制,这种机制与eclipse中的plugin机制同出一辙,一样可以方便的进行插拔.

开发将爬取记录存入数据库的nutch plugin过程如下.

1.定义这一nutch plugin要实现的主要功能:

在使用nutch爬取网络资源的同时,将网络资源的主要信息存储入数据库.

2.新建plugin 包:

org.apache.nutch.indexer.store

并开发StoreIndexingFilter工具类如下:

public class StoreIndexingFilter implements IndexingFilter
{
public static final Log LOG = LogFactory.getLog(StoreIndexingFilter.class);

/** A flag that tells if magic resolution must be performed */
private boolean MAGIC;

/** Get the MimeTypes resolver instance. */
private MimeUtil MIME;

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
{

IResourceEntityService resourceEntityService = (IResourceEntityService) SpringUtil.getInstance().getBean(\"resourceEntityService\");

String _url = doc.getFieldValue(\"url\");
String _title = doc.getFieldValue(\"title\");
if (StringUtils.isNotEmpty(_url))
{
if (!resourceEntityService.isExists(ResourceEntity.class, \"url\", _url))
{
ResourceEntity resourceEntity = new ResourceEntity();
resourceEntity.setUrl(_url);
if (StringUtils.isNotEmpty(_title))
{
if (_title.length() > 255)
{
_title = _title.substring(0, 254);
}
}
resourceEntity.setName(_title);
resourceEntityService.saveResourceEntity(resourceEntity);
}
}
return doc;
}
private NutchDocument addTime(NutchDocument doc, ParseData data, String url, CrawlDatum datum)
{
long time = -1;

String lastModified = data.getMeta(Metadata.LAST_MODIFIED);
if (lastModified != null)
{ // try parse last-modified
time = getTime(lastModified, url); // use as time
// store as string
doc.add(\"lastModified\", Long.toString(time));
}

if (time == -1)
{ // if no last-modified
time = datum.getFetchTime(); // use fetch time
}

SimpleDateFormat sdf = new SimpleDateFormat(\"yyyyMMdd\");
sdf.setTimeZone(TimeZone.getTimeZone(\"GMT\"));
String dateString = sdf.format(new Date(time));

// un-stored, indexed and un-tokenized
doc.add(\"date\", dateString);

return doc;
}

private long getTime(String date, String url)
{
long time = -1;
try
{
time = HttpDateFormat.toLong(date);
} catch (ParseException e)
{
// try to parse it as date in alternative format
try
{
Date parsedDate = DateUtils.parseDate(date, new String[] { \"EEE MMM dd HH:mm:ss yyyy\", \"EEE MMM dd HH:mm:ss yyyy zzz\",
\"EEE, MMM dd HH:mm:ss yyyy zzz\", \"EEE, dd MMM yyyy HH:mm:ss zzz\", \"EEE,dd MMM yyyy HH:mm:ss zzz\", \"EEE, dd MMM yyyy HH:mm:sszzz\",
\"EEE, dd MMM yyyy HH:mm:ss\", \"EEE, dd-MMM-yy HH:mm:ss zzz\", \"yyyy/MM/dd HH:mm:ss.SSS zzz\", \"yyyy/MM/dd HH:mm:ss.SSS\",
\"yyyy/MM/dd HH:mm:ss zzz\", \"yyyy/MM/dd\", \"yyyy.MM.dd HH:mm:ss\", \"yyyy-MM-dd HH:mm\", \"MMM dd yyyy HH:mm:ss. zzz\",
\"MMM dd yyyy HH:mm:ss zzz\", \"dd.MM.yyyy HH:mm:ss zzz\", \"dd MM yyyy HH:mm:ss zzz\", \"dd.MM.yyyy; HH:mm:ss\", \"dd.MM.yyyy HH:mm:ss\",
\"dd.MM.yyyy zzz\" });
time = parsedDate.getTime();
// if (LOG.isWarnEnabled()) {
// LOG.warn(url + \": parsed date: \" + date +\" to:\"+time);
// }
} catch (Exception e2)
{
if (LOG.isWarnEnabled())
{
LOG.warn(url + \": can\'t parse erroneous date: \" + date);
}
}
}
return time;
}

// Add Content-Length
private NutchDocument addLength(NutchDocument doc, ParseData data, String url)
{
String contentLength = data.getMeta(Response.CONTENT_LENGTH);

if (contentLength != null)
doc.add(\"contentLength\", contentLength);

return doc;
}
private NutchDocument addType(NutchDocument doc, ParseData data, String url)
{
MimeType mimeType = null;
String contentType = data.getMeta(Response.CONTENT_TYPE);
if (contentType == null)
{
mimeType = MIME.getMimeType(url);
} else
{
mimeType = MIME.forName(MimeUtil.cleanMimeType(contentType));
}

// Checks if we solved the content-type.
if (mimeType == null)
{
return doc;
}

contentType = mimeType.getName();

doc.add(\"type\", contentType);

String[] parts = getParts(contentType);

for (String part : parts)
{
doc.add(\"type\", part);
}

return doc;
}

static String[] getParts(String mimeType)
{
return mimeType.split(\"/\");
}

private PatternMatcher matcher = new Perl5Matcher();

private Configuration conf;
static Perl5Pattern patterns[] = { null, null };
static
{
Perl5Compiler compiler = new Perl5Compiler();
try
{
// order here is important
patterns[0] = (Perl5Pattern) compiler.compile(\"\\\\bfilename=[\'\\\"](.+)[\'\\\"]\");
patterns[1] = (Perl5Pattern) compiler.compile(\"\\\\bfilename=(\\\\S+)\\\\b\");
} catch (MalformedPatternException e)
{
// just ignore
}
}

private NutchDocument resetTitle(NutchDocument doc, ParseData data, String url)
{
String contentDisposition = data.getMeta(Metadata.CONTENT_DISPOSITION);
if (contentDisposition == null)
return doc;

MatchResult result;
for (int i = 0; i < patterns.length; i++)
{
if (matcher.contains(contentDisposition, patterns[i]))
{
result = matcher.getMatch();
doc.add(\"title\", result.group(1));
break;
}
}

return doc;
}

public void addIndexBackendOptions(Configuration conf)
{

LuceneWriter.addFieldOptions(\"type\", LuceneWriter.STORE.NO, LuceneWriter.INDEX.UNTOKENIZED, conf);

LuceneWriter.addFieldOptions(\"primaryType\", LuceneWriter.STORE.YES, LuceneWriter.INDEX.UNTOKENIZED, conf);
LuceneWriter.addFieldOptions(\"subType\", LuceneWriter.STORE.YES, LuceneWriter.INDEX.UNTOKENIZED, conf);

LuceneWriter.addFieldOptions(\"contentLength\", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, conf);

LuceneWriter.addFieldOptions(\"lastModified\", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, conf);

// un-stored, indexed and un-tokenized
LuceneWriter.addFieldOptions(\"date\", LuceneWriter.STORE.NO, LuceneWriter.INDEX.UNTOKENIZED, conf);
}

public void setConf(Configuration conf)
{
this.conf = conf;
MIME = new MimeUtil(conf);
}

public Configuration getConf()
{
return this.conf;
}

}

其中最主要的方法为:

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
{

IResourceEntityService resourceEntityService = (IResourceEntityService) SpringUtil.getInstance().getBean(\"resourceEntityService\");
String _url = doc.getFieldValue(\"url\");
String _title = doc.getFieldValue(\"title\");
if (StringUtils.isNotEmpty(_url))
{
if (!resourceEntityService.isExists(ResourceEntity.class, \"url\", _url))
{
ResourceEntity resourceEntity = new ResourceEntity();
resourceEntity.setUrl(_url);
if (StringUtils.isNotEmpty(_title))
{
if (_title.length() > 255)
{
_title = _title.substring(0, 254);
}
}
resourceEntity.setName(_title);
resourceEntityService.saveResourceEntity(resourceEntity);
}
}
return doc;
}
也就是说,要在使用nutch构建document文档的同时,这一资源,存入数据库.

存入数据库的代码resourceEntityService.saveResourceEntity(resourceEntity);不再详细给出,有兴趣的可以查看FaceYe的开源项目相关信息．

接下来需要做的事情是配置本插件的plugin文件，整体配置如下：

<plugin id=\"index-store\" name=\"Store Indexing Filter\" version=\"1.0.0\"
provider-name=\"nutch.org\">

<runtime>
<library name=\"index-store.jar\">
<export name=\"*\" />
</library>
</runtime>

<requires>
<import plugin=\"nutch-extensionpoints\" />
<import plugin=\"query-more\"></import>
</requires>

<extension id=\"org.apache.nutch.indexer.store\" name=\"Nutch More Indexing Filter\"
point=\"org.apache.nutch.indexer.IndexingFilter\">
<implementation id=\"StoreIndexingFilter\"
class=\"org.apache.nutch.indexer.store.StoreIndexingFilter\" />
</extension>

</plugin>

这个xml文件的主要含义是告诉nutch加载哪个jar，使用哪个类．文件中有清晰的描述．

nutch数据存数据库的插件开发完毕了，接下来要做的是使用ant将本插件编译为jar文件，为启用本插件做准备．

编译nutch源码及配置文件为jar主要通过修改ant编译文件来完成．

操作步骤为：打开nutch/src/plugin/文件，找到build.xml中的＂deploy\"任务，添加

<ant dir=\"index-store\" target=\"deploy\"/>即可．

到些，将nutch爬取的数据存储入数据库的开发工作可以基本完成，接下来是要启用本插件，这就很简单了，

打开nutch/conf/nutch-site.xml．

找到plugin-include接点，在value中使用＂index-(basic|anchor|store)＂代替index-(basic|anchor);就完成了將nutch爬取数据存储入数据库插件的启用.

vanadiumlin007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nucth抓取存储数据库

本文由FaceYe的haipeng根据实际开发工作整理而来,转载请注明出处. 就像我们知道的一样,nutch是一个架构在lucene之上的网络爬虫+搜索引擎. 是由lucene的作者在lucene基础之上开发,并整合了hadoop,实现在分布式云计算,使用google标准的HFDS文件系统作为存储结构,是一款高伸缩性能与高效高并发的网络爬虫+搜索引擎. FaceYe在后台已经整合了nutch,在适...
复制链接

扫一扫