Nutch 2.x 与 Nutch 1.x 相比,剥离出了存储层,放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据了。Nutch 1.7 则是把数据直接存储在HDFS上。
1.修改 conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>MyNutchSpider</value>
</property>
<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element
may be a relative or absolute path. If absolute, it is used
as is. If
relative, it is searched for on the classpath.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other
informationis available
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>The Gora DataStore class for storing and retrieving data.
</description>
</property>
<property>
<name>io.serializations</name>
<value>org.apache.hadoop.io.serializer.WritableSerialization</value>
<description>A list of serialization classes that can be used for
obtaining serializers and deserializers.
</description>
</property>
</configuration>
2.修改
ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />
<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/>
3.将Nutch与solr集成在一起
将
NUTCH_DIR/conf/schema-solr4.xml
拷贝到SOLR_DIR/solr/collection1/conf/
,重命名为schema.xml,并在<fields>...</fields>
最后添加一行
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>,重启solr
4.使用crawl脚本抓取网页
$ ./bin/crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
$ ./bin/crawl ~/urls/ TestCrawl http://localhost:8983/solr/ 2
参数
(1).~/urls
是存放了种子url的目录- (2).TestCrawl 是crawlId,这会在HBase中创建一张以crawlId为前缀的表,例如TestCrawl_Webpage。
- (3).http://localhost:8983/solr/ , 这是Solr服务器
- (4).2,numberOfRounds,迭代的次数
./bin/nutch readdb -crawlId TestCrawl -stats
也可以进HBase shell 查看:
hbase(main):001:0> scan ‘TestCrawl_webpage‘
对于列的含义不确定时可以查看conf/gora-hbase-mapping.xml 文件,该文件定义了列族及列的含义。