Nutch 2.3+Hbase0.98.6+Solr4.4

Nutch 2.x 与 Nutch 1.x 相比,剥离出了存储层,放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据了。Nutch 1.7 则是把数据直接存储在HDFS上。

1.修改 conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>MyNutchSpider</value>
</property>
<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element
may be a relative or absolute path. If absolute, it is used
as is. If
relative, it is searched for on the classpath.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld
</value>
<description>Regular expression naming plugin directory names to
 include. Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints
plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins. In order to use HTTPS please
enable
 protocol-httpclient, but be aware of possible intermittent problems
with the
 underlying commons-httpclient library.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other
informationis available
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>The Gora DataStore class for storing and retrieving data.
</description>
</property>
<property>
<name>io.serializations</name>
<value>org.apache.hadoop.io.serializer.WritableSerialization</value>
<description>A list of serialization classes that can be used for
obtaining serializers and deserializers.
</description>
</property>
</configuration>

2.修改 ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/>

<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" /> 

<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/>

3.将Nutch与solr集成在一起

将 NUTCH_DIR/conf/schema-solr4.xml 拷贝到SOLR_DIR/solr/collection1/conf/ ,重命名为schema.xml,并在<fields>...</fields> 最后添加一行

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>,重启solr

4.使用crawl脚本抓取网页

$ ./bin/crawl

Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

$ ./bin/crawl ~/urls/ TestCrawl http://localhost:8983/solr/ 2

  • 参数
  • (1).~/urls 是存放了种子url的目录
  • (2).TestCrawl 是crawlId,这会在HBase中创建一张以crawlId为前缀的表,例如TestCrawl_Webpage。
  • (3).http://localhost:8983/solr/ , 这是Solr服务器
  • (4).2,numberOfRounds,迭代的次数
查看结果

./bin/nutch readdb -crawlId TestCrawl -stats

也可以进HBase shell 查看:

hbase(main):001:0> scan ‘TestCrawl_webpage‘

对于列的含义不确定时可以查看conf/gora-hbase-mapping.xml 文件,该文件定义了列族及列的含义。





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值