Nutch 2.3+Hbase0.98.6+Solr4.4_nutch2.3 hbase0.98.8-CSDN博客

本文链接：https://blog.csdn.net/verina/article/details/47777405

Nutch 2.x 与 Nutch 1.x 相比，剥离出了存储层，放到了gora中，可以使用多种数据库，例如HBase, Cassandra, MySql来存储数据了。Nutch 1.7 则是把数据直接存储在HDFS上。

1.修改 conf/nutch-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>http.agent.name</name> <value>MyNutchSpider</value> </property>  <property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath. </description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld </value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other informationis available </description> </property> <property> <name>generate.batch.id</name> <value>*</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>The Gora DataStore class for storing and retrieving data. </description> </property> <property> <name>io.serializations</name> <value>org.apache.hadoop.io.serializer.WritableSerialization</value> <description>A list of serialization classes that can be used for obtaining serializers and deserializers. </description> </property> </configuration>

2.修改 ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/>

<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />

<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/> <dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/>

3.将Nutch与solr集成在一起

将 NUTCH_DIR/conf/schema-solr4.xml 拷贝到SOLR_DIR/solr/collection1/conf/ ，重命名为schema.xml，并在<fields>...</fields> 最后添加一行