Hadoop2.6.0
Hbase0.98.20
Nutch2.3.1
solr6.0.1
vm10
centos6.5
jdk1.8
comcat8
1、hadoop环境 (修改本机hosts 为 zwhz)
a、解压hadoop-2.6.0.tar.gz
b、/usr/local/app/hadoop-2.6.0/etc/hadoop
c、vi core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://zwhz:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/app/data/hadoop/tmp</value>
<description>Abasefor other temporary directories.</description>
</property>
</configuration>
d、vi hadoop-env.sh
export JAVA_HOME=/usr/local/app/jdk1.8.0_91
e、vi hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/app/data/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/app/data/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
f、vi mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>zwhz:9001</value>
<description>Host or IP and port of JobTracker.</description>
</property>
</configuration>
g、vi slave
zwhz
2、nutch环境
tar zxvf apache-nutch-2.3.1-src.tar.gz
/usr/local/app/apache-nutch-2.3.1
a、修改ivy/ivy.xml
<dependency org="org.apache.solr" name="solr-solrj" rev="6.0.1" conf="*->default" />
<dependency org="org.apache.hadoop" name="hadoop-common" rev="2.6.0" conf="*->default">
<exclude org="hsqldb" name="hsqldb" />
<exclude org="net.sf.kosmosfs" name="kfs" />
<exclude org="net.java.dev.jets3t" name="jets3t" />
<exclude org="org.eclipse.jdt" name="core" />
<exclude org="org.mortbay.jetty" name="jsp-*" />
<exclude org="ant" name="ant" />
</dependency>
<dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.6.0" conf="*->default"/>
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.6.0" conf="*->default"/>
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.6.0" conf="*->default"/>
<dependency org="org.apache.solr" name="solr-solrj" rev="6.0.1"
conf="*->default" />
<dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>
<!--取消该注释-->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />
<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-compiler" rev="0.6.1" conf="*->default"/>
<!--将hadoop1.2相关的去掉,然后添加-->
<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.6.0" conf="*->default"/>
b、修改ivysetting.xml
编译时部分jar包可能不能下载,需要修改如下配置:
<property name="repository.apache.org" value="http://maven.restlet.org/" override="false"/>
配置环境变量vi /etc/profile
JAVA_HOME=/usr/local/app/jdk1.8.0_91
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH
export ANT_HOME=/usr/local/app/apache-ant-1.9.7
export PATH=$ANT_HOME/bin:$PATH
export HADOOP_HOME=/usr/local/app/hadoop-2.6.0
export PATH=$HADOOP_HOME/bin:$PATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
c、修改nutch-site.xml
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>nutch_zwh</value>
</property>
<property>
<name>http.robots.agents</name>
<value>nutch_zwh,*</value>
</property>
<property>
<name>plugin.folders</name>
<value>plugins</value>
</property>
</configuration>
d、修改gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
6、编译
ant runtime
编译通过之后,就可以使用命令逐步抓取:
1、injector job将url注入抓取队列中进行初始化
cd runtime/local
mkdir urls
echo "http://nutch.apache.org/" > urls/seed.txt
bin/nutch inject urls -crawlId test
以上测试都没有问题,在hbase中新建了一个表test_webpage,有相应的数据写入
3、solr环境
参考 http://blog.csdn.net/happyzwh/article/details/51741204
4、把下面文档加入 /usr/local/app/tomcat8/webapps/solr/solrhome/my_solr/conf/managed-schema 下面
<dynamicField name="meta_*" type="string" stored="true" indexed="true"/>
<fieldType name="text_ik" class="solr.TextField">
<analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
<field name="text_ik" type="text_ik" indexed="true" stored="true" multiValued="false" />
<fieldType name="url" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
<field name="host" type="url" stored="false" indexed="true"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
<field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="contentLength" type="string" stored="true" indexed="false"/>
<field name="lastModified" type="date" stored="true" indexed="false"/>
<field name="date" type="tdate" stored="true" indexed="true"/>
<field name="lang" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true" indexed="true"/>
<field name="updatedDate" type="date" stored="true" indexed="true"/>
<field name="batchId" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="anchor" type="text_general" stored="true" indexed="true" multiValued="true"/>
<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>
问题:
1、avro-1.7.7.jar 代替 /usr/local/app/hadoop-2.6.0/lib下相应包
2、hbase数据库错误 查看 http://blog.csdn.net/happyzwh/article/details/51735785
3、Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 查看 http://blog.csdn.net/happyzwh/article/details/51735753
4、java.lang.ClassNotFoundException: Class org.apache.gora.mapreduce.PersistentSerialization not found & WARN serializer.SerializationFactory: Serialization class not found:
把solr4.7/dist下jar包及solrj-lib下jar包复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下
把gora-core-0.6.1.jar复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下
把hadoop*-2.6.0.jar复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下