搭建Hadoop2.6+Hbase0.98.20+Nutch2.3.1+solr6.0.1环境

3 篇文章 0 订阅
1 篇文章 0 订阅
一、单机环境
        Hadoop2.6.0
        Hbase0.98.20
        Nutch2.3.1
       solr6.0.1
       vm10
      centos6.5
      jdk1.8
     comcat8
   1、hadoop环境   (修改本机hosts 为 zwhz)
         a、解压hadoop-2.6.0.tar.gz
         b、/usr/local/app/hadoop-2.6.0/etc/hadoop
         c、vi core-site.xml
            <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://zwhz:9000</value>
            </property>
            <property>
                <name>hadoop.tmp.dir</name>
                <value>/usr/local/app/data/hadoop/tmp</value>
                <description>Abasefor other temporary directories.</description>
            </property>
         </configuration>
      d、vi hadoop-env.sh
           export JAVA_HOME=/usr/local/app/jdk1.8.0_91
      e、vi hdfs-site.xml
          <configuration>
                <property>
                     <name>dfs.name.dir</name>
                     <value>/usr/local/app/data/hadoop/dfs/name</value>
                </property>
                <property>
                     <name>dfs.data.dir</name>
                     <value>/usr/local/app/data/hadoop/dfs/data</value>
               </property>
               <property>
                    <name>dfs.replication</name>
                    <value>1</value>
               </property>
        </configuration>
       f、vi mapred-site.xml
           <configuration>
             <property>
                 <name>mapred.job.tracker</name>
                 <value>zwhz:9001</value>
                 <description>Host or IP and port of JobTracker.</description>
             </property>
        </configuration>
           g、vi slave
                 zwhz
  2、nutch环境
          tar zxvf apache-nutch-2.3.1-src.tar.gz
      /usr/local/app/apache-nutch-2.3.1
           a、修改ivy/ivy.xml
            <dependency org="org.apache.solr" name="solr-solrj" rev="6.0.1" conf="*->default" />
            <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.6.0" conf="*->default">
          <exclude org="hsqldb" name="hsqldb" />
          <exclude org="net.sf.kosmosfs" name="kfs" />
          <exclude org="net.java.dev.jets3t" name="jets3t" />
          <exclude org="org.eclipse.jdt" name="core" />
          <exclude org="org.mortbay.jetty" name="jsp-*" />
          <exclude org="ant" name="ant" />
        </dependency>
        <dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.6.0" conf="*->default"/>
        <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.6.0"      conf="*->default"/>
        <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.6.0" conf="*->default"/>
        <dependency org="org.apache.solr" name="solr-solrj" rev="6.0.1"
              conf="*->default" />
        <dependency org="org.apache.gora" name="gora-core" rev="0.6.1" conf="*->default"/>    
        <!--取消该注释-->   
        <dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />   
        <dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6.1" conf="*->default"/>   
        <dependency org="org.apache.gora" name="gora-compiler" rev="0.6.1" conf="*->default"/>        
        <!--将hadoop1.2相关的去掉,然后添加-->  
        <dependency org="org.apache.hadoop" name="hadoop-client" rev="2.6.0" conf="*->default"/>  
    b、修改ivysetting.xml
    编译时部分jar包可能不能下载,需要修改如下配置:
    <property name="repository.apache.org" value="http://maven.restlet.org/" override="false"/>  
   
    配置环境变量vi  /etc/profile
        JAVA_HOME=/usr/local/app/jdk1.8.0_91
    PATH=$JAVA_HOME/bin:$PATH
    CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
    export JAVA_HOME
    export PATH
    export CLASSPATH
    export ANT_HOME=/usr/local/app/apache-ant-1.9.7
    export PATH=$ANT_HOME/bin:$PATH
    export HADOOP_HOME=/usr/local/app/hadoop-2.6.0
    export PATH=$HADOOP_HOME/bin:$PATH
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
c、修改nutch-site.xml
    <configuration>  
        <property>    
            <name>storage.data.store.class</name>     
            <value>org.apache.gora.hbase.store.HBaseStore</value>     
            <description>Default class for storing data</description>     
        </property>     
        <property>
          <name>http.agent.name</name>
          <value>nutch_zwh</value>
    </property>
    <property>
          <name>http.robots.agents</name>
          <value>nutch_zwh,*</value>
    </property>    
        <property>  
                <name>plugin.folders</name>  
                <value>plugins</value>  
            </property>  
    </configuration>  
d、修改gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
6、编译
ant runtime
编译通过之后,就可以使用命令逐步抓取:
1、injector job将url注入抓取队列中进行初始化
cd runtime/local
mkdir urls
echo "http://nutch.apache.org/" > urls/seed.txt
bin/nutch inject urls -crawlId test
以上测试都没有问题,在hbase中新建了一个表test_webpage,有相应的数据写入
3、solr环境

 参考   http://blog.csdn.net/happyzwh/article/details/51741204


4、把下面文档加入  /usr/local/app/tomcat8/webapps/solr/solrhome/my_solr/conf/managed-schema  下面

  <dynamicField name="meta_*" type="string" stored="true" indexed="true"/>

      <fieldType name="text_ik" class="solr.TextField">  
          <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>  
    </fieldType>  
    <field name="text_ik"  type="text_ik" indexed="true"  stored="true"  multiValued="false" />
    
     <fieldType name="url" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>
      </analyzer>
    </fieldType>

    <field name="host" type="url" stored="false" indexed="true"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="string" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="tdate" stored="true" indexed="true"/>
    <field name="lang" type="string" stored="true" indexed="true"/>
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="feed" type="string" stored="true" indexed="true"/>
    <field name="publishedDate" type="date" stored="true" indexed="true"/>
    <field name="updatedDate" type="date" stored="true" indexed="true"/>

    <field name="batchId" type="string" stored="true" indexed="false"/>
    <field name="digest" type="string" stored="true" indexed="false"/>
    <field name="boost" type="float" stored="true" indexed="false"/>

    <field name="cache" type="string" stored="true" indexed="false"/>
     <field name="anchor" type="text_general" stored="true" indexed="true" multiValued="true"/>
     <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

5、 bin/crawl /urldir jlc  http://localhost:8080/solr/my_solr 2  运行 一键抓取并索引

问题:

1、avro-1.7.7.jar 代替 /usr/local/app/hadoop-2.6.0/lib下相应包
2、hbase数据库错误 查看 http://blog.csdn.net/happyzwh/article/details/51735785

3、Unable to load native-hadoop library for your platform... using builtin-java classes where applicable    查看  http://blog.csdn.net/happyzwh/article/details/51735753

4、java.lang.ClassNotFoundException: Class org.apache.gora.mapreduce.PersistentSerialization not found  &    WARN serializer.SerializationFactory: Serialization class not found:

    把solr4.7/dist下jar包及solrj-lib下jar包复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下

    把gora-core-0.6.1.jar复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下

    把hadoop*-2.6.0.jar复制到 /usr/local/app/hadoop-2.6.0/share/hadoop/mapreduce下


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值