Installing Apache Nutch on Centos 6

http://www.paul4llen.com/installing-apache-nutch-on-centos-6/

THIS IS A GREAT GUIDELINE:

http://www.covert.io/post/18414889381/accumulo-nutch-and-gora

Introduction

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

  1. Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. This tutorial is for Nutch 1.x, NOT Nutch 2.x.
  2. Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it’s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter’s for custom implementations e.g.Apache Tika for parsing. Additonally, pluggable indexing exists for Apache SolrElastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.

Also see the Nutch wiki.

Preparation

See Laying the foundation in CentOS.

Install and Build Nutch 2.2.1

  1. Download latest source code for Nutch 2.2.1 athttps://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
  2. Upload source to /opt.
  3. Unpack source:
    cd /opt
    tar xvfz apache-nutch-2.2.1-src.tar.gz
  4. Build with ant:
    cd apache-nutch-2.2.1
    ant
  5. Verify installation:
    • ${NUTCH_RUNTIME_HOME} refers to the ready-to-use Nutch installation in/apache-nutch-2.2.1/runtime/local.
    • Confirm that:
      cd /apache-nutch-2.2.1/runtime/local/bin
      bash nutch

      yields something like:

      Usage: nutch COMMAND where command is one of:
      crawl             one-step crawler for intranets (DEPRECATED)
      readdb            read / dump crawl db
      mergedb           merge crawldb-s, with optional filtering
      readlinkdb        read / dump link db
      inject            inject new urls into the database
      generate          generate new segments to fetch from crawl db
      freegen           generate new segments to fetch from text files
      fetch             fetch a segment's pages
  6. Note that:
    • config files should be modified in apache-nutch-2.2.1/runtime/local/conf/
    • ant clean will remove this directory (keep copies of modified config files)
  7. Edit /opt/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml:
    • Specify your Agent Name (say) “My Nutch Spider”:
      <property>
       <name>http.agent.name</name>
       <value>My Nutch Spider</value>
      </property>
    • Specify the GORA backend:
      <property>
       <name>storage.data.store.class</name>
       <value>org.apache.gora.hbase.store.HBaseStore</value>
       <description>Default class for storing data</description>
      </property>
  8. Edit /opt/apache-nutch-2.2.1/ivy/ivy.xml:
    • Ensure the HBase gora-hbase dependency is available:
      <!-- Uncomment this to use HBase as Gora backend. -->
      <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

*************

*************

Crawl Your First Website

  1. Make a directory for URLs-to-crawl WHERE THOUGH:
    cd /opt/apache-nutch-2.2.1/runtime/local/ ??????
    mkdir -p urls
  2. Create seed.txt with one URL per line for each site you want to crawl, e.g.
    http://nutch.apache.org/
  3. Upload seed.txt to /urls.
  4. Edit the file conf/regex-urlfilter.txt and replace
    # accept anything else
    +.

    with a regular expression matching the domain you wish to crawl e.g. to limit the crawl to the nutch.apache.org domain, the line should read:

     +^http://([a-z0-9]*\.)*nutch.apache.org/

    This will include any URL in the domain nutch.apache.org.

Install Apache Hbase

https://hbase.apache.org/

https://hbase.apache.org/book/ – Reference manual

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Features

 

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
  • Extensible jruby-based (JIRB) shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
  • ***********************************************************************************8
    sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
    (modify the property “searcher.dir” to: /opt/nutch-1.0/crawl/ & the docBase attribute to the full path of your nutch-1.0 war file: docBase=”nutch.war” path=”/opt/tomcat/webapps/”)sudo ant
    sudo ant war (this compiles with the new build/nutch.xml file)sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
    (a .war file is a zip/jar file known as a “web archive” or war file, it is uncompressed when tomcat is started)

        1. Edit /etc/profile:
    Add these lines just above: # ksh workaround
    
    sudo vi /etc/profile
    
    ##Tomcat 6 / Java##
    JAVA_HOME="/usr/java/jdk1.6.0_16"
    export JAVA_HOME
    CATALINA_HOME="/opt/tomcat"
    export CATALINA_HOME
    NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16"
    export NUTCH_JAVA_HOME
    ##End Tomcat 6 / Java##
        1. Configure Nutch to fetch URLs:
    cd /opt/nutch-1.0; sudo mkdir urls
    (Make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)
    
    sudo vi conf/nutch-default.xml
    Edit the following:
    http.agent.name <value>My Spider</value>
    http.robots.agents <value>My Spider</value>
    http.agent.description <value>My Bot</value>
    http.agent.url <value>http://www.example.com</value>
    http.agent.email <value>admin@example.com</value>
    all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
        1. Nutch “deepcrawler” script:
    Put this script in /opt/nutch-1.0/bin
    chmod +x deepcrawler
    Note: This script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in: /opt/nutch-1.0/crawl1 to store the new crawl.
        1. Fetch URLs with Nutch via command line:
    If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject, so you'll want to run it in screen so you can detach and reattach to check progress.
    
    screen -S nutch
    sudo service tomcat start
    cd /opt/nutch-1.0; su -c "bin/deepcrawler"
        1. Download & install Nutch-Gui 0.2:
    Note: if you use the script provided above, you can skip the GUI altogether.
    
    Download Nutch-Gui 0.2 from: http://github.com/101tec/nutch/downloads
    
    sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2
    
    sudo ant clean package
    cd build/nutch-gui-0.2
    sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war
    
    unsecured quick test method, to assure it's working:
    su -c "bin/nutch admin /opt/nutch-1.0 50060"
    
    http://example.com:50060/general
    
    more secure password protection:
    sudo vi conf/nutchguiUsers.properties
    (edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role)
    
    screen -S nutch-gui (since we'll probably run it for a while)
    su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure"
    
    http://example.com:50060/general

    Troubleshooting / How To Test

    Explanation troubleshooting basics and expectations.

        1. Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
    rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME
    tomcat-6.0.16-0
    jdk-1.6.0_16-fcs
    ant-1.6.5-2jpp.2
    xml-commons-apis-1.3.02-0.b2.7jpp.10
    ant-trax-1.6.5-2jpp.2
    /usr/java/jdk1.6.0_16
    
    Replace "localhost" with your machines IP
    
    Try accessing Tomcat here: http://localhost:8080/
    Try accessing Nutch here: http://localhost:8080/nutch/
    Try accessing Nutch-Gui here: http://localhost:50060/general
        1. Set Tomcat to start on boot:
    sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat
    tomcat          0:off   1:off   2:on    3:on    4:on    5:on    6:off

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值