http://www.paul4llen.com/installing-apache-nutch-on-centos-6/
THIS IS A GREAT GUIDELINE:
http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
Introduction
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:
- Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. This tutorial is for Nutch 1.x, NOT Nutch 2.x.
- Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Being pluggable and modular of course has it’s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter’s for custom implementations e.g.Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, etc.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.
Also see the Nutch wiki.
Preparation
See Laying the foundation in CentOS.
Install and Build Nutch 2.2.1
- Download latest source code for Nutch 2.2.1 at
https://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
- Upload source to
/opt
. - Unpack source:
cd /opt tar xvfz apache-nutch-2.2.1-src.tar.gz
- Build with
ant
:cd apache-nutch-2.2.1 ant
- Verify installation:
${NUTCH_RUNTIME_HOME}
refers to the ready-to-use Nutch installation in/apache-nutch-2.2.1/runtime/local
.- Confirm that:
cd /apache-nutch-2.2.1/runtime/local/bin bash nutch
yields something like:
Usage: nutch COMMAND where command is one of: crawl one-step crawler for intranets (DEPRECATED) readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages
- Note that:
config
files should be modified inapache-nutch-2.2.1/runtime/local/conf/
ant clean
will remove this directory (keep copies of modified config files)
- Edit
/opt/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml
:- Specify your Agent Name (say) “My Nutch Spider”:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
- Specify the GORA backend:
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
- Specify your Agent Name (say) “My Nutch Spider”:
- Edit
/opt/apache-nutch-2.2.1/ivy/ivy.xml
:- Ensure the HBase gora-hbase dependency is available:
<!-- Uncomment this to use HBase as Gora backend. --> <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
- Ensure the HBase gora-hbase dependency is available:
*************
*************
Crawl Your First Website
- Make a directory for URLs-to-crawl WHERE THOUGH:
cd /opt/apache-nutch-2.2.1/runtime/local/ ?????? mkdir -p urls
- Create
seed.txt
with one URL per line for each site you want to crawl, e.g.http://nutch.apache.org/
- Upload
seed.txt
to/urls
. - Edit the file
conf/regex-urlfilter.txt
and replace# accept anything else +.
with a regular expression matching the domain you wish to crawl e.g. to limit the crawl to the
nutch.apache.org
domain, the line should read:+^http://([a-z0-9]*\.)*nutch.apache.org/
This will include any URL in the domain
nutch.apache.org
.
Install Apache Hbase
https://hbase.apache.org/
https://hbase.apache.org/book/ – Reference manual
Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Features
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible jruby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
- ***********************************************************************************8
sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
(modify the property “searcher.dir” to: /opt/nutch-1.0/crawl/ & the docBase attribute to the full path of your nutch-1.0 war file: docBase=”nutch.war” path=”/opt/tomcat/webapps/”)sudo ant
sudo ant war (this compiles with the new build/nutch.xml file)sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
(a .war file is a zip/jar file known as a “web archive” or war file, it is uncompressed when tomcat is started)- Edit /etc/profile:
Add these lines just above: # ksh workaround sudo vi /etc/profile ##Tomcat 6 / Java## JAVA_HOME="/usr/java/jdk1.6.0_16" export JAVA_HOME CATALINA_HOME="/opt/tomcat" export CATALINA_HOME NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16" export NUTCH_JAVA_HOME ##End Tomcat 6 / Java##
- Configure Nutch to fetch URLs:
cd /opt/nutch-1.0; sudo mkdir urls (Make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com) sudo vi conf/nutch-default.xml Edit the following: http.agent.name <value>My Spider</value> http.robots.agents <value>My Spider</value> http.agent.description <value>My Bot</value> http.agent.url <value>http://www.example.com</value> http.agent.email <value>admin@example.com</value> all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.
- Nutch “deepcrawler” script:
Put this script in /opt/nutch-1.0/bin chmod +x deepcrawler Note: This script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in: /opt/nutch-1.0/crawl1 to store the new crawl.
- Fetch URLs with Nutch via command line:
If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject, so you'll want to run it in screen so you can detach and reattach to check progress. screen -S nutch sudo service tomcat start cd /opt/nutch-1.0; su -c "bin/deepcrawler"
- Download & install Nutch-Gui 0.2:
Note: if you use the script provided above, you can skip the GUI altogether. Download Nutch-Gui 0.2 from: http://github.com/101tec/nutch/downloads sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2 sudo ant clean package cd build/nutch-gui-0.2 sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war unsecured quick test method, to assure it's working: su -c "bin/nutch admin /opt/nutch-1.0 50060" http://example.com:50060/general more secure password protection: sudo vi conf/nutchguiUsers.properties (edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role) screen -S nutch-gui (since we'll probably run it for a while) su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure" http://example.com:50060/general
Troubleshooting / How To Test
Explanation troubleshooting basics and expectations.
- Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME tomcat-6.0.16-0 jdk-1.6.0_16-fcs ant-1.6.5-2jpp.2 xml-commons-apis-1.3.02-0.b2.7jpp.10 ant-trax-1.6.5-2jpp.2 /usr/java/jdk1.6.0_16 Replace "localhost" with your machines IP Try accessing Tomcat here: http://localhost:8080/ Try accessing Nutch here: http://localhost:8080/nutch/ Try accessing Nutch-Gui here: http://localhost:50060/general
- Set Tomcat to start on boot:
sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:off