Installing Apache Nutch on Centos 6_could not find a package configuration file provid-CSDN博客

http://www.paul4llen.com/installing-apache-nutch-on-centos-6/

THIS IS A GREAT GUIDELINE:

http://www.covert.io/post/18414889381/accumulo-nutch-and-gora

Introduction

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing. This tutorial is for Nutch 1.x, NOT Nutch 2.x.
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it’s benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter’s for custom implementations e.g.Apache Tika for parsing. Additonally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.

Also see the Nutch wiki.

Preparation

See Laying the foundation in CentOS.

Install and Build Nutch 2.2.1

Download latest source code for Nutch 2.2.1 athttps://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
Upload source to /opt.

Unpack source:

cd /opt
tar xvfz apache-nutch-2.2.1-src.tar.gz

Build with ant:
```
cd apache-nutch-2.2.1
ant
```

Verify installation:

${NUTCH_RUNTIME_HOME} refers to the ready-to-use Nutch installation in/apache-nutch-2.2.1/runtime/local.

Confirm that:

cd /apache-nutch-2.2.1/runtime/local/bin
bash nutch

yields something like:

Usage: nutch COMMAND where command is one of:
crawl             one-step crawler for intranets (DEPRECATED)
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages

Note that:
- config files should be modified in apache-nutch-2.2.1/runtime/local/conf/
- ant clean will remove this directory (keep copies of modified config files)

Edit /opt/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml:

Specify your Agent Name (say) “My Nutch Spider”:

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

Specify the GORA backend:

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

Edit /opt/apache-nutch-2.2.1/ivy/ivy.xml:

Ensure the HBase gora-hbase dependency is available:

<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

*************

Crawl Your First Website

Make a directory for URLs-to-crawl WHERE THOUGH:

cd /opt/apache-nutch-2.2.1/runtime/local/ ??????
mkdir -p urls

Create seed.txt with one URL per line for each site you want to crawl, e.g.
```
http://nutch.apache.org/
```
Upload seed.txt to /urls.
Edit the file conf/regex-urlfilter.txt and replace
```
# accept anything else
+.
```
with a regular expression matching the domain you wish to crawl e.g. to limit the crawl to the nutch.apache.org domain, the line should read:
```
 +^http://([a-z0-9]*\.)*nutch.apache.org/
```
This will include any URL in the domain nutch.apache.org.

Install Apache Hbase

https://hbase.apache.org/

https://hbase.apache.org/book/ – Reference manual

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Features

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

***********************************************************************************8
sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml
(modify the property “searcher.dir” to: /opt/nutch-1.0/crawl/ & the docBase attribute to the full path of your nutch-1.0 war file: docBase=”nutch.war” path=”/opt/tomcat/webapps/”)sudo ant
sudo ant war (this compiles with the new build/nutch.xml file)sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war
(a .war file is a zip/jar file known as a “web archive” or war file, it is uncompressed when tomcat is started)

1. 1. Edit /etc/profile:

Add these lines just above: # ksh workaround

sudo vi /etc/profile

##Tomcat 6 / Java##
JAVA_HOME="/usr/java/jdk1.6.0_16"
export JAVA_HOME
CATALINA_HOME="/opt/tomcat"
export CATALINA_HOME
NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16"
export NUTCH_JAVA_HOME
##End Tomcat 6 / Java##

1. 1. Configure Nutch to fetch URLs:

cd /opt/nutch-1.0; sudo mkdir urls
(Make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com)

sudo vi conf/nutch-default.xml
Edit the following:
http.agent.name <value>My Spider</value>
http.robots.agents <value>My Spider</value>
http.agent.description <value>My Bot</value>
http.agent.url <value>http://www.example.com</value>
http.agent.email <value>admin@example.com</value>
all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.

1. 1. Nutch “deepcrawler” script:

Put this script in /opt/nutch-1.0/bin
chmod +x deepcrawler
Note: This script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in: /opt/nutch-1.0/crawl1 to store the new crawl.

1. 1. Fetch URLs with Nutch via command line:

If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject, so you'll want to run it in screen so you can detach and reattach to check progress.

screen -S nutch
sudo service tomcat start
cd /opt/nutch-1.0; su -c "bin/deepcrawler"

1. 1. Download & install Nutch-Gui 0.2:

Note: if you use the script provided above, you can skip the GUI altogether.

Download Nutch-Gui 0.2 from: http://github.com/101tec/nutch/downloads

sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2

sudo ant clean package
cd build/nutch-gui-0.2
sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war

unsecured quick test method, to assure it's working:
su -c "bin/nutch admin /opt/nutch-1.0 50060"

http://example.com:50060/general

more secure password protection:
sudo vi conf/nutchguiUsers.properties
(edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role)

screen -S nutch-gui (since we'll probably run it for a while)
su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure"

http://example.com:50060/general

Troubleshooting / How To Test

Explanation troubleshooting basics and expectations.

1. 1. Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:

rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME
tomcat-6.0.16-0
jdk-1.6.0_16-fcs
ant-1.6.5-2jpp.2
xml-commons-apis-1.3.02-0.b2.7jpp.10
ant-trax-1.6.5-2jpp.2
/usr/java/jdk1.6.0_16

Replace "localhost" with your machines IP

Try accessing Tomcat here: http://localhost:8080/
Try accessing Nutch here: http://localhost:8080/nutch/
Try accessing Nutch-Gui here: http://localhost:50060/general

1. 1. Set Tomcat to start on boot:

sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat
tomcat          0:off   1:off   2:on    3:on    4:on    5:on    6:off