本人原创,转载请注明出处:http://blog.csdn.net/panjunbiao/article/details/12171147
Nutch安装
参考文档:http://wiki.apache.org/nutch/NutchTutorial安装必要程序:
yum update
yum list java*
yum install java-1.7.0-openjdk-devel.x86_64
找到java的安装路径:
参考:http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
设置JAVA_HOME:
参考:http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html
vi + /etc/profile
JAVA_HOME=/usr/lib/jvm/java JRE_HOME=/usr/lib/jvm/java/jre PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib export JAVA_HOME JRE_HOME PATH CLASSPATH |
source /etc/profile
下载二进制包文件:
curl -O http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz
解包:
tar -xvzf apache-nutch-1.7-bin.tar.gz
检验运行文件
cd apache-nutch-1.7
bin/nutch
此时会出现用法帮助,表示安装成功了。
修改文件conf/nutch-site.xml,设置HTTP请求中agent的名字:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Friendly Crawler</value> </property> </configuration> |
创建种子文件夹
mkdir -p urls
执行第一次爬虫任务:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-09-29 12:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: finished at 2013-09-29 12:01:33, elapsed: 00:00:03 Generator: starting at 2013-09-29 12:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl |
将种子URL写到文件urls/seed.txt中:
http://www.36kr.com/ |
# accept anything else # +. # added by panjunbiao +36kr.com |
再次执行爬虫程序,发现有些种子网站被skip了:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-09-29 12:10:24 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-09-29 12:10:27, elapsed: 00:00:03 Generator: starting at 2013-09-29 12:10:27 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130929121029 Generator: finished at 2013-09-29 12:10:30, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-09-29 12:10:30 Fetcher: segment: crawl/segments/20130929121029 |