http://biaowen.javaeye.com/blog/420586
注意,tomcat和nutch路径需要修改成自己的
# nutch更目录
NUTCH_HOME=/cygdrive/e/java/CoreJava/IndexSearchAbout/nutch-1.0
# tomcat目录
CATALINA_HOME=/cygdrive/d/JavaTools/apache-tomcat-6.0.14
还有批量将crawled/替换为你的索引存储目录。
将该shell代码保存到你的爬虫nutch更目录下,可任意命名(如:runbot)
然后在cygwin里直接输入一下文件名就可以运行
- #!/bin/sh
- # runbot script to run the Nutch bot for crawling and re-crawling.
- # Usage: bin/runbot [safe]
- # If executed in 'safe' mode, it doesn't delete the temporary
- # directories generated during crawl. This might be helpful for
- # analysis and recovery in case a crawl fails.
- #
- # Author: Susam Pal
- #
- # 增量采集时候特别注意,如果在同一台机器上运行Crawl和Searcher,
- # 由于tomcat处于启动状态,tomcat线程占用着索引文件,所以在增量
- # 爬取时候蜘蛛需要删除旧索引后从新生成新索引(Crawl/index文件夹),
- # 由于索引文件夹被TOMCAT占用着,所以蜘蛛操作不了程序就报错了。
- # 判断crawl/index文件夹是否被占用简单的方法就是直接手动删除一
- # 下index(删除前做一下这文件夹的备份,呵呵),删除不了说明被占用了。
- #
- # 下边这段增量爬取脚本逻辑上是解决了线程占用的问题,不过可能由于机器不能及时关闭java.exe,所以很多时候都抛异常了“提示什么dir exist
- # 1 ,注入爬取入口
- # 2 ,逐个深度进行爬取
- # 3 ,合并爬取下来的内容
- # 4 ,将数据段相关链接写入linkdb
- # 5 ,生成indexes
- # 6 ,去重
- # 7 ,合并索引
- # 8 ,停止tomcat,释放操作索引的线程,Crawl更新线程后启动TOMCAT
- #
- # 参数设置
- depth=5
- threads=10
- adddays=1
- topN=30 #Comment this statement if you don't want to set topN value
- # Arguments for rm and mv
- RMARGS="-rf"
- MVARGS="--verbose"
- # Parse arguments
- # 模式,yes启动,在索引操作时候会做备份,否则反之,直接更新索引!
- safe=yes
- # nutch更目录
- NUTCH_HOME=/cygdrive/e/java/CoreJava/IndexSearchAbout/nutch-1.0
- # tomcat目录
- CATALINA_HOME=/cygdrive/d/JavaTools/apache-tomcat-6.0 . 14
- if [ -z "$NUTCH_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
- else
- echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
- fi
- if [ -z "$CATALINA_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
- else
- echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
- fi
- if [ -n "$topN" ]
- then
- topN="-topN $topN"
- else
- topN=""
- fi
- steps=8
- # 1 ,注入爬取入口
- echo "----- Inject (Step 1 of $steps) -----"
- $NUTCH_HOME/bin/nutch inject crawled/crawldb urls/url.txt
- # 2 ,逐个深度进行爬取
- echo "----- Generate, Fetch, Parse, Update (Step 2 o $steps) -----"
- for ((i= 0 ; i <= $depth; i++))
- do
- echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
- $NUTCH_HOME/bin/nutch generate crawled/crawldb crawled/segments $topN /
- -adddays $adddays
- if [ $? -ne 0 ]
- then
- echo "runbot: Stopping at depth $depth. No more URLs to fetcfh."
- break
- fi
- segment=`ls -d crawled/segments/* | tail -1 `
- $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
- if [ $? -ne 0 ]
- then
- echo "runbot: fetch $segment at depth `expr $i + 1` failed."
- echo "runbot: Deleting segment $segment."
- rm $RMARGS $segment
- continue
- fi
- $NUTCH_HOME/bin/nutch updatedb crawled/crawldb $segment
- done
- # 3 ,合并爬取下来的内容
- echo "----- Merge Segments (Step 3 of $steps) -----"
- #将多个数据段合并到一个数据中并且保存至MERGEDsegments
- $NUTCH_HOME/bin/nutch mergesegs crawled/MERGEDsegments crawled/segments/*
- #rm $RMARGS crawled/segments
- rm $RMARGS crawled/BACKUPsegments
- mv $MVARGS crawled/segments crawled/BACKUPsegments
- mkdir crawled/segments
- mv $MVARGS crawled/MERGEDsegments/* crawled/segments
- rm $RMARGS crawled/MERGEDsegments
- # 4 ,将数据段相关链接写入linkdb
- echo "----- Invert Links (Step 4 of $steps) -----"
- $NUTCH_HOME/bin/nutch invertlinks crawled/linkdb crawled/segments/*
- # 5 ,生成indexes
- echo "----- Index (Step 5 of $steps) -----"
- $NUTCH_HOME/bin/nutch index crawled/NEWindexes crawled/crawldb crawled/linkdb crawled/segments/*
- # 6 ,去重
- echo "----- Dedup (Step 6 of $steps) -----"
- $NUTCH_HOME/bin/nutch dedup crawled/NEWindexes
- # 7 ,合并索引
- echo "----- Merge Indexes (Step 7 of $steps) -----"
- $NUTCH_HOME/bin/nutch merge crawled/NEWindex crawled/NEWindexes
- # 8 ,停止tomcat,释放操作索引的线程,Crawl更新线程后启动TOMCAT
- # 需要先停止tomcat,否则tomcat占用着索引文件夹index,不能对索引文件进行更新!(异常:什么文件以存在之类的,dir exists……)
- echo "----- Loading New Index (Step 8 of $steps) -----"
- #${CATALINA_HOME}/bin/shutdown.sh
- #如果是安全模式则先备份后删除索引
- if [ "$safe" != "yes" ]
- then
- rm $RMARGS crawled/NEWindexes
- rm $RMARGS crawled/index
- else
- rm $RMARGS crawled/BACKUPindexes
- rm $RMARGS crawled/BACKUPindex
- mv $MVARGS crawled/NEWindexes crawled/BACKUPindexes
- mv $MVARGS crawled/index crawled/BACKUPindex
- rm $RMARGS crawled/NEWindexes
- rm $RMARGS crawled/index
- fi
- #需要先删除旧索引(在上边已经完成)后在生成新索引
- mv $MVARGS crawled/NEWindex crawled/index
- #索引更新完成后启动tomcat
- #${CATALINA_HOME}/bin/startup.sh
- echo "runbot: FINISHED: Crawl completed!"
- echo ""