前言
先来一条最新消息: nutch 1.5发布了! 直接上到tika1.1和hadoop1.0,这下有得继续玩儿了。
不过刚看了一下,即使nutch发布到1.5,但默认还是没有提供增量爬的脚本。nutch的官方wiki上有Susam Pal写的recrawl脚本(
http://wiki.apache.org/nutch/Crawl),但是那个脚本不能拿来直接用,因为:
- ta只能用在local运行nutch的情况
- ta的搜索是基于lucene,如果使用solr做索引就不行了
- 如果是lucene做索引,ta只能在nutch1.2及以下使用,nutch1.3+版本中各命令的参数已经改变
以前在用官方的recrawl脚本实现过nutch1.2的增量爬;这段时间在弄nutch1.4,所以基于以前的recrawl脚本写了一个基于nutch1.4+solr3.4的脚本,目前在1master-9slave的hadoop下运行成功,现在记录下来备忘;我之前也是到处google脚本,但都没找到,现在分享出来,一起交流,也请各位能够提出宝贵的修改意见 ^_^
正文
相比于nutch1.2及以前版本,nutch1.3+中不少命令的参数已经改变,比如fetch命令已经不再解析-noParsing参数。
首先先说一下这个脚本的功能要求:
- 支持nutch1.4在hadoop集群上运行,索引方式是solr
- 尽量减少重复索引的数据量
- 在hdfs上保留完整的数据
- 具有一定的容错能力
脚本如下:(shell脚本能力相当欠缺,写的很丑)
#!/usr/bin/env bash
# recrawl script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/recrawl
# Author: Joey (base on Susam Pal's work)
if [ $# != 3 ]
then
echo "Usage: recrawl DATADIR URLDIR SOLRSERVERADDRESS"
echo "where "
echo " DATADIR is the parent directory where crawled data will be stored "
echo " URLDIR is the parent directory of the injected url files "
echo " SOLRSERVERADDRESS is the address of solr index server "
echo "eg: recrawl hdfs://localhost:9000/user/root/crawleddatadir \
hdfs://localhost:9000/user/root/crawledurldir http://localhost:8983/solr"
exit 1
fi
# Crawl options
depth=3
threads=64
topN=128
# Temp segments dir in Hadoop DFS
TEMPSEGMENTSDIR="tmpsegmentsdir"
# Arguments for hadoop dfsshell ls/cp/rm/mv command
LSCOMMAND="hadoop fs -ls"
RMCOMMAND="hadoop fs -rmr"
CPCOMMAND="hadoop fs -cp"
MVCOMMAND="hadoop fs -mv"
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo recrawl: $0 could not find environment variable NUTCH_HOME
echo recrawl: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo recrawl: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
for((j=0;j>-1;j++))
do
# "----- See if it will go on to crawl -----"
for switch in `cat $NUTCH_HOME/bin/recrawlswitcher`
do
break
done
if [ $switch == "off" ]
then
echo "--- Shut down the recrawl due to recrawl swithcher is off ---"
break
fi
echo "--- Beginning at count `expr $j + 1` ---"
steps=6
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $1/crawldb $2
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0;i<$depth;i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $1/crawldb $1/segments -topN $topN
if [ $? -ne 0 ]
then
echo "recrawl: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`$LSCOMMAND $1/segments/ | tail -1 | awk '{print $8}'`
echo "--- fetch into segment:"$segment" ---"
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "recrawl: fetch $segment at depth `expr $i + 1` failed."
echo "recrawl: Deleting segment $segment."
$RMCOMMAND $segment
continue
fi
echo "--- Beginning parsing ---"
$NUTCH_HOME/bin/nutch parse $segment
echo "--- Beginning updateddb ---"
$NUTCH_HOME/bin/nutch updatedb $1/crawldb $segment
echo "--- Beginning copy segment dir to temp segments dir ---"
$CPCOMMAND $segment $TEMPSEGMENTSDIR/
done
echo "--- Merge Segments (Step 3 of $steps) ---"
$NUTCH_HOME/bin/nutch mergesegs $1/MERGEDsegments -dir $1/segments
echo "--- Backup old segments dir, delete backups when this-count crawl finished ---"
$MVCOMMAND $1/segments $1/segmentsBakcup
if [ $? -ne 0 ]
then
echo "recrawl: Failed to backup current segments, so exit in case of data loss"
exit 1
fi
echo "--- Move the MERGEDsegments to segments dir ---"
$MVCOMMAND $1/MERGEDsegments $1/segments
if [ $? -ne 0 ]
then
echo "recrawl: Failed to move MERGEDsegments to segments, so exit in case of data loss "
exit 1
fi
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $1/linkdb -dir $1/segments
echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch solrindex $3 $1/crawldb -linkdb $1/linkdb -dir $TEMPSEGMENTSDIR
echo "----- Delete Duplicates (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch solrdedup $3
echo "--- The main recrawl process is done, now gona deelete the temp segments dir ---"
$RMCOMMAND $TEMPSEGMENTSDIR/*
echo "--- Delete the temp old segments backups ---"
$RMCOMMAND $1/segmentsBakcup
echo "recrawl: FINISHED: Crawl `expr $i + 1`-th completed!"
done
echo "All FINISHED with `expr $i + 1` count..."
这个脚本测试了几天,运行比较正常;不过这个脚本还可以继续改进,比如:
- 目前仅在generate、fetch阶段对返回值做了处理,下一步应该在nutch的每个阶段都做返回值进行判断做相应的错误处理
- 目前出现错误仅是在保证数据不损坏的情况下退出,应该利用sendmail等给管理员发报警邮件
- 运行时输入的参数可以灵活再调整,目前depth、topN、threads是脚本里定义,datadir、urldir、solraddr是作为参数传递,这个地方有些乱,可以都定义在脚本中