IntranetRecrawl

 转自:http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

IntranetRecrawl

  1. Version 0.7.2
    1. Example Usage
    2. Script
  2. Version 0.8.0 and 0.9.0
    1. Example Usage
    2. Changes for 0.9.0
    3. Code

Here are a couple of scripts for recrawling your Intranet.

Version 0.7.2

Place in the main nutch directory and run.

Example Usage

./recrawl crawl 10 31

(with adddays being '31', all pages will be recrawled)

Script
#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi

if [ -n "$2" ]
then
depth=$2
else
depth=5
fi

if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

Version 0.8.0 and 0.9.0

Place in the bin sub-directory within your Nutch install and run.

CALL THE SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK

Example Usage

./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT /usr/local/nutch/crawl 10 31

Setting adddays at 31 causes all pages will to be recrawled.

Changes for 0.9.0

No changes necessary for this to run with Nutch 0.9.0.

Code
#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu

if [ -n "$1" ]
then
tomcat_dir=$1
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$2" ]
then
crawl_dir=$2
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$3" ]
then
depth=$3
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$4" ]
then
adddays=$4
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$5" ]
then
topn="-topN $5"
else
topn=""
fi

#Sets the path to bin
nutch_dir=`dirname $0`

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
$nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment
done

# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir

for segment in `ls -d $segments_dir/* | tail -$depth`
do
echo "Removing Temporary Segment: $segment"
rm -rf $segment
done

cp -R $mergesegs_dir/* $segments_dir
rm -rf $mergesegs_dir

# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment

# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes

# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes

echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."

last edited 2007-08-16 17:21:55 by JamesVictor

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值