将nutch部署在hadoop上运行
bin/crawl hdfs://localhost:9000/user/hadoop/urls data http://localhost:8983/solr/ 1
在generator完成之后,提示:
ls: 无法访问data/segments/: 没有那个文件或目录
Operating on segment :
Fetching :
打开HDFS查看,发现明明有这个目录存在。
百思不得其解
在各种百度,google无解之后,想到了查看nutch的源码。
查看了一下crawl脚本的文件内容:
# determines whether mode based on presence of job file
mode=local
if [ -f ../*nutch-*.job ]; then
mode=distributed
fi
......
if [ $mode = "local" ]; then
SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
else
SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments | sed -e "s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1`