关于怎么安装JDK以及怎么配置环境变量,这里也不多做介绍,网上有很多的例子。下载完nutch1.4后,比如加压到/home/chenyanting/nutch目录,可使用解压命令:tar zxvf apache-nutch-1.4-bin.tar.gz
在此目录下运行命令 ./bin/nutch 若没有出现下面的内容:
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parame
则要修改nutch解压目录中的runtime/local/bin/nutch脚本的执行权限 chmod 755 nutch
export JAVA_HOME='java路径'
<value>My Nutch Spider</value>
mkdir -p urls
cd urls
- 在里面新建文件seeds.txt
- 往这个文件里面加入你要爬取的地址比如:
- 修改文件conf/regex-urlfilter.txt,在最后加上
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
本文出自 “陈砚羲” 博客,转载请与作者联系!