nutch1.4 + solr3.5 上路

最新推荐文章于 2025-12-15 14:52:07 发布

原创最新推荐文章于 2025-12-15 14:52:07 发布 · 4.6k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#solr #exception #duplicates #junit #graph #thread

nutch 同时被 3 个专栏收录

1 篇文章

订阅专栏

爬虫

1 篇文章

订阅专栏

solr

1 篇文章

订阅专栏

本文介绍了Nutch 1.4与Solr 3.5的安装配置过程，包括环境搭建、配置修改、爬虫测试及Solr集成，并解决了常见错误如Nutch Fetcher代理名称缺失、输入路径不存在等问题。

1.下载nutch1.4 & solr3.5

http://mirror.bjtu.edu.cn/apache/lucene/solr/3.5.0/apache-solr-3.5.0.zip
http://mirror.bjtu.edu.cn/apache/nutch/apache-nutch-1.4-bin.zip

2.解压安装

2.1 目录规划

安装目录 nutch:/home/xcloud/spider/nutch
安装目录 solr:/home/xcloud/spider/solr

2.2文件解压

解压nutch
unzip apache-nutch-1.4-bin.zip
mv apache-nutch-1.4 nutch
chmod 775 -R nutch

解压solr
unzip apache-solr-3.5.0.zip
mv apache-solr-3.5.0 solr
chmod 775 -R solr

3.nutch配置/运行测试

3.1 nutch配置目录说明

nutch运行目录:$NUTCH_HOME/runtime/local/bin
nutch配置目录:$NUTCH_HOME/runtime/local/conf

3.2nutch测试

$NUTCH_HOME/runtime/local/bin/nutch

xcloud@xcloud:~/spider/nutch/runtime/local/bin$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Expert: -core option is for developers only. It avoids building the job jar,
instead it simply includes classes compiled with ant compile-core.
NOTE: this works only for jobs executed in 'local' mode
xcloud@xcloud:~/spider/nutch/runtime/local/bin$

说明安装成功。

3.3 nutch配置修改

$NUTCH_HOME/runtime/local/conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>

3.4创建种子目录

mkdir $NUTCH_HOME/runtime/local/bin/urls
echo http://nutch.apache.org/ >> $NUTCH_HOME/runtime/local/bin/urls/seed.txt

3.5修改conf/regex-urlfilter.txt

sudo gedit $NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt
把
# accept anything else
+.
修改为
+^http://([a-z0-9]*\.)*nutch.apache.org/

3.6爬行

./nutch crawl urls -dir crawl -depth 3 -topN 5

3.7 solrindex

./nutch solrindex http://127.0.0.1:8983/solr/ mycrawl/crawldb -linkdb mycrawl/linkdb mycrawl/segments/*

4.solr配置/运行测试

4.1启动

启动目录：${SOLR_HOME}/example
java -jar start.jar

4.2验证

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

4.3nutch&solr集成

cp $NUTCH_HOME/runtime/local/conf/schema.xml ${SOLR_HOME}/example/solr/conf/

java -jar start.jar
然后在页面中输入nutch，返回xml格式到结果。

5.遇到到问题

5.1 Nutch Fetcher: No agents listed in ‘http.agent.name’ property

解决：修改$NUTCH_HOME/runtime/local/conf/nutch-site.xml，如上图所示

5.2Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405100520/parse_data
Input path does not exist: file:/home/xcloud/spider/nutch/runtime/local/bin/crawl/segments/20120405101222/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

原来：bin/nutch crawl urls -dir crawl -depth 3 -topN 5
修改：
mkdir mycrawl
chmod 777 mycrawl

bin/nutch crawl urls -dir mycrawl -depth 3 -topN 5

5.3 regex-urlfilter.txt 正则匹配问题

xcloud@xcloud:~/nutch$ nutch crawl urls -dir db -depth 2 -topN 2
solrUrl is not set, indexing will be skipped...
crawl started in: db
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-09 14:56:20
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

查看logs/hadoop.log

Caused by: java.io.IOException: Invalid first character: http://XXX/product/*
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:200)
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
... 21 more
原来：

http://XXX/product/*

修改为

+^http://XXX/product/*

5.4 timeout exception

2012-04-09 16:35:52,067 ERROR http.Http - java.net.SocketTimeoutException: Read timed out
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.socketRead0(Native Method)
2012-04-09 16:35:52,067 ERROR http.Http - at java.net.SocketInputStream.read(SocketInputStream.java:129)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
2012-04-09 16:35:52,067 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:116)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
2012-04-09 16:35:52,068 ERROR http.Http - at java.io.FilterInputStream.read(FilterInputStream.java:90)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:229)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
2012-04-09 16:35:52,068 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665)

修改conf/nutch-default.xml中的http.timeout=200000

2012-04-09 15:28:20,865 INFO api.RobotRulesParser - Couldn't get robots.txt for http://xxx.html: java.net.SocketTimeoutException: Read timed out

修改lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java 源码

/home/xcloud/iworkspace/nutch/plugin/lib-http/org/apache/nutch/protocol/http/api/RobotRulesParser.java
class:/home/xcloud/iworkspace/nutch/bin/org/apache/nutch/protocol/http/api
jar路径：/home/xcloud/nutch/runtime/local/plugins/lib-http